Important Metrics
We are tracking a number of metrics at our operations dashboard. What follows are the distinct categories of metrics
Service Level Indicators
SLIs are carefully defined quantitative measures of some aspect of the level of service that is provided. You can find more information on SLIs in the Site Reliability Engineering Book.
SLI | Query | Status |
---|---|---|
Availability | min(up{job="ArgoCD Metrics"}) | Done |
Reconciliation Performance | sum(increase(argocd_app_reconcile_bucket{dest_server=\~"$Server", job="ArgoCD Metrics"}[10m])) by (le) | Needs Update to use Histogram |
Length of Work Queue | increase(workqueue_unfinished_work_seconds{}[10m]) | Done |
SLI Descriptions
- Availability: This tells us whether important ArgoCD pods are up.
- Reconciliation Performance: For ArgoCD, Reconciliation refers to making the state of the application match what is stored in git. This is usually done by managing Kubernetes objects. This SLI lets us see how much time our reconciliations are taking, and whether the system is performing in a degraded state/ needs to be scaled up.
- Length of Work Queue: This tells how many seconds of work are in progress but have not been observed by work_duration. Large values usually indicate stuck threads.
Usage Metrics
These metrics are being used to track usage of the ArgoCD instance
Usage Metric | Query | Status |
---|---|---|
Number of Managed Servers | count(count by (dest_server) (argocd_app_info{project=\~"$Project"})) | Done |
Number of Managed Apps | count((argocd_app_info{project=\~"$Project"})) | Done |
Number of Source Repos | count(count by (repo) (argocd_app_info{project=\~"$Project"})) | Done |
Rate of Change | To Be Implemented | Not Done |
Service Level Indicators for App Owners
These metrics are possible SLIs for the App Owners
SLI | Query | Status |
---|---|---|
App Sync Failures | ceil(increase(argocd_app_sync_total{dest_server=\~"\$Server", project=\~"$Project", phase=\~"Error|Failed"}[10m])) | Done |
Possible Service Level Indicators
These metrics are interesting to look at but their usefulness has not yet been determined.
Metric | Query | Possible Usage | Status |
---|---|---|---|
Percentage of Apps in Sync | sum(argocd_app_info{sync_status="Synced", dest_server=\~"\$Server", dest_namespace=\~"\$Namespace", project=\~"\$Project"})/sum(argocd_app_info{dest_server=\~"\$Server", dest_namespace=\~"\$Namespace", project=\~"\$Project"}) | App Owner SLI | Done |
Cumulative GRPC Success Rate | sum(grpc_server_handled_total{grpc_code="OK",job="ArgoCD Metrics"})/sum(grpc_server_handled_total{job="ArgoCD Metrics"}) | SRE SLI | Done |
Application Time Spent out of Sync | To be Implemented | App Owner SLI | Not Done |
App Reconciliation Work Duration | sum(increase(workqueue_work_duration_seconds_bucket{name="app_reconciliation_queue"}[10m])) by (le) | SRE SLI | Update to use Histogram |
App Operation Processing Work Duration | sum(increase(workqueue_work_duration_seconds_bucket{name="app_operation_processing_queue"}[10m])) be (le) | SRE SLI | Update to use Histogram |
Metric Descriptions
- Percentage of Apps in Sync: Having a large percentage of apps be out of sync for a long duration indicates that there are some issues going on. Since the issues can be caused by both app-owners and the SREs it doesn't make sense to have this as an SRE SLI.
- Cumulative GRPC Success Rate: Seeing a large drop in the success rate could indicate issues with the underlying system.
- Application Time Spent out of Sync: Similar to 1, this can have multiple causes so it is difficult to define an SRE SLI based on this.
- App Reconciliation Work Duration: This gives us a histogram of how much time (in seconds) it takes for ArgoCD to process items in the App Reconciliation queue.
- App Operation Processing Work Duration: This gives us a histogram of how much time (in seconds) it takes for ArgoCD to process items in the App Processing queue.