Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 27 additions & 1 deletion docs/operations/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,30 @@ linkTitle: "Operations"
no_section_index_title: true
weight: 8
menu:
---
---

This section covers day-2 operation of a Cortex cluster. Start here if you are
running Cortex in production.

## Core operator guides

- [Monitoring Cortex]({{< relref "./monitoring-cortex.md" >}}) — install the
bundled dashboards, alert rules, and recording rules.
- [Troubleshooting]({{< relref "./troubleshooting.md" >}}) — symptom-driven
decision tree for the write path, read path, storage, and rings.
- [Upgrading]({{< relref "./upgrading.md" >}}) — version-to-version upgrade
procedure, component ordering, and downgrade caveats.

## Specialized topics

- [Scaling the Query Frontend]({{< relref "./scalable-query-frontend.md" >}})
- [Query Auditor]({{< relref "./query-auditor.md" >}}) — detect query
correctness regressions.
- [Query Tee]({{< relref "./query-tee.md" >}}) — compare two Cortex deployments
side-by-side.
- [Requests Mirroring with Envoy]({{< relref
"./requests-mirroring-to-secondary-cluster.md" >}})

For component-level operational guidance (HA pairs, shuffle sharding, zone
replication, capacity planning, encryption), see the [Guides]({{< relref
"../guides/" >}}) section.
132 changes: 132 additions & 0 deletions docs/operations/monitoring-cortex.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
title: "Monitoring Cortex"
linkTitle: "Monitoring Cortex"
weight: 1
slug: monitoring-cortex
---

This page describes the bundled assets Cortex ships for monitoring a production
deployment — Grafana dashboards, Prometheus alerting rules, and recording rules
— and how to install them. The assets live in the repository and are kept in
sync with the code; they are the same artifacts the Cortex maintainers use to
operate their own clusters.

## What ships with Cortex

| Asset | Source | Purpose |
|-------|--------|---------|
| Dashboards (JSON) | `docs/getting-started/dashboards/` | Drop-in Grafana dashboards covering every Cortex component |
| Alert rules | `docs/getting-started/alerts.yaml` | 50+ PrometheusRule alerts grouped by component |
| Recording rules | `docs/getting-started/cortex-jsonnet/cortex-mixin/recording_rules.libsonnet` | Pre-aggregated series used by the dashboards and alerts |
| Jsonnet mixin | `docs/getting-started/cortex-jsonnet/cortex-mixin/` | The source of truth — generates the JSON/YAML above |

## Dashboards

Each dashboard JSON in `docs/getting-started/dashboards/` is ready to import into
Grafana via **Dashboards → Import → Upload JSON file**.

| Dashboard | What to watch |
|-----------|---------------|
| `cortex-writes.json` | End-to-end write path: distributor QPS, ingestion rate, ingester push errors and latency, samples appended, WAL writes. The first dashboard to open during a write incident. |
| `cortex-reads.json` | End-to-end read path: query QPS at the frontend, scheduler queue length, querier execution latency, store-gateway and ingester sub-queries. |
| `cortex-queries.json` | Per-query breakdowns: chunks/series fetched, bytes processed, queries by tenant. Useful for hunting expensive queries. |
| `cortex-slow-queries.json` | The slowest queries in the last interval, including the PromQL and the tenant. Pair with the query-frontend logs. |
| `cortex-compactor.json` | Compactor run progress, blocks compacted vs. failed, sync errors. |
| `cortex-compactor-resources.json` | CPU, memory, disk, and goroutines for the compactor pods. |
| `cortex-object-store.json` | Object-store request rate, latency, and error rate broken down by operation (Get, Iter, Upload). |
| `cortex-rollout-progress.json` | Rolling-deployment progress for stateful sets (ingester, store-gateway, compactor). |
| `cortex-scaling.json` | Suggested replica counts derived from current load — pair with [Capacity Planning]({{< relref "../guides/capacity-planning.md" >}}). |
| `cortex-config.json` | The runtime configuration currently in effect, by tenant. |
| `alertmanager.json` | Alertmanager-specific: notification rate, replication, ring health. |
| `ruler.json` | Ruler-specific: evaluation rate, missed evaluations, push and query errors. |

Dashboards assume a Prometheus datasource named `Cortex`; either name your
datasource that way or edit the dashboard variables on import. Several
dashboards rely on the recording rules described below — install those first or
some panels will be empty.

## Alerts

The bundled alerts in `docs/getting-started/alerts.yaml` are grouped by concern:

| Group | Examples |
|-------|----------|
| `cortex_alerts` | `CortexIngesterUnhealthy`, `CortexRequestErrors`, `CortexRequestLatency`, `CortexQueriesIncorrect`, `CortexInconsistentRuntimeConfig`, `CortexKVStoreFailure`, `CortexMemoryMapAreasTooHigh` |
| `cortex_ingester_instance_alerts` | `CortexIngesterReachingSeriesLimit`, `CortexIngesterReachingTenantsLimit`, `CortexDistributorReachingInflightPushRequestLimit` |
| `cortex-rollout-alerts` | `CortexRolloutStuck` |
| `cortex-provisioning` | `CortexProvisioningTooManyActiveSeries`, `CortexProvisioningTooManyWrites`, `CortexAllocatingTooMuchMemory` |
| `ruler_alerts` | `CortexRulerTooManyFailedPushes`, `CortexRulerTooManyFailedQueries`, `CortexRulerMissedEvaluations`, `CortexRulerFailedRingCheck` |
| `gossip_alerts` | `CortexGossipMembersMismatch` |
| `etcd_alerts` | `EtcdAllocatingTooMuchMemory` |
| `alertmanager_alerts` | `CortexAlertmanagerSyncConfigsFailing`, `CortexAlertmanagerRingCheckFailing`, `CortexAlertmanagerPartialStateMergeFailing`, `CortexAlertmanagerReplicationFailing`, `CortexAlertmanagerPersistStateFailing`, `CortexAlertmanagerInitialSyncFailed` |
| `cortex_blocks_alerts` | `CortexIngesterHasNotShippedBlocks`, `CortexIngesterHasUnshippedBlocks`, `CortexIngesterTSDBHeadCompactionFailed`, `CortexIngesterTSDBWALCorrupted`, `CortexQuerierHasNotScanTheBucket`, `CortexQuerierHighRefetchRate`, `CortexStoreGatewayHasNotSyncTheBucket`, `CortexBucketIndexNotUpdated`, `CortexTenantHasPartialBlocks` |
| `cortex_compactor_alerts` | `CortexCompactorHasNotSuccessfullyCleanedUpBlocks`, `CortexCompactorHasNotSuccessfullyRunCompaction`, `CortexCompactorHasNotUploadedBlocks` |

For every alert, the file ships with `for`, `severity`, and a short summary in
annotations. Treat these as a starting point — tune the thresholds (and which
alerts page vs. ticket) to your SLOs.

### Installing the alerts

The alerts file is a standard Prometheus rule file. In Kubernetes with the
Prometheus Operator, wrap it in a `PrometheusRule` resource; an example lives in
`docs/getting-started/prometheusrule.yaml`. With a self-hosted Prometheus, add
the file to `rule_files:` in `prometheus.yml`.

If you also run a Cortex ruler, the same file can be loaded into Cortex itself
via `cortextool rules load` (see [Sharded Ruler]({{< relref
"../guides/sharded_ruler.md" >}})).

## Recording rules

The dashboards depend on a set of pre-aggregated metrics defined in
`docs/getting-started/cortex-jsonnet/cortex-mixin/recording_rules.libsonnet`.
These collapse per-instance counters into per-cluster/per-tenant rates so the
dashboards stay fast on large deployments. Install them the same way you
install the alerts — alongside, in the same Prometheus.

Skipping the recording rules will leave several dashboard panels blank or
extremely slow.

## The Jsonnet mixin

If you already manage Prometheus rules and dashboards via Jsonnet/Tanka, import
`docs/getting-started/cortex-jsonnet/cortex-mixin/` directly:

```jsonnet
local cortexMixin = import 'cortex-mixin/mixin.libsonnet';

{
prometheusAlerts+:: cortexMixin.prometheusAlerts,
prometheusRules+:: cortexMixin.prometheusRules,
grafanaDashboards+:: cortexMixin.grafanaDashboards,
}
```

The mixin honours the standard [monitoring-mixin
contract](https://github.com/monitoring-mixins/docs), so it composes with mixins
for Kubernetes, etcd, Memcached, and the other dependencies a Cortex cluster
typically runs alongside.

The mixin's `_config` block exposes knobs for the datasource name, single-binary
vs. microservices mode, namespace/cluster labels, and per-component selectors.
See `cortex-mixin/config.libsonnet` for the full list.

## Tracing

Dashboards and alerts cover RED metrics — latency, traffic, errors. For
end-to-end request tracing, configure Cortex's OpenTelemetry/Jaeger exporter as
described in [Tracing]({{< relref "../guides/tracing.md" >}}). The
`cortex-slow-queries.json` dashboard surfaces a query ID that maps directly to a
trace when tracing is enabled, making it easy to pivot from "this query was
slow" to "here is where it spent its time."

## Related

- [Capacity Planning]({{< relref "../guides/capacity-planning.md" >}}) — sizing
inputs to feed the scaling dashboard.
- [Tracing]({{< relref "../guides/tracing.md" >}}) — span exporter setup.
- [Query Auditor]({{< relref "./query-auditor.md" >}}) — detecting query
correctness regressions.
- [Query Tee]({{< relref "./query-tee.md" >}}) — comparing two Cortex
deployments side-by-side.
187 changes: 187 additions & 0 deletions docs/operations/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
---
title: "Troubleshooting Cortex"
linkTitle: "Troubleshooting"
weight: 2
slug: troubleshooting
---

A decision tree for the most common production issues. Each section starts with
the symptom an operator sees, names the metrics and logs to inspect, and points
to the upstream fix.

The [bundled dashboards and alerts]({{< relref "./monitoring-cortex.md" >}})
surface most of the signals referenced below. Install them first if you have
not already.

## Write path

### Distributors return 5xx on `/api/v1/push`

1. **Confirm where the error originates.** Distributor logs include the cause:
ingester unreachable, rate-limit exceeded, validation error. Filter for
`level=warn` and `level=error` on the distributor.
2. **Check ingester health on the ring page** (`/ring` on any distributor). All
ingesters should be in state `ACTIVE`. `UNHEALTHY` or missing ingesters
point at a partition between distributor and ingester, or at the KV store.
3. **Check the `CortexIngesterUnhealthy` alert.** If it is firing, follow it:
the offending ingester is in the alert's labels.
4. **Inspect `cortex_distributor_ingester_append_failures_total`.** A non-zero
rate that matches the 5xx rate confirms ingester-side rejection.

If the cause is `per-user limit exceeded`, raise the limit in `runtime_config`
([Overrides]({{< relref "../guides/overrides.md" >}})) rather than scaling out.

### Samples are accepted but never appear in queries

1. **Verify the tenant header.** The push and the query must use the same
`X-Scope-OrgID`. The single most common cause of "missing data" is a
tenant-ID mismatch.
2. **Check `cortex_ingester_memory_series` on the receiving ingester.** If
non-zero for the tenant, the data is in memory and queries should see it.
3. **Confirm time-range overlap.** Ingesters serve recent data from the TSDB
head and from local on-disk blocks until they age out per
`-blocks-storage.tsdb.retention-period` (default `6h`). Queriers stop
consulting ingesters entirely for time ranges older than
`-limits.query-ingesters-within` (per-tenant, when set). Older data must
have been shipped and must be visible to the store-gateway via the bucket
index — check `cortex_ingester_shipper_uploads_total`, the
`CortexIngesterHasNotShippedBlocks` alert, and
`CortexBucketIndexNotUpdated`.

### Distributor `inflight push requests` rejected

The `CortexDistributorReachingInflightPushRequestLimit` alert fires when
distributors near `-distributor.instance-limits.max-inflight-push-requests`.
Either scale distributors horizontally or raise the limit if CPU and memory
have headroom.

## Read path

### Queries time out at the frontend

1. **Look at `cortex-reads.json` and `cortex-slow-queries.json`.** They show
queue depth, per-step latency, and the offending PromQL.
2. **If the frontend queue is full** (`CortexFrontendQueriesStuck` or
`CortexSchedulerQueriesStuck`): there are not enough queriers, or queriers
are blocked on something downstream. Check querier CPU, then ingester and
store-gateway latency.
3. **If the queue is empty but queries are still slow:** the bottleneck is in
the querier or below. Look at chunks fetched per query and bytes scanned —
an expensive query may need the protections in [Protecting Cortex from
Heavy Queries]({{< relref "../guides/protecting-cortex-from-heavy-queries.md"
>}}).

### Queries return partial or no data for old time ranges

Old data lives in object storage and is served by the store-gateway. Check:

- `CortexStoreGatewayHasNotSyncTheBucket` — a stale store-gateway will not see
recently uploaded blocks.
- `CortexBucketIndexNotUpdated` — the compactor maintains the bucket index;
querier and store-gateway use it to discover blocks.
- `CortexQuerierHighRefetchRate` — symptom of store-gateways missing blocks
the querier expected to find.

### Queries return incorrect results

`CortexQueriesIncorrect` fires when the same query, run through the query-tee
against two backends, disagrees. Cortex ships a [Query
Auditor]({{< relref "./query-auditor.md" >}}) for this case; pair it with the
[Query Tee]({{< relref "./query-tee.md" >}}) to bisect which deployment is
wrong.

## Storage path

### Ingester is not shipping blocks

The `CortexIngesterHasNotShippedBlocks` and `CortexIngesterHasUnshippedBlocks`
alerts catch this. Common causes:

- Object-store credentials misconfigured — see distributor and ingester logs
for `403`/`AccessDenied`.
- A new block has not been cut yet. Ingesters cut blocks every
`-blocks-storage.tsdb.block-ranges-period` (default `2h`); a recently
started ingester has nothing to ship until the first block-range elapses.
- Disk pressure: check `cortex_ingester_tsdb_*` metrics and pod disk usage.

### TSDB head compaction or WAL errors

`CortexIngesterTSDBHeadCompactionFailed`, `CortexIngesterTSDBWALCorrupted`, and
`CortexIngesterTSDBWALWritesFailed` indicate disk-level problems. Treat the
affected ingester as a failed replica: cordon it, let traffic move to the
other replicas in the ring, then restore from a healthy ingester or replay
the WAL on a fresh volume. Do **not** restart in place if the WAL is corrupt —
you will lose the in-memory series.

### Compactor falls behind

`CortexCompactorHasNotSuccessfullyRunCompaction` means recent blocks are
piling up and queries will get slower over time. Check:

- Compactor CPU and memory headroom — compaction is CPU-bound.
- Object-store latency on the compactor (it does a lot of small reads/writes).
- The `cortex-compactor.json` dashboard for per-tenant progress.

See [Partitioning Compactor]({{< relref "../guides/partitioning-compactor.md"
>}}) for scaling out.

## Hash ring and KV store

### `CortexKVStoreFailure` is firing

The component named in the alert cannot reach the KV store backend (Consul,
etcd, or memberlist). Steps:

1. From an affected pod, hit the KV backend's health endpoint directly.
2. If the backend is up, look for network policy or DNS changes since the alert
started.
3. With memberlist, check `cortex_memberlist_client_messages_received_total`
and `cortex_memberlist_client_messages_sent_total` on each pod; a partition
shows up as one-sided traffic.

### Ingesters keep joining and leaving the ring

`CortexGossipMembersMismatch` indicates members disagree on cluster membership.
This is almost always a misconfigured `join_members:` list (some pods do not
list a bootstrap peer that resolves) or a packet-loss issue between zones.
[Gossip Ring Getting Started]({{< relref "../guides/gossip-ring-getting-started.md"
>}}) walks through the canonical configuration.

## Alertmanager

`CortexAlertmanagerSyncConfigsFailing`, `CortexAlertmanagerReplicationFailing`,
and the `*Persist*` / `*InitialSync*` alerts trace to the Alertmanager's
storage backend or its peer replication. Inspect the alertmanager logs for the
specific operation that failed; the alert annotations include the storage
endpoint that returned the error.

## Ruler

A spike in `CortexRulerMissedEvaluations` typically means a ruler tenant has
too many rules for the assigned shards. Either shard more aggressively (see
[Sharded Ruler]({{< relref "../guides/sharded_ruler.md" >}})) or move
heavy-evaluation tenants to the
[query-frontend-backed rule evaluation path]({{< relref
"../guides/rule-evaluations-via-query-frontend.md" >}}) so they share the
query path's capacity rather than the ruler's local one.

## Multi-tenant noisy-neighbour

If one tenant is degrading the cluster for everyone:

1. Use `cortex-queries.json` filtered by tenant to confirm the source.
2. Apply tenant-specific limits via `runtime_config` ([Overrides]({{< relref
"../guides/overrides.md" >}})). Limits take effect within seconds — no
restart needed.
3. For longer-term isolation, move the tenant to its own shuffle shard
([Shuffle Sharding]({{< relref "../guides/shuffle-sharding.md" >}})).

## When the answer isn't here

- Search recent CHANGELOG entries for the component you suspect — many subtle
bugs are documented there before they show up in an issue.
- Check [GitHub issues](https://github.com/cortexproject/cortex/issues) for the
alert name or error string; production issues are frequently filed verbatim.
- Ask in the
[#cortex Slack channel](https://cloud-native.slack.com/messages/cortex) with
the alert name, the dashboard timeframe, and a relevant log line.
Loading