cortexproject · CharlieTLe · Jun 30, 2026 · Jul 1, 2026
diff --git a/docs/operations/_index.md b/docs/operations/_index.md
@@ -4,4 +4,30 @@ linkTitle: "Operations"
 no_section_index_title: true
 weight: 8
 menu:
----
+---
+
+This section covers day-2 operation of a Cortex cluster. Start here if you are
+running Cortex in production.
+
+## Core operator guides
+
+- [Monitoring Cortex]({{< relref "./monitoring-cortex.md" >}}) — install the
+  bundled dashboards, alert rules, and recording rules.
+- [Troubleshooting]({{< relref "./troubleshooting.md" >}}) — symptom-driven
+  decision tree for the write path, read path, storage, and rings.
+- [Upgrading]({{< relref "./upgrading.md" >}}) — version-to-version upgrade
+  procedure, component ordering, and downgrade caveats.
+
+## Specialized topics
+
+- [Scaling the Query Frontend]({{< relref "./scalable-query-frontend.md" >}})
+- [Query Auditor]({{< relref "./query-auditor.md" >}}) — detect query
+  correctness regressions.
+- [Query Tee]({{< relref "./query-tee.md" >}}) — compare two Cortex deployments
+  side-by-side.
+- [Requests Mirroring with Envoy]({{< relref
+  "./requests-mirroring-to-secondary-cluster.md" >}})
+
+For component-level operational guidance (HA pairs, shuffle sharding, zone
+replication, capacity planning, encryption), see the [Guides]({{< relref
+"../guides/" >}}) section.
diff --git a/docs/operations/monitoring-cortex.md b/docs/operations/monitoring-cortex.md
@@ -0,0 +1,132 @@
+---
+title: "Monitoring Cortex"
+linkTitle: "Monitoring Cortex"
+weight: 1
+slug: monitoring-cortex
+---
+
+This page describes the bundled assets Cortex ships for monitoring a production
+deployment — Grafana dashboards, Prometheus alerting rules, and recording rules
+— and how to install them. The assets live in the repository and are kept in
+sync with the code; they are the same artifacts the Cortex maintainers use to
+operate their own clusters.
+
+## What ships with Cortex
+
+| Asset | Source | Purpose |
+|-------|--------|---------|
+| Dashboards (JSON) | `docs/getting-started/dashboards/` | Drop-in Grafana dashboards covering every Cortex component |
+| Alert rules | `docs/getting-started/alerts.yaml` | 50+ PrometheusRule alerts grouped by component |
+| Recording rules | `docs/getting-started/cortex-jsonnet/cortex-mixin/recording_rules.libsonnet` | Pre-aggregated series used by the dashboards and alerts |
+| Jsonnet mixin | `docs/getting-started/cortex-jsonnet/cortex-mixin/` | The source of truth — generates the JSON/YAML above |
+
+## Dashboards
+
+Each dashboard JSON in `docs/getting-started/dashboards/` is ready to import into
+Grafana via **Dashboards → Import → Upload JSON file**.
+
+| Dashboard | What to watch |
+|-----------|---------------|
+| `cortex-writes.json` | End-to-end write path: distributor QPS, ingestion rate, ingester push errors and latency, samples appended, WAL writes. The first dashboard to open during a write incident. |
+| `cortex-reads.json` | End-to-end read path: query QPS at the frontend, scheduler queue length, querier execution latency, store-gateway and ingester sub-queries. |
+| `cortex-queries.json` | Per-query breakdowns: chunks/series fetched, bytes processed, queries by tenant. Useful for hunting expensive queries. |
+| `cortex-slow-queries.json` | The slowest queries in the last interval, including the PromQL and the tenant. Pair with the query-frontend logs. |
+| `cortex-compactor.json` | Compactor run progress, blocks compacted vs. failed, sync errors. |
+| `cortex-compactor-resources.json` | CPU, memory, disk, and goroutines for the compactor pods. |
+| `cortex-object-store.json` | Object-store request rate, latency, and error rate broken down by operation (Get, Iter, Upload). |
+| `cortex-rollout-progress.json` | Rolling-deployment progress for stateful sets (ingester, store-gateway, compactor). |
+| `cortex-scaling.json` | Suggested replica counts derived from current load — pair with [Capacity Planning]({{< relref "../guides/capacity-planning.md" >}}). |
+| `cortex-config.json` | The runtime configuration currently in effect, by tenant. |
+| `alertmanager.json` | Alertmanager-specific: notification rate, replication, ring health. |
+| `ruler.json` | Ruler-specific: evaluation rate, missed evaluations, push and query errors. |
+
+Dashboards assume a Prometheus datasource named `Cortex`; either name your
+datasource that way or edit the dashboard variables on import. Several
+dashboards rely on the recording rules described below — install those first or
+some panels will be empty.
+
+## Alerts
+
+The bundled alerts in `docs/getting-started/alerts.yaml` are grouped by concern:
+
+| Group | Examples |
+|-------|----------|
+| `cortex_alerts` | `CortexIngesterUnhealthy`, `CortexRequestErrors`, `CortexRequestLatency`, `CortexQueriesIncorrect`, `CortexInconsistentRuntimeConfig`, `CortexKVStoreFailure`, `CortexMemoryMapAreasTooHigh` |
+| `cortex_ingester_instance_alerts` | `CortexIngesterReachingSeriesLimit`, `CortexIngesterReachingTenantsLimit`, `CortexDistributorReachingInflightPushRequestLimit` |
+| `cortex-rollout-alerts` | `CortexRolloutStuck` |
+| `cortex-provisioning` | `CortexProvisioningTooManyActiveSeries`, `CortexProvisioningTooManyWrites`, `CortexAllocatingTooMuchMemory` |
+| `ruler_alerts` | `CortexRulerTooManyFailedPushes`, `CortexRulerTooManyFailedQueries`, `CortexRulerMissedEvaluations`, `CortexRulerFailedRingCheck` |
+| `gossip_alerts` | `CortexGossipMembersMismatch` |
+| `etcd_alerts` | `EtcdAllocatingTooMuchMemory` |
+| `alertmanager_alerts` | `CortexAlertmanagerSyncConfigsFailing`, `CortexAlertmanagerRingCheckFailing`, `CortexAlertmanagerPartialStateMergeFailing`, `CortexAlertmanagerReplicationFailing`, `CortexAlertmanagerPersistStateFailing`, `CortexAlertmanagerInitialSyncFailed` |
+| `cortex_blocks_alerts` | `CortexIngesterHasNotShippedBlocks`, `CortexIngesterHasUnshippedBlocks`, `CortexIngesterTSDBHeadCompactionFailed`, `CortexIngesterTSDBWALCorrupted`, `CortexQuerierHasNotScanTheBucket`, `CortexQuerierHighRefetchRate`, `CortexStoreGatewayHasNotSyncTheBucket`, `CortexBucketIndexNotUpdated`, `CortexTenantHasPartialBlocks` |
+| `cortex_compactor_alerts` | `CortexCompactorHasNotSuccessfullyCleanedUpBlocks`, `CortexCompactorHasNotSuccessfullyRunCompaction`, `CortexCompactorHasNotUploadedBlocks` |
+
+For every alert, the file ships with `for`, `severity`, and a short summary in
+annotations. Treat these as a starting point — tune the thresholds (and which
+alerts page vs. ticket) to your SLOs.
+
+### Installing the alerts
+
+The alerts file is a standard Prometheus rule file. In Kubernetes with the
+Prometheus Operator, wrap it in a `PrometheusRule` resource; an example lives in
+`docs/getting-started/prometheusrule.yaml`. With a self-hosted Prometheus, add
+the file to `rule_files:` in `prometheus.yml`.
+
+If you also run a Cortex ruler, the same file can be loaded into Cortex itself
+via `cortextool rules load` (see [Sharded Ruler]({{< relref
+"../guides/sharded_ruler.md" >}})).
+
+## Recording rules
+
+The dashboards depend on a set of pre-aggregated metrics defined in
+`docs/getting-started/cortex-jsonnet/cortex-mixin/recording_rules.libsonnet`.
+These collapse per-instance counters into per-cluster/per-tenant rates so the
+dashboards stay fast on large deployments. Install them the same way you
+install the alerts — alongside, in the same Prometheus.
+
+Skipping the recording rules will leave several dashboard panels blank or
+extremely slow.
+
+## The Jsonnet mixin
+
+If you already manage Prometheus rules and dashboards via Jsonnet/Tanka, import
+`docs/getting-started/cortex-jsonnet/cortex-mixin/` directly:
+
+```jsonnet
+local cortexMixin = import 'cortex-mixin/mixin.libsonnet';
+
+{
+  prometheusAlerts+:: cortexMixin.prometheusAlerts,
+  prometheusRules+:: cortexMixin.prometheusRules,
+  grafanaDashboards+:: cortexMixin.grafanaDashboards,
+}
+```
+
+The mixin honours the standard [monitoring-mixin
+contract](https://github.com/monitoring-mixins/docs), so it composes with mixins
+for Kubernetes, etcd, Memcached, and the other dependencies a Cortex cluster
+typically runs alongside.
+
+The mixin's `_config` block exposes knobs for the datasource name, single-binary
+vs. microservices mode, namespace/cluster labels, and per-component selectors.
+See `cortex-mixin/config.libsonnet` for the full list.
+
+## Tracing
+
+Dashboards and alerts cover RED metrics — latency, traffic, errors. For
+end-to-end request tracing, configure Cortex's OpenTelemetry/Jaeger exporter as
+described in [Tracing]({{< relref "../guides/tracing.md" >}}). The
+`cortex-slow-queries.json` dashboard surfaces a query ID that maps directly to a
+trace when tracing is enabled, making it easy to pivot from "this query was
+slow" to "here is where it spent its time."
+
+## Related
+
+- [Capacity Planning]({{< relref "../guides/capacity-planning.md" >}}) — sizing
+  inputs to feed the scaling dashboard.
+- [Tracing]({{< relref "../guides/tracing.md" >}}) — span exporter setup.
+- [Query Auditor]({{< relref "./query-auditor.md" >}}) — detecting query
+  correctness regressions.
+- [Query Tee]({{< relref "./query-tee.md" >}}) — comparing two Cortex
+  deployments side-by-side.
diff --git a/docs/operations/troubleshooting.md b/docs/operations/troubleshooting.md
@@ -0,0 +1,187 @@
+---
+title: "Troubleshooting Cortex"
+linkTitle: "Troubleshooting"
+weight: 2
+slug: troubleshooting
+---
+
+A decision tree for the most common production issues. Each section starts with
+the symptom an operator sees, names the metrics and logs to inspect, and points
+to the upstream fix.
+
+The [bundled dashboards and alerts]({{< relref "./monitoring-cortex.md" >}})
+surface most of the signals referenced below. Install them first if you have
+not already.
+
+## Write path
+
+### Distributors return 5xx on `/api/v1/push`
+
+1. **Confirm where the error originates.** Distributor logs include the cause:
+   ingester unreachable, rate-limit exceeded, validation error. Filter for
+   `level=warn` and `level=error` on the distributor.
+2. **Check ingester health on the ring page** (`/ring` on any distributor). All
+   ingesters should be in state `ACTIVE`. `UNHEALTHY` or missing ingesters
+   point at a partition between distributor and ingester, or at the KV store.
+3. **Check the `CortexIngesterUnhealthy` alert.** If it is firing, follow it:
+   the offending ingester is in the alert's labels.
+4. **Inspect `cortex_distributor_ingester_append_failures_total`.** A non-zero
+   rate that matches the 5xx rate confirms ingester-side rejection.
+
+If the cause is `per-user limit exceeded`, raise the limit in `runtime_config`
+([Overrides]({{< relref "../guides/overrides.md" >}})) rather than scaling out.
+
+### Samples are accepted but never appear in queries
+
+1. **Verify the tenant header.** The push and the query must use the same
+   `X-Scope-OrgID`. The single most common cause of "missing data" is a
+   tenant-ID mismatch.
+2. **Check `cortex_ingester_memory_series` on the receiving ingester.** If
+   non-zero for the tenant, the data is in memory and queries should see it.
+3. **Confirm time-range overlap.** Ingesters serve recent data from the TSDB
+   head and from local on-disk blocks until they age out per
+   `-blocks-storage.tsdb.retention-period` (default `6h`). Queriers stop
+   consulting ingesters entirely for time ranges older than
+   `-limits.query-ingesters-within` (per-tenant, when set). Older data must
+   have been shipped and must be visible to the store-gateway via the bucket
+   index — check `cortex_ingester_shipper_uploads_total`, the
+   `CortexIngesterHasNotShippedBlocks` alert, and
+   `CortexBucketIndexNotUpdated`.
+
+### Distributor `inflight push requests` rejected
+
+The `CortexDistributorReachingInflightPushRequestLimit` alert fires when
+distributors near `-distributor.instance-limits.max-inflight-push-requests`.
+Either scale distributors horizontally or raise the limit if CPU and memory
+have headroom.
+
+## Read path
+
+### Queries time out at the frontend
+
+1. **Look at `cortex-reads.json` and `cortex-slow-queries.json`.** They show
+   queue depth, per-step latency, and the offending PromQL.
+2. **If the frontend queue is full** (`CortexFrontendQueriesStuck` or
+   `CortexSchedulerQueriesStuck`): there are not enough queriers, or queriers
+   are blocked on something downstream. Check querier CPU, then ingester and
+   store-gateway latency.
+3. **If the queue is empty but queries are still slow:** the bottleneck is in
+   the querier or below. Look at chunks fetched per query and bytes scanned —
+   an expensive query may need the protections in [Protecting Cortex from
+   Heavy Queries]({{< relref "../guides/protecting-cortex-from-heavy-queries.md"
+   >}}).
+
+### Queries return partial or no data for old time ranges
+
+Old data lives in object storage and is served by the store-gateway. Check:
+
+- `CortexStoreGatewayHasNotSyncTheBucket` — a stale store-gateway will not see
+  recently uploaded blocks.
+- `CortexBucketIndexNotUpdated` — the compactor maintains the bucket index;
+  querier and store-gateway use it to discover blocks.
+- `CortexQuerierHighRefetchRate` — symptom of store-gateways missing blocks
+  the querier expected to find.
+
+### Queries return incorrect results
+
+`CortexQueriesIncorrect` fires when the same query, run through the query-tee
+against two backends, disagrees. Cortex ships a [Query
+Auditor]({{< relref "./query-auditor.md" >}}) for this case; pair it with the
+[Query Tee]({{< relref "./query-tee.md" >}}) to bisect which deployment is
+wrong.
+
+## Storage path
+
+### Ingester is not shipping blocks
+
+The `CortexIngesterHasNotShippedBlocks` and `CortexIngesterHasUnshippedBlocks`
+alerts catch this. Common causes:
+
+- Object-store credentials misconfigured — see distributor and ingester logs
+  for `403`/`AccessDenied`.
+- A new block has not been cut yet. Ingesters cut blocks every
+  `-blocks-storage.tsdb.block-ranges-period` (default `2h`); a recently
+  started ingester has nothing to ship until the first block-range elapses.
+- Disk pressure: check `cortex_ingester_tsdb_*` metrics and pod disk usage.
+
+### TSDB head compaction or WAL errors
+
+`CortexIngesterTSDBHeadCompactionFailed`, `CortexIngesterTSDBWALCorrupted`, and
+`CortexIngesterTSDBWALWritesFailed` indicate disk-level problems. Treat the
+affected ingester as a failed replica: cordon it, let traffic move to the
+other replicas in the ring, then restore from a healthy ingester or replay
+the WAL on a fresh volume. Do **not** restart in place if the WAL is corrupt —
+you will lose the in-memory series.
+
+### Compactor falls behind
+
+`CortexCompactorHasNotSuccessfullyRunCompaction` means recent blocks are
+piling up and queries will get slower over time. Check:
+
+- Compactor CPU and memory headroom — compaction is CPU-bound.
+- Object-store latency on the compactor (it does a lot of small reads/writes).
+- The `cortex-compactor.json` dashboard for per-tenant progress.
+
+See [Partitioning Compactor]({{< relref "../guides/partitioning-compactor.md"
+>}}) for scaling out.
+
+## Hash ring and KV store
+
+### `CortexKVStoreFailure` is firing
+
+The component named in the alert cannot reach the KV store backend (Consul,
+etcd, or memberlist). Steps:
+
+1. From an affected pod, hit the KV backend's health endpoint directly.
+2. If the backend is up, look for network policy or DNS changes since the alert
+   started.
+3. With memberlist, check `cortex_memberlist_client_messages_received_total`
+   and `cortex_memberlist_client_messages_sent_total` on each pod; a partition
+   shows up as one-sided traffic.
+
+### Ingesters keep joining and leaving the ring
+
+`CortexGossipMembersMismatch` indicates members disagree on cluster membership.
+This is almost always a misconfigured `join_members:` list (some pods do not
+list a bootstrap peer that resolves) or a packet-loss issue between zones.
+[Gossip Ring Getting Started]({{< relref "../guides/gossip-ring-getting-started.md"
+>}}) walks through the canonical configuration.
+
+## Alertmanager
+
+`CortexAlertmanagerSyncConfigsFailing`, `CortexAlertmanagerReplicationFailing`,
+and the `*Persist*` / `*InitialSync*` alerts trace to the Alertmanager's
+storage backend or its peer replication. Inspect the alertmanager logs for the
+specific operation that failed; the alert annotations include the storage
+endpoint that returned the error.
+
+## Ruler
+
+A spike in `CortexRulerMissedEvaluations` typically means a ruler tenant has
+too many rules for the assigned shards. Either shard more aggressively (see
+[Sharded Ruler]({{< relref "../guides/sharded_ruler.md" >}})) or move
+heavy-evaluation tenants to the
+[query-frontend-backed rule evaluation path]({{< relref
+"../guides/rule-evaluations-via-query-frontend.md" >}}) so they share the
+query path's capacity rather than the ruler's local one.
+
+## Multi-tenant noisy-neighbour
+
+If one tenant is degrading the cluster for everyone:
+
+1. Use `cortex-queries.json` filtered by tenant to confirm the source.
+2. Apply tenant-specific limits via `runtime_config` ([Overrides]({{< relref
+   "../guides/overrides.md" >}})). Limits take effect within seconds — no
+   restart needed.
+3. For longer-term isolation, move the tenant to its own shuffle shard
+   ([Shuffle Sharding]({{< relref "../guides/shuffle-sharding.md" >}})).
+
+## When the answer isn't here
+
+- Search recent CHANGELOG entries for the component you suspect — many subtle
+  bugs are documented there before they show up in an issue.
+- Check [GitHub issues](https://github.com/cortexproject/cortex/issues) for the
+  alert name or error string; production issues are frequently filed verbatim.
+- Ask in the
+  [#cortex Slack channel](https://cloud-native.slack.com/messages/cortex) with
+  the alert name, the dashboard timeframe, and a relevant log line.