observability/README.gotmpl at main · coder/observability · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
<!-- generated: do not edit manually! -->
<!-- see scripts/README.gotmpl -->

# Coder Observability Chart

> [!NOTE]
> This Helm chart is in BETA; use with caution

## Overview

This chart contains a highly opinionated set of integrations between Grafana, Loki, Prometheus, Alertmanager, and
Grafana Agent.

Dashboards, alerts, and runbooks are preconfigured for monitoring [Coder](https://coder.com/) installations.

Out of the box:

Metrics will be scraped from all pods which have a `prometheus.io/scrape=true` annotation.<br>
Logs will be scraped from all pods in the Kubernetes cluster.

## Installation

```bash
helm repo add coder-observability https://helm.coder.com/observability
helm upgrade --install coder-observability coder-observability/coder-observability --version {{ template "chart.version" . }} --namespace coder-observability --create-namespace
```

## Requirements

### General

- Helm 3.7+

### Coder

<details open>
<summary>Kubernetes-based deployments</summary>
  If your installation is not in a namespace named `coder`, you will need to modify:

```yaml
global:
  coder:
    controlPlaneNamespace: <your namespace>
    externalProvisionersNamespace: <your namespace>
```

</details>

<details>
<summary>Non-Kubernetes deployments (click to expand)</summary>
  Ensure your Coder installation is accessible to the resources created by this chart.

Set `global.coder.scrapeMetrics` such that the metrics can be scraped from your installation, e.g.:

```yaml
global:
  coder:
    scrapeMetrics:
      hostname: your.coder.host
      port: 2112
      scrapeInterval: 15s
      additionalLabels:
        job: coder
```

If you would like your logs scraped from a process outside Kubernetes, you need to mount the log file(s) in and
configure Grafana Agent to scrape them; here's an example configuration:

```yaml
grafana-agent:
  agent:
    mounts:
      extra:
        - mountPath: /var/log
          name: logs
          readOnly: true
  controller:
    volumes:
      extra:
        - hostPath:
            path: /var/log
          name: logs

  extraBlocks: |-
    loki.source.file "coder_log" {
      targets    = [
        {__path__ = "/var/log/coder.log", job="coder"},
      ]
      forward_to = [loki.write.loki.receiver]
    }
```

</details>

Ensure these environment variables are set in your Coder deployment:

- `CODER_PROMETHEUS_ENABLE=true`
- `CODER_PROMETHEUS_COLLECT_AGENT_STATS=true`
- `CODER_LOGGING_HUMAN=/dev/stderr` (only `human` log format is supported
  currently; [issue](https://github.com/coder/observability/issues/8))

Ensure these labels exist on your Coder & provisioner deployments:

- `prometheus.io/scrape=true`
- `prometheus.io/port=2112` (ensure this matches the port defined by `CODER_PROMETHEUS_ADDRESS`)

If you use the [`coder/coder` helm chart](https://github.com/coder/coder/tree/main/helm), you can use the
following:

```yaml
coder:
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "2112"
```

For more details, see
the [coder documentation on exposing Prometheus metrics](https://coder.com/docs/v2/latest/admin/prometheus).

### Postgres

You may configure the Helm chart to monitor your Coder deployment's Postgres server. Ensure that the resources created
by this Helm chart can access your Postgres server.

Create a secret with your Postgres password and reference it as follows, along with the other connection details:

```yaml
global:
  postgres:
    hostname: <your postgres server host>
    port: <postgres port>
    database: <coder database>
    username: <database username>
    mountSecret: <your secret name here>
```

The secret should be in the form of `PGPASSWORD=<your password>`, as this secret will be used to create an environment
variable.

```yaml
apiVersion: v1
kind: Secret
metadata:
  name: pg-secret
  namespace: coder-observability
data:
  PGPASSWORD: <base64-encoded password>
```

<details>
<summary>Postgres metrics (click to expand)</summary>

A tool called [`postgres-exporter`](https://github.com/prometheus-community/postgres_exporter) is used to scrape metrics
from your Postgres server, and you can see the metrics it is exposing as follows:

```bash
kubectl -n coder-observability port-forward statefulset/postgres-exporter 9187

curl http://localhost:9187/metrics
```

</details>

### Grafana

To access Grafana, run:

```bash
kubectl -n coder-observability port-forward svc/grafana 3000:80
```

And open your web browser to http://localhost:3000/.

By default, Grafana is configured to allow anonymous access; if you want password authentication, define this in
your `values.yaml`:

```yaml
grafana:
  admin:
    existingSecret: grafana-admin
    userKey: username
    passwordKey: password
  grafana.ini:
    auth.anonymous:
      enabled: false
```

You will also need to define a secret as follows:

```yaml
apiVersion: v1
kind: Secret
metadata:
  name: grafana-admin # this matches the "existingSecret" field above
stringData:
  username: "<your username>" # this matches the "userKey" field above
  password: "<your password>" # this matches the "passwordKey" field above
```

To add an Ingress for Grafana, define this in your `values.yaml`:

```yaml
grafana:
  grafana.ini:
    server:
      domain: observability.example.com
      root_url: "%(protocol)s://%(domain)s/grafana"
      serve_from_sub_path: true
  ingress:
    enabled: true
    hosts:
      - "observability.example.com"
    path: "/"
```

### Prometheus

To access Prometheus, run:

```bash
kubectl -n coder-observability port-forward svc/prometheus 9090:80
```

And open your web browser to http://localhost:9090/graph.

#### Native Histograms

Native histograms are an **experimental** Prometheus feature that remove the need to predefine bucket boundaries and instead provide higher-resolution, adaptive buckets (see [Prometheus docs](https://prometheus.io/docs/specs/native_histograms/) for details).

Unlike classic histograms, which are sent in plain text, **native histograms require the protobuf protocol**.
In addition to running Prometheus with native histogram support, since the Prometheus Helm chart is configured with remote write, the Grafana Agent must be configured to scrape and remote write using protobuf.
Native histograms are **disabled by default**, but when you enable them globally, the Helm chart automatically updates the Grafana Agent configuration accordingly.

To enable native histograms, define this in your `values.yaml`:

```yaml
global:
  telemetry:
    metrics:
      native_histograms: true

prometheus:
  server:
    extraFlags:
      - web.enable-lifecycle
      - enable-feature=remote-write-receiver
      - enable-feature=native-histograms
```

After updating values, it might be required to restart the Grafana Agent so it picks up the new configuration:
```bash
kubectl -n coder-observability rollout restart daemonset/grafana-agent
```

⚠️ **Important**: Classic and native histograms cannot be aggregated together.
If you switch from classic to native histograms, dashboards may need to account for the transition. See [Prometheus migration guidelines](https://prometheus.io/docs/specs/native_histograms/#migration-considerations) for details.

<details>
<summary>Validate Prometheus Native Histograms</summary>

1) Check Prometheus flags:

    Open http://localhost:9090/flags and confirm that `--enable-feature` includes `native-histograms`.

2) Inspect histogram metrics:

   * Classic histograms expose metrics with suffixes: `_bucket`, `_sum`, and `_count`.
   * Native histograms are exposed directly under the metric name.
   * Example: query `coderd_workspace_creation_duration_seconds` in http://localhost:9090/graph.

3) Check Grafana Agent (if remote write is enabled):

   To confirm, run:
    ```bash
    kubectl -n coder-observability port-forward svc/grafana-agent 3030:80
    ```
   Then open http://localhost:3030:
   * scrape configurations defined in `prometheus.scrape.cadvisor`, should have `enable_protobuf_negotiation: true`
   * remote write configurations defined in `prometheus.remote_write.default` should have `send_native_histograms: true`

</details>

## Subcharts

{{ template "chart.requirementsTable" . }}

Each subchart can be disabled by setting the `enabled` field to `false`.

| Subchart        | Setting                 |
|-----------------|-------------------------|
| `grafana`       | `grafana.enabled`       |
| `grafana-agent` | `grafana-agent.enabled` |
| `loki`          | `loki.enabled`          |
| `prometheus`    | `prometheus.enabled`    |

## Values

The `global` values are the values which pertain to this chart, while the rest pertain to the subcharts.
These values represent only the values _set_ in this chart. For the full list of available values, please see each
subchart.

For example, the `grafana.replicas` value is set by this chart by default, and is one of hundreds of available
values which are defined [here](https://github.com/grafana/helm-charts/tree/main/charts/grafana#configuration).

{{ template "chart.valuesTable" . }}

{{ template "helm-docs.versionFooter" . }}