From 7e0ff93fea51a24f0c327391fc85008c923473ca Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Wed, 13 May 2026 20:39:54 -0700 Subject: [PATCH 1/5] Start OOM doc draft --- .../pages/troubleshoot/memory-management.adoc | 53 +++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 modules/sql/pages/troubleshoot/memory-management.adoc diff --git a/modules/sql/pages/troubleshoot/memory-management.adoc b/modules/sql/pages/troubleshoot/memory-management.adoc new file mode 100644 index 000000000..7cf7d6791 --- /dev/null +++ b/modules/sql/pages/troubleshoot/memory-management.adoc @@ -0,0 +1,53 @@ += Troubleshoot memory-related query cancellations +:description: Recover from query cancellations triggered by Redpanda SQL's automatic out-of-memory protection. +:page-topic-type: how-to + +// TODO: SME — confirm page title and nav label. Now that the page is symptom-led troubleshooting, the previous "Memory management" framing is too broad. +// Options: +// "Troubleshoot memory-related query cancellations" (current; matches Troubleshoot section voice) +// "Recover from OOM cancellation" (concise; uses internal term) +// Keep "Memory management" (matches current nav label but doesn't signal action) + +Redpanda SQL automatically cancels running queries on a node when the node's memory usage approaches its configured limit. If your application sees the following error, your queries have hit this protection: + +[source,text] +---- +cancelled due to OOM prevention +---- + +// TODO: SME — confirm the exact client-facing error envelope. The string above is the error reason raised internally by the engine. Clients connecting through `psql` or a PostgreSQL driver typically receive it wrapped in a PostgreSQL error message. Confirm: +// - Is a SQLSTATE code set on this error? If so, which one? +// - Does the message reach the client verbatim, or is the wording different? + +Only queries running on the affected node at the time of reclamation are cancelled. Other nodes in the cluster continue to serve queries. The node resumes accepting new queries immediately after reclamation completes, so in most cases you can retry the failed query and it succeeds. + +== If the error keeps happening + +If queries are repeatedly cancelled with this error, the workload is consistently pressing a node against its memory limit. + +// TODO: SME — runbook depth. Confirm which of the following actions to recommend, and in what order. Suggested guidance to validate: +// - Reduce query concurrency on the affected workload. +// - Simplify the query — narrow the scan range, add filters, reduce parallel CTEs. +// - Scale up the cluster. +// Also confirm: is there a heuristic for choosing among them (for example, look at oxla_process_memory_total over time)? + +== Why this happens + +Redpanda SQL monitors each node's resident memory usage and triggers a brief reclamation phase when the node approaches its memory limit. During reclamation, the node cancels its running queries and frees memory so it can keep serving new queries. The protection runs on each node independently and is always on. There is no configuration option to enable, disable, or tune it. + +// TODO: SME — confirm whether `memory.max` and `memory.max_non_query` are exposed through the BYOC layer at GA. Per OXLA-9109, the configurable threshold was descoped before ship. If neither is exposed to users (even via support), this section stands as-is. If either is reachable (for example via a support-only path), note it here so users understand what controls exist. + +== Monitor memory usage + +Use the following Prometheus gauge to track each node's resident memory and watch for sustained growth toward the node's limit: + +[cols="1,3"] +|=== +| Metric | Description + +| `oxla_process_memory_total` +| Process Resident Set Size (RSS) in bytes, reported per node. +|=== + +// TODO: Once the Redpanda SQL metrics are finalized, verify where they should be documented. + From 76f83485c9e5fcccdc0b348717f441bc6beba89c Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Tue, 19 May 2026 17:22:42 -0700 Subject: [PATCH 2/5] Review pass --- modules/ROOT/nav.adoc | 2 +- .../pages/troubleshoot/memory-management.adoc | 16 ++++++++++------ 2 files changed, 11 insertions(+), 7 deletions(-) diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 142049a66..db5de0d33 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -358,7 +358,7 @@ ** xref:sql:manage/index.adoc[Manage Redpanda SQL] ** xref:sql:troubleshoot/index.adoc[Troubleshoot] *** xref:sql:troubleshoot/degraded-state-handling.adoc[] -*** xref:sql:troubleshoot/memory-management.adoc[Memory Management] +*** xref:sql:troubleshoot/memory-management.adoc[OOM cancellations] * xref:develop:index.adoc[Develop] ** xref:develop:kafka-clients.adoc[] diff --git a/modules/sql/pages/troubleshoot/memory-management.adoc b/modules/sql/pages/troubleshoot/memory-management.adoc index 7cf7d6791..9694abc2b 100644 --- a/modules/sql/pages/troubleshoot/memory-management.adoc +++ b/modules/sql/pages/troubleshoot/memory-management.adoc @@ -1,6 +1,7 @@ = Troubleshoot memory-related query cancellations -:description: Recover from query cancellations triggered by Redpanda SQL's automatic out-of-memory protection. +:description: Recover from query cancellations triggered by Redpanda SQL's automatic OOM prevention. :page-topic-type: how-to +:personas: platform_admin, data_engineer // TODO: SME — confirm page title and nav label. Now that the page is symptom-led troubleshooting, the previous "Memory management" framing is too broad. // Options: @@ -8,7 +9,7 @@ // "Recover from OOM cancellation" (concise; uses internal term) // Keep "Memory management" (matches current nav label but doesn't signal action) -Redpanda SQL automatically cancels running queries on a node when the node's memory usage approaches its configured limit. If your application sees the following error, your queries have hit this protection: +Redpanda SQL automatically cancels running queries on a node when the node's memory usage approaches its memory limit. If your application sees the following error, your queries have hit this protection: [source,text] ---- @@ -23,7 +24,7 @@ Only queries running on the affected node at the time of reclamation are cancell == If the error keeps happening -If queries are repeatedly cancelled with this error, the workload is consistently pressing a node against its memory limit. +If queries are repeatedly cancelled with this error, the workload is consistently approaching the node's memory limit. // TODO: SME — runbook depth. Confirm which of the following actions to recommend, and in what order. Suggested guidance to validate: // - Reduce query concurrency on the affected workload. @@ -35,8 +36,6 @@ If queries are repeatedly cancelled with this error, the workload is consistentl Redpanda SQL monitors each node's resident memory usage and triggers a brief reclamation phase when the node approaches its memory limit. During reclamation, the node cancels its running queries and frees memory so it can keep serving new queries. The protection runs on each node independently and is always on. There is no configuration option to enable, disable, or tune it. -// TODO: SME — confirm whether `memory.max` and `memory.max_non_query` are exposed through the BYOC layer at GA. Per OXLA-9109, the configurable threshold was descoped before ship. If neither is exposed to users (even via support), this section stands as-is. If either is reachable (for example via a support-only path), note it here so users understand what controls exist. - == Monitor memory usage Use the following Prometheus gauge to track each node's resident memory and watch for sustained growth toward the node's limit: @@ -49,5 +48,10 @@ Use the following Prometheus gauge to track each node's resident memory and watc | Process Resident Set Size (RSS) in bytes, reported per node. |=== -// TODO: Once the Redpanda SQL metrics are finalized, verify where they should be documented. +// TODO: Once the Redpanda SQL metrics are finalized, verify where they should be documented and add a cross-link from "Suggested reading" below to that page. + +== Suggested reading + +* xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]: inspect currently running queries on the cluster. +* xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]: list the SQL engine's nodes and their state. From 5bb851156731d33ee6d82597bf7fd5caa2ecfc45 Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Tue, 19 May 2026 17:26:45 -0700 Subject: [PATCH 3/5] Capitalization --- modules/ROOT/nav.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index db5de0d33..ed74db7ac 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -358,7 +358,7 @@ ** xref:sql:manage/index.adoc[Manage Redpanda SQL] ** xref:sql:troubleshoot/index.adoc[Troubleshoot] *** xref:sql:troubleshoot/degraded-state-handling.adoc[] -*** xref:sql:troubleshoot/memory-management.adoc[OOM cancellations] +*** xref:sql:troubleshoot/memory-management.adoc[OOM Cancellations] * xref:develop:index.adoc[Develop] ** xref:develop:kafka-clients.adoc[] From f7291d220e4e4a11653d47ae05cbdc185042822b Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Tue, 19 May 2026 17:28:21 -0700 Subject: [PATCH 4/5] Rename file --- modules/ROOT/nav.adoc | 2 +- .../{memory-management.adoc => oom-cancellations.adoc} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename modules/sql/pages/troubleshoot/{memory-management.adoc => oom-cancellations.adoc} (100%) diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index ed74db7ac..7f10aa408 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -358,7 +358,7 @@ ** xref:sql:manage/index.adoc[Manage Redpanda SQL] ** xref:sql:troubleshoot/index.adoc[Troubleshoot] *** xref:sql:troubleshoot/degraded-state-handling.adoc[] -*** xref:sql:troubleshoot/memory-management.adoc[OOM Cancellations] +*** xref:sql:troubleshoot/oom-cancellations.adoc[OOM Cancellations] * xref:develop:index.adoc[Develop] ** xref:develop:kafka-clients.adoc[] diff --git a/modules/sql/pages/troubleshoot/memory-management.adoc b/modules/sql/pages/troubleshoot/oom-cancellations.adoc similarity index 100% rename from modules/sql/pages/troubleshoot/memory-management.adoc rename to modules/sql/pages/troubleshoot/oom-cancellations.adoc From f63c3df74ec728d60d218d0b01b5b0c72c01b12a Mon Sep 17 00:00:00 2001 From: Kat Batuigas Date: Tue, 19 May 2026 17:54:08 -0700 Subject: [PATCH 5/5] Update page attributes --- .../sql/pages/troubleshoot/oom-cancellations.adoc | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/modules/sql/pages/troubleshoot/oom-cancellations.adoc b/modules/sql/pages/troubleshoot/oom-cancellations.adoc index 9694abc2b..7298c3078 100644 --- a/modules/sql/pages/troubleshoot/oom-cancellations.adoc +++ b/modules/sql/pages/troubleshoot/oom-cancellations.adoc @@ -1,7 +1,10 @@ -= Troubleshoot memory-related query cancellations += Troubleshoot Memory-related Query Cancellations :description: Recover from query cancellations triggered by Redpanda SQL's automatic OOM prevention. -:page-topic-type: how-to +:page-topic-type: troubleshooting :personas: platform_admin, data_engineer +:learning-objective-1: Identify when query cancellations are caused by OOM prevention +:learning-objective-2: Recover from OOM-cancelled queries +:learning-objective-3: Monitor node memory usage to anticipate cancellations // TODO: SME — confirm page title and nav label. Now that the page is symptom-led troubleshooting, the previous "Memory management" framing is too broad. // Options: @@ -22,6 +25,12 @@ cancelled due to OOM prevention Only queries running on the affected node at the time of reclamation are cancelled. Other nodes in the cluster continue to serve queries. The node resumes accepting new queries immediately after reclamation completes, so in most cases you can retry the failed query and it succeeds. +Use this page to: + +* [ ] {learning-objective-1} +* [ ] {learning-objective-2} +* [ ] {learning-objective-3} + == If the error keeps happening If queries are repeatedly cancelled with this error, the workload is consistently approaching the node's memory limit.