Skip to content

Improve memory estimates in cloud storage#29773

Open
ballard26 wants to merge 3 commits intoredpanda-data:devfrom
ballard26:cs-imprv-mem-est
Open

Improve memory estimates in cloud storage#29773
ballard26 wants to merge 3 commits intoredpanda-data:devfrom
ballard26:cs-imprv-mem-est

Conversation

@ballard26
Copy link
Copy Markdown
Contributor

@ballard26 ballard26 commented Mar 8, 2026

Previously the per-segment memory estimate used was sizeof(remote_segment) + 4KB. This significantly underestimated actual usage. The biggest missing contributor was the hydration wait lists in both the segment and chunks. Each can cache one free chunk(~20KB) even when empty. Assuming that there is likely to be at least one segment wait list and one chunked wait list this results in ~40KB of memory usage unaccounted for. In workloads with many small segments this allowed RP to materialize far more segments than memory allowed resulting in an OOM.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

The memory semaphore that limits materialized remote segments was
significantly underestimating per-segment memory usage. The projection
only accounted for sizeof(remote_segment) + 4KB for the index, missing
coroutine frames, heap-allocated subobjects (paths, coarse index,
segment_chunks), and ~40KB per segment from chunked_fifo's free chunk
caching policy in the expiring_fifo wait lists. This caused the system
to materialize far more segments than actual memory could support,
leading to OOM under workloads with many small segments and prefetch
enabled.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves memory estimates for materialized remote segments in cloud storage to prevent OOM in workloads with many small segments. The previous estimate of sizeof(remote_segment) + 4KB significantly underestimated actual usage, primarily missing the ~40KB overhead from expiring_fifo retained free chunks in both segment and chunk hydration wait lists.

Changes:

  • Added estimate_wait_list_overhead() static method and improved estimate_memory_use() to account for heap-allocated paths, coarse index, segment chunks, and waiter FIFO overhead.
  • Updated projected_remote_segment_memory_usage() in materialized_resources.cc to include subobject, coroutine, and waiter overhead (~56KB total projected overhead vs previous ~4KB).
  • Fixed two typos ("extimate" → "estimate") in log messages and a minor code ordering cleanup in remote_partition.cc.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/v/cloud_storage/remote_segment.h Added estimate_wait_list_overhead() static method declaration
src/v/cloud_storage/remote_segment.cc Implemented estimate_wait_list_overhead() and enhanced estimate_memory_use() with heap path, index, chunk, and waiter overhead accounting
src/v/cloud_storage/materialized_resources.cc Updated projected memory estimate with subobject, coroutine, and waiter overhead; fixed two "extimate" → "estimate" typos
src/v/cloud_storage/remote_partition.cc Moved _ts_probe.segment_materialized() after log statement for consistency

Comment on lines +377 to +378
// retry chain, etc.)
res += sizeof(segment_chunks);
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof(remote_segment) already includes sizeof(std::optional<segment_chunks>), which provides inline storage for the entire segment_chunks struct. Adding sizeof(segment_chunks) again when _chunks_api is engaged double-counts the inline struct size.

If the intent is to account for heap allocations from segment_chunks's internal data structures (e.g., the _chunks btree_map root node, _prefetches chunked_vector buffer, etc.), consider estimating that heap overhead directly instead of using sizeof(segment_chunks).

For a memory-estimation function aimed at preventing OOM, over-counting is safer than under-counting, so this isn't critical, but it does make the estimate less accurate than intended.

Suggested change
// retry chain, etc.)
res += sizeof(segment_chunks);
// retry chain, etc.). Account approximately for per-chunk overhead.

Copilot uses AI. Check for mistakes.
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#81488
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudTopicsL0GCNodeFailureTest test_node_failure_mid_gc {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/81488#019ccf7b-5bbf-4238-a410-f661d05b4acf FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0269, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCNodeFailureTest&test_method=test_node_failure_mid_gc
src/v/cloud_storage/tests/remote_partition_fuzz_test src/v/cloud_storage/tests/remote_partition_fuzz_test unit https://buildkite.com/redpanda/redpanda/builds/81488#019ccf5c-ce7d-480c-b7ff-e8d6f824eeda FAIL 0/1

@Lazin
Copy link
Copy Markdown
Contributor

Lazin commented Mar 19, 2026

- Test timed out at 2026-03-08 21:59:34 UTC --
I don't think it's caused by the change in this PR

@Lazin
Copy link
Copy Markdown
Contributor

Lazin commented Mar 19, 2026

There is a small chance that this new memory accounting logic limits concurrency so the test naturally becomes slower. But this is an old test and I don't think it stresses memory bad enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants