feat: Dynamic memory snapshots by Pijukatel · Pull Request #1715 · apify/crawlee-python

Pijukatel · 2026-02-05T13:52:34Z

Description

Add Ratio type to represent the maximum relative available memory of the system.
Allow to initialize the Snapshotter.max_memory_size and MemorySnapshot.max_memory_size with either Ratio (dynamic memory) or ByteSize (fixed memory)
When Ratio is used, the MemorySnapshot.is_overloaded will take into account the current available memory. (Previously, it would take into account only the initial available memory.)

Top level usage in Crawlers:
Fixed memory

BasicCrawler(configuration=Configuration(memory_mbytes=1024))

Dynamic memory

BasicCrawler(configuration=Configuration(available_memory_ratio=0.5))

Issues

Closes: Snapshotter does not account for dynamic memory scaling (e.g., K8s burstable QoS) #1704

Testing

Unit test

Checklist

CI passed

codecov · 2026-02-05T13:57:39Z

Codecov Report

❌ Patch coverage is 71.87500% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.42%. Comparing base (8c0dae6) to head (3f27a14).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
src/crawlee/_autoscaling/snapshotter.py	65.38%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1715      +/-   ##
==========================================
- Coverage   92.47%   92.42%   -0.06%     
==========================================
  Files         156      156              
  Lines       10602    10621      +19     
==========================================
+ Hits         9804     9816      +12     
- Misses        798      805       +7

Flag	Coverage Δ
unit	`92.42% <71.87%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Pijukatel · 2026-02-10T15:53:40Z

E2E tests: https://github.com/apify/crawlee-python/actions/runs/21871925606

Copilot

Pull request overview

This PR introduces dynamic memory snapshot support to address autoscaling limitations in environments with variable memory allocations (e.g., Kubernetes burstable QoS). It adds a Ratio type that allows the autoscaler to dynamically query available system memory rather than being locked to an initial baseline.

Changes:

Introduced Ratio type for representing dynamic memory as a proportion of total system memory
Modified Snapshotter and MemorySnapshot to accept either ByteSize (fixed) or Ratio (dynamic) for memory limits
Added logic to dynamically evaluate memory overload based on current available memory when using Ratio

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`src/crawlee/_utils/byte_size.py`	Adds `Ratio` Pydantic model with validation for memory ratios (0.0 < value ≤ 1.0)
`src/crawlee/_autoscaling/snapshotter.py`	Updates `max_memory_size` parameter to accept `ByteSize \| Ratio` and dynamically calculates memory limits when using `Ratio`
`src/crawlee/_autoscaling/_types.py`	Modifies `MemorySnapshot.is_overloaded` to dynamically query system memory when `max_memory_size` is a `Ratio`
`tests/unit/_autoscaling/test_snapshotter.py`	Adds comprehensive test simulating memory scale-up/scale-down scenarios with mocked memory info

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/crawlee/_autoscaling/_types.py

tests/unit/_autoscaling/test_snapshotter.py

Mantisus

Looks good! I have a few comments below

Mantisus · 2026-02-10T18:15:39Z

src/crawlee/_autoscaling/_types.py

+    """The maximum memory that can be used by `AutoscaledPool`.
+
+    When of type `ByteSize` then it is used as fixed memory size. When of type `Ratio` then it allows for dynamic memory
+    scaling based on the available system memory.


I believe the available_memory_ratio docstring in Configuration should also mention this dynamic scaling behavior.

Mantisus · 2026-02-10T18:25:45Z

src/crawlee/_utils/byte_size.py

+class Ratio(BaseModel):
+    """Represents ratio of memory."""
+
+    value: Annotated[float, Field(gt=0.0, le=1.0)]


I believe we could add gt=0.0 and le=1.0 constraints to available_memory_ratio in Configuration for consistency.

Mantisus · 2026-02-10T18:52:29Z

src/crawlee/_autoscaling/_types.py

+            # The snapshot overload is decided not when the snapshot was taken, but when `is_overload` property is
+            # accessed. This allows for dynamic memory scaling. The same memory snapshot that used to be overloaded in
+            # the past can become non-overloaded if the available memory was increased.
+            max_memory_size = ByteSize(int(get_memory_info().total_size.bytes * self.max_memory_size.value))


I believe get_memory_info() should be wrapped in asyncio.to_thread since it uses psutil.

Also, I don't think historical snapshots should be affected by memory scaling - the overload was real at the moment the snapshot was taken. Once it's older than _SNAPSHOT_HISTORY, it won't influence the Snapshotter anyway. The 30-second inertia from old memory values seems reasonable to me.

Good point. No need to be so "realtime-ish"

Mantisus · 2026-02-10T19:03:01Z

src/crawlee/_autoscaling/snapshotter.py

-        self._evaluate_memory_load(event_data.memory_info.current_size, event_data.memory_info.created_at)
+
+        if isinstance(self._max_memory_size, Ratio):
+            max_memory_size = ByteSize(int(get_memory_info().total_size.bytes * self._max_memory_size.value))


I think we could skip calling get_memory_info() when event_data.memory_info is MemoryInfo and just use memory_info.total_size.bytes instead.

Mantisus

LGTM!

vdusek · 2026-02-12T07:30:23Z

src/crawlee/_autoscaling/_types.py


-    max_memory_size: ByteSize
-    """The maximum memory that can be used by `AutoscaledPool`."""
+    max_memory_size: ByteSize | Ratio


In _snapshot_memory, we now always resolve Ratio to ByteSize, so I believe there should be only ByteSize. Also, don't forget to update the docstring.

Suggested change

max_memory_size: ByteSize | Ratio

max_memory_size: ByteSize

vdusek · 2026-02-12T07:39:34Z

src/crawlee/_utils/byte_size.py

@@ -1,14 +1,22 @@
 from __future__ import annotations


I don't think the Ratio belongs here. A ratio is not a byte size - it's a proportion of memory. We should place Ration in the _autoscaling/_types.py where it is (only) used.

vdusek · 2026-02-12T07:40:34Z

src/crawlee/_utils/byte_size.py

 _BYTES_PER_TB = _BYTES_PER_KB**4


+class Ratio(BaseModel):


This should probably be only a dataclass instead of a Pydantic model

vdusek · 2026-02-12T07:44:21Z

src/crawlee/_autoscaling/snapshotter.py

+                # This is just hypothetical case, that should not happen in practice.
+                # `LocalEvenManager` should always provide `MemoryInfo` in the event data.
+                # When running on Apify, `self._max_memory_size` is always `ByteSize`, not `Ratio`.


Please log warning then

vdusek · 2026-02-12T07:46:32Z

tests/unit/_autoscaling/test_snapshotter.py

            assert prev_time <= curr_time, f'Items at indices {i - 1} and {i} are not in chronological order'
+
+
+_initial_memory_info = get_memory_info()


Can we use a fixture so that this does not run at import time?

janbuchar · 2026-02-12T12:08:56Z

Once you folks are done here, we need to make sure that this is also implemented in the JS port

Draft of dynamic memory snapshots

23c803d

github-actions bot assigned Pijukatel Feb 5, 2026

github-actions bot added this to the 133rd sprint - Tooling team milestone Feb 5, 2026

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Feb 5, 2026

Pijukatel mentioned this pull request Feb 5, 2026

Snapshotter does not account for dynamic memory scaling (e.g., K8s burstable QoS) #1704

Open

Pijukatel requested review from Mantisus and vdusek February 10, 2026 15:46

Pijukatel marked this pull request as ready for review February 10, 2026 15:46

Pijukatel marked this pull request as draft February 10, 2026 15:48

Merge remote-tracking branch 'origin/master' into dynamic-snapshotter

7fc2463

Pijukatel marked this pull request as ready for review February 10, 2026 15:53

vdusek requested a review from Copilot February 10, 2026 18:47

Copilot started reviewing on behalf of vdusek February 10, 2026 18:47 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

src/crawlee/_autoscaling/_types.py Outdated Show resolved Hide resolved

tests/unit/_autoscaling/test_snapshotter.py Show resolved Hide resolved

Mantisus suggested changes Feb 10, 2026

View reviewed changes

Review comments and simplification.

3f27a14

Pijukatel force-pushed the dynamic-snapshotter branch from 6c11d73 to 3f27a14 Compare February 11, 2026 10:05

Pijukatel requested a review from Mantisus February 11, 2026 10:16

Mantisus approved these changes Feb 11, 2026

View reviewed changes

vdusek requested changes Feb 12, 2026

View reviewed changes

		assert prev_time <= curr_time, f'Items at indices {i - 1} and {i} are not in chronological order'


		_initial_memory_info = get_memory_info()

Conversation

Pijukatel commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Checklist

Uh oh!

codecov bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Pijukatel commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Mantisus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pijukatel Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mantisus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janbuchar commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Pijukatel commented Feb 5, 2026 •

edited

Loading

codecov bot commented Feb 5, 2026 •

edited

Loading

Pijukatel Feb 11, 2026 •

edited

Loading