-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
fix: prevent memory leak by closing unused context #1640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When scraping many URLs continuously, browser contexts accumulated in memory and were never cleaned up. The existing cleanup only ran when browsers went idle, which never happened under continuous load. See: #943. Key changes: - browser_manager.py: Add _context_refcounts tracking, cleanup_contexts(), and release_context() methods - async_crawler_strategy.py: Release context ref in finally block after crawl - deploy/docker/api.py: Trigger context cleanup after each request This fixes or at least, drastically improve the memory leaks in my testing.
Signed-off-by: Martichou <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request addresses a memory leak issue where browser contexts accumulate in memory and are never cleaned up under continuous load. The fix introduces reference counting for contexts and adds periodic cleanup mechanisms.
- Implements reference counting to track active usage of browser contexts
- Adds cleanup_contexts() method to periodically close idle contexts
- Triggers context cleanup after each API request to prevent unbounded memory growth
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| crawl4ai/browser_manager.py | Adds reference counting system (_context_refcounts), cleanup_contexts() method for closing idle contexts, and release_context() method for decrementing refcounts |
| crawl4ai/async_crawler_strategy.py | Adds release_context() call in finally block to decrement refcount when crawl completes |
| deploy/docker/api.py | Triggers cleanup_contexts() after each request to limit context accumulation, with whitespace cleanup |
| README.md | Adds new sponsor (Thor Data) - unrelated to memory leak fix |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Release the context reference so cleanup can work | ||
| if not self.browser_config.use_managed_browser: | ||
| try: | ||
| config_signature = self.browser_manager._make_config_signature(config) | ||
| await self.browser_manager.release_context(config_signature) | ||
| except Exception: | ||
| pass # Don't fail on cleanup |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The release_context call here creates a reference counting imbalance when using session_id. Looking at browser_manager.py get_page(), when a session_id is provided and already exists, the function returns early (line 1063-1066) without incrementing the refcount. However, this release_context will still be called, decrementing a counter that was never incremented. This will cause the refcount to go negative (though clamped to 0 by the max() call in release_context), potentially allowing contexts to be cleaned up while still in use by sessions. The condition should also check that no session_id is being used, similar to: if not self.browser_config.use_managed_browser and not config.session_id:
| # If force=True and we still have too many, close active ones too | ||
| if force and len(self.contexts_by_config) - len(contexts_to_close) > max_contexts: | ||
| remaining_excess = len(self.contexts_by_config) - len(contexts_to_close) - max_contexts | ||
| contexts_to_close.extend(active_contexts[:remaining_excess]) |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation states that force will "close contexts even if they have pages (but never if refcount > 0)", but the implementation at lines 1207-1209 will actually close active contexts when force=True, and active_contexts includes contexts with refcount > 0 (added at line 1171-1173). This means force=True can close contexts that are actively being used by requests, contradicting the docstring and potentially causing "Target closed" errors during active crawls. The condition should filter out contexts with refcount > 0 from active_contexts before extending contexts_to_close.
| # If force=True and we still have too many, close active ones too | |
| if force and len(self.contexts_by_config) - len(contexts_to_close) > max_contexts: | |
| remaining_excess = len(self.contexts_by_config) - len(contexts_to_close) - max_contexts | |
| contexts_to_close.extend(active_contexts[:remaining_excess]) | |
| # If force=True and we still have too many, close additional contexts | |
| # but never close contexts with refcount > 0 (they may be in active use). | |
| if force and len(self.contexts_by_config) - len(contexts_to_close) > max_contexts: | |
| remaining_excess = len(self.contexts_by_config) - len(contexts_to_close) - max_contexts | |
| # From active_contexts, only consider those whose refcount is 0 for forced closure | |
| force_closable_active = [ | |
| (sig, ctx) | |
| for sig, ctx in active_contexts | |
| if self._context_refcounts.get(sig, 0) == 0 | |
| ] | |
| contexts_to_close.extend(force_closable_active[:remaining_excess]) |
| except Exception: | ||
| pass |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| pass | |
| except Exception as e: | |
| # Ignore individual page close failures but record them for diagnostics | |
| self.logger.warning( | |
| message="Error closing page during context cleanup: {error}", | |
| tag="WARNING", | |
| params={"error": str(e)} | |
| ) |
Signed-off-by: Martichou <[email protected]>
|
Hi @Martichou — thanks for this PR and the detailed memory profiling data. Your diagnosis of the problem was spot on: browser contexts in We ended up implementing the fix directly on 1. Config signature shrink (the bigger win)The root cause of excessive context creation was that 2. Refcount tracking + LRU eviction (your core idea)Following your reference-counting concept, we added:
3. Storage state path leak fixAlso fixed a separate leak where the Test coverageAdded 20 tests (8 signature, 5 eviction logic, 7 real-browser integration) — all passing. The real-browser tests verify: same context reused for non-context config changes, refcounts drop to 0 after crawl, LRU eviction caps contexts, concurrent crawls track correctly. Your PR was the catalyst for this investigation — really appreciate the contribution and the benchmarking script. Closing in favor of the direct implementation on develop. |
contexts_by_config accumulated browser contexts unboundedly in long-running crawlers (Docker API). Two root causes fixed: 1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7 affect the browser context (proxy_config, locale, timezone_id, geolocation, override_navigator, simulate_user, magic). Switched from blacklist to whitelist — non-context fields like word_count_threshold, css_selector, screenshot, verbose no longer cause unnecessary context creation. 2. No eviction mechanism existed between close() calls. Added refcount tracking (_context_refcounts, incremented under _contexts_lock in get_page, decremented in release_page_with_context) and LRU eviction (_evict_lru_context_locked) that caps contexts at _max_contexts=20, evicting only idle contexts (refcount==0) oldest-first. Also fixed: storage_state path leaked a temporary context every request (now explicitly closed after clone_runtime_state). Closes #943. Credit to @Martichou for the investigation in #1640.
Summary
When scraping many URLs continuously, browser contexts accumulate in memory and are never cleaned up. The existing cleanup mechanism only runs when browsers go idle, which never happens under continuous load. This causes memory to grow unbounded until the process crashes or becomes unresponsive.
Fixes #943
List of files changed and why
browser_manager.py: Add _context_refcounts tracking, cleanup_contexts(), and release_context() methodsasync_crawler_strategy.py: Release context ref in finally block after crawldeploy/docker/api.py: Trigger context cleanup after each requestHow Has This Been Tested?
This has been tested locally by running the following script and comparing the before/after memory usage with both the master version and the patched version through a docker compose.
The script simply perform 100 scrape with 8 concurrency and report the status code repartition:
https://gist.github.com/Martichou/27555055d130d1c65f6a8457fbeb2a22
Result of the test:
Unpatched version:
Patched version:
It may not have eliminated every leaks (1% gains between run for unknown reason), but closing the browser using the kill browser endpoint make the memory go back to 10%.
Checklist: