Fix #1929: Bug: /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on by Memtensor-AI · Pull Request #1935 · MemTensor/MemOS

Memtensor-AI · 2026-06-16T09:42:12Z

Description

Fixes the event-loop starvation on GET /api/v1/embeddings/maintenance reported in issue #1929. The endpoint used to paginate every row of traces / policies / world_model / skills in apps/memos-local-plugin/core/pipeline/memory-core.ts::collectEmbeddingSlots() and hydrate the BLOB vector columns into JS purely to inspect each vector's length, blocking better-sqlite3 on the main thread for 4+ minutes on a ~93K-row deployment with ~270 MB of vectors.

This change adds a new SQL-only counter embeddingMaintenanceCounts(db, { expectedByteLen }) in apps/memos-local-plugin/core/storage/repos/index.ts that issues five SELECT COUNT(*) + SUM(CASE WHEN ...) queries — LENGTH(blob) reads only the BLOB header, never the payload — and rewires computeEmbeddingMaintenanceStats() to use it. The public EmbeddingMaintenanceStats JSON shape, the HTTP route, the JSON-RPC bridge, and the agent contract are unchanged. The two pre-existing semantic filters (short-text trace skip via shouldTraceHaveEmbeddings, lightweight-memory carveout for vec_action) are preserved verbatim inside the SQL WHERE clauses so per-bucket counts stay identical for installed users. rebuildEmbeddings() keeps using slot enumeration for the actual row updates but now also pulls its before/after stats from the fast path.

Tests: new tests/unit/storage/embedding-maintenance.test.ts (4 cases) pins the bucket semantics, lightweight carveout, short-text filter, dimension-mismatch detection, empty-DB safety, and the expectedByteLen=0 fallback. The pre-existing memory-core suite (28/28 tests) passes unchanged — including the "repairs missing and wrong-dimension imported trace embeddings" and "does not require action vectors for lightweight memory traces" cases — confirming the public contract is byte-identical. tsc -p tsconfig.json --noEmit and tsc -p tsconfig.build.json both exit 0. The 3 wider-suite failures (tests/e2e/v7-full-chain.e2e.test.ts, tests/unit/storage/migrator.test.ts::namespace-visibility…regression #1787, tests/unit/storage/traces-count.test.ts::count() should return accurate count > 500) all reproduce on bare main after git stash and are pre-existing failures unrelated to this change.

Branch bugfix/autodev-1929-rerun-20260616 pushed to origin at commit 29c802fc. Spec artifacts (proposal.md / design.md / spec.md / task.md / verification-report.md) archived to the sibling specs repo under 2026-06-16-1929-bug-apiv1embeddingsmaintenance-causes-100-cpu-and-event-loop/ and pushed to memos-autodev-specs main.

Note: the issue's "Additional Fix" mentioning bounding scanAndTopK in core/storage/vector.ts is intentionally left for a follow-up — the production hang the title and event-loop logs describe is fully explained by the maintenance endpoint path, and keeping this PR scoped tightens the review surface (see proposal.md).

Related Issue (Required): Fixes #1929

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (does not change functionality, e.g. code style improvements, linting)
Documentation update

How Has This Been Tested?

Automated tests are pending.

Unit Test
Test Script Or Test Steps (please provide)
Pipeline Automated API Test (please provide)

Checklist

I have performed a self-review of my own code
I have commented my code in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works
I have created related documentation issue/PR in MemOS-Docs (if applicable)
I have linked the issue to this PR (if applicable)
I have mentioned the person who will review this PR

@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.

Reviewer Checklist

closes Bug: /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on large corpora #1929
Made sure Checks passed
Tests have been provided

Memtensor-AI · 2026-06-16T09:49:25Z

❌ Automated Test Results: FAILED

Auto-fix retry 1/2 triggered.

Failed tests:

test_out_of_range_rejected_or_clamped_to_valid[negative_1]
test_out_of_range_rejected_or_clamped_to_valid[negative_60s]
test_out_of_range_rejected_or_clamped_to_valid[negative_one_day]
test_out_of_range_rejected_or_clamped_to_valid[max_plus_1]
test_out_of_range_rejected_or_clamped_to_valid[max_plus_one_day]
test_out_of_range_rejected_or_clamped_to_valid[hundred_x_max]
test_invalid_type_does_not_crash_or_corrupt[string_number]
test_invalid_type_does_not_crash_or_corrupt[string_text]
test_invalid_type_does_not_crash_or_corrupt[none_value]
test_invalid_type_does_not_crash_or_corrupt[dict_value]

Error details

Tests failed. Failed cases: test_out_of_range_rejected_or_clamped_to_valid[negative_1], test_out_of_range_rejected_or_clamped_to_valid[negative_60s], test_out_of_range_rejected_or_clamped_to_valid[negative_one_day], test_out_of_range_rejected_or_clamped_to_valid[max_plus_1], test_out_of_range_rejected_or_clamped_to_valid[max_plus_one_day]

Branch: bugfix/autodev-1929-rerun-20260616

Memtensor-AI · 2026-06-17T03:28:39Z

❌ Automated Test Results: FAILED

Auto-fix retry 2/2 triggered.

Failed tests:

test_out_of_range_rejected_or_clamped_to_valid[negative_1]
test_out_of_range_rejected_or_clamped_to_valid[negative_60s]
test_out_of_range_rejected_or_clamped_to_valid[negative_one_day]
test_out_of_range_rejected_or_clamped_to_valid[max_plus_1]
test_out_of_range_rejected_or_clamped_to_valid[max_plus_one_day]
test_out_of_range_rejected_or_clamped_to_valid[hundred_x_max]
test_invalid_type_does_not_crash_or_corrupt[string_number]
test_invalid_type_does_not_crash_or_corrupt[string_text]
test_invalid_type_does_not_crash_or_corrupt[none_value]
test_invalid_type_does_not_crash_or_corrupt[bool_true]

Error details

The PATCH config endpoint returns HTTP 500 with error code 'internal' when client sends invalid vectorScanMaxAgeMs values (negative, too large, wrong type), instead of a 4xx client error. Schema validation failures on user input are being treated as server errors.

Branch: bugfix/autodev-1929-rerun-20260616

…gs/maintenance The maintenance stats endpoint used to paginate every row of `traces`, `policies`, `world_model`, and `skills` and hydrate the BLOB vector columns into JS just to inspect each vector's byte length. On a deployment with ~93K traces and ~270 MB of vectors that single `better-sqlite3` call blocked the Node event loop at 100% CPU for 4+ minutes (issue #1929), starving every concurrent `onTurnStart`. Replace the implementation with five `SELECT COUNT(*) + SUM(CASE WHEN ...)` queries — `LENGTH(blob)` reads only the BLOB header, never the payload — so the per-bucket counts now finish in single-millisecond territory regardless of database size. The public `EmbeddingMaintenanceStats` JSON shape, the HTTP route, and the JSON-RPC bridge are unchanged. The two pre-existing semantic filters from the slot-based path (`shouldTraceHaveEmbeddings` for short-text traces, lightweight-memory carveout for `vec_action`) are preserved verbatim inside the SQL WHERE clauses so per-bucket counts do not shift for already-installed users. Tests: - New `tests/unit/storage/embedding-maintenance.test.ts` pins the bucket semantics, lightweight carveout, short-text filter, dimension-mismatch detection, empty-DB safety, and the `expectedByteLen=0` fallback. - Existing memory-core suite passes unchanged (28/28), including the "repairs missing and wrong-dimension imported trace embeddings" and "does not require action vectors for lightweight memory traces" cases — proves the public contract is byte-identical.

…geMs Second half of the issue #1929 mitigation: the previous commit cut the `/api/v1/embeddings/maintenance` cost to SQL counts, but the tier-2 retrieval path (`scanAndTopK` over `traces.vec_summary` / `vec_action`) still runs a brute-force full-table scan on every `onTurnStart`. On a ~93K-row deployment that single scan blocks the Node event loop for 5-30 seconds. Introduce `algorithm.retrieval.vectorScanMaxAgeMs` (ms, default `0` = unbounded for back-compat). When > 0, the vector channels add `ts >= now() - vectorScanMaxAgeMs` to the SQL WHERE clause so only recent traces participate in the cosine scan. The keyword (FTS / pattern / structural) channels are left unbounded so ancient traces remain reachable via exact-text recall. Schema validation: - `NumberInRange(0, 0, 31_536_000_000)` — Typebox rejects negative, out-of-range, and non-number patches at `resolveConfig` time, so a "dirty" `PATCH /api/v1/config` never reaches `writer.ts`'s atomic rename. A subsequent `GET /api/v1/config` always returns a value in [0, 1 year]. - Hard cap of one year matches the contract pinned by the autodev rerun harness: anything larger is indistinguishable from "unbounded" at the corpus sizes where the bound starts to matter, and accepting absurdly large values would let misconfigured deployments silently revert to the legacy starvation path. Tests: `tests/unit/config/load.test.ts` adds 22 parametrised cases covering the accepted range (default, 1d, 30d, max, 0), the out-of-range rejection set the harness exercises (-1, -60s, -86_400_000, max+1, max+86_400_000, 100×max), and the invalid-type set (string number, string text, null, dict, list, NaN, Inf).

`MemosError("config_invalid", ...)` raised by `core/config/index.ts::resolveConfig` on a bad PATCH body used to bubble up to `server/http.ts`'s catch-all and surface as a 500 `internal` error. The rerun harness for issue #1929 explicitly asserts `status_code < 500` on every malformed `PATCH /api/v1/config` body — concurrent search calls must not be poisoned by a misbehaving viewer or admin script — so the route now catches `MemosError` whose code is `config_invalid` or `config_write_failed` and translates it to a 400 `invalid_argument` with the schema validator's message. Any other error keeps propagating to the global handler so unexpected bugs still page operators. Also adds `bool_true` / `bool_false` to the parametrized `retrieval.vectorScanMaxAgeMs` invalid-type tests so the schema-level guard is pinned for both booleans (Typebox `Type.Number` rejects them but coercion behaviour is worth pinning explicitly). Test plan: 102/102 in `tests/unit/config/load.test.ts` + `tests/unit/server/http.test.ts` pass, including three new integration tests that exercise the 400-mapping path (`schema validation errors → 400`, `writer failures → 400`, `unexpected errors → 500`). `npx tsc -p tsconfig.json --noEmit` and `npx tsc -p tsconfig.build.json` both clean. Refs: #1929

The previous fix mapped both `config_invalid` and `config_write_failed` from `PATCH /api/v1/config` to HTTP 400. But `config_write_failed` is raised only when the atomic config rename fails (disk full / permission denied) — a server-side I/O fault, not bad client input. Returning 400 `invalid_argument` for it misleads clients into thinking their (valid) payload was rejected and hides a real operational problem from the 500 pager path. Narrow the client-error set to `config_invalid` only (the Typebox schema-validation failure raised by `resolveConfig` on a malformed PATCH body). `config_write_failed` and every other error keep propagating to the global handler as 500. The #1929 rerun harness contract tests exercise only malformed input (`config_invalid`), so all 32 schema-contract cases stay green. Refs: #1929 Co-authored-by: Cursor <cursoragent@cursor.com>

Align the http.test.ts expectation with the corrected route behaviour: a `config_write_failed` (atomic rename failed) is a server-side fault, so `PATCH /api/v1/config` must return 500 `internal`, not 400. The schema-validation (`config_invalid`) → 400 case is covered by the adjacent test. Refs: #1929 Co-authored-by: Cursor <cursoragent@cursor.com>

Memtensor-AI · 2026-06-17T04:14:01Z

✅ Manual re-validation: config PATCH 4xx fix verified

The last automated FAILED result above was produced against commit 0ea9bc3 — before the actual route fix landed. The schema-validation guard only rejected bad values at the schema layer; the rejection still escaped to the global handler as a 500. The real fix was committed afterwards (after the 2/2 auto-fix retries were already exhausted), so the pipeline never re-ran to validate it.

Fix now on this branch:

7111b4ff — PATCH /api/v1/config catches the config_invalid MemosError from resolveConfig and returns 400 invalid_argument instead of letting it bubble to the catch-all 500.
05b805a0 — narrow the client-error set to config_invalid only. config_write_failed (atomic rename failed: disk full / permission denied) is a server-side fault and correctly stays 500 so operators are paged and clients aren't misled.
744e2b33 — update http.test.ts to pin config_write_failed → 500.

Re-validated manually on the test node (rebuilt the plugin from branch HEAD 744e2b33, reinstalled into the live OpenClaw SUT, re-ran the suite):

Check	Result
`tsc -p tsconfig.json --noEmit`	clean (exit 0)
`tests/unit/server/http.test.ts` + `tests/unit/config/load.test.ts` (vitest)	102 passed
`test_vector_scan_max_age_schema_contracts.py` (pytest, live SUT)	32 passed

The 15 previously-failing harness cases (test_out_of_range_rejected_or_clamped_to_valid[*], test_invalid_type_does_not_crash_or_corrupt[*]) all pass now — every malformed PATCH body returns a 4xx, never 5xx.

Memtensor-AI assigned CarltonXiang, MatthewZhuang, syzsunshine219 and World-controller Jun 16, 2026

Memtensor-AI requested review from CarltonXiang, MatthewZhuang, World-controller and syzsunshine219 June 16, 2026 09:42

Memtensor-AI added ai-generated bug Something isn't working | 功能异常 labels Jun 16, 2026

Memtensor-AI mentioned this pull request Jun 16, 2026

Bug: /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on large corpora #1929

Open

Memtensor-AI force-pushed the bugfix/autodev-1929-rerun-20260616 branch from 29c802f to 0ea9bc3 Compare June 16, 2026 10:33

MemOS AutoDev added 3 commits June 17, 2026 11:32

Memtensor-AI force-pushed the bugfix/autodev-1929-rerun-20260616 branch from 0ea9bc3 to 7111b4f Compare June 17, 2026 03:48

jiachengzhen and others added 2 commits June 17, 2026 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1929: Bug: /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on#1935

Fix #1929: Bug: /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on#1935
Memtensor-AI wants to merge 5 commits into
dev-20260615-v2.0.20from
bugfix/autodev-1929-rerun-20260616

Memtensor-AI commented Jun 16, 2026

Uh oh!

Memtensor-AI commented Jun 16, 2026

Uh oh!

Memtensor-AI commented Jun 17, 2026

Uh oh!

Memtensor-AI commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Memtensor-AI commented Jun 16, 2026

Description

Type of change

How Has This Been Tested?

Checklist

Reviewer Checklist

Uh oh!

Memtensor-AI commented Jun 16, 2026

❌ Automated Test Results: FAILED

Uh oh!

Memtensor-AI commented Jun 17, 2026

❌ Automated Test Results: FAILED

Uh oh!

Memtensor-AI commented Jun 17, 2026

✅ Manual re-validation: config PATCH 4xx fix verified

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants