Skip to content

feat(#1423): Diagram.trace() for upstream restriction propagation#1471

Merged
dimitri-yatsenko merged 5 commits into
masterfrom
feat/1423-diagram-trace
Jul 2, 2026
Merged

feat(#1423): Diagram.trace() for upstream restriction propagation#1471
dimitri-yatsenko merged 5 commits into
masterfrom
feat/1423-diagram-trace

Conversation

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Summary

T2.2.a of the provenance trinity (datajoint-docs#183 spec). Upstream mirror of `Diagram.cascade()`: walks the FK graph from a restricted seed to every ancestor with OR convergence — an ancestor entity is included if reachable through any FK path from the seed.

Closes #1423. Slated for DataJoint 2.3.

Branch dependency

This branch is stacked on `fix/1429-cascade-part-part-renamed-fk` (#1468) for the upward propagation primitives (`_apply_propagation_rule_upward`, `_find_real_edge_props`). After #1468 merges, this branch will be rebased onto master before review.

What's added

Component File
`Diagram.trace(table_expr)` — classmethod, mirror of `cascade()` `src/datajoint/diagram.py`
`_propagate_restrictions_upstream(start_node)` — multi-pass walk over `in_edges`, applies upward rules at each real edge; alias-node transparent `src/datajoint/diagram.py`
`getitem(key)` — accepts Table class/instance (→ pre-restricted `QueryExpression`) or string (→ pre-restricted `FreeTable`); raises on non-ancestor `src/datajoint/diagram.py`
Bugfix in `_apply_propagation_rule_upward` Backward Rule 3: previous code projected child to its own PK, which excluded non-primary FK columns. Now projects to the FK columns via `proj(*attr_map.keys())`. `src/datajoint/diagram.py`
`load_all_upstream()` — symmetric to `load_all_downstream` `src/datajoint/dependencies.py`
`find_upstream_schemas_sql()` on all three adapters `src/datajoint/adapters/{base,mysql,postgres}.py`
8 integration tests covering single/multi-hop, renamed FK, OR convergence, non-ancestor rejection, string indexing, counts, leaf-seed `tests/integration/test_trace.py` (new)

Tests

  • 8/8 trace tests pass on MySQL.
  • Regression set (cascade_delete + cascading_delete + dependencies + semantic_matching): 36/36 pass — no regressions from the Rule 3 fix.

Sequencing

This is T2.2.a. After it merges:

  • T2.2.b (`self.upstream` in `make()`) — branches from master, wires the trace into `AutoPopulate._populate_one()`.
  • T2.2.c (`strict_provenance` config) — branches from master after T2.2.b, adds the runtime gates at fetch + insert chokepoints.
  • Docs spec datajoint-docs#183 flips from draft when this PR opens.

Test plan

  • 8/8 new trace tests pass on MySQL
  • 36/36 regression tests pass on MySQL — no Rule 3 regression
  • CI green (lint, test matrix on MySQL + PostgreSQL, unit-tests)
  • Manual smoke test against a multi-schema pipeline

Implements T2.2.a of the provenance trinity (datajoint-docs#183). Upstream
mirror of Diagram.cascade(): walks the FK graph from a restricted seed to
every ancestor with OR convergence — an ancestor entity is included if
reachable through any FK path from the seed.

Reuses the upward propagation primitives added by #1468
(_apply_propagation_rule_upward / _find_real_edge_props) applied here in
a generalized form (any child → any parent, not just Part → Master).

Branch note: stacked on fix/1429-cascade-part-part-renamed-fk (#1468)
for the upward primitives. Will rebase onto master after #1468 lands.

What's added:

- src/datajoint/diagram.py:
  - New @classmethod Diagram.trace(table_expr) — mirror of cascade(),
    walks ancestors instead of descendants, trims to ancestor subgraph.
  - New _propagate_restrictions_upstream(start_node) — multi-pass walk
    over in_edges, applies the upward rules at each real edge.
    Alias-node transparent.
  - New __getitem__(key) — supports both Table subclass/instance
    (returns pre-restricted QueryExpression) and string (returns
    pre-restricted FreeTable). Raises DataJointError for tables outside
    the trace's subgraph.
  - Bugfix in _apply_propagation_rule_upward Backward Rule 3: previous
    code projected child to its OWN PK (child_ft.proj()) which excluded
    non-primary FK columns. Now projects to the FK columns via
    proj(*attr_map.keys()), correctly carrying them into the parent
    restriction for non-primary-FK cases. Caught by
    test_trace_or_convergence_two_paths.

- src/datajoint/dependencies.py:
  - New load_all_upstream() — symmetric to load_all_downstream.
    Iteratively discovers upstream schemas reachable via reverse FK
    edges, expanding the graph until convergence.

- src/datajoint/adapters/{base,mysql,postgres}.py:
  - New find_upstream_schemas_sql(schemas_list) on each adapter,
    symmetric to find_downstream_schemas_sql.

- tests/integration/test_trace.py (new, 8 tests covering single-hop,
  multi-hop, renamed FK, OR convergence across two paths, non-ancestor
  rejection, string indexing → FreeTable, counts(), leaf-table seed).

All 8 trace tests pass on MySQL. Regression: test_cascade_delete +
test_cascading_delete + test_dependencies + test_semantic_matching
— 36 tests pass, no regressions from the Rule 3 fix.

Slated for DataJoint 2.3.
The trace-mode __getitem__ I added shadowed networkx.DiGraph's standard
adjacency-dict lookup for ALL Diagrams, not just trace results. ERD
tests (and any other code that does diagram[node_name] for adjacency)
were getting DataJointError("not in this trace's subgraph") instead of
the adjacency dict.

Fix: short-circuit non-trace diagrams (no _mode attribute or _mode != "trace")
to super().__getitem__(key) before any trace-specific logic runs.

Tests:
- 5 previously-failing erd tests now pass (test_erd, test_diagram_algebra,
  test_repr_svg, test_make_image, test_part_table_parsing).
- 8/8 trace tests still pass.
dimitri-yatsenko added a commit that referenced this pull request Jun 23, 2026
Implements T2.2.c of the provenance trinity, completing the trio
(Diagram.trace → self.upstream → strict_provenance).

When dj.config["strict_provenance"] = True, runtime gates enforce the
upstream-only convention inside make():
- Reads must target a table in the active trace's allowed set
  (declared ancestors + self + self's Parts).
- Writes must target self or self's Parts.
- Inserted rows' PK columns that overlap with the current key must
  equal the key's values (key-consistency rule).

Default is False. Existing make() bodies are unaffected.

Branch stacked on feat/1424-self-upstream (#1473) → feat/1423-diagram-trace
(#1471) → fix/1429-cascade-part-part-renamed-fk (#1468). Will rebase
onto master after the chain merges.

What's added:

- src/datajoint/provenance.py (new): the runtime context module.
  - `_active_strict_make` ContextVar holding (target, allowed_tables,
    key) for the currently-executing make() invocation. ContextVar
    chosen over threading.local to propagate correctly across
    contextvars-aware concurrency boundaries.
  - `push_strict_make_context` / `pop_strict_make_context` — context
    lifecycle managed by `_populate_one`'s try/finally.
  - `assert_read_allowed(query_expression)` — read gate. Recursively
    discovers base tables via the QueryExpression's `_support` chain
    and checks each against the allowed set.
  - `assert_write_allowed(target_table, rows)` — write gate. Verifies
    the target is self or one of self's Part tables, and checks the
    key-consistency rule on each dict row.

- src/datajoint/settings.py: new `strict_provenance: bool` field on
  Config (default False), env-var `DJ_STRICT_PROVENANCE`, ENV_VAR_MAPPING
  entry.

- src/datajoint/autopopulate.py: in `_populate_one`, push the strict
  context (when the flag is on) just before the make() invocation
  block. The allowed table set = trace's ancestor nodes ∪ {self.full_table_name}
  ∪ {self's Parts}. Pop in the existing `finally` block.

- src/datajoint/expression.py: `QueryExpression.cursor` now calls
  `assert_read_allowed(self)` before issuing SQL. No-op outside make().

- src/datajoint/table.py: `Table.insert` calls `assert_write_allowed(self, rows)`
  after the existing `_allow_insert` check. No-op outside make().

Part-table detection uses class `__dict__` traversal (filtered to Part
subclasses) instead of `dir/getattr` to avoid triggering the
`_JobsDescriptor` (which would lazy-declare ~~table inside the populate
transaction — caught by the first test iteration).

Documented limitation (deferred): the read gate does not distinguish
reads that came through `self.upstream` from reads of the same ancestor
via a direct expression. Both are allowed if the table is in the
allowed set. The intent is to catch reads from *undeclared*
dependencies; tightening the "must come through self.upstream" path
requires propagating an attribution marker through QueryExpression
composition and is left for a follow-up release.

Tests in tests/integration/test_strict_provenance.py (6 new):

- test_strict_compliant_make_passes — make() reading via self.upstream
  and writing self.insert1 with matching key runs cleanly under strict.
- test_strict_blocks_read_from_undeclared_table — read from an unrelated
  table raises with "strict_provenance ... undeclared" message.
- test_strict_blocks_write_to_other_table — insert into a non-self,
  non-Part target raises "not permitted".
- test_strict_blocks_write_with_mismatched_key — row PK that disagrees
  with the current key raises "does not match the current make() key".
- test_strict_writes_to_part_table_pass — self.PartName.insert(...) works.
- test_strict_off_by_default_no_change — default-off regression check;
  the canonical "direct (Ancestor & key).fetch1()" pattern still works
  when strict_provenance is unset.

Regression: 17/17 autopopulate tests pass with strict_provenance unset
(default). 6/6 new strict tests pass with strict_provenance=True.
8/8 trace tests + 9/9 cascade tests unaffected.

Slated for DataJoint 2.3.
@ttngu207

ttngu207 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

I traced the core logic against the forward cascade engine and it holds up well: the reverse-topological single-pass walk correctly mirrors cascade, OR-convergence is a genuine union (not intersection), the renamed-FK reversal is a clean inverse of the forward Rule 2, alias nodes are traversed transparently, and the Rule-3 fix is correct. No correctness issues found — nice.

One thing I'd add before merge: coverage on the cases that most stress the design's novelty. A regression in the reverse-topo walk or load_all_upstream would tend to drop an OR arm and yield a subset, which passes shallow assertions — so these are the high-value ones to pin:

  • True multi-hop diamond OR-convergence — an ancestor reachable via two multi-hop paths, not just the adjacent two-edge case in test_trace_or_convergence_two_paths.
  • Cross-schema ancestorsload_all_upstream() + find_upstream_schemas_sql on all three adapters are added, but no test spans two schemas, so the unloaded-ancestor-schema discovery path is unexercised.
  • Upward Part-of-Part chains (trace of a Part reaching its Master through intermediate Parts).
  • An isolated non-aliased secondary FK — the exact path the Rule-3 fix changes (proj(*attr_map.keys())). It's currently only incidentally covered inside the OR test; a direct single-hop secondary-FK test asserting the ancestor rows would guard it explicitly.

Not blocking — the logic is sound — but for the foundation of the trinity I think these are worth having.

ttngu207
ttngu207 previously approved these changes Jul 1, 2026
MilagrosMarin
MilagrosMarin previously approved these changes Jul 1, 2026

@MilagrosMarin MilagrosMarin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 03dd0a7b on top of the already-approved #1468. All four coverage gaps @ttngu207 flagged still hold on the current tip.

Verified ✅

  • Structure. _propagate_restrictions_upstream (diagram.py:468) walks in_edges in reverse topo order (children before parents), applies _apply_propagation_rule_upward at each real edge, transparently steps through alias nodes. Multi-pass loop terminates on DAG.
  • OR-convergence. Multiple FK paths append to _cascade_restrictions[parent]; _restricted_table OR-combines the list via ft.restrict(list) → OrList. Same storage shape as cascade; OR semantics fall out of the reuse.
  • Rule 3 fix (child_ft.proj()child_ft.proj(*attr_map.keys()) at diagram.py:896). Old code projected to child's own PK, so non-primary FK columns dropped out and the parent's natural-join found no shared cols — restriction silently lost. New code carries FK columns. Cascade regression (36/36) confirms unaffected: Rule 3 only fires when child_attrs ⊄ parent_pk, which is rare in cascade's Part-to-Master walks.
  • __getitem__ gate on _mode == "trace". Non-trace diagrams fall through to networkx adjacency lookup; existing patterns keep working.
  • load_all_upstream. Structural mirror of load_all_downstream. Both adapter SQLs reverse the FK-schema query correctly.

Coverage gaps (concur with @ttngu207)

Verified by inspection of tests/integration/test_trace.py:

  1. Multi-hop diamond OR-convergence — the OR test uses two adjacent-edge paths; a Leaf → {Middle1, Middle2} → Root shape would exercise multi-pass accumulation.
  2. Cross-schema ancestors — no test spans two schemas; load_all_upstream's discovery path is unexercised.
  3. Upward Part-of-Part chainsPart doesn't appear in the file.
  4. Direct Rule 3 hit test — only exercised inside the mixed OR test; a single-hop non-primary-FK trace with an explicit assertion would guard the exact code path this PR touches.

Not blocking, but worth having before the trinity's later pieces (self.upstream, strict_provenance) build on trace — a subset-shaped regression here would silently loosen provenance downstream.

Small observations

  • Docstring drift. Rule 3's numbered docstring (diagram.py:866) still says child.proj() — text should mirror the fixed code.
  • Alias transparency reimplemented inline in _propagate_restrictions_upstream (lines 508–530) rather than reusing _find_real_edge_props at diagram.py:1007. Small refactor opportunity for future maintenance.

Bottom line

Substantive logic sound. Multi-hop diamond and cross-schema tests would be worth adding before merge; the other two coverage items and the observations can be follow-ups. Approving with the same non-blocking stance as @ttngu207.

dimitri-yatsenko and others added 2 commits July 1, 2026 12:09
…cstring

Addresses the before-merge coverage @ttngu207 and @MilagrosMarin flagged on
#1471 (the two they singled out as worth adding before merge):

- test_trace_multi_hop_diamond_or_convergence: an ancestor reached via two
  MULTI-HOP arms (Leaf -> {Left, Right} -> Root), asserting the OR-union
  {1, 2}. Guards the multi-pass accumulation in the reverse-topo walk, where a
  regression would silently drop an arm and yield a subset.
- test_trace_cross_schema_ancestor: seed and ancestor in different schemas,
  exercising load_all_upstream's unloaded-ancestor-schema discovery. Runs on
  both MySQL and PostgreSQL via schema_by_backend.

Also fixes the Backward Rule 3 docstring drift (diagram.py): it described
child.proj() but the code projects child.proj(*attr_map.keys()) to carry the
FK columns.

Upward Part-of-Part chains and an isolated non-aliased secondary-FK test remain
as noted follow-ups.
@dimitri-yatsenko dimitri-yatsenko dismissed stale reviews from MilagrosMarin and ttngu207 via 2d81534 July 2, 2026 15:34

@MilagrosMarin MilagrosMarin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2d81534a lands both coverage gaps we flagged before-merge (multi-hop diamond, cross-schema ancestor) plus the Rule 3 docstring fix. The diamond uses a renamed root_id2 on Right to force real multi-hop OR reconvergence, and excluded Root=3 proves it. Part-of-Part upward + isolated non-primary-FK explicitly deferred.

Re-approving.

@dimitri-yatsenko dimitri-yatsenko merged commit cf77c1e into master Jul 2, 2026
5 checks passed
@dimitri-yatsenko dimitri-yatsenko deleted the feat/1423-diagram-trace branch July 2, 2026 16:26
dimitri-yatsenko added a commit that referenced this pull request Jul 2, 2026
…verage

Addresses the review feedback from @ttngu207 and @MilagrosMarin on #1473:

1. test_upstream_cleared_after_make now captures the actual populate instance
   inside make() and asserts its _upstream is cleared afterward. The old test
   probed a fresh Derived().upstream (never set -> class default None), so it
   passed even if the finally-reset were deleted. Now it has teeth.
2. test_upstream_cleared_after_make_raises (new): forces make() to raise and
   asserts the populate instance's upstream is still cleared -- exercising the
   exception path the finally block exists for (previously zero coverage).
3. test_upstream_seen_across_tripartite_make now asserts, via id(self._upstream)
   grouped per key, that all three phases share one upstream object -- proving
   the docstring's 'constructed once, shared across phases' claim instead of
   only checking the result.
4. test_upstream_rejects_non_ancestor now exercises both the class-form and
   string-form Diagram.__getitem__ branches.

Validated locally on MySQL and PostgreSQL (full test_autopopulate.py: 18 passed,
2 skipped). Upward Part-of-Part and isolated-secondary-FK coverage remain
follow-ups (the diamond-OR and cross-schema gaps were closed at the trace layer
in #1471).
@dimitri-yatsenko dimitri-yatsenko added the feature Indicates new features label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Indicates new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Diagram.trace() for upstream restriction propagation

3 participants