[Refactor] New rule parser to support AST#104
Conversation
bfc2e15 to
0fd8c98
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces RuleParserV2, a new rule parsing implementation that generates AST node structures instead of JSON, along with a comprehensive refactoring of variable node classes.
Changes:
- New RuleParserV2 module that parses rules into AST structures with ElementVariableNode and SetVariableNode for rule variables
- Renamed VarNode → ElementVariableNode and VarSetNode → SetVariableNode throughout the codebase for semantic clarity
- Updated internal token naming from V/VL to EV/SV (ElementVariable/SetVariable) to align with new v2 semantics
- Comprehensive test coverage (745 lines) for RuleParserV2 including parametrized tests over entire rule catalog
- Code cleanup in test_rule_parser.py (import consolidation and formatting)
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| core/rule_parser_v2.py | New RuleParserV2 implementation with 6-step pipeline for rule parsing and AST generation |
| tests/test_rule_parser_v2.py | Comprehensive test suite covering extendToFullSQL, replaceVars, parsing, variable substitution, and rule catalog validation |
| core/ast/node.py | Renamed VarNode/VarSetNode to ElementVariableNode/SetVariableNode with updated docstrings |
| core/ast/init.py | Updated exports to use new class names |
| tests/test_rule_parser.py | Import consolidation and code cleanup (formatting and indentation fixes) |
| tests/test_ast.py | Updated imports to use new ElementVariableNode/SetVariableNode names |
| tests/ast_util.py | Updated imports and comments to reference new variable node class names |
| core/query_parser.py | Updated imports and TODO comment to reference new class names |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
HazelYuAhiru
left a comment
There was a problem hiding this comment.
Overall LGTM! Just with a small comment.
| class VarNode(Node): | ||
| """VarSQL variable node""" | ||
| class ElementVariableNode(Node): | ||
| """Rule element variable ``<name>`` (see ``VarType.ElementVariable`` in rule_parser_v2).""" |
There was a problem hiding this comment.
ElementVariableNode don't have their own __eq__ or __hash__, so they both inherit Node's implementation, which only compares type and children. The .name field is invisible to equality checks so something like ElementVariableNode("x") == ElementVariableNode("y") will return True
| class VarSetNode(Node): | ||
| """VarSQL variable set node""" | ||
| class SetVariableNode(Node): | ||
| """Rule set variable ``<<name>>`` (see ``VarType.SetVariable`` in rule_parser_v2).""" |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if node.type == NodeType.FUNCTION: | ||
| f = node | ||
| if not isinstance(f, FunctionNode): | ||
| return node | ||
| new_args = [RuleParserV2._substitute_placeholders(a, rev) for a in f.children] | ||
| return FunctionNode(f.name, _args=new_args, _alias=f.alias) | ||
|
|
||
| if node.type == NodeType.LIST: | ||
| ln = node | ||
| if not isinstance(ln, ListNode): | ||
| return node | ||
| return ListNode([RuleParserV2._substitute_placeholders(c, rev) for c in ln.children]) | ||
|
|
There was a problem hiding this comment.
RuleParserV2._substitute_placeholders currently substitutes tokens in ColumnNode/TableNode/aliases but does not handle other places where internal EV/SV tokens can legally appear (e.g., LIMIT/OFFSET values like LIMIT <x3> and literals like '%<y>%'). This can leave internal tokens in the resulting v2 AST, making the output inconsistent with the goal of returning an AST with external variable names/nodes. Consider extending substitution to cover LimitNode/OffsetNode (and any other node fields that can carry identifiers) and adding test assertions for these cases so token leakage is caught.
| from core.ast.node import ( | ||
| Node, QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode, | ||
| LiteralNode, DataTypeNode, TimeUnitNode, IntervalNode, | ||
| CaseNode, WhenThenNode, | ||
| OperatorNode, UnaryOperatorNode, FunctionNode, GroupByNode, HavingNode, | ||
| OrderByNode, OrderByItemNode, LimitNode, OffsetNode, SubqueryNode, | ||
| VarNode, VarSetNode, JoinNode, ListNode | ||
| ElementVariableNode, SetVariableNode, JoinNode, ListNode | ||
| ) | ||
| # TODO: implement VarNode, VarSetNode | ||
| # TODO: implement ElementVariableNode, SetVariableNode |
There was a problem hiding this comment.
core/query_parser.py imports ElementVariableNode and SetVariableNode but never references them (only mentioned in a TODO). If QueryParser isn’t responsible for emitting these node types yet, consider removing the unused imports (and adjusting the TODO) to avoid dead code / potential lint failures.
| Known limitation: ``_substitute_placeholders`` does not replace a ColumnNode's | ||
| name when the column is qualified (``parent_alias`` is set) but the parent alias | ||
| itself is a literal not present in the internal-to-external mapping. We only flag | ||
| unqualified columns and qualified columns whose parent alias is also an internal | ||
| token. | ||
| """ | ||
| internal_tokens = set(result.mapping.values()) | ||
|
|
||
| for tree_label, tree in [("pattern", result.pattern_ast), ("rewrite", result.rewrite_ast)]: | ||
| for n in _walk(tree): | ||
| if isinstance(n, ColumnNode): | ||
| if n.parent_alias is None: | ||
| assert not _TOKEN_RE.match(n.name), ( | ||
| f"{tree_label} AST has raw internal token {n.name!r} " | ||
| f"as unqualified ColumnNode" | ||
| ) | ||
| elif n.parent_alias in internal_tokens: | ||
| assert not _TOKEN_RE.match(n.parent_alias), ( | ||
| f"{tree_label} AST has raw internal token {n.parent_alias!r} " | ||
| f"as ColumnNode.parent_alias" | ||
| ) | ||
| assert not _TOKEN_RE.match(n.name), ( | ||
| f"{tree_label} AST has raw internal token {n.name!r} " | ||
| f"as ColumnNode.name (parent_alias was also an internal token)" | ||
| ) | ||
| # else: parent_alias is a literal (e.g. "t") and name is an internal | ||
| # token — known gap in _substitute_placeholders; skip. |
There was a problem hiding this comment.
The docstring for _assert_no_internal_tokens describes a “known limitation” where _substitute_placeholders doesn’t replace ColumnNode.name for qualified columns when parent_alias is a literal. In the current RuleParserV2 implementation, the pa is not None and nm in rev branch does replace the name in that case, so this comment appears outdated/misleading. Please update the comment (or tighten the assertion) to reflect the actual substitution behavior so the test doesn’t mask real token leaks.
| Known limitation: ``_substitute_placeholders`` does not replace a ColumnNode's | |
| name when the column is qualified (``parent_alias`` is set) but the parent alias | |
| itself is a literal not present in the internal-to-external mapping. We only flag | |
| unqualified columns and qualified columns whose parent alias is also an internal | |
| token. | |
| """ | |
| internal_tokens = set(result.mapping.values()) | |
| for tree_label, tree in [("pattern", result.pattern_ast), ("rewrite", result.rewrite_ast)]: | |
| for n in _walk(tree): | |
| if isinstance(n, ColumnNode): | |
| if n.parent_alias is None: | |
| assert not _TOKEN_RE.match(n.name), ( | |
| f"{tree_label} AST has raw internal token {n.name!r} " | |
| f"as unqualified ColumnNode" | |
| ) | |
| elif n.parent_alias in internal_tokens: | |
| assert not _TOKEN_RE.match(n.parent_alias), ( | |
| f"{tree_label} AST has raw internal token {n.parent_alias!r} " | |
| f"as ColumnNode.parent_alias" | |
| ) | |
| assert not _TOKEN_RE.match(n.name), ( | |
| f"{tree_label} AST has raw internal token {n.name!r} " | |
| f"as ColumnNode.name (parent_alias was also an internal token)" | |
| ) | |
| # else: parent_alias is a literal (e.g. "t") and name is an internal | |
| # token — known gap in _substitute_placeholders; skip. | |
| This assertion checks identifier-bearing AST fields that should contain external | |
| names after placeholder substitution, including both qualified and unqualified | |
| ``ColumnNode.name`` values. | |
| """ | |
| internal_tokens = set(result.mapping.values()) | |
| for tree_label, tree in [("pattern", result.pattern_ast), ("rewrite", result.rewrite_ast)]: | |
| for n in _walk(tree): | |
| if isinstance(n, ColumnNode): | |
| assert not _TOKEN_RE.match(n.name), ( | |
| f"{tree_label} AST has raw internal token {n.name!r} as ColumnNode.name" | |
| ) | |
| if n.parent_alias in internal_tokens: | |
| assert not _TOKEN_RE.match(n.parent_alias), ( | |
| f"{tree_label} AST has raw internal token {n.parent_alias!r} " | |
| f"as ColumnNode.parent_alias" | |
| ) |
| from .node import ( | ||
| Node, | ||
| TableNode, | ||
| SubqueryNode, | ||
| ColumnNode, | ||
| LiteralNode, | ||
| VarNode, | ||
| VarSetNode, | ||
| ElementVariableNode, | ||
| SetVariableNode, | ||
| OperatorNode, |
There was a problem hiding this comment.
core/ast/init.py lists 'OrderByItemNode' in all, but it is not imported from .node in this module. This makes from core.ast import OrderByItemNode fail at runtime despite being advertised as public API. Either import OrderByItemNode here or remove it from all to keep exports consistent.
Summary
This PR introduces and stabilizes RuleParser v2 along with dedicated v2 test coverage.
Highlights
<x>)<<y>>)test_rule_parser_v2.py.Testing
tests/test_rule_parser_v2.py