Skip to content

[Refactor] New rule parser to support AST#104

Open
colinthebomb1 wants to merge 6 commits intomainfrom
colin/rule-parser-parse-v2
Open

[Refactor] New rule parser to support AST#104
colinthebomb1 wants to merge 6 commits intomainfrom
colin/rule-parser-parse-v2

Conversation

@colinthebomb1
Copy link
Copy Markdown
Collaborator

Summary

This PR introduces and stabilizes RuleParser v2 along with dedicated v2 test coverage.

Highlights

  • New RuleParserV2 that works with our AST Nodes.
  • Similar structure to original rule parser, but added step for substituing variables with corresponding nodes.
  • Aligns v2 variable semantics and node naming for clarity:
    • element variables (<x>)
    • set variables (<<y>>)
  • New comprehensive v2 tests in test_rule_parser_v2.py.

Testing

  • Ran tests/test_rule_parser_v2.py
  • Ran full test suite to confirm compatibility with existing behavior
  • All tests passing

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces RuleParserV2, a new rule parsing implementation that generates AST node structures instead of JSON, along with a comprehensive refactoring of variable node classes.

Changes:

  • New RuleParserV2 module that parses rules into AST structures with ElementVariableNode and SetVariableNode for rule variables
  • Renamed VarNode → ElementVariableNode and VarSetNode → SetVariableNode throughout the codebase for semantic clarity
  • Updated internal token naming from V/VL to EV/SV (ElementVariable/SetVariable) to align with new v2 semantics
  • Comprehensive test coverage (745 lines) for RuleParserV2 including parametrized tests over entire rule catalog
  • Code cleanup in test_rule_parser.py (import consolidation and formatting)

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
core/rule_parser_v2.py New RuleParserV2 implementation with 6-step pipeline for rule parsing and AST generation
tests/test_rule_parser_v2.py Comprehensive test suite covering extendToFullSQL, replaceVars, parsing, variable substitution, and rule catalog validation
core/ast/node.py Renamed VarNode/VarSetNode to ElementVariableNode/SetVariableNode with updated docstrings
core/ast/init.py Updated exports to use new class names
tests/test_rule_parser.py Import consolidation and code cleanup (formatting and indentation fixes)
tests/test_ast.py Updated imports to use new ElementVariableNode/SetVariableNode names
tests/ast_util.py Updated imports and comments to reference new variable node class names
core/query_parser.py Updated imports and TODO comment to reference new class names

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@HazelYuAhiru HazelYuAhiru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Just with a small comment.

class VarNode(Node):
"""VarSQL variable node"""
class ElementVariableNode(Node):
"""Rule element variable ``<name>`` (see ``VarType.ElementVariable`` in rule_parser_v2)."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ElementVariableNode don't have their own __eq__ or __hash__, so they both inherit Node's implementation, which only compares type and children. The .name field is invisible to equality checks so something like ElementVariableNode("x") == ElementVariableNode("y") will return True

class VarSetNode(Node):
"""VarSQL variable set node"""
class SetVariableNode(Node):
"""Rule set variable ``<<name>>`` (see ``VarType.SetVariable`` in rule_parser_v2)."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +443 to +455
if node.type == NodeType.FUNCTION:
f = node
if not isinstance(f, FunctionNode):
return node
new_args = [RuleParserV2._substitute_placeholders(a, rev) for a in f.children]
return FunctionNode(f.name, _args=new_args, _alias=f.alias)

if node.type == NodeType.LIST:
ln = node
if not isinstance(ln, ListNode):
return node
return ListNode([RuleParserV2._substitute_placeholders(c, rev) for c in ln.children])

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RuleParserV2._substitute_placeholders currently substitutes tokens in ColumnNode/TableNode/aliases but does not handle other places where internal EV/SV tokens can legally appear (e.g., LIMIT/OFFSET values like LIMIT <x3> and literals like '%<y>%'). This can leave internal tokens in the resulting v2 AST, making the output inconsistent with the goal of returning an AST with external variable names/nodes. Consider extending substitution to cover LimitNode/OffsetNode (and any other node fields that can carry identifiers) and adding test assertions for these cases so token leakage is caught.

Copilot uses AI. Check for mistakes.
Comment on lines 1 to +9
from core.ast.node import (
Node, QueryNode, SelectNode, FromNode, WhereNode, TableNode, ColumnNode,
LiteralNode, DataTypeNode, TimeUnitNode, IntervalNode,
CaseNode, WhenThenNode,
OperatorNode, UnaryOperatorNode, FunctionNode, GroupByNode, HavingNode,
OrderByNode, OrderByItemNode, LimitNode, OffsetNode, SubqueryNode,
VarNode, VarSetNode, JoinNode, ListNode
ElementVariableNode, SetVariableNode, JoinNode, ListNode
)
# TODO: implement VarNode, VarSetNode
# TODO: implement ElementVariableNode, SetVariableNode
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core/query_parser.py imports ElementVariableNode and SetVariableNode but never references them (only mentioned in a TODO). If QueryParser isn’t responsible for emitting these node types yet, consider removing the unused imports (and adjusting the TODO) to avoid dead code / potential lint failures.

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +117
Known limitation: ``_substitute_placeholders`` does not replace a ColumnNode's
name when the column is qualified (``parent_alias`` is set) but the parent alias
itself is a literal not present in the internal-to-external mapping. We only flag
unqualified columns and qualified columns whose parent alias is also an internal
token.
"""
internal_tokens = set(result.mapping.values())

for tree_label, tree in [("pattern", result.pattern_ast), ("rewrite", result.rewrite_ast)]:
for n in _walk(tree):
if isinstance(n, ColumnNode):
if n.parent_alias is None:
assert not _TOKEN_RE.match(n.name), (
f"{tree_label} AST has raw internal token {n.name!r} "
f"as unqualified ColumnNode"
)
elif n.parent_alias in internal_tokens:
assert not _TOKEN_RE.match(n.parent_alias), (
f"{tree_label} AST has raw internal token {n.parent_alias!r} "
f"as ColumnNode.parent_alias"
)
assert not _TOKEN_RE.match(n.name), (
f"{tree_label} AST has raw internal token {n.name!r} "
f"as ColumnNode.name (parent_alias was also an internal token)"
)
# else: parent_alias is a literal (e.g. "t") and name is an internal
# token — known gap in _substitute_placeholders; skip.
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for _assert_no_internal_tokens describes a “known limitation” where _substitute_placeholders doesn’t replace ColumnNode.name for qualified columns when parent_alias is a literal. In the current RuleParserV2 implementation, the pa is not None and nm in rev branch does replace the name in that case, so this comment appears outdated/misleading. Please update the comment (or tighten the assertion) to reflect the actual substitution behavior so the test doesn’t mask real token leaks.

Suggested change
Known limitation: ``_substitute_placeholders`` does not replace a ColumnNode's
name when the column is qualified (``parent_alias`` is set) but the parent alias
itself is a literal not present in the internal-to-external mapping. We only flag
unqualified columns and qualified columns whose parent alias is also an internal
token.
"""
internal_tokens = set(result.mapping.values())
for tree_label, tree in [("pattern", result.pattern_ast), ("rewrite", result.rewrite_ast)]:
for n in _walk(tree):
if isinstance(n, ColumnNode):
if n.parent_alias is None:
assert not _TOKEN_RE.match(n.name), (
f"{tree_label} AST has raw internal token {n.name!r} "
f"as unqualified ColumnNode"
)
elif n.parent_alias in internal_tokens:
assert not _TOKEN_RE.match(n.parent_alias), (
f"{tree_label} AST has raw internal token {n.parent_alias!r} "
f"as ColumnNode.parent_alias"
)
assert not _TOKEN_RE.match(n.name), (
f"{tree_label} AST has raw internal token {n.name!r} "
f"as ColumnNode.name (parent_alias was also an internal token)"
)
# else: parent_alias is a literal (e.g. "t") and name is an internal
# token — known gap in _substitute_placeholders; skip.
This assertion checks identifier-bearing AST fields that should contain external
names after placeholder substitution, including both qualified and unqualified
``ColumnNode.name`` values.
"""
internal_tokens = set(result.mapping.values())
for tree_label, tree in [("pattern", result.pattern_ast), ("rewrite", result.rewrite_ast)]:
for n in _walk(tree):
if isinstance(n, ColumnNode):
assert not _TOKEN_RE.match(n.name), (
f"{tree_label} AST has raw internal token {n.name!r} as ColumnNode.name"
)
if n.parent_alias in internal_tokens:
assert not _TOKEN_RE.match(n.parent_alias), (
f"{tree_label} AST has raw internal token {n.parent_alias!r} "
f"as ColumnNode.parent_alias"
)

Copilot uses AI. Check for mistakes.
Comment on lines 8 to 16
from .node import (
Node,
TableNode,
SubqueryNode,
ColumnNode,
LiteralNode,
VarNode,
VarSetNode,
ElementVariableNode,
SetVariableNode,
OperatorNode,
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core/ast/init.py lists 'OrderByItemNode' in all, but it is not imported from .node in this module. This makes from core.ast import OrderByItemNode fail at runtime despite being advertised as public API. Either import OrderByItemNode here or remove it from all to keep exports consistent.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants