|
| 1 | +# Writing Treesitter Parser |
| 2 | + |
| 3 | +## Treesitter CLI Util |
| 4 | + |
| 5 | +## AST & CTS |
| 6 | + |
| 7 | +- A tree only contains **named nodes** is a abstract syntax tree(AST) |
| 8 | +- A tree contains **both named and unnamed nodes** is a concrete syntax tree(CST) |
| 9 | + |
| 10 | +## Grammar Structure |
| 11 | + |
| 12 | +- `name`: name of that parser |
| 13 | +- `rules`: rules for generating nodes |
| 14 | +- see [doc](https://tree-sitter.github.io/tree-sitter/creating-parsers/2-the-grammar-dsl.html) for more properties of a grammar. |
| 15 | + |
| 16 | +## Tree Structure |
| 17 | + |
| 18 | +Treesitter syntax tree(and its query) is generally represented in scheme language, a lisp dialect. |
| 19 | +Each `()` initializes a new list, each element in the list is either presented as node type(named nodes) or string literal(unnamed nodes). |
| 20 | +The **node type** is name of rule that matched the section. |
| 21 | + |
| 22 | +- **field** : Each element might have a *field name* such as `kind: "const"` to give the element(node) a **descriptive** name in context. |
| 23 | +- **token**: Each atomic node is considered as a token, such as `(number)` and `(comment)` in the example. |
| 24 | + |
| 25 | +> [!NOTE] |
| 26 | +> A unnamed node could have *field name*, the field name is for node representation in tree, not the nominal identity of that node. |
| 27 | +
|
| 28 | +```query |
| 29 | +; generated tree for javascript code |
| 30 | +; const foo = 1 + 2 // this is foo |
| 31 | +(program ; [0, 0] - [1, 0] |
| 32 | + (lexical_declaration ; [0, 0] - [0, 18] |
| 33 | + kind: "const" ; [0, 0] - [0, 5] |
| 34 | + (variable_declarator ; [0, 6] - [0, 17] |
| 35 | + name: (identifier) ; [0, 6] - [0, 9] |
| 36 | + "=" ; [0, 10] - [0, 11] |
| 37 | + value: (binary_expression ; [0, 12] - [0, 17] |
| 38 | + left: (number) ; [0, 12] - [0, 13] |
| 39 | + operator: "+" ; [0, 14] - [0, 15] |
| 40 | + right: (number))) ; [0, 16] - [0, 17] |
| 41 | + ";") ; [0, 17] - [0, 18] |
| 42 | + (comment)) ; [0, 19] - [0, 33] |
| 43 | +``` |
| 44 | + |
| 45 | + |
| 46 | + |
| 47 | +## Writing Rules |
| 48 | + |
| 49 | +1. **The top level rule**: the most generic wrapper rule to cover all possible content to be parsed. |
| 50 | + - **top level rule MUST be the first rule property declared in `rules` field.** |
| 51 | + - the name of top level rule can be arbitrary, usually depend on language specification. |
| 52 | + - `C#` for example uses the `compilation_unit` as the name of top level rule. |
| 53 | +```js |
| 54 | +module.exports = grammar({ |
| 55 | + name: 'c_sharp', |
| 56 | + rules: { |
| 57 | + /*...*/ |
| 58 | + compilation_unit: $ => seq( // must be the first rule // [!code highlight] |
| 59 | + optional($.shebang_directive), |
| 60 | + repeat($._top_level_item), |
| 61 | + ), |
| 62 | + _top_level_item: $ => prec(2, choice( |
| 63 | + $._top_level_item_no_statement, |
| 64 | + $.global_statement, |
| 65 | + )), |
| 66 | + /*...*/ |
| 67 | + } |
| 68 | +}); |
| 69 | +``` |
| 70 | + |
| 71 | +### Named & Unnamed Nodes |
| 72 | + |
| 73 | +A node generated by a rule that was assigned to a property of `rules` is called a *named node*. |
| 74 | +A node generated by a rule that was written in literal string/regex is *unnamed nodes*. |
| 75 | + |
| 76 | +```js |
| 77 | +module.exports = grammar({ |
| 78 | + name: 'foo', |
| 79 | + rules: { |
| 80 | + if_statement: $ => seq("if", "(", $._expression, ")", $._statement); |
| 81 | + }, |
| 82 | +}); |
| 83 | +``` |
| 84 | + |
| 85 | +> [!NOTE] |
| 86 | +> Unnamed nodes are not visible from treesitter CST by default, but they does exist in the structure and can be inspected. |
| 87 | +> They just don't have a node type. |
| 88 | +
|
| 89 | +### Aliased Rule |
| 90 | + |
| 91 | + |
| 92 | +### Tokenized Rule |
| 93 | + |
| 94 | +`token(rule)` made a complex rule as a **atomic** node, tree-sitter would only match but does not generate the concrete sub-tree for this node. |
| 95 | +The following rule would made comment as `(comment)` in concrete tree, it does not include the unnamed nodes match that pattern. |
| 96 | + |
| 97 | +```js |
| 98 | +module.exports = grammar({ |
| 99 | + name: 'foo', |
| 100 | + rules: { |
| 101 | + /* ... */ |
| 102 | + comment: _ => token(choice( |
| 103 | + seq('//', /[^\n\r]*/), |
| 104 | + seq( |
| 105 | + '/*', |
| 106 | + /[^*]*\*+([^/*][^*]*\*+)*/, |
| 107 | + '/', |
| 108 | + ), |
| 109 | + )), |
| 110 | + }, |
| 111 | +}); |
| 112 | +``` |
| 113 | + |
| 114 | +### Node Description |
| 115 | + |
| 116 | +A field of node is a **descriptive name** for semantic of that node in certain context. |
| 117 | + |
| 118 | +The following rule defines descriptive name for each node of that function node. |
| 119 | + |
| 120 | +```js |
| 121 | +module.exports = grammar({ |
| 122 | + name: 'foo', |
| 123 | + rules: { |
| 124 | + /* ... */ |
| 125 | + function_definition: $ => |
| 126 | + seq( |
| 127 | + "func", |
| 128 | + field("name", $.identifier), |
| 129 | + field("parameters", $.parameter_list), |
| 130 | + field("return_type", $._type), |
| 131 | + field("body", $.block), |
| 132 | + ) |
| 133 | + }, |
| 134 | +}); |
| 135 | +``` |
| 136 | + |
0 commit comments