lark/docs/reference.md

# Lark Reference

## What is Lark?

Lark is a general-purpose parsing library. It's written in Python, and supports two parsing algorithms: Earley (default) and LALR(1).

Lark is a re-write of my previous parsing library, [PlyPlus](https://github.com/erezsh/plyplus).

## Grammar

Lark accepts its grammars in [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form) form.

The grammar is a list of rules and tokens, each in their own line.

Rules can be defined on multiple lines when using the *OR* operator ( | ).

Comments start with // and last to the end of the line (C++ style)

Lark begins the parse with the rule 'start', unless specified otherwise in the options.

### Tokens

Tokens are defined in terms of:

    NAME : "string" or /regexp/
                   
    NAME.ignore : ..

.ignore is a flag that drops the token before it reaches the parser (usually whitespace)

Example:

    IF: "if"

    INTEGER : /[0-9]+/

    WHITESPACE.ignore: /[ \t\n]+/

### Rules

Each rule is defined in terms of:

    name : list of items to match
         | another list of items    -> optional_alias
         | etc.

An alias is a name for the specific rule alternative. It affects tree construction.

An item is a:
    
 - rule
 - token
 - (item item ..) - Group items
 - [item item ..] - Maybe. Same as: "(item item ..)?"
 - item? - Zero or one instances of item ("maybe")
 - item\* - Zero or more instances of item
 - item+ - One or more instances of item


Example:

    float: "-"? DIGIT* "." DIGIT+ exp
         | "-"? DIGIT+ exp

    exp: "-"? ("e" | "E") DIGIT+

    DIGIT: /[0-9]/

## Tree Construction

Lark builds a tree automatically based on the structure of the grammar. Is also accepts some hints.

In general, Lark will place each rule as a branch, and its matches as the children of the branch.

Using item+ or item\* will result in a list of items.

Example:

    expr: "(" expr ")"
        | NAME+

    NAME: /\w+/

Lark will parse "(((hello world)))" as:

    expr
        expr
            expr
                "hello"
                "world"

The brackets do not appear in the tree by design.

Tokens that won't appear in the tree are:

 - Unnamed strings (like "keyword" or "+")
 - Tokens whose name starts with an underscore (like \_DIGIT)

Tokens that *will* appear in the tree are:

 - Unnamed regular expressions (like /[0-9]/)
 - Named tokens whose name starts with a letter (like DIGIT)

## Shaping the tree

a. Rules whose name begins with an underscore will be inlined into their containing rule.

Example:

    start: "(" _greet ")"
    _greet: /\w+/ /\w+/

Lark will parse "(hello world)" as:

    start
        "hello"
        "world"


b. Rules that recieve a question mark (?) at the beginning of their definition, will be inlined if they have a single child.

Example:

    start: greet greet
    ?greet: "(" /\w+/ ")"
          | /\w+ /\w+/

Lark will parse "hello world (planet)" as:

    start
        greet
            "hello"
            "world"
        "planet"

c. Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option.

Example:

    start: greet greet
    greet: "hello" -> hello
         | "world"

Lark will parse "hello world" as:

    start
        hello
        greet

## Lark Options

When initializing the Lark object, you can provide it with keyword options:

- start - The start symbol (Default: "start")
- parser - Decides which parser engine to use, "earley" or "lalr". (Default: "earley")
           Note: Both will use Lark's lexer.
- transformer - Applies the transformer to every parse tree (only allowed with parser="lalr")
- only\_lex - Don't build a parser. Useful for debugging (default: False)
- postlex - Lexer post-processing (Default: None)
- profile - Measure run-time usage in Lark. Read results from the profiler proprety (Default: False)  

To be supported:

- debug
- cache\_grammar
- keep\_all\_tokens
Added a docs/reference 2017-02-10 14:10:13 +00:00			`# Lark Reference`

			`## What is Lark?`

			`Lark is a general-purpose parsing library. It's written in Python, and supports two parsing algorithms: Earley (default) and LALR(1).`

Small additions to the docs 2017-02-11 13:51:47 +00:00			`Lark is a re-write of my previous parsing library, [PlyPlus](https://github.com/erezsh/plyplus).`

Added a docs/reference 2017-02-10 14:10:13 +00:00			`## Grammar`

			`Lark accepts its grammars in [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form) form.`

			`The grammar is a list of rules and tokens, each in their own line.`

			`Rules can be defined on multiple lines when using the OR operator ( \| ).`

			`Comments start with // and last to the end of the line (C++ style)`

			`Lark begins the parse with the rule 'start', unless specified otherwise in the options.`

			`### Tokens`

			`Tokens are defined in terms of:`

			`NAME : "string" or /regexp/`

			`NAME.ignore : ..`

			`.ignore is a flag that drops the token before it reaches the parser (usually whitespace)`

			`Example:`

			`IF: "if"`

			`INTEGER : /[0-9]+/`

			`WHITESPACE.ignore: /[ \t\n]+/`

			`### Rules`

			`Each rule is defined in terms of:`

			`name : list of items to match`
			`\| another list of items -> optional_alias`
			`\| etc.`

			`An alias is a name for the specific rule alternative. It affects tree construction.`

			`An item is a:`

			`- rule`
			`- token`
			`- (item item ..) - Group items`
			`- [item item ..] - Maybe. Same as: "(item item ..)?"`
			`- item? - Zero or one instances of item ("maybe")`
			`- item\* - Zero or more instances of item`
			`- item+ - One or more instances of item`


			`Example:`

			`float: "-"? DIGIT* "." DIGIT+ exp`
			`\| "-"? DIGIT+ exp`

			`exp: "-"? ("e" \| "E") DIGIT+`

			`DIGIT: /[0-9]/`

			`## Tree Construction`

			`Lark builds a tree automatically based on the structure of the grammar. Is also accepts some hints.`

			`In general, Lark will place each rule as a branch, and its matches as the children of the branch.`

			`Using item+ or item\* will result in a list of items.`

			`Example:`

			`expr: "(" expr ")"`
			`\| NAME+`

			`NAME: /\w+/`

			`Lark will parse "(((hello world)))" as:`

			`expr`
			`expr`
			`expr`
			`"hello"`
			`"world"`

			`The brackets do not appear in the tree by design.`

			`Tokens that won't appear in the tree are:`

			`- Unnamed strings (like "keyword" or "+")`
			`- Tokens whose name starts with an underscore (like \_DIGIT)`

			`Tokens that will appear in the tree are:`

			`- Unnamed regular expressions (like /[0-9]/)`
			`- Named tokens whose name starts with a letter (like DIGIT)`

			`## Shaping the tree`

Fixed lexer newline counter and other stuff 2017-02-10 22:50:48 +00:00			`a. Rules whose name begins with an underscore will be inlined into their containing rule.`
Added a docs/reference 2017-02-10 14:10:13 +00:00
			`Example:`

			`start: "(" _greet ")"`
			`_greet: /\w+/ /\w+/`

			`Lark will parse "(hello world)" as:`

			`start`
			`"hello"`
			`"world"`


Fixed lexer newline counter and other stuff 2017-02-10 22:50:48 +00:00			`b. Rules that recieve a question mark (?) at the beginning of their definition, will be inlined if they have a single child.`
Added a docs/reference 2017-02-10 14:10:13 +00:00
			`Example:`

			`start: greet greet`
			`?greet: "(" /\w+/ ")"`
			`\| /\w+ /\w+/`

			`Lark will parse "hello world (planet)" as:`

			`start`
			`greet`
			`"hello"`
			`"world"`
			`"planet"`

Fixed lexer newline counter and other stuff 2017-02-10 22:50:48 +00:00			`c. Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option.`
Added a docs/reference 2017-02-10 14:10:13 +00:00
			`Example:`

			`start: greet greet`
			`greet: "hello" -> hello`
			`\| "world"`

			`Lark will parse "hello world" as:`

			`start`
			`hello`
			`greet`

			`## Lark Options`

			`When initializing the Lark object, you can provide it with keyword options:`

			`- start - The start symbol (Default: "start")`
			`- parser - Decides which parser engine to use, "earley" or "lalr". (Default: "earley")`
			`Note: Both will use Lark's lexer.`
			`- transformer - Applies the transformer to every parse tree (only allowed with parser="lalr")`
			`- only\_lex - Don't build a parser. Useful for debugging (default: False)`
			`- postlex - Lexer post-processing (Default: None)`
			`- profile - Measure run-time usage in Lark. Read results from the profiler proprety (Default: False)`

			`To be supported:`

			`- debug`
			`- cache\_grammar`
			`- keep\_all\_tokens`