lark/docs/reference.md

# Lark Reference

## What is Lark?

Lark is a general-purpose parsing library. It's written in Python, and supports two parsing algorithms: Earley (default) and LALR(1).

Lark is a re-write of my previous parsing library, [PlyPlus](https://github.com/erezsh/plyplus).

## Grammar

Lark accepts its grammars in [EBNF](https://www.wikiwand.com/en/Extended_Backus%E2%80%93Naur_form) form.

The grammar is a list of rules and tokens, each in their own line.

Rules can be defined on multiple lines when using the *OR* operator ( | ).

Comments start with // and last to the end of the line (C++ style)

Lark begins the parse with the rule 'start', unless specified otherwise in the options.

### Tokens

Tokens are defined in terms of:

    NAME : "string" or /regexp/

    NAME.ignore : ..

.ignore is a flag that drops the token before it reaches the parser (usually whitespace)

Example:

    IF: "if"

    INTEGER : /[0-9]+/

    WHITESPACE.ignore: /[ \t\n]+/

### Rules

Each rule is defined in terms of:

    name : list of items to match
         | another list of items    -> optional_alias
         | etc.

An alias is a name for the specific rule alternative. It affects tree construction.

An item is a:

 - rule
 - token
 - (item item ..) - Group items
 - [item item ..] - Maybe. Same as: "(item item ..)?"
 - item? - Zero or one instances of item ("maybe")
 - item\* - Zero or more instances of item
 - item+ - One or more instances of item


Example:

    float: "-"? DIGIT* "." DIGIT+ exp
         | "-"? DIGIT+ exp

    exp: "-"? ("e" | "E") DIGIT+

    DIGIT: /[0-9]/

## Tree Construction

Lark builds a tree automatically based on the structure of the grammar. Is also accepts some hints.

In general, Lark will place each rule as a branch, and its matches as the children of the branch.

Using item+ or item\* will result in a list of items.

Example:

    expr: "(" expr ")"
        | NAME+

    NAME: /\w+/

Lark will parse "(((hello world)))" as:

    expr
        expr
            expr
                "hello"
                "world"

The brackets do not appear in the tree by design.

Tokens that won't appear in the tree are:

 - Unnamed strings (like "keyword" or "+")
 - Tokens whose name starts with an underscore (like \_DIGIT)

Tokens that *will* appear in the tree are:

 - Unnamed regular expressions (like /[0-9]/)
 - Named tokens whose name starts with a letter (like DIGIT)

## Shaping the tree

a. Rules whose name begins with an underscore will be inlined into their containing rule.

Example:

    start: "(" _greet ")"
    _greet: /\w+/ /\w+/

Lark will parse "(hello world)" as:

    start
        "hello"
        "world"


b. Rules that recieve a question mark (?) at the beginning of their definition, will be inlined if they have a single child.

Example:

    start: greet greet
    ?greet: "(" /\w+/ ")"
          | /\w+ /\w+/

Lark will parse "hello world (planet)" as:

    start
        greet
            "hello"
            "world"
        "planet"

c. Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option.

Example:

    start: greet greet
    greet: "hello" -> hello
         | "world"

Lark will parse "hello world" as:

    start
        hello
        greet

## Lark Options

When initializing the Lark object, you can provide it with keyword options:

- start - The start symbol (Default: "start")
- parser - Decides which parser engine to use, "earley" or "lalr". (Default: "earley")
           Note: Both will use Lark's lexer.
- transformer - Applies the transformer to every parse tree (only allowed with parser="lalr")
- only\_lex - Don't build a parser. Useful for debugging (default: False)
- postlex - Lexer post-processing (Default: None)
- profile - Measure run-time usage in Lark. Read results from the profiler proprety (Default: False)

To be supported:

- debug
- cache\_grammar
- keep\_all\_tokens