lark/docs/reference.md

3.8 KiB

Lark Reference

What is Lark?

Lark is a general-purpose parsing library. It's written in Python, and supports two parsing algorithms: Earley (default) and LALR(1).

Lark is a re-write of my previous parsing library, PlyPlus.

Grammar

Lark accepts its grammars in EBNF form.

The grammar is a list of rules and tokens, each in their own line.

Rules can be defined on multiple lines when using the OR operator ( | ).

Comments start with // and last to the end of the line (C++ style)

Lark begins the parse with the rule 'start', unless specified otherwise in the options.

Tokens

Tokens are defined in terms of:

NAME : "string" or /regexp/
               
NAME.ignore : ..

.ignore is a flag that drops the token before it reaches the parser (usually whitespace)

Example:

IF: "if"

INTEGER : /[0-9]+/

WHITESPACE.ignore: /[ \t\n]+/

Rules

Each rule is defined in terms of:

name : list of items to match
     | another list of items    -> optional_alias
     | etc.

An alias is a name for the specific rule alternative. It affects tree construction.

An item is a:

  • rule
  • token
  • (item item ..) - Group items
  • [item item ..] - Maybe. Same as: "(item item ..)?"
  • item? - Zero or one instances of item ("maybe")
  • item* - Zero or more instances of item
  • item+ - One or more instances of item

Example:

float: "-"? DIGIT* "." DIGIT+ exp
     | "-"? DIGIT+ exp

exp: "-"? ("e" | "E") DIGIT+

DIGIT: /[0-9]/

Tree Construction

Lark builds a tree automatically based on the structure of the grammar. Is also accepts some hints.

In general, Lark will place each rule as a branch, and its matches as the children of the branch.

Using item+ or item* will result in a list of items.

Example:

expr: "(" expr ")"
    | NAME+

NAME: /\w+/

Lark will parse "(((hello world)))" as:

expr
    expr
        expr
            "hello"
            "world"

The brackets do not appear in the tree by design.

Tokens that won't appear in the tree are:

  • Unnamed strings (like "keyword" or "+")
  • Tokens whose name starts with an underscore (like _DIGIT)

Tokens that will appear in the tree are:

  • Unnamed regular expressions (like /[0-9]/)
  • Named tokens whose name starts with a letter (like DIGIT)

Shaping the tree

a. Rules whose name begins with an underscore will be inlined into their containing rule.

Example:

start: "(" _greet ")"
_greet: /\w+/ /\w+/

Lark will parse "(hello world)" as:

start
    "hello"
    "world"

b. Rules that recieve a question mark (?) at the beginning of their definition, will be inlined if they have a single child.

Example:

start: greet greet
?greet: "(" /\w+/ ")"
      | /\w+ /\w+/

Lark will parse "hello world (planet)" as:

start
    greet
        "hello"
        "world"
    "planet"

c. Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option.

Example:

start: greet greet
greet: "hello" -> hello
     | "world"

Lark will parse "hello world" as:

start
    hello
    greet

Lark Options

When initializing the Lark object, you can provide it with keyword options:

  • start - The start symbol (Default: "start")
  • parser - Decides which parser engine to use, "earley" or "lalr". (Default: "earley") Note: Both will use Lark's lexer.
  • transformer - Applies the transformer to every parse tree (only allowed with parser="lalr")
  • only_lex - Don't build a parser. Useful for debugging (default: False)
  • postlex - Lexer post-processing (Default: None)
  • profile - Measure run-time usage in Lark. Read results from the profiler proprety (Default: False)

To be supported:

  • debug
  • cache_grammar
  • keep_all_tokens