10 Grammar Reference
Erez Shinan edited this page 2018-05-01 15:43:45 +03:00

Overview

A grammar is a list of rules and terminals, that together define a language.

Rules define the structure of the language, while terminals define its alphabet.

A terminal may be a string, a regular expression, or a concatenation of these and other terminals.

A rule is a list of rules and terminals, that are below it in the hierarchy of the search.

A parsing algorithm is an algorithm that takes a grammar and a sequence of symbols (members of the alphabet), and searches for a structure that is allowed by the grammar, and matches the entirety of the sequence.

Syntax

Grammars in Lark are based on EBNF syntax, with several enhancements.

Lark grammars are composed of a list of definitions and directives, each on its own line. A definition is either a named rule, or a named terminal. Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: | ).

Comments start with // and last to the end of the line (C++ style)

Lark begins the parse with the rule 'start', unless specified otherwise in the options.

Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects for tree construction, and when using a lexer (aka tokenizer or scanner).

Rules

Syntax:

name : items to match
     | more items      -> optional_alias
     | etc.

An alias is a name for the specific rule alternative. It affects tree construction.

Names of rules and aliases are always in lowercase.

Each item is one of:

  • rule name
  • terminal name
  • literal (an anonymous terminal).
  • (item item ..) - Group items
  • [item item ..] - Maybe. Same as: "(item item ..)?"
  • item? - Zero or one instances of item ("maybe")
  • item* - Zero or more instances of item
  • item+ - One or more instances of item
  • item ~ n - Exactly n instances of item
  • item ~ n..m - Between n to m instances of item

Examples:

hello_world: "hello" "world"
mul: [mul "*"] number     //# Left-recursion is allowed!
expr: expr operator expr
    | value               //# Multi-line, belongs to expr

four_words: word ~ 4

Terminals

Syntax:

NAME : literals to match

Terminals are defined just like rules, but cannot contain rules.

Literals can be one of:

  • "string"
  • /regular expression+/
  • "case-insensitive string"i
  • /re with flags/imulx
  • literal range ("a".."z", "1..9", etc.)

Terminals when using a lexer:

When using a lexer (standard or contextual), it is the grammar-author's responsibility to make sure the literals don't collide, or that if they do, they are matched in the desired order. Literals are matched in an order according to the following criteria:

  1. Highest priority first (priority is specified as: TERM.number: ...)
  2. Length of literal (for regexps, the longest theoretical match is used)
  3. Length of literal / pattern
  4. Name

Examples:

IF: "if"
INTEGER : /[0-9]+/
INTEGER2 : ("0".."9")+          //# Same as INTEGER
DECIMAL.2: INTEGER "." INTEGER  //# Will be matched before INTEGER
WHITESPACE: (" " | /\t/ )+
SQL_SELECT: "select"i

Directives

%ignore

All occurrences of the terminal will be ignored, and won't be part of the parse.

Syntax:

%ignore TERMINAL

Examples:

%ignore " "
COMMENT: "#" /[^\n]/*
%ignore COMMENT

%import

Allows to import terminals from a built-in collection.

Future versions will allow to import from arbitrary files.

Future versions will allow to import rules and macros*.

Syntax:

%import module.TOKEN

Example:

%import common.NUMBER