11 Features
Erez Shinan edited this page 2018-06-26 17:13:11 +03:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Parsers

Lark implements two algorithms: Earley and LALR(1)

Earley Parser

An Earley Parser is a chart parser capable of parsing any context-free grammar at O(n^3), and O(n^2) when the grammar is unambiguous. It can parse most LR grammars at O(n). Most programming languages are LR, and can be parsed at a linear time.

Lark provides 3 lexer modes for Earley:

  • Dynamic (default; see below)

  • Standard (fast, but removes some ambiguity)

Dynamic lexing

The Earley parser runs on top of a "skipping" chart parser, which allows it to run regular expressions, instead of matching characters one-by-one.

This is an improvement to Earley that is unique to Lark.

Ambiguity

Lark can handle any ambiguity in the grammar. The user may request all derivations (using ambiguity='explicit'), or let Lark automatically choose the most fitting derivation.

Lark also supports user-defined rule priority to steer the automatic ambiguity resolution.

LALR(1) Parser

LALR(1) is a very efficient, true-and-tested parsing algorithm. It's incredibly fast and requires very little memory.

It can parse most programming languages (For example: Python and Java).

Lark comes with an efficient implementation that easily competes with PLY.

Lark provides 2 lexer modes for LALR(1):

  • Contextual (default; see below)

  • Standard (basic, same as PLY)

Contextual lexing

The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. Its surprisingly effective at resolving common terminal collisions, and allows to parse languages that LALR(1) was previously incapable of parsing.

This is an improvement to LALR(1) that is unique to Lark.

CYK Parser

A CYK parser can parse any context-free grammar at O(n^3*|G|).

Its too slow to be practical for simple grammars, but it offers good performance for highly ambiguous grammars.

Other features

  • EBNF grammar, with extra features such as %import and %ignore (See: Grammar Reference)
  • Builds a parse-tree (AST) automagically based on the grammar (See: Tree Construction)
  • Stand-alone parser generator - create a small parser, independent of Lark, to embed in your project.
  • Automatic line & column tracking in lexer
  • Automatic token collision resolution (for standard usage)
  • Standard library of terminals (strings, numbers, names, etc.)
  • Unicode fully supported
  • Extensive test suite
  • Python 2 & Python 3 compatible
  • Pure-Python implementation

Experimental features

  • Automatic reconstruction of input from parse-tree (see examples)
  • Import grammars from Nearley.js

Planned features (not implemented yet)

  • Generate code in other languages than Python
  • Grammar composition
  • LALR(k) parser
  • "Look-back" Enhancement for LALR(1)
  • Full regexp-collision support using NFAs
  • Automatically produce syntax-highlighters for grammars, for popular IDEs