The parser will try to match each rule (left-part) by matching its items (right-part) sequentially, trying each alternative (In practice, the parser is predictive so we don't have to try every alternative).
Unfortunately, this means that it will also filter out literals like "true" and "false", and we will lose that information. The next section, "Shaping the tree" deals with this issue, and others.
1. Those little arrows signify *aliases*. An alias is a name for a specific part of the rule. In this case, we will name the *true/false/null* matches, and this way we won't lose the information. We also alias *SIGNED_NUMBER* to mark it for later processing.
2. The question-mark prefixing *value* ("?value") tells the tree-builder to inline this branch if it has only one member. In this case, *value* will always have only one member, and will always be inlined.
3. We turned the *ESCAPED_STRING* terminal into a rule. This way it will appear in the tree as a branch. This is equivalent to aliasing (like we did for the number), but now *string* can also be used elsewhere in the grammar (namely, in the *pair* rule).
>>> text = '{"key": ["item0", "item1", 3.14, true]}'
>>> print( json_parser.parse(text).pretty() )
dict
pair
string "key"
list
string "item0"
string "item1"
number 3.14
true
```
Ah! That is much much nicer.
## Part 4 - Evaluating the tree
It's nice to have a tree, but what we really want is a JSON object.
The way to do it is to evaluate the tree, using a Transformer.
A transformer is a class with methods corresponding to branch names. For each branch, the appropriate method will be called with the children of the branch as its argument, and its return value will replace the branch in the tree.
So let's write a partial transformer, that handles lists and dictionaries:
By now, we have a fully working JSON parser, that can accept a string of JSON, and return its Pythonic representation.
But how fast is it?
Now, of course there are JSON libraries for Python written in C, and we can never compete with them. But since this is applicable to any parser you would write in Lark, let's see how far we can take this.
The first step for optimizing is to have a benchmark. For this benchmark I'm going to take data from [json-generator.com/](http://www.json-generator.com/). I took their default suggestion and changed it to 5000 objects. The result is a 6.6MB sparse JSON file.
$ time python tutorial_json.py json_data > /dev/null
real 0m36.257s
user 0m34.735s
sys 0m1.361s
That's unsatisfactory time for a 6MB file. Maybe if we were parsing configuration or a small DSL, but we're trying to handle large amount of data here.
Well, turns out there's quite a bit we can do about it!
### Step 2 - LALR(1)
So far we've been using the Earley algorithm, which is the default in Lark. Earley is powerful but slow. But it just so happens that our grammar is LR-compatible, and specifically LALR(1) compatible.
It's important to note that not all grammars are LR-compatible, and so you can't always switch to LALR(1). But there's no harm in trying! If Lark lets you build the grammar, it means you're good to go.
### Step 3 - Tree-less LALR(1)
So far, we've built a full parse tree for our JSON, and then transformed it. It's a convenient method, but it's not the most efficient in terms of speed and memory. Luckily, Lark lets us avoid building the tree when parsing with LALR(1).
We've used the transformer we've already written, but this time we plug it straight into the parser. Now it can avoid building the parse tree, and just send the data straight into our transformer. The *parse()* method now returns the transformed JSON, instead of a tree.
Let's benchmark it:
real 0m4.866s
user 0m4.722s
sys 0m0.121s
That's a measurable improvement! Also, this way is more memory efficient. Check out the benchmark table at the end to see just how much.
As a general practice, it's recommended to work with parse trees, and only skip the tree-builder when your transformer is already working.
### Step 4 - PyPy
PyPy is a JIT engine for running Python, and it's designed to be a drop-in replacement.
Lark is written purely in Python, which makes it very suitable for PyPy.
Let's get some free performance:
$ time pypy tutorial_json.py json_data > /dev/null
I added a few other parsers for comparison. PyParsing and funcparselib fair pretty well in their memory usage (they don't build a tree), but they can't compete with the run-time speed of LALR(1).