From c319ace48d1b0edea506a5364fd04816480e84a7 Mon Sep 17 00:00:00 2001 From: julienmalard Date: Fri, 26 Jun 2020 11:47:00 -0400 Subject: [PATCH] Update README.md --- README.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/README.md b/README.md index 1c7062c..02b89d7 100644 --- a/README.md +++ b/README.md @@ -176,6 +176,27 @@ You can use the output as a regular python module: 0.38981434460254655 ``` +### Using Unicode character classes with `regex` +Python's builtin `re` module has a few persistent known bugs and also won't parse +advanced regex features such as character classes. +With `pip install lark-parser[regex]`, the `regex` module will be installed alongside `lark` +and can act as a drop-in replacement to `re`. + +Any instance of `Lark` instantiated with `regex=True` will now use the `regex` module +instead of `re`. For example, we can now use character classes to match PEP-3131 compliant Python identifiers. +```python +from lark import Lark +>>> g = Lark(r""" + ?start: NAME + NAME: ID_START ID_CONTINUE* + ID_START: /[\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}_]+/ + ID_CONTINUE: ID_START | /[\p{Mn}\p{Mc}\p{Nd}\p{Pc}·]+/ + """, regex=True) + +>>> g.parse('வணக்கம்') +'வணக்கம்' + +``` ## License