spaCy/spacy/lang/tokenizer_exceptions.py

import re

from .char_classes import ALPHA_LOWER, ALPHA
from ..symbols import ORTH, POS, TAG, LEMMA, SPACE


# URL validation regex courtesy of: https://mathiasbynens.be/demo/url-regex
# and https://gist.github.com/dperini/729294 (Diego Perini, MIT License)
# A few mods to this regex to account for use cases represented in test_urls
URL_PATTERN = (
    # fmt: off
    r"^"
    # protocol identifier (mods: make optional and expand schemes)
    # (see: https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml)
    r"(?:(?:[\w\+\-\.]{2,})://)?"
    # mailto:user or user:pass authentication
    r"(?:\S+(?::\S*)?@)?"
    r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    # MH: Do we really need this? Seems excessive, and seems to have caused
    # Issue #957
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host & domain names
    # mods: match is case-sensitive, so include [A-Z]
      "(?:"  # noqa: E131
        "(?:"
          "[A-Za-z0-9\u00a1-\uffff]"
          "[A-Za-z0-9\u00a1-\uffff_-]{0,62}"
        ")?"
        "[A-Za-z0-9\u00a1-\uffff]\."
      ")+"
    # TLD identifier
    # mods: use ALPHA_LOWER instead of a wider range so that this doesn't match
    # strings like "lower.Upper", which can be split on "." by infixes in some
    # languages
    r"(?:[" + ALPHA_LOWER + "]{2,63})"
    r")"
    # port number
    r"(?::\d{2,5})?"
    # resource path
    r"(?:[/?#]\S*)?"
    r"$"
    # fmt: on
).strip()

TOKEN_MATCH = None
URL_MATCH = re.compile("(?u)" + URL_PATTERN).match


BASE_EXCEPTIONS = {}


for exc_data in [
    {ORTH: " ", POS: SPACE, TAG: "_SP"},
    {ORTH: "\t", POS: SPACE, TAG: "_SP"},
    {ORTH: "\\t", POS: SPACE, TAG: "_SP"},
    {ORTH: "\n", POS: SPACE, TAG: "_SP"},
    {ORTH: "\\n", POS: SPACE, TAG: "_SP"},
    {ORTH: "\u2014"},
    {ORTH: "\u00a0", POS: SPACE, LEMMA: "  ", TAG: "_SP"},
]:
    BASE_EXCEPTIONS[exc_data[ORTH]] = [exc_data]


for orth in [
    "'",
    '\\")',
    "<space>",
    "''",
    "C++",
    "a.",
    "b.",
    "c.",
    "d.",
    "e.",
    "f.",
    "g.",
    "h.",
    "i.",
    "j.",
    "k.",
    "l.",
    "m.",
    "n.",
    "o.",
    "p.",
    "q.",
    "r.",
    "s.",
    "t.",
    "u.",
    "v.",
    "w.",
    "x.",
    "y.",
    "z.",
    "ä.",
    "ö.",
    "ü.",
]:
    BASE_EXCEPTIONS[orth] = [{ORTH: orth}]


emoticons = set(
    r"""
:)
:-)
:))
:-))
:)))
:-)))
(:
(-:
=)
(=
:]
:-]
[:
[-:
:o)
(o:
:}
:-}
8)
8-)
(-8
;)
;-)
(;
(-;
:(
:-(
:((
:-((
:(((
:-(((
):
)-:
=(
>:(
:')
:'-)
:'(
:'-(
:/
:-/
=/
=|
:|
:-|
:1
:P
:-P
:p
:-p
:O
:-O
:o
:-o
:0
:-0
:()
>:o
:*
:-*
:3
:-3
=3
:>
:->
:X
:-X
:x
:-x
:D
:-D
;D
;-D
=D
xD
XD
xDD
XDD
8D
8-D

^_^
^__^
^___^
>.<
>.>
<.<
._.
;_;
-_-
-__-
v.v
V.V
v_v
V_V
o_o
o_O
O_o
O_O
0_o
o_0
0_0
o.O
O.o
O.O
o.o
0.0
o.0
0.o
@_@
<3
<33
<333
</3
(^_^)
(-_-)
(._.)
(>_<)
(*_*)
(¬_¬)
ಠ_ಠ
ಠ︵ಠ
(ಠ_ಠ)
¯\(ツ)/¯
(╯°□°）╯︵┻━┻
><(((*>
""".split()
)


for orth in emoticons:
    BASE_EXCEPTIONS[orth] = [{ORTH: orth}]
-												Replacing regex library with re to increase tokenization speed (#3218)

* replace unicode categories with raw list of code points

* simplifying ranges

* fixing variable length quotes

* removing redundant regular expression

* small cleanup of regexp notations

* quotes and alpha as ranges instead of alterations

* removed most regexp dependencies and features

* exponential backtracking - unit tests

* rewrote expression with pathological backtracking

* disabling double hyphen tests for now

* test additional variants of repeating punctuation

* remove regex and redundant backslashes from load_reddit script

* small typo fixes

* disable double punctuation test for russian

* clean up old comments

* format block code

* final cleanup

* naming consistency

* french strings as unicode for python 2 support

* french regular expression case insensitive

											
										
										
											2019-02-01 07:05:22 +00:00
+								import re
-												Revert "Revert "Merge remote-tracking branch 'origin/master'""

This reverts commit fb9d3bb022e89f2cd63f2dd61efcac2eeb65cff9.

											
										
										
											2017-01-03 17:17:57 +00:00
-												Revert changes to token_match priority from #4374

* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly

											
										
										
											2020-03-09 11:09:41 +00:00
+								from .char_classes import ALPHA_LOWER, ALPHA
-												Tidy up and fix small bugs and typos

											
										
										
											2019-02-08 13:14:49 +00:00
+								from ..symbols import ORTH, POS, TAG, LEMMA, SPACE
-												Merge base tokenizer exceptions

											
										
										
											2017-05-08 13:55:52 +00:00
-												Issue #840 - URL pattenr too broad

											
										
										
											2017-03-04 22:13:11 +00:00
+								# URL validation regex courtesy of: https://mathiasbynens.be/demo/url-regex
-												Fix and improve URL pattern (#4882)

* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
											
										
										
											2020-01-06 13:58:30 +00:00
+								# and https://gist.github.com/dperini/729294 (Diego Perini, MIT License)
 								# A few mods to this regex to account for use cases represented in test_urls
-												Rename _URL_PATTERN to URL_PATTERN

											
										
										
											2017-05-08 22:00:00 +00:00
+								URL_PATTERN = (
-												Fix and improve URL pattern (#4882)

* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
											
										
										
											2020-01-06 13:58:30 +00:00
+								    # fmt: off
-												Issue #840 - URL pattenr too broad

											
										
										
											2017-03-04 22:13:11 +00:00
+								    r"^"
-												Fix and improve URL pattern (#4882)

* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
											
										
										
											2020-01-06 13:58:30 +00:00
+								    # protocol identifier (mods: make optional and expand schemes)
 								    # (see: https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml)
-												Improve URL_PATTERN and handling in tokenizer (#4374)

* Move prefix and suffix detection for URL_PATTERN

Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.

Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.

* Match a wider range of URI schemes

											
										
										
											2019-10-05 11:00:09 +00:00
+								    r"(?:(?:[\w\+\-\.]{2,})://)?"
 								    # mailto:user or user:pass authentication
-												Issue #840 - URL pattenr too broad

											
										
										
											2017-03-04 22:13:11 +00:00
+								    r"(?:\S+(?::\S*)?@)?"
 								    r"(?:"
 								    # IP address exclusion
 								    # private & local networks
 								    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
 								    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
 								    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
 								    # IP address dotted notation octets
 								    # excludes loopback network 0.0.0.0
 								    # excludes reserved space >= 224.0.0.0
 								    # excludes network & broadcast addresses
 								    # (first & last IP address of each class)
-												Rename test #913 -> #957, comment

Make test for #957 reference correct bug. Add comment.

Previous commit closes #957.

											
										
										
											2017-04-07 13:54:25 +00:00
+								    # MH: Do we really need this? Seems excessive, and seems to have caused
 								    # Issue #957
-												Issue #840 - URL pattenr too broad

											
										
										
											2017-03-04 22:13:11 +00:00
+								    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
 								    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
 								    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
 								    r"|"
-												Fix and improve URL pattern (#4882)

* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
											
										
										
											2020-01-06 13:58:30 +00:00
+								    # host & domain names
 								    # mods: match is case-sensitive, so include [A-Z]
-												Tidy up and auto-format

											
										
										
											2020-02-18 14:38:18 +00:00
+								      "(?:"  # noqa: E131
-												Fix and improve URL pattern (#4882)

* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
											
										
										
											2020-01-06 13:58:30 +00:00
+								        "(?:"
 								          "[A-Za-z0-9\u00a1-\uffff]"
 								          "[A-Za-z0-9\u00a1-\uffff_-]{0,62}"
 								        ")?"
 								        "[A-Za-z0-9\u00a1-\uffff]\."
 								      ")+"
-												Issue #840 - URL pattenr too broad

											
										
										
											2017-03-04 22:13:11 +00:00
+								    # TLD identifier
-												Fix and improve URL pattern (#4882)

* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
											
										
										
											2020-01-06 13:58:30 +00:00
+								    # mods: use ALPHA_LOWER instead of a wider range so that this doesn't match
 								    # strings like "lower.Upper", which can be split on "." by infixes in some
 								    # languages
 								    r"(?:[" + ALPHA_LOWER + "]{2,63})"
-												Issue #840 - URL pattenr too broad

											
										
										
											2017-03-04 22:13:11 +00:00
+								    r")"
 								    # port number
 								    r"(?::\d{2,5})?"
 								    # resource path
-												Improve URL_PATTERN and handling in tokenizer (#4374)

* Move prefix and suffix detection for URL_PATTERN

Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.

Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.

* Match a wider range of URI schemes

											
										
										
											2019-10-05 11:00:09 +00:00
+								    r"(?:[/?#]\S*)?"
-												Issue #840 - URL pattenr too broad

											
										
										
											2017-03-04 22:13:11 +00:00
+								    r"$"
-												Fix and improve URL pattern (#4882)

* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
											
										
										
											2020-01-06 13:58:30 +00:00
+								    # fmt: on
-												Issue #840 - URL pattenr too broad

											
										
										
											2017-03-04 22:13:11 +00:00
+								).strip()
-												Revert "Revert "Merge remote-tracking branch 'origin/master'""

This reverts commit fb9d3bb022e89f2cd63f2dd61efcac2eeb65cff9.

											
										
										
											2017-01-03 17:17:57 +00:00
-												Add tokenizer option for token match with affixes

To fix the slow tokenizer URL (#4374) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.

											
										
										
											2020-05-05 08:35:33 +00:00
+								TOKEN_MATCH = None
-												Rename to url_match

Rename to `url_match` and update docs.

											
										
										
											2020-05-22 10:41:03 +00:00
+								URL_MATCH = re.compile("(?u)" + URL_PATTERN).match
-												Revert "Revert "Merge remote-tracking branch 'origin/master'""

This reverts commit fb9d3bb022e89f2cd63f2dd61efcac2eeb65cff9.

											
										
										
											2017-01-03 17:17:57 +00:00
-												Merge base tokenizer exceptions

											
										
										
											2017-05-08 13:55:52 +00:00
 								BASE_EXCEPTIONS = {}
 								for exc_data in [
-												Prevent exceptions from setting POS but not TAG. Closes #1773

											
										
										
											2018-12-30 12:15:23 +00:00
+								    {ORTH: " ", POS: SPACE, TAG: "_SP"},
 								    {ORTH: "\t", POS: SPACE, TAG: "_SP"},
 								    {ORTH: "\\t", POS: SPACE, TAG: "_SP"},
 								    {ORTH: "\n", POS: SPACE, TAG: "_SP"},
 								    {ORTH: "\\n", POS: SPACE, TAG: "_SP"},
 								    {ORTH: "\u2014"},
 								    {ORTH: "\u00a0", POS: SPACE, LEMMA: "  ", TAG: "_SP"},
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								]:
-												Tidy up tokenizer exceptions

											
										
										
											2017-11-01 22:02:45 +00:00
+								    BASE_EXCEPTIONS[exc_data[ORTH]] = [exc_data]
-												Merge base tokenizer exceptions

											
										
										
											2017-05-08 13:55:52 +00:00
 								for orth in [
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								    "'",
 								    '\\")',
 								    "<space>",
 								    "''",
 								    "C++",
 								    "a.",
 								    "b.",
 								    "c.",
 								    "d.",
 								    "e.",
 								    "f.",
 								    "g.",
 								    "h.",
 								    "i.",
 								    "j.",
 								    "k.",
 								    "l.",
 								    "m.",
 								    "n.",
 								    "o.",
 								    "p.",
 								    "q.",
 								    "r.",
 								    "s.",
 								    "t.",
 								    "u.",
 								    "v.",
 								    "w.",
 								    "x.",
 								    "y.",
 								    "z.",
 								    "ä.",
 								    "ö.",
 								    "ü.",
 								]:
-												Merge base tokenizer exceptions

											
										
										
											2017-05-08 13:55:52 +00:00
+								    BASE_EXCEPTIONS[orth] = [{ORTH: orth}]
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								emoticons = set(
-												Make the emoticon list a raw string (#4139)

While working on an unrelated task I got warnings about an unsupported
escape sequence (`"\("`) in the tokenizer exceptions. Making the
tokenizer exceptions a raw string makes this warning go away.

The specific string that triggered this is `¯\(ツ)/¯`.
											
										
										
											2019-08-18 13:17:13 +00:00
+								    r"""
-												Merge base tokenizer exceptions

											
										
										
											2017-05-08 13:55:52 +00:00
+								:)
 								:-)
 								:))
 								:-))
 								:)))
 								:-)))
 								(:
 								(-:
 								=)
 								(=
 								:]
 								:-]
 								[:
 								[-:
 								:o)
 								(o:
 								:}
 								:-}
 )
 -)
 								(-8
 								;)
 								;-)
 								(;
 								(-;
 								:(
 								:-(
 								:((
 								:-((
 								:(((
 								:-(((
 								):
 								)-:
 								=(
 								>:(
 								:')
 								:'-)
 								:'(
 								:'-(
 								:/
 								:-/
 								=/
 								=|
 								:|
 								:-|
 								:1
 								:P
 								:-P
 								:p
 								:-p
 								:O
 								:-O
 								:o
 								:-o
 								:0
 								:-0
 								:()
 								>:o
 								:*
 								:-*
 								:3
 								:-3
 								=3
 								:>
 								:->
 								:X
 								:-X
 								:x
 								:-x
 								:D
 								:-D
 								;D
 								;-D
 								=D
 								xD
 								XD
 								xDD
 								XDD
 D
 -D
 								^_^
 								^__^
 								^___^
 								>.<
 								>.>
 								<.<
 								._.
 								;_;
 								-_-
 								-__-
 								v.v
 								V.V
 								v_v
 								V_V
 								o_o
 								o_O
 								O_o
 								O_O
 _o
 								o_0
 _0
 								o.O
 								O.o
 								O.O
 								o.o
 .0
 								o.0
 .o
 								@_@
 								<3
 								<33
 								<333
 								</3
 								(^_^)
 								(-_-)
 								(._.)
 								(>_<)
 								(*_*)
 								(¬_¬)
 								ಠ_ಠ
 								ಠ︵ಠ
 								(ಠ_ಠ)
 								¯\(ツ)/¯
 								(╯°□°）╯︵┻━┻
 								><(((*>
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								""".split()
 								)
-												Merge base tokenizer exceptions

											
										
										
											2017-05-08 13:55:52 +00:00
 								for orth in emoticons:
 								    BASE_EXCEPTIONS[orth] = [{ORTH: orth}]