spaCy/spacy/lang/it/punctuation.py

# coding: utf8
from __future__ import unicode_literals

from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
from ..char_classes import ALPHA, HYPHENS, CONCAT_QUOTES
from ..char_classes import ALPHA_LOWER, ALPHA_UPPER


ELISION = "'’"


_prefixes = [r"'[0-9][0-9]", r"[0-9]+°"] + BASE_TOKENIZER_PREFIXES


_infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])(?:{h})(?=[{al}])".format(a=ALPHA, h=HYPHENS, al=ALPHA_LOWER),
        r"(?<=[{a}0-9])[:<>=\/](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}][{el}])(?=[{a}0-9\"])".format(a=ALPHA, el=ELISION),
    ]
)

TOKENIZER_PREFIXES = _prefixes
TOKENIZER_INFIXES = _infixes
-												Improve Italian & Urdu tokenization accuracy (#3228)

## Description

1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.

2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.

### Types of change
Enhancement of Italian & Urdu tokenization

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-04 21:39:25 +00:00
+								# coding: utf8
 								from __future__ import unicode_literals
-												Tidy up and auto-format

											
										
										
											2020-03-25 11:28:12 +00:00
+								from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
-												Improve Italian tokenization (#5204)

Improve Italian tokenization for UD_Italian-ISDT.
											
										
										
											2020-03-25 10:28:02 +00:00
+								from ..char_classes import LIST_ELLIPSES, LIST_ICONS
 								from ..char_classes import ALPHA, HYPHENS, CONCAT_QUOTES
 								from ..char_classes import ALPHA_LOWER, ALPHA_UPPER
-												Improve Italian & Urdu tokenization accuracy (#3228)

## Description

1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.

2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.

### Types of change
Enhancement of Italian & Urdu tokenization

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-04 21:39:25 +00:00
-												Improve Italian tokenization (#5204)

Improve Italian tokenization for UD_Italian-ISDT.
											
										
										
											2020-03-25 10:28:02 +00:00
+								ELISION = "'’"
-												Improve Italian & Urdu tokenization accuracy (#3228)

## Description

1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.

2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.

### Types of change
Enhancement of Italian & Urdu tokenization

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-04 21:39:25 +00:00
-												Tidy up and auto-format

											
										
										
											2020-03-25 11:28:12 +00:00
+								_prefixes = [r"'[0-9][0-9]", r"[0-9]+°"] + BASE_TOKENIZER_PREFIXES
-												Improve Italian tokenization (#5204)

Improve Italian tokenization for UD_Italian-ISDT.
											
										
										
											2020-03-25 10:28:02 +00:00
 								_infixes = (
 								    LIST_ELLIPSES
 								    + LIST_ICONS
 								    + [
 								        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
 								        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
 								            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
 								        ),
 								        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
 								        r"(?<=[{a}])(?:{h})(?=[{al}])".format(a=ALPHA, h=HYPHENS, al=ALPHA_LOWER),
 								        r"(?<=[{a}0-9])[:<>=\/](?=[{a}])".format(a=ALPHA),
-												Tidy up and auto-format

											
										
										
											2020-03-25 11:28:12 +00:00
+								        r"(?<=[{a}][{el}])(?=[{a}0-9\"])".format(a=ALPHA, el=ELISION),
-												Improve Italian tokenization (#5204)

Improve Italian tokenization for UD_Italian-ISDT.
											
										
										
											2020-03-25 10:28:02 +00:00
+								    ]
 								)
 								TOKENIZER_PREFIXES = _prefixes
-												Improve Italian & Urdu tokenization accuracy (#3228)

## Description

1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.

2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.

### Types of change
Enhancement of Italian & Urdu tokenization

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-04 21:39:25 +00:00
+								TOKENIZER_INFIXES = _infixes