spaCy

Commit Graph

Author	SHA1	Message	Date
Sofie	9745b0d523	Improve Italian & Urdu tokenization accuracy (#3228 ) ## Description 1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour. 2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour. ### Types of change Enhancement of Italian & Urdu tokenization ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-04 22:39:25 +01:00
Sofie	a3efa3e8d9	Improve Catalan tokenization accuracy (#3225 ) * small hyphen clean up for French * catalan infix similar to french	2019-02-04 20:37:19 +11:00

Author

SHA1

Message

Date

Sofie

9745b0d523

Improve Italian & Urdu tokenization accuracy (#3228 )

## Description

1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour.

2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour.

### Types of change
Enhancement of Italian & Urdu tokenization

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

2019-02-04 22:39:25 +01:00

Sofie

a3efa3e8d9

Improve Catalan tokenization accuracy (#3225 )

* small hyphen clean up for French

* catalan infix similar to french

2019-02-04 20:37:19 +11:00

2 Commits