Remove corpus-specific tag maps from the language data for languages
without custom tokenizers. For languages with custom word segmenters
that also provide tags (Japanese and Korean), the tag maps for the
custom tokenizers are kept as the default.
The default tag maps for languages without custom tokenizers are now the
default tag map from `lang/tag_map/py`, UPOS -> UPOS.
Update Polish tokenizer for UD_Polish-PDB, which is a relatively major
change from the existing tokenizer. Unused exceptions files and
conflicting test cases removed.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Improved stop words list
* Removed some wrong stop words form list
* Improved stop words list
* Removed some wrong stop words form list
* Improved Polish Tokenizer (#38)
* Add tests for polish tokenizer
* Add polish tokenizer exceptions
* Don't split any words containing hyphens
* Fix test case with wrong model answer
* Remove commented out line of code until better solution is found
* Add source srx' license
* Rename exception_list.py to match spaCy conventionality
* Add a brief explanation of where the exception list comes from
* Add newline after reach exception
* Rename COPYING.txt to LICENSE
* Delete old files
* Add header to the license
* Agreements signed
* Stanisław Giziński agreement
* Krzysztof Kowalczyk - signed agreement
* Mateusz Olko agreement
* Add DoomCoder's contributor agreement
* Improve like number checking in polish lang
* like num tests added
* all from SI system added
* Final licence and removed splitting exceptions
* Added polish stop words to LEX_ATTRA
* Add encoding info to pl tokenizer exceptions
<!--- Provide a general summary of your changes in the title. -->
## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)
Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.
At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.
### Types of change
enhancement, code style
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.