Commit Graph

15 Commits

Author SHA1 Message Date
Paul O'Leary McCann 756b66b7c0 Reduce size of language data (#4141)
* Move Turkish lemmas to a json file

Rather than a large dict in Python source, the data is now a big json
file. This includes a method for loading the json file, falling back to
a compressed file, and an update to MANIFEST.in that excludes json in
the spacy/lang directory.

This focuses on Turkish specifically because it has the most language
data in core.

* Transition all lemmatizer.py files to json

This covers all lemmatizer.py files of a significant size (>500k or so).
Small files were left alone.

None of the affected files have logic, so this was pretty
straightforward.

One unusual thing is that the lemma data for Urdu doesn't seem to be
used anywhere. That may require further investigation.

* Move large lang data to json for fr/nb/nl/sv

These are the languages that use a lemmatizer directory (rather than a
single file) and are larger than English.

For most of these languages there were many language data files, in
which case only the large ones (>500k or so) were converted to json. It
may or may not be a good idea to migrate the remaining Python files to
json in the future.

* Fix id lemmas.json

The contents of this file were originally just copied from the Python
source, but that used single quotes, so it had to be properly converted
to json first.

* Add .json.gz to gitignore

This covers the json.gz files built as part of distribution.

* Add language data gzip to build process

Currently this gzip data on every build; it works, but it should be
changed to only gzip when the source file has been updated.

* Remove Danish lemmatizer.py

Missed this when I added the json.

* Update to match latest explosion/srsly#9

The way gzipped json is loaded/saved in srsly changed a bit.

* Only compress language data if necessary

If a .json.gz file exists and is newer than the corresponding json file,
it's not recompressed.

* Move en/el language data to json

This only affected files >500kb, which was nouns for both languages and
the generic lookup table for English.

* Remove empty files in Norwegian tokenizer

It's unclear why, but the Norwegian (nb) tokenizer had empty files for
adj/adv/noun/verb lemmas. This may have been a result of copying the
structure of the English lemmatizer.

This removed the files, but still creates the empty sets in the
lemmatizer. That may not actually be necessary.

* Remove dubious entries in English lookup.json

" furthest" and " skilled" - both prefixed with a space - were in the
English lookup table. That seems obviously wrong so I have removed them.

* Fix small issues with en/fr lemmatizers

The en tokenizer was including the removed _nouns.py file, so that's
removed.

The fr tokenizer is unusual in that it has a lemmatizer directory with
both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
to load the json language data, so that was fixed.

* Auto-format

* Auto-format

* Update srsly pin

* Consistently use pathlib paths
2019-08-20 14:54:11 +02:00
Ines Montani 5d0b60999d Merge branch 'master' into develop 2019-02-07 20:54:07 +01:00
Ines Montani 338d659bd0 Store JSON schemas in Python and tidy up (#3235) 2019-02-07 19:44:31 +11:00
Paul Ganssle 021d04069a Build metadata modernization - pyproject.toml and python_requires (#3167)
* Added pyproject.toml

This adds the build requirements metadata to the repo, which can be used
with any build tools that implement PEP 517 and PEP 518 (e.g. pip, tox).
It is no longer necessary to have the build dependencies installed when
installing from source.

* Add python_requires for 2.7, 3.4+

This directive specifies in the build metadata which version of CPython
is supported by this version of spaCy, which pip will take into account
when determining what version to download. This will allow you to safely
drop old versions of Python without `pip install spaCy` breaking for those
versions.

* Add Python 3.7 to the trove classifiers
2019-01-16 17:42:09 +01:00
Matthew Honnibal 9fc8ce0c4d Add schemas to MANIFEST 2018-12-19 01:18:50 +01:00
Ines Montani 3832c8a2c1 💫 Use README.md instead of README.rst (#2968)
* Auto-format setup.py

* Use README.md instead of README.rst
2018-11-26 22:04:35 +01:00
ines d208bcef96 Add entry point-style auto alias for "spacy"
Simplest way to run commands as spacy xxx instead of python -m spacy
xxx, while avoiding environment conflicts
2017-08-14 12:18:39 +02:00
ines 7f8c2ef3c1 Remove buildbot.json for now 2017-03-17 14:35:10 +01:00
Henning Peters 1fe29c6919 cleanup 2016-03-13 18:12:32 +01:00
Henning Peters 49f499ca1c cleanup 2016-03-12 14:30:24 +01:00
Henning Peters 5701686272 cleanup 2016-03-12 13:47:10 +01:00
Henning Peters 74dc02a0e6 fix windows readme 2015-12-21 21:58:53 +01:00
Henning Peters c17ce6c119 (re-)include cython sources, murmurhash header discovery 2015-12-21 12:40:44 +01:00
Henning Peters ac318b568c new approach to dependency headers 2015-12-13 11:49:17 +01:00
Matthew Honnibal d5d1578e44 * Add manifest file 2015-01-30 16:49:02 +11:00