Commit Graph

479 Commits

Author SHA1 Message Date
adrianeboyd 521f361052
Switch to new gold.align method (#5334)
* Switch from original `_align` to new simpler alignment algorithm from
  #4526

* Remove alignment normalizations beyond whitespace and lowercasing
2020-04-21 19:31:03 +02:00
Ines Montani e0cf4796a5 Move lookup tables out of the core library (#4346)
* Add default to util.get_entry_point

* Tidy up entry points

* Read lookups from entry points

* Remove lookup tables and related tests

* Add lookups install option

* Remove lemmatizer tests

* Remove logic to process language data files

* Update setup.cfg
2019-10-01 00:01:27 +02:00
Ines Montani ba186299e1 Tidy up and modernize setup and config (#4344)
* Tidy up and modernize setup and config

* Update setup.cfg

* Re-add pyproject.toml

* Delete .flake8

* Move static meta from about to setup.cfg

* Update setup.cfg

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
2019-09-30 20:10:55 +02:00
Matthew Honnibal 84837c1680 Use include_package_data in setup.py 2019-09-30 14:56:44 +02:00
Matthew Honnibal b6ec291bde Require preshed 3.0.2 2019-09-28 22:23:24 +02:00
Matthew Honnibal 4c383ab77e Require newer preshed 2019-09-28 22:08:05 +02:00
Matthew Honnibal 96dd143a18 Install json.gz files 2019-09-28 16:35:39 +02:00
Ines Montani 80d554f2e2 Remove unsupported version [ci skip] 2019-09-19 01:14:42 +02:00
Ines Montani 7e3ac2cd41 Merge branch 'master' into develop 2019-09-12 15:35:25 +02:00
Ines Montani 0760c41393 Change st_ctime to st_mtime 2019-09-12 15:35:01 +02:00
Matthew Honnibal c181a94e75 Require thinc 7.1.1 2019-09-10 20:12:24 +02:00
Matthew Honnibal 28741ff5db Require preshed v3.0.0 2019-09-10 19:13:07 +02:00
Matthew Honnibal 4e2f07a655 Merge branch 'develop' into feature/lemmatizer 2019-08-25 21:03:25 +02:00
Matthew Honnibal b8edc8dffb Require thinc 7.1 2019-08-25 14:54:09 +02:00
Matthew Honnibal c308cf3e3e
Merge branch 'master' into feature/lemmatizer 2019-08-25 13:52:27 +02:00
Matthew Honnibal f9075a6fd1 Update to blis 0.4 and thinc 7.1 2019-08-25 13:50:47 +02:00
Wannaphong Phatthiyaphaibun d53c3fcbc1 Add Thai Language tokenizers (#4191)
Add th (pythainlp)
2019-08-25 11:35:21 +02:00
Matthew Honnibal bcd08f20af Merge changes from master 2019-08-21 14:18:52 +02:00
Paul O'Leary McCann 756b66b7c0 Reduce size of language data (#4141)
* Move Turkish lemmas to a json file

Rather than a large dict in Python source, the data is now a big json
file. This includes a method for loading the json file, falling back to
a compressed file, and an update to MANIFEST.in that excludes json in
the spacy/lang directory.

This focuses on Turkish specifically because it has the most language
data in core.

* Transition all lemmatizer.py files to json

This covers all lemmatizer.py files of a significant size (>500k or so).
Small files were left alone.

None of the affected files have logic, so this was pretty
straightforward.

One unusual thing is that the lemma data for Urdu doesn't seem to be
used anywhere. That may require further investigation.

* Move large lang data to json for fr/nb/nl/sv

These are the languages that use a lemmatizer directory (rather than a
single file) and are larger than English.

For most of these languages there were many language data files, in
which case only the large ones (>500k or so) were converted to json. It
may or may not be a good idea to migrate the remaining Python files to
json in the future.

* Fix id lemmas.json

The contents of this file were originally just copied from the Python
source, but that used single quotes, so it had to be properly converted
to json first.

* Add .json.gz to gitignore

This covers the json.gz files built as part of distribution.

* Add language data gzip to build process

Currently this gzip data on every build; it works, but it should be
changed to only gzip when the source file has been updated.

* Remove Danish lemmatizer.py

Missed this when I added the json.

* Update to match latest explosion/srsly#9

The way gzipped json is loaded/saved in srsly changed a bit.

* Only compress language data if necessary

If a .json.gz file exists and is newer than the corresponding json file,
it's not recompressed.

* Move en/el language data to json

This only affected files >500kb, which was nouns for both languages and
the generic lookup table for English.

* Remove empty files in Norwegian tokenizer

It's unclear why, but the Norwegian (nb) tokenizer had empty files for
adj/adv/noun/verb lemmas. This may have been a result of copying the
structure of the English lemmatizer.

This removed the files, but still creates the empty sets in the
lemmatizer. That may not actually be necessary.

* Remove dubious entries in English lookup.json

" furthest" and " skilled" - both prefixed with a space - were in the
English lookup table. That seems obviously wrong so I have removed them.

* Fix small issues with en/fr lemmatizers

The en tokenizer was including the removed _nouns.py file, so that's
removed.

The fr tokenizer is unusual in that it has a lemmatizer directory with
both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
to load the json language data, so that was fixed.

* Auto-format

* Auto-format

* Update srsly pin

* Consistently use pathlib paths
2019-08-20 14:54:11 +02:00
Ines Montani 123929b58b Update Thinc version pin 2019-07-12 00:15:35 +02:00
Ines Montani cda9fc3dae Update Thinc version pin 2019-07-11 15:53:13 +02:00
cedar101 58f06e6180 Korean support (#3901)
* start lang/ko

* add test codes

* using natto-py

* add test_ko_tokenizer_full_tags()

* spaCy contributor agreement

* external dependency for ko

* collections.namedtuple for python version < 3.5

* case fix

* tuple unpacking

* add jongseong(final consonant)

* apply mecab option

* Remove Pipfile for now


Co-authored-by: Ines Montani <ines@ines.io>
2019-07-09 22:23:16 +02:00
Ines Montani 5d6b4bb3bd Update srsly pin 2019-06-07 11:14:32 +02:00
Ines Montani a7fd42d937 Make jsonschema dependency optional (#3784) 2019-05-30 14:34:58 +02:00
Ines Montani a8416c46f7 Use string name in setup.py
Hopefully this will trick GitHub's parser into recognising it as a Python package and show us the dependents / "used by" statistics 🤞
2019-05-28 17:11:39 +02:00
Ines Montani 04658ebbb2 Relax jsonschema pin (closes #3628) 2019-05-03 11:58:58 +02:00
svlandeg 12d4caf341 Merge remote-tracking branch 'upstream/master' into feature/el-framework 2019-03-22 13:44:36 +01:00
Ines Montani c81923ee30 Update wasabi pin 2019-03-22 13:31:58 +01:00
svlandeg cf34113250 very minimal KB functionality working 2019-03-22 11:36:44 +01:00
Matthew Honnibal 02d7b41893 Fix GPU installation. Closes #3437 2019-03-20 00:59:27 +01:00
Matthew Honnibal 932d7dde1c Fix compile error 2019-03-07 14:34:54 +01:00
Matthew Honnibal ef3110a444 Fix compile error 2019-03-07 10:45:55 +01:00
Matthew Honnibal fc1cc4c529 Move morphologizer under spacy/pipes 2019-03-07 01:36:26 +01:00
Matthew Honnibal 3993f41cc4 Update morphology branch from develop 2019-03-07 00:14:43 +01:00
Ines Montani 55bb570f51 Add [ja] to extras_require 2019-02-25 09:37:05 +01:00
Matthew Honnibal 55bb3cc482 Require thinc 7.0.2 2019-02-23 13:10:09 +01:00
Matthew Honnibal 808ae7521b Require thinc 7.0.1 2019-02-16 17:29:57 +01:00
Matthew Honnibal eea3001b98 Depend on thinc 7.0.1.dev2 2019-02-16 17:02:30 +01:00
Matthew Honnibal f456b673d4 Require thinc 7.0.1.dev1 2019-02-16 16:22:26 +01:00
Matthew Honnibal 11e826ac3b Require thinc v7.0.1.dev0 2019-02-16 15:47:02 +01:00
Matthew Honnibal 4c49f5f7b0 Update Thinc dependency 2019-02-15 12:39:08 +01:00
Matthew Honnibal bed956c698 Drop regex dependency 2019-02-13 23:08:22 +11:00
Ines Montani a9f8d17632
💫 Break up large pipeline.pyx (#3246)
* Break up large pipeline.pyx

* Merge some components back together

* Fix typo
2019-02-10 12:14:51 +01:00
Ines Montani 5d0b60999d Merge branch 'master' into develop 2019-02-07 20:54:07 +01:00
Ines Montani 1ea4df459d 💫 Break up large matcher.pyx (#3236)
* Break up large matcher.pyx

* Remove unused function
2019-02-07 19:42:25 +11:00
Paul Ganssle 021d04069a Build metadata modernization - pyproject.toml and python_requires (#3167)
* Added pyproject.toml

This adds the build requirements metadata to the repo, which can be used
with any build tools that implement PEP 517 and PEP 518 (e.g. pip, tox).
It is no longer necessary to have the build dependencies installed when
installing from source.

* Add python_requires for 2.7, 3.4+

This directive specifies in the build metadata which version of CPython
is supported by this version of spaCy, which pip will take into account
when determining what version to download. This will allow you to safely
drop old versions of Python without `pip install spaCy` breaking for those
versions.

* Add Python 3.7 to the trove classifiers
2019-01-16 17:42:09 +01:00
Mathieu Morey f07b577fbd Support CUDA 10 (#3126)
* ENH support CUDA 10

* Update _instructions.jade
2019-01-09 03:10:45 +01:00
Matthew Honnibal b7ce85a6f3 Fix packaging of json schemas 2018-12-19 13:54:02 +01:00
Matthew Honnibal 35ff889852 Fix OSX wheel building 2018-12-19 13:14:57 +01:00
Matthew Honnibal a2b75036e9 Try to make sure json schemas are packaged 2018-12-19 01:08:51 +01:00