spaCy

Commit Graph

Author	SHA1	Message	Date
Ines Montani	79540e1eea	Remove bin/spacy from MANIFEST	2020-07-01 22:15:18 +02:00
Sofie Van Landeghem	06f0a8daa0	Default settings to configurations (#4995 ) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build	2020-02-27 18:42:27 +01:00
Sofie Van Landeghem	1c01842588	add pyx and pxd files to the distribution (#5000 )	2020-02-11 17:42:17 -05:00
Ines Montani	ba186299e1	Tidy up and modernize setup and config (#4344 ) * Tidy up and modernize setup and config * Update setup.cfg * Re-add pyproject.toml * Delete .flake8 * Move static meta from about to setup.cfg * Update setup.cfg Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>	2019-09-30 20:10:55 +02:00
Ines Montani	69c674bdbf	Update MANIFEST.in	2019-09-30 16:33:07 +02:00
Ines Montani	b8eca6cd11	Update MANIFEST.in	2019-09-30 16:27:12 +02:00
Matthew Honnibal	512e2208dc	Include .txt files	2019-09-30 15:24:25 +02:00
Paul O'Leary McCann	756b66b7c0	Reduce size of language data (#4141 ) * Move Turkish lemmas to a json file Rather than a large dict in Python source, the data is now a big json file. This includes a method for loading the json file, falling back to a compressed file, and an update to MANIFEST.in that excludes json in the spacy/lang directory. This focuses on Turkish specifically because it has the most language data in core. * Transition all lemmatizer.py files to json This covers all lemmatizer.py files of a significant size (>500k or so). Small files were left alone. None of the affected files have logic, so this was pretty straightforward. One unusual thing is that the lemma data for Urdu doesn't seem to be used anywhere. That may require further investigation. * Move large lang data to json for fr/nb/nl/sv These are the languages that use a lemmatizer directory (rather than a single file) and are larger than English. For most of these languages there were many language data files, in which case only the large ones (>500k or so) were converted to json. It may or may not be a good idea to migrate the remaining Python files to json in the future. * Fix id lemmas.json The contents of this file were originally just copied from the Python source, but that used single quotes, so it had to be properly converted to json first. * Add .json.gz to gitignore This covers the json.gz files built as part of distribution. * Add language data gzip to build process Currently this gzip data on every build; it works, but it should be changed to only gzip when the source file has been updated. * Remove Danish lemmatizer.py Missed this when I added the json. * Update to match latest explosion/srsly#9 The way gzipped json is loaded/saved in srsly changed a bit. * Only compress language data if necessary If a .json.gz file exists and is newer than the corresponding json file, it's not recompressed. * Move en/el language data to json This only affected files >500kb, which was nouns for both languages and the generic lookup table for English. * Remove empty files in Norwegian tokenizer It's unclear why, but the Norwegian (nb) tokenizer had empty files for adj/adv/noun/verb lemmas. This may have been a result of copying the structure of the English lemmatizer. This removed the files, but still creates the empty sets in the lemmatizer. That may not actually be necessary. * Remove dubious entries in English lookup.json " furthest" and " skilled" - both prefixed with a space - were in the English lookup table. That seems obviously wrong so I have removed them. * Fix small issues with en/fr lemmatizers The en tokenizer was including the removed _nouns.py file, so that's removed. The fr tokenizer is unusual in that it has a lemmatizer directory with both __init__.py and lemmatizer.py. lemmatizer.py had not been converted to load the json language data, so that was fixed. * Auto-format * Auto-format * Update srsly pin * Consistently use pathlib paths	2019-08-20 14:54:11 +02:00
Ines Montani	5d0b60999d	Merge branch 'master' into develop	2019-02-07 20:54:07 +01:00
Ines Montani	338d659bd0	Store JSON schemas in Python and tidy up (#3235 )	2019-02-07 19:44:31 +11:00
Paul Ganssle	021d04069a	Build metadata modernization - pyproject.toml and python_requires (#3167 ) * Added pyproject.toml This adds the build requirements metadata to the repo, which can be used with any build tools that implement PEP 517 and PEP 518 (e.g. pip, tox). It is no longer necessary to have the build dependencies installed when installing from source. * Add python_requires for 2.7, 3.4+ This directive specifies in the build metadata which version of CPython is supported by this version of spaCy, which pip will take into account when determining what version to download. This will allow you to safely drop old versions of Python without `pip install spaCy` breaking for those versions. * Add Python 3.7 to the trove classifiers	2019-01-16 17:42:09 +01:00
Matthew Honnibal	9fc8ce0c4d	Add schemas to MANIFEST	2018-12-19 01:18:50 +01:00
Ines Montani	3832c8a2c1	💫 Use README.md instead of README.rst (#2968 ) * Auto-format setup.py * Use README.md instead of README.rst	2018-11-26 22:04:35 +01:00
ines	d208bcef96	Add entry point-style auto alias for "spacy" Simplest way to run commands as spacy xxx instead of python -m spacy xxx, while avoiding environment conflicts	2017-08-14 12:18:39 +02:00
ines	7f8c2ef3c1	Remove buildbot.json for now	2017-03-17 14:35:10 +01:00
Henning Peters	1fe29c6919	cleanup	2016-03-13 18:12:32 +01:00
Henning Peters	49f499ca1c	cleanup	2016-03-12 14:30:24 +01:00
Henning Peters	5701686272	cleanup	2016-03-12 13:47:10 +01:00
Henning Peters	74dc02a0e6	fix windows readme	2015-12-21 21:58:53 +01:00
Henning Peters	c17ce6c119	(re-)include cython sources, murmurhash header discovery	2015-12-21 12:40:44 +01:00
Henning Peters	ac318b568c	new approach to dependency headers	2015-12-13 11:49:17 +01:00
Matthew Honnibal	d5d1578e44	* Add manifest file	2015-01-30 16:49:02 +11:00

22 Commits