Commit Graph

11457 Commits

Author SHA1 Message Date
adrianeboyd 993758c58f
Remove unnecessary iterator in Language.pipe (#5101)
Remove iterator over `raw_texts` with `iterator.tee()` in
`Language.pipe` that is never consumed and consumes memory
unnecessarily.
2020-03-08 13:22:25 +01:00
Ines Montani cd79c7bd26
Merge pull request #5110 from dhpollack/dhp/fix-minor-svg-error
fix typo in svg file - caused documentation build error
2020-03-06 15:32:43 +01:00
Sofie Van Landeghem 1a2b8fc264
set vector of merged entity (#5085)
* merge_entities sets the vector in the vocab for the merged token

* add unit test

* import unicode_literals

* move code to _merge function

* only set vector if vocab has non-zero vectors
2020-03-06 14:45:28 +01:00
adrianeboyd c95ce96c44
Update sentence recognizer (#5109)
* Update sentence recognizer

* rename `sentrec` to `senter`
* use `spacy.HashEmbedCNN.v1` by default
* update to follow `Tagger` modifications
* remove component methods that can be inherited from `Tagger`
* add simple initialization and overfitting pipeline tests

* Update serialization test for senter
2020-03-06 14:45:02 +01:00
Sofie Van Landeghem 6ac9fc0619
Unit test for NEL functionality (#5114)
* empty begin_training for sentencizer

* overfitting unit test for entity linker

* fixed NEL IO by storing the entity_vector_length in the cfg
2020-03-06 14:42:23 +01:00
David Pollack 80004930ed fix typo in svg file 2020-03-05 17:04:33 +01:00
Matthew Honnibal 3440a72ecb
Update Makefile (#5099) 2020-03-04 19:28:16 +01:00
Ines Montani 31faab3647
Merge pull request #5097 from mirfan899/master
Basque language support added.
2020-03-04 17:20:23 +01:00
Ines Montani 3adc511cb0
Merge pull request #5070 from explosion/refactor/simplify-warnings
Simplify warnings
2020-03-04 17:11:18 +01:00
Ines Montani b0cfab317f Merge branch 'develop' into refactor/simplify-warnings 2020-03-04 16:38:55 +01:00
Ines Montani 99d8ee506f
Merge pull request #5100 from adrianeboyd/feature/bump-srsly-1.0.2
Require srsly >=1.0.2
2020-03-04 16:32:52 +01:00
Adriane Boyd 4d655b1d45 Require srsly >=1.0.2 2020-03-04 13:50:37 +01:00
Muhammad Irfan 224a7f8e94 examples 2020-03-04 15:49:06 +05:00
Muhammad Irfan 03376c9d9b Basque language added and tested. 2020-03-04 11:58:56 +05:00
adrianeboyd 9be90dbca3
Improve token head verification (#5079)
* Improve token head verification

Improve the verification for valid token heads when heads are set:

* in `Token.head`: heads come from the same document
* in `Doc.from_array()`: head indices are within the bounds of the
document

* Improve error message
2020-03-03 21:44:51 +01:00
adrianeboyd 8c20dae6f7
Fix model-final/model-best meta from train CLI (#5093)
* Fix model-final/model-best meta

* include speed and accuracy from final iteration
* combine with speeds from base model if necessary

* Include token_acc metric for all components
2020-03-03 21:43:25 +01:00
Sofie Van Landeghem a0998868ff
prevent updating cfg if the Model was already defined (#5078) 2020-03-03 13:58:56 +01:00
Sofie Van Landeghem d307e9ca58
take care of global vectors in multiprocessing (#5081)
* restore load_nlp.VECTORS in the child process

* add unit test

* fix test

* remove unnecessary import

* add utf8 encoding

* import unicode_literals
2020-03-03 13:58:22 +01:00
adrianeboyd d078b47c81
Break out of infinite loop as intended (#5077) 2020-03-03 12:29:05 +01:00
adrianeboyd 697bec764d
Normalize IS_SENT_START to SENT_START for Matcher (#5080) 2020-03-03 12:22:39 +01:00
adrianeboyd 2281c4708c
Restore empty tokenizer properties (#5026)
* Restore empty tokenizer properties

* Check for types in tokenizer.from_bytes()

* Add test for setting empty tokenizer rules
2020-03-02 11:55:02 +01:00
Sofie Van Landeghem c6b12ab02a
Bugfix/get doc (#5049)
* new (broken) unit test

* fixing get_doc method
2020-03-02 11:49:28 +01:00
Ines Montani 648f61d077
Tidy up compiler flags and imports (#5071) 2020-03-02 11:48:10 +01:00
Ines Montani 7efaa76168 Update errors.py 2020-02-28 12:23:31 +01:00
Ines Montani 37691e6d5d Simplify warnings 2020-02-28 12:20:23 +01:00
Ines Montani 5da3ad682a Tidy up and auto-format 2020-02-28 11:57:41 +01:00
adrianeboyd 65d7bab10f
Initialize all values in a2b/b2a in new align (#5063) 2020-02-27 18:43:00 +01:00
Sofie Van Landeghem 06f0a8daa0
Default settings to configurations (#4995)
* fix grad_clip naming

* cleaning up pretrained_vectors out of cfg

* further refactoring Model init's

* move Model building out of pipes

* further refactor to require a model config when creating a pipe

* small fixes

* making cfg in nn_parser more consistent

* fixing nr_class for parser

* fixing nn_parser's nO

* fix printing of loss

* architectures in own file per type, consistent naming

* convenience methods default_tagger_config and default_tok2vec_config

* let create_pipe access default config if available for that component

* default_parser_config

* move defaults to separate folder

* allow reading nlp from package or dir with argument 'name'

* architecture spacy.VocabVectors.v1 to read static vectors from file

* cleanup

* default configs for nel, textcat, morphologizer, tensorizer

* fix imports

* fixing unit tests

* fixes and clean up

* fixing defaults, nO, fix unit tests

* restore parser IO

* fix IO

* 'fix' serialization test

* add *.cfg to manifest

* fix example configs with additional arguments

* replace Morpohologizer with Tagger

* add IO bit when testing overfitting of tagger (currently failing)

* fix IO - don't initialize when reading from disk

* expand overfitting tests to also check IO goes OK

* remove dropout from HashEmbed to fix Tagger performance

* add defaults for sentrec

* update thinc

* always pass a Model instance to a Pipe

* fix piped_added statement

* remove obsolete W029

* remove obsolete errors

* restore byte checking tests (work again)

* clean up test

* further test cleanup

* convert from config to Model in create_pipe

* bring back error when component is not initialized

* cleanup

* remove calls for nlp2.begin_training

* use thinc.api in imports

* allow setting charembed's nM and nC

* fix for hardcoded nM/nC + unit test

* formatting fixes

* trigger build
2020-02-27 18:42:27 +01:00
Matthew Honnibal b4e0d2bf50
Improve Makefile (#5067)
* Improve pex making

* Update gitignore
2020-02-26 20:59:10 +01:00
Adriane Boyd 9f740a9891 Add a few more Danish tokenizer exceptions 2020-02-26 14:59:03 +01:00
Ines Montani 1c212215cd
Merge pull request #5064 from adrianeboyd/feature/german-tokenization
Improve German tokenization
2020-02-26 13:41:44 +01:00
Ines Montani f39ddda193
Merge pull request #5062 from svlandeg/bugfix/merge-conflicts
Fix sync between master and develop
2020-02-26 13:41:16 +01:00
Ines Montani 56978f5cd8
Merge pull request #5060 from svlandeg/feature/update-thinc
update thinc
2020-02-26 13:40:23 +01:00
Adriane Boyd d1f703d78d Improve German tokenization
Improve German tokenization with respect to Tiger.
2020-02-26 13:06:52 +01:00
Ines Montani 54da6a2a07 Update pyproject.toml 2020-02-26 12:51:53 +01:00
Ines Montani ed9358420e Merge branch 'master' into pr/5060 2020-02-26 12:51:29 +01:00
adrianeboyd ff184b7a9c
Add tag_map argument to CLI debug-data and train (#4750) (#5038)
Add an argument for a path to a JSON-formatted tag map, which is used to
update and extend the default language tag map.
2020-02-26 12:10:38 +01:00
svlandeg 18ff97589d update spacy to 2.2.4.dev0 2020-02-26 10:50:05 +01:00
svlandeg 62406a9513 update from thinc 7.4.0.dev2 to 7.4.0 2020-02-26 10:30:35 +01:00
svlandeg fc6e34c3a1 fix bugs from porting master to develop 2020-02-26 08:44:22 +01:00
Ines Montani c7e3c034d2
Merge pull request #5061 from explosion/fix/pyproject-toml-master
Update pyproject.toml
2020-02-25 20:22:26 +01:00
Ines Montani 192b8d45a1
Merge pull request #5008 from svlandeg/fix/build_dependencies
Re-add pyproject.toml and add tests for dependency version consistency
2020-02-25 16:52:18 +01:00
Ines Montani dc36ec98a4 Update pyproject.toml 2020-02-25 16:46:14 +01:00
Ines Montani b6a6cff708 Add blis to pyproject.toml 2020-02-25 16:17:23 +01:00
Ines Montani 912572e04a Only copy if file exists (not if installed from sdist etc.) 2020-02-25 16:01:58 +01:00
Ines Montani 436b26fe0f Revert other changes 2020-02-25 15:48:29 +01:00
Ines Montani c1a5ece65f Tidy up setup and update requirements tests 2020-02-25 15:46:39 +01:00
Ines Montani 5d21d3e8b9 Merge branch 'develop' into pr/5008 2020-02-25 15:24:47 +01:00
Ines Montani acb4e3c7ba
Merge pull request #5039 from adrianeboyd/typo/website-token-api-shape
Fix formatting in Token API
2020-02-25 14:57:25 +01:00
Ines Montani d50152b917
Merge pull request #5019 from questoph/master
Optimizing tokenization for Luxembourgish (dealing with apostrophe infixes)
2020-02-25 14:48:50 +01:00