Commit Graph

10804 Commits

Author SHA1 Message Date
Sofie Van Landeghem 22b9e12159 Ensure the NER remains consistent after resizing (#4330)
* test and fix for second bug of issue 4042

* fix for first bug in 4042

* crashing test for Issue 4313

* forgot one instance of resize

* remove prints

* undo uncomment

* delete test for 4313 (uses third party lib)

* add fix for Issue 4313

* unit test for 4313
2019-09-27 20:57:13 +02:00
adrianeboyd 3906785b49 Initialize low data warning for debug-data parser (#4331) 2019-09-27 20:56:49 +02:00
Ines Montani 59beab8405 Update v2-2.md [ci skip] 2019-09-27 18:10:43 +02:00
Ines Montani 206e8a5ac7 Also apply hotfix to Ukrainian lemmaitzer 2019-09-27 18:03:26 +02:00
Ines Montani acd5bcb0b3 Tidy up fixtures 2019-09-27 17:57:59 +02:00
Ines Montani b21b2e27e5 Hotfix Russian lemmatizer 2019-09-27 17:56:12 +02:00
Matthew Honnibal a4d4c4bfa4 Set version to v2.2.0.dev11 2019-09-27 16:40:26 +02:00
Ines Montani 685e4b2554 Update v2-2.md [ci skip] 2019-09-27 16:35:01 +02:00
Ines Montani aad66d9bb9 Document PhraseMatcher.remove [ci skip] 2019-09-27 16:34:53 +02:00
adrianeboyd c23edf302b Replace PhraseMatcher with trie-based search (#4309)
* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Store docs internally only as attr lists

* Reduces size for pickle

* Remove duplicate keywords store

Now that docs are stored as lists of attr hashes, there's no need to
have the duplicate _keywords store.
2019-09-27 16:22:34 +02:00
adrianeboyd d844030fd8 Update UD bin scripts (#4315)
* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)
2019-09-27 16:20:38 +02:00
tamuhey b408b5b29e Refactor language update (#4316)
* refactor: separate formatting docs and golds in Language.update

* fix return typo
2019-09-27 16:20:21 +02:00
Matthew Honnibal 105a91975b Fix sdist command 2019-09-27 15:52:26 +02:00
Ines Montani 3624153591 Update languages.json [ci skip] 2019-09-27 15:15:41 +02:00
EarlGreyT 1e9e2d8aa1 fix typo in first token (#4327)
* fix typo in first token

The head of 'in' is review which has an offset of 4 and not 44

* added contributor agreement
2019-09-27 14:49:36 +02:00
Jaydeep Borkar 6a06a3fa6a Update stop_words.py and add name in contributors (#4325)
* Update stop_words.py and add name in contributors

* add jaydeepborkar.md in contributors directory

* Reset template [ci skip]


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-27 11:57:27 +02:00
Ajinkya Kale 975aebd7e4 typo fix for wordnet_annotator (#4326) 2019-09-27 11:52:53 +02:00
Ines Montani eb0649e38e Fix tag [ci skip] 2019-09-26 16:22:33 +02:00
Ines Montani da9a869d3f Update vectors name docs [ci skip] 2019-09-26 16:21:32 +02:00
Matthew Honnibal 58533f01bf Set version to v2.2.0.dev10 2019-09-26 03:03:50 +02:00
Matthew Honnibal 27ace84f4a Support model name in init-model 2019-09-26 03:01:32 +02:00
Matthew Honnibal d0b30bf8cd Merge branch 'master' of https://github.com/explosion/spaCy 2019-09-25 21:14:30 +02:00
Matthew Honnibal eced2f3211 Set version to v2.2.0.dev9 2019-09-25 21:14:07 +02:00
Em Zhan aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Matthew Honnibal 1251b57dbb Fix vectors name arg to init-model 2019-09-25 14:21:27 +02:00
Matthew Honnibal 92ed4dc5e0
Allow vectors name to be set in init-model (#4321)
* Allow vectors name to be specified in init-model

* Document --vectors-name argument to init-model

* Update website/docs/api/cli.md

Co-Authored-By: Ines Montani <ines@ines.io>
2019-09-25 13:11:00 +02:00
Eric Semeniuc 09816f8323 update sense2vec version (#4320) 2019-09-25 12:17:54 +02:00
Ines Montani 52904b7270 Raise if on_match is not callable or None 2019-09-24 23:06:24 +02:00
Ines Montani 38de08c7a9 Update README.md [ci skip] 2019-09-24 14:31:09 +02:00
Sofie Van Landeghem 42340740e3 update neuralcoref example (#4317) 2019-09-24 10:47:17 +02:00
Ines Montani 16aa092fb5 Improve Morphology errors (#4314)
* Improve Morphology errors

* Also clean up some other errors

* Update errors.py
2019-09-21 14:37:06 +02:00
Ines Montani 9bf69bfbb2 Remove test 2019-09-19 17:38:41 +02:00
Ines Montani 197406de1d Update v2-2.md [ci skip] 2019-09-19 14:33:58 +02:00
Ines Montani c1030b1ad2 Update README.md [ci skip] 2019-09-19 13:35:12 +02:00
Ines Montani 0f9e253a69 Update README.md [ci skip] 2019-09-19 13:34:37 +02:00
Ines Montani f2d224756b Update README.md [ci skip] 2019-09-19 12:52:26 +02:00
Ines Montani 80d554f2e2 Remove unsupported version [ci skip] 2019-09-19 01:14:42 +02:00
Ines Montani 8cd3763678 Update about.py [ci skip] 2019-09-19 01:02:25 +02:00
Ines Montani ddc09b08ed Update v2-2.md [ci skip] 2019-09-19 00:58:30 +02:00
Matthew Honnibal f52b857953 Update version 2019-09-19 00:56:35 +02:00
Matthew Honnibal e34b4a38b0 Fix set labels meta 2019-09-19 00:56:07 +02:00
Matthew Honnibal 9d399fe63a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-09-19 00:04:06 +02:00
Matthew Honnibal 7d510c833e Fix orth replacement 2019-09-19 00:03:24 +02:00
Ines Montani 89d1dc4afa Merge branch 'master' into develop 2019-09-18 22:12:24 +02:00
Sean Löfgren 31c683d87d add return_matches and as_tuples back to Matcher.pipe (#4303)
* add contributor agreement [ci skip]

* add return_matches and as_tuples back to Matcher.pipe
2019-09-18 22:00:33 +02:00
Matthew Honnibal 42df49133d Also lower-case in orth variants 2019-09-18 21:54:51 +02:00
Matthew Honnibal 19d99fc9e7 Set version to v2.2.0.dev7 2019-09-18 21:43:59 +02:00
Matthew Honnibal e2047576c4 Fix merge conflict 2019-09-18 21:42:11 +02:00
Matthew Honnibal 46c02d25b1 Merge changes to test_ner 2019-09-18 21:41:24 +02:00
Sofie Van Landeghem de5a9ecdf3 Distinction between outside, missing and blocked NER annotations (#4307)
* remove duplicate unit test

* unit test (currently failing) for issue 4267

* bugfix: ensure doc.ents preserves kb_id annotations

* fix in setting doc.ents with empty label

* rename

* test for presetting an entity to a certain type

* allow overwriting Outside + blocking presets

* fix actions when previous label needs to be kept

* fix default ent_iob in set entities

* cleaner solution with U- action

* remove debugging print statements

* unit tests with explicit transitions and is_valid testing

* remove U- from move_names explicitly

* remove unit tests with pre-trained models that don't work

* remove (working) unit tests with pre-trained models

* clean up unit tests

* move unit tests

* small fixes

* remove two TODO's from doc.ents comments
2019-09-18 21:37:17 +02:00