Commit Graph

10814 Commits

Author SHA1 Message Date
Adriane Boyd 0b7e52c797 Move more of special case retokenize to cdef nogil
Move as much of the special case retokenization to nogil as possible.
2019-09-27 09:26:20 +02:00
Adriane Boyd 72c2f98dc9 Switch special case reload threshold to variable
Refer to variable instead of hard-coded threshold
2019-09-27 09:24:52 +02:00
Adriane Boyd 669bc1a314 Switch to local cdef functions for span filtering 2019-09-26 21:00:46 +02:00
Adriane Boyd ae348bee43 Switch to PhraseMatcher.find_matches 2019-09-26 14:43:22 +02:00
Adriane Boyd 63b014d09f Merge branch 'feature/hashmatcher' into bugfix/tokenizer-special-cases-matcher 2019-09-26 14:34:09 +02:00
Adriane Boyd 3fdb22d832 Implement full remove()
Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.
2019-09-26 11:31:03 +02:00
Adriane Boyd 230699e4fe Merge branch 'feature/ud-script-update' into bugfix/tokenizer-special-cases-matcher 2019-09-25 11:10:30 +02:00
Adriane Boyd 7862a6eb01 Restructure imports to export find_matches 2019-09-25 11:03:58 +02:00
Adriane Boyd 3c6f1d7e3a Switch from numpy array to Token.get_struct_attr
Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)
2019-09-25 09:41:27 +02:00
Adriane Boyd d995a7849e Switch from map_get_unless_missing to map_get 2019-09-24 16:20:24 +02:00
Adriane Boyd 34550ef662 Update fix for match ID vocab 2019-09-24 16:07:38 +02:00
Adriane Boyd d4141302b6 Fix how match ID hash is stored/added 2019-09-24 15:36:26 +02:00
Adriane Boyd 39540ed1ce Replace dict trie with MapStruct trie 2019-09-24 14:39:50 +02:00
Adriane Boyd a7e9c0fd3e Remove cruft in matching loop for partial matches
There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.
2019-09-23 09:11:13 +02:00
Adriane Boyd c38c330585 Add missing loop for match ID set in search loop 2019-09-21 15:57:38 +02:00
Adriane Boyd ede32c01e2 Update UD bin scripts
* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)
2019-09-21 12:20:22 +02:00
Adriane Boyd 97327bd268 Remove final traces of UD script modifications 2019-09-21 12:13:31 +02:00
Adriane Boyd 046a62741a Remove UD script modifications
Only used for timing/testing, should be a separate PR
2019-09-21 11:09:00 +02:00
Adriane Boyd d92e8c8ac8 Update error message number 2019-09-20 20:36:53 +02:00
Adriane Boyd 73ca0ce4f3 Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher 2019-09-20 16:44:33 +02:00
Adriane Boyd d3990d080c Improve efficiency of special cases handling
* Use PhraseMatcher instead of Matcher
* Improve efficiency of merging/splitting special cases in document
  * Process merge/splits in one pass without repeated token shifting
  * Merge in place if no splits
2019-09-20 16:39:30 +02:00
Adriane Boyd e74963acd4 Add test for #4248, clean up test 2019-09-20 09:20:57 +02:00
Adriane Boyd 3a4e1f5ca7 Fix internal keyword add/remove for numpy arrays 2019-09-20 09:18:38 +02:00
Adriane Boyd 0d851db6d9 Restore support for pickling 2019-09-19 20:20:53 +02:00
Adriane Boyd 3931368ce8 Merge remote-tracking branch 'upstream/master' into feature/hashmatcher 2019-09-19 17:42:17 +02:00
Ines Montani 9bf69bfbb2 Remove test 2019-09-19 17:38:41 +02:00
Adriane Boyd 0d9740e826 Replace PhraseMatcher with Aho-Corasick
Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.
2019-09-19 16:49:05 +02:00
Ines Montani 197406de1d Update v2-2.md [ci skip] 2019-09-19 14:33:58 +02:00
Ines Montani c1030b1ad2 Update README.md [ci skip] 2019-09-19 13:35:12 +02:00
Ines Montani 0f9e253a69 Update README.md [ci skip] 2019-09-19 13:34:37 +02:00
Ines Montani f2d224756b Update README.md [ci skip] 2019-09-19 12:52:26 +02:00
Ines Montani 80d554f2e2 Remove unsupported version [ci skip] 2019-09-19 01:14:42 +02:00
Ines Montani 8cd3763678 Update about.py [ci skip] 2019-09-19 01:02:25 +02:00
Ines Montani ddc09b08ed Update v2-2.md [ci skip] 2019-09-19 00:58:30 +02:00
Matthew Honnibal f52b857953 Update version 2019-09-19 00:56:35 +02:00
Matthew Honnibal e34b4a38b0 Fix set labels meta 2019-09-19 00:56:07 +02:00
Matthew Honnibal 9d399fe63a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-09-19 00:04:06 +02:00
Matthew Honnibal 7d510c833e Fix orth replacement 2019-09-19 00:03:24 +02:00
Ines Montani 89d1dc4afa Merge branch 'master' into develop 2019-09-18 22:12:24 +02:00
Sean Löfgren 31c683d87d add return_matches and as_tuples back to Matcher.pipe (#4303)
* add contributor agreement [ci skip]

* add return_matches and as_tuples back to Matcher.pipe
2019-09-18 22:00:33 +02:00
Matthew Honnibal 42df49133d Also lower-case in orth variants 2019-09-18 21:54:51 +02:00
Matthew Honnibal 19d99fc9e7 Set version to v2.2.0.dev7 2019-09-18 21:43:59 +02:00
Matthew Honnibal e2047576c4 Fix merge conflict 2019-09-18 21:42:11 +02:00
Matthew Honnibal 46c02d25b1 Merge changes to test_ner 2019-09-18 21:41:24 +02:00
Sofie Van Landeghem de5a9ecdf3 Distinction between outside, missing and blocked NER annotations (#4307)
* remove duplicate unit test

* unit test (currently failing) for issue 4267

* bugfix: ensure doc.ents preserves kb_id annotations

* fix in setting doc.ents with empty label

* rename

* test for presetting an entity to a certain type

* allow overwriting Outside + blocking presets

* fix actions when previous label needs to be kept

* fix default ent_iob in set entities

* cleaner solution with U- action

* remove debugging print statements

* unit tests with explicit transitions and is_valid testing

* remove U- from move_names explicitly

* remove unit tests with pre-trained models that don't work

* remove (working) unit tests with pre-trained models

* clean up unit tests

* move unit tests

* small fixes

* remove two TODO's from doc.ents comments
2019-09-18 21:37:17 +02:00
Moshe Hazoom 72463b062f Improve speed of _merge method (#4300)
* make merge more efficient

* fix offsets

* merge works with relative indices

* remove printing

* Add the SCA

* fix SCA date

* more cythonize _retokenize.pyx

* more cythonize _retokenize.pyx

* fix only declaration in _retokenize.pyx

* switch back to absolute head

* switch back to absolute head

* fix comment

* merge from origin repo
2019-09-18 21:34:34 +02:00
Ines Montani 63a584c6d4 Update README.md [ci skip] 2019-09-18 21:34:24 +02:00
tamuhey 875f3e5d8c remove redundant __call__ method in pipes.TextCategorizer (#4305)
* remove redundant __call__ method in pipes.TextCategorizer

Because the parent __call__ method behaves in the same way.

* fix: Pipe.__call__ arg

* fix: invalid arg in Pipe.__call__

* modified:   spacy/tests/regression/test_issue4278.py (#4278)

* deleted:    Pipfile
2019-09-18 21:31:27 +02:00
Ines Montani d84763727c Remove unused setting [ci skip] 2019-09-18 21:24:14 +02:00
Ines Montani 9c940eab94 Update version in examples [ci skip] 2019-09-18 21:23:26 +02:00