Commit Graph

5610 Commits

Author SHA1 Message Date
Ines Montani 5651a0d052 💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280)
* Add deprecation warning to Doc.merge and Span.merge

* Replace {Doc,Span}.merge with Doc.retokenize
2019-02-15 10:29:44 +01:00
Ines Montani f146121092 💫 Make handling of [Pipe].labels consistent (#3273)
* Make handling of [Pipe].labels consistent

* Un-xfail passing test

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: ines <ines@ines.io>

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: ines <ines@ines.io>

* Update spacy/tests/pipeline/test_pipe_methods.py

Co-Authored-By: ines <ines@ines.io>

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: ines <ines@ines.io>

* Move error message to spacy.errors

* Fix textcat labels and test

* Make EntityRuler.labels return tuple as well
2019-02-15 06:03:19 +11:00
Ines Montani 3d577b77c6 Auto-formatting 2019-02-14 19:56:38 +01:00
Ines Montani 2569339a98 Formatting and whitespace [ci skip] 2019-02-14 18:05:07 +01:00
Ines Montani e104e47c21 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-14 15:35:34 +01:00
Ines Montani 0cd01a8c5e Merge branch 'master' into develop 2019-02-14 15:35:20 +01:00
Ines Montani 2e31921d0a 💫 Add base Language classes for more languages (#3276)
* Add base classes for more languages

* Add test for language class initialization

Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded
2019-02-15 01:31:19 +11:00
Grivaz 39815513e2 Add split one token into several (resolves #2838) (#3253)
* Add split one token into several (resolves #2838)

* Improve error message for token splitting

* Make retokenizer.split() tests use a Token object

Change retokenizer.split() to use a Token object, instead of an index.

* Pass Token into retokenize.split()

Tweak retokenize.split() API so that we pass the `Token` object, not the index.

* Fix token.idx in retokenize.split()

* Test that token.idx is correct after split

* Fix token.idx for split tokens

* Fix retokenize.split()

* Fix retokenize.split

* Fix retokenize.split() test
2019-02-15 01:27:13 +11:00
Ines Montani 743ecf728c Tidy up conftest 2019-02-14 13:27:13 +01:00
Ines Montani 106d95b01a Fix typo 2019-02-14 12:26:56 +01:00
Ines Montani 11d6b874db
Update stop_words.py 2019-02-14 12:25:19 +01:00
Ines Montani 60c2a3bb65 Also raise original error message in util.get_lang_class
Otherwise, the true error that happens within a Language subclass is swallowed, because if it's imported lazily like that, it'll always be an ImportError
2019-02-13 16:52:25 +01:00
Ines Montani 4d2438f985 Tidy up and auto-format 2019-02-13 15:29:08 +01:00
Ines Montani fbf9f1edf1 Also raise error in Span.__reduce__ 2019-02-13 13:22:05 +01:00
Ines Montani 2d0c3c73f4
Raise better error if token is pickled (resolves #2833) (#3267) 2019-02-13 11:27:04 +01:00
Ines Montani 2f45bd94c0 Auto-formatting 2019-02-12 18:30:11 +01:00
Ines Montani 0184a95340 Merge branch 'master' into develop 2019-02-12 18:29:24 +01:00
Akhilesh a78db10941 add kannada support (#3264)
* add kannada support

* add few more stop words

* add support for Kannada Language
2019-02-12 18:28:39 +01:00
Ines Montani b589b945db
Fix PhraseMatcher pickling and length (resolves #3248) (#3252) 2019-02-12 18:27:54 +01:00
Ines Montani 483dddc9bc 💫 Add token match pattern validation via JSON schemas (#3244)
* Add custom MatchPatternError

* Improve validators and add validation option to Matcher

* Adjust formatting

* Never validate in Matcher within PhraseMatcher

If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale).
2019-02-13 01:47:26 +11:00
Ines Montani ad2a514cdf Show warning if phrase pattern Doc was overprocessed (#3255)
In most cases, the PhraseMatcher will match on the verbatim token text or as of v2.1, sometimes the lowercase text. This means that we only need a tokenized Doc, without any other attributes.

If phrase patterns are created by processing large terminology lists with the full `nlp` object, this easily can make things a lot slower, because all components will be applied, even if we don't actually need the attributes they set (like part-of-speech tags, dependency labels).

The warning message also includes a suggestion to use nlp.make_doc or nlp.tokenizer.pipe for even faster processing. For now, the validation has to be enabled explicitly by setting validate=True.
2019-02-13 01:45:31 +11:00
Matthew Honnibal 6ec834dc72 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-13 01:14:44 +11:00
Matthew Honnibal 43fa039d96 xfail regression test for model labels 2019-02-13 01:14:26 +11:00
Matthew Honnibal bc300d4e31 Add test for issue 3209 2019-02-13 01:13:01 +11:00
Ines Montani 34a3cc26a9 Add xfailing test for reverse pattern (see #1971) 2019-02-12 14:49:59 +01:00
Ines Montani fe39fd4d13 Make warning tests more explicit 2019-02-10 14:02:19 +01:00
Ines Montani a9f8d17632
💫 Break up large pipeline.pyx (#3246)
* Break up large pipeline.pyx

* Merge some components back together

* Fix typo
2019-02-10 12:14:51 +01:00
Ines Montani e7593b791e Fix import 2019-02-08 20:50:52 +01:00
Ines Montani 0754b848fe Actually xfail test for #1971 2019-02-08 20:50:35 +01:00
Ines Montani 414a69b736 Add xfailing test (see #1971, #2675, #2671) 2019-02-08 20:50:01 +01:00
Ines Montani ea07f3022e Only run noun chunks iterator in Span if available (closes #3199) 2019-02-08 18:33:16 +01:00
Ines Montani ff36b14cb2 Fix whitespace 2019-02-08 18:31:31 +01:00
Ines Montani f4ce7bb7e9 Fix typo and deprecation message (resolves #3195) [ci skip] 2019-02-08 18:09:23 +01:00
Ines Montani 694139aad3 Fix formatting [ci skip] 2019-02-08 16:32:36 +01:00
Ines Montani 2898768757 Remove unused attribute [ci skip] 2019-02-08 16:31:30 +01:00
Ines Montani 586c56fc6c Tidy up regression tests 2019-02-08 15:51:13 +01:00
Ines Montani 25602c794c Tidy up and fix small bugs and typos 2019-02-08 14:14:49 +01:00
Ines Montani 9e652afa4b Merge branch 'master' into develop 2019-02-08 13:28:09 +01:00
Björn Lennartsson 647f0140c7 Fixed tag map for Swedish Talbanken (#3186) 2019-02-08 14:28:59 +11:00
Stanisław Giziński 1448ad100c Improved polish tokenizer and stop words. (#2974)
* Improved stop words list

* Removed some wrong stop words form list

* Improved stop words list

* Removed some wrong stop words form list

* Improved Polish Tokenizer (#38)

* Add tests for polish tokenizer

* Add polish tokenizer exceptions

* Don't split any words containing hyphens

* Fix test case with wrong model answer

* Remove commented out line of code until better solution is found

* Add source srx' license

* Rename exception_list.py to match spaCy conventionality

* Add a brief explanation of where the exception list comes from

* Add newline after reach exception

* Rename COPYING.txt to LICENSE

* Delete old files

* Add header to the license

* Agreements signed

* Stanisław Giziński agreement

* Krzysztof Kowalczyk - signed agreement

* Mateusz Olko agreement

* Add DoomCoder's contributor agreement

* Improve like number checking in polish lang


* like num tests added

* all from SI system added

* Final licence and removed splitting exceptions

* Added polish stop words to LEX_ATTRA

* Add encoding info to pl tokenizer exceptions
2019-02-08 14:27:21 +11:00
Ines Montani 402d133c90 Add Ukrainian unicode 2019-02-07 21:11:58 +01:00
Ines Montani e2d93e4852 Merge branch 'master' into develop 2019-02-07 21:10:08 +01:00
Ines Montani 2499da97e8 Format 2019-02-07 21:07:02 +01:00
Julia Makogon b41d64825a Ukrainian language added. Small fixes in Russian (#3241)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement
2019-02-07 21:05:11 +01:00
Ines Montani 77efee0295 Auto-format 2019-02-07 21:00:04 +01:00
Ines Montani 5d0b60999d Merge branch 'master' into develop 2019-02-07 20:54:07 +01:00
Matthew Honnibal dbeebfa3a2 Set version to v2.1.0a7.dev1 2019-02-08 01:54:01 +11:00
Ines Montani 338d659bd0 Store JSON schemas in Python and tidy up (#3235) 2019-02-07 19:44:31 +11:00
Ines Montani 1ea4df459d 💫 Break up large matcher.pyx (#3236)
* Break up large matcher.pyx

* Remove unused function
2019-02-07 19:42:25 +11:00
Ines Montani a9bf5d9fd8 Add xfailing test for set value with operator [ci skip] 2019-02-06 13:40:11 +01:00