Commit Graph

9551 Commits

Author SHA1 Message Date
Ines Montani 2d0c3c73f4
Raise better error if token is pickled (resolves #2833) (#3267) 2019-02-13 11:27:04 +01:00
Ines Montani 2f45bd94c0 Auto-formatting 2019-02-12 18:30:11 +01:00
Ines Montani 0184a95340 Merge branch 'master' into develop 2019-02-12 18:29:24 +01:00
Akhilesh a78db10941 add kannada support (#3264)
* add kannada support

* add few more stop words

* add support for Kannada Language
2019-02-12 18:28:39 +01:00
Ines Montani b589b945db
Fix PhraseMatcher pickling and length (resolves #3248) (#3252) 2019-02-12 18:27:54 +01:00
Ines Montani 5dd39d8697
Update universe.json 2019-02-12 18:05:51 +01:00
Abhijit Balaji 75a40f56fc added spacy-langdetect to universe.json (#3266) 2019-02-12 18:04:38 +01:00
Ines Montani 483dddc9bc 💫 Add token match pattern validation via JSON schemas (#3244)
* Add custom MatchPatternError

* Improve validators and add validation option to Matcher

* Adjust formatting

* Never validate in Matcher within PhraseMatcher

If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale).
2019-02-13 01:47:26 +11:00
Ines Montani ad2a514cdf Show warning if phrase pattern Doc was overprocessed (#3255)
In most cases, the PhraseMatcher will match on the verbatim token text or as of v2.1, sometimes the lowercase text. This means that we only need a tokenized Doc, without any other attributes.

If phrase patterns are created by processing large terminology lists with the full `nlp` object, this easily can make things a lot slower, because all components will be applied, even if we don't actually need the attributes they set (like part-of-speech tags, dependency labels).

The warning message also includes a suggestion to use nlp.make_doc or nlp.tokenizer.pipe for even faster processing. For now, the validation has to be enabled explicitly by setting validate=True.
2019-02-13 01:45:31 +11:00
Matthew Honnibal 6ec834dc72 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-13 01:14:44 +11:00
Matthew Honnibal 43fa039d96 xfail regression test for model labels 2019-02-13 01:14:26 +11:00
Matthew Honnibal bc300d4e31 Add test for issue 3209 2019-02-13 01:13:01 +11:00
Ines Montani 34a3cc26a9 Add xfailing test for reverse pattern (see #1971) 2019-02-12 14:49:59 +01:00
Ines Montani d86dc9868b Remove black from dev requirements 2019-02-10 14:20:15 +01:00
Ines Montani fe39fd4d13 Make warning tests more explicit 2019-02-10 14:02:19 +01:00
Ines Montani 0d206cf47c Add black to dev requirements 2019-02-10 13:28:45 +01:00
Ines Montani a9f8d17632
💫 Break up large pipeline.pyx (#3246)
* Break up large pipeline.pyx

* Merge some components back together

* Fix typo
2019-02-10 12:14:51 +01:00
Ines Montani e7593b791e Fix import 2019-02-08 20:50:52 +01:00
Ines Montani 0754b848fe Actually xfail test for #1971 2019-02-08 20:50:35 +01:00
Ines Montani 414a69b736 Add xfailing test (see #1971, #2675, #2671) 2019-02-08 20:50:01 +01:00
Ines Montani ea07f3022e Only run noun chunks iterator in Span if available (closes #3199) 2019-02-08 18:33:16 +01:00
Ines Montani ff36b14cb2 Fix whitespace 2019-02-08 18:31:31 +01:00
Ines Montani f4ce7bb7e9 Fix typo and deprecation message (resolves #3195) [ci skip] 2019-02-08 18:09:23 +01:00
Ines Montani 8ad15a2377 Fix typo [ci skip] 2019-02-08 17:29:53 +01:00
Ines Montani 7a985cba24 Fix typo (closes #3232) [ci skip] 2019-02-08 17:29:18 +01:00
Ines Montani 694139aad3 Fix formatting [ci skip] 2019-02-08 16:32:36 +01:00
Ines Montani 2898768757 Remove unused attribute [ci skip] 2019-02-08 16:31:30 +01:00
Ines Montani 586c56fc6c Tidy up regression tests 2019-02-08 15:51:13 +01:00
Ines Montani 25602c794c Tidy up and fix small bugs and typos 2019-02-08 14:14:49 +01:00
Ines Montani 9e652afa4b Merge branch 'master' into develop 2019-02-08 13:28:09 +01:00
Björn Lennartsson 647f0140c7 Fixed tag map for Swedish Talbanken (#3186) 2019-02-08 14:28:59 +11:00
Stanisław Giziński 1448ad100c Improved polish tokenizer and stop words. (#2974)
* Improved stop words list

* Removed some wrong stop words form list

* Improved stop words list

* Removed some wrong stop words form list

* Improved Polish Tokenizer (#38)

* Add tests for polish tokenizer

* Add polish tokenizer exceptions

* Don't split any words containing hyphens

* Fix test case with wrong model answer

* Remove commented out line of code until better solution is found

* Add source srx' license

* Rename exception_list.py to match spaCy conventionality

* Add a brief explanation of where the exception list comes from

* Add newline after reach exception

* Rename COPYING.txt to LICENSE

* Delete old files

* Add header to the license

* Agreements signed

* Stanisław Giziński agreement

* Krzysztof Kowalczyk - signed agreement

* Mateusz Olko agreement

* Add DoomCoder's contributor agreement

* Improve like number checking in polish lang


* like num tests added

* all from SI system added

* Final licence and removed splitting exceptions

* Added polish stop words to LEX_ATTRA

* Add encoding info to pl tokenizer exceptions
2019-02-08 14:27:21 +11:00
Ines Montani 402d133c90 Add Ukrainian unicode 2019-02-07 21:11:58 +01:00
Ines Montani e2d93e4852 Merge branch 'master' into develop 2019-02-07 21:10:08 +01:00
Ines Montani 2499da97e8 Format 2019-02-07 21:07:02 +01:00
Ines Montani 18205c6c48 Update company name 2019-02-07 21:06:55 +01:00
Julia Makogon b41d64825a Ukrainian language added. Small fixes in Russian (#3241)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement
2019-02-07 21:05:11 +01:00
Ines Montani 77efee0295 Auto-format 2019-02-07 21:00:04 +01:00
Ines Montani be1ff09403 Update dependencies 2019-02-07 20:57:55 +01:00
Ines Montani f7e4674423 Fix contributor agreement 2019-02-07 20:56:13 +01:00
Ines Montani 4684195822
Rename contributer_agreement.md to .github/contributors/lauraBaakman.md 2019-02-07 20:55:53 +01:00
Ines Montani 5d0b60999d Merge branch 'master' into develop 2019-02-07 20:54:07 +01:00
Laura Baakman 04aa041c9e Update Example input JSON file to adhere to specification. (#3243)
* Example file does not adhere to json input spec.

According to the [json input spec ](https://spacy.io/api/annotation#json-input) the `id ` needs to be an `int` not a string. Using a string as `id` results in a `TypeError` when calling `spacy.gold.read_json_file()`.

* Add spaCy Contributor Agreement.
2019-02-07 16:18:01 +01:00
Matthew Honnibal dbeebfa3a2 Set version to v2.1.0a7.dev1 2019-02-08 01:54:01 +11:00
Ines Montani 338d659bd0 Store JSON schemas in Python and tidy up (#3235) 2019-02-07 19:44:31 +11:00
Ines Montani 1ea4df459d 💫 Break up large matcher.pyx (#3236)
* Break up large matcher.pyx

* Remove unused function
2019-02-07 19:42:25 +11:00
Ines Montani a9bf5d9fd8 Add xfailing test for set value with operator [ci skip] 2019-02-06 13:40:11 +01:00
Ines Montani e51a238b3f Auto-format 2019-02-06 13:32:18 +01:00
Ines Montani f25bd9f5e4 Add gold.spans_from_biluo_tags helper (#3227) 2019-02-06 21:50:26 +11:00
Ines Montani 5e16490d9d Fix default argument in TextCategorizer.Model (resolves #3221) 2019-02-05 12:33:47 +01:00