Commit Graph

10800 Commits

Author SHA1 Message Date
svlandeg c54aabc3cd fix loading custom tokenizer rules/exceptions from file 2019-08-28 14:17:44 +02:00
svlandeg 7bec0ebbcb failing unit test for Issue 4190 2019-08-28 14:16:34 +02:00
Ines Montani b91425f803 Update universe.json [ci skip] 2019-08-28 13:45:06 +02:00
Adriane Boyd 0a26e94d02 Modify raw to match orth variant annotation tuples
If raw is available, attempt to modify raw to match the orth variants.
If raw/words can't be aligned, abort and return unmodified
raw/annotation.
2019-08-28 13:38:54 +02:00
Ines Montani aedae8b4c5 Update universe.json [ci skip] 2019-08-28 11:59:06 +02:00
Adriane Boyd 47af3f676e Single and paired orth variants for German 2019-08-28 09:19:18 +02:00
Adriane Boyd 56c38484a1 Single and paired orth variants for English 2019-08-28 09:19:18 +02:00
Adriane Boyd aae05ff16b Add train_docs() option to add orth variants
Filtering by orth and tag, create variants of training docs with
alternate orth variants, e.g., unicode quotes, dashes, and ellipses.

The variants can be single tokens (dashes) or paired tokens (quotes)
with left and right versions.

Currently restricted to only add variants to training documents without
raw text provided, where only gold.words needs to be modified.
2019-08-28 09:18:36 +02:00
Björn Böing bae0455f91 Fix visualizer options linking for displaCy. (#4202) 2019-08-27 14:04:28 +02:00
Ines Montani 8114933f01 Fix universe.json [ci skip] 2019-08-27 12:13:42 +02:00
Ines Montani 48385552c6 Update languages.json [ci skip] 2019-08-27 11:52:51 +02:00
Ines Montani f4012ba054 Update README.md [ci skip] 2019-08-26 12:32:52 +02:00
Matthew Honnibal af7fad2c6d Set version to v2.2.0.dev1 2019-08-25 22:05:47 +02:00
Matthew Honnibal 71c0321ecf Fix test 2019-08-25 22:03:37 +02:00
Matthew Honnibal 188a1cf297 Fix morphology for | features 2019-08-25 21:57:02 +02:00
Matthew Honnibal 095c63c6b8 Avoid making prepositions get the tag SCONJ 2019-08-25 21:56:47 +02:00
Matthew Honnibal 22250cf6b7 Make regression test less sensitive to tag-map stuff 2019-08-25 21:54:26 +02:00
Matthew Honnibal 4e2f07a655 Merge branch 'develop' into feature/lemmatizer 2019-08-25 21:03:25 +02:00
yanaiela 5d7bc26735 new universe project - the numeric fused-head (#4192)
* new universe project

* Update website/meta/universe.json

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/meta/universe.json

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-25 17:25:28 +02:00
Matthew Honnibal 9b5c94fed9 Add get-version script 2019-08-25 15:12:36 +02:00
Matthew Honnibal 7bc68913e3 Improve pex building in Makefile 2019-08-25 14:54:19 +02:00
Matthew Honnibal b8edc8dffb Require thinc 7.1 2019-08-25 14:54:09 +02:00
Matthew Honnibal c308cf3e3e
Merge branch 'master' into feature/lemmatizer 2019-08-25 13:52:27 +02:00
Matthew Honnibal f9075a6fd1 Update to blis 0.4 and thinc 7.1 2019-08-25 13:50:47 +02:00
Matthew Honnibal 08e8267a59 Set version to 2.2.0.dev0 2019-08-25 13:50:00 +02:00
Wannaphong Phatthiyaphaibun d53c3fcbc1 Add Thai Language tokenizers (#4191)
Add th (pythainlp)
2019-08-25 11:35:21 +02:00
Christos Aridas 61f5c007a0 DOC Fix pipeline functions examples (#4189) 2019-08-23 19:15:32 +02:00
Matthew Honnibal bb911e5f4e Fix #3830: 'subtok' label being added even if learn_tokens=False (#4188)
* Prevent subtok label if not learning tokens

The parser introduces the subtok label to mark tokens that should be
merged during post-processing. Previously this happened even if we did
not have the --learn-tokens flag set. This patch passes the config
through to the parser, to prevent the problem.

* Make merge_subtokens a parser post-process if learn_subtokens

* Fix train script

* Add test for 3830: subtok problem

* Fix handlign of non-subtok in parser training
2019-08-23 17:54:00 +02:00
Sofie Van Landeghem c417c380e3 Matcher ID fixes (#4179)
* allow phrasematcher to link one match to multiple original patterns

* small fix for defining ent_id in the matcher (anti-ghost prevention)

* cleanup

* formatting
2019-08-22 17:17:07 +02:00
Ines Montani f5d3afb1a3 Fix typo in docstrings [ci skip] 2019-08-22 16:24:15 +02:00
Ines Montani 5ca7dd0f94
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance
2019-08-22 14:21:32 +02:00
Sofie Van Landeghem 73b38c33e4 Small retokenizer fix (#4174) 2019-08-22 12:23:54 +02:00
Ines Montani a8752a569d Auto-format [ci skip] 2019-08-22 11:44:39 +02:00
Pavle Vidanović 60e10a9f93 Serbian language improvement (#4169)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)
2019-08-22 11:43:07 +02:00
Sofie Van Landeghem de272f8b82 adding double match for optional operator at the end (#4166) 2019-08-21 22:46:56 +02:00
Sofie Van Landeghem 01c5980187 Serialize POS attribute when doc.is_tagged (#4092)
* fix and unit test for issue 3959

* additional unit test for manifestation of the same (resolved) bug
2019-08-21 21:59:30 +02:00
Sofie Van Landeghem 7539a4f3a8 use states[q] in while retry loop (#4162) 2019-08-21 21:58:04 +02:00
Ines Montani b072c13017 Update universe with videos [ci skip] 2019-08-21 21:35:37 +02:00
adrianeboyd 2d17b047e2 Check for is_tagged/is_parsed for Matcher attrs (#4163)
Check for relevant components in the pipeline when Matcher is called,
similar to the checks for PhraseMatcher in #4105.

* keep track of attributes seen in patterns

* when Matcher is called on a Doc, check for is_tagged for LEMMA, TAG,
POS and for is_parsed for DEP
2019-08-21 20:52:36 +02:00
Pavle Vidanović 4fe9329bfb Serbian language code update "rs" -> "sr" (#4159)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix
2019-08-21 19:57:37 +02:00
Matthew Honnibal bcd08f20af Merge changes from master 2019-08-21 14:18:52 +02:00
adrianeboyd 8fe7bdd0fa Improve token pattern checking without validation (#4105)
* Fix typo in rule-based matching docs

* Improve token pattern checking without validation

Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses #4070 (also related: #4063, #4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

* Remove unnecessary .keys()

* Rephrase error messages

* Add another type check to Matcher

Add another type check to Matcher for more understandable error messages
in some rare cases.

* Support phrase_matcher_attr=TEXT for EntityRuler

* Don't use spacy.errors in examples and bin scripts

* Fix error code

* Auto-format

Also try get Azure pipelines to finally start a build :(

* Update errors.py


Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2019-08-21 14:00:37 +02:00
Ines Montani 3134a9b6e0 Add section on expanding regex match to token boundaries (see #4158) [ci skip] 2019-08-21 12:53:31 +02:00
Ines Montani f580302673 Tidy up and auto-format 2019-08-20 17:36:34 +02:00
Ines Montani 364aaf5bc2 Simplify test 2019-08-20 16:41:58 +02:00
Sofie Van Landeghem 68ee0384fd Unit test for Issue 3879 (#4153)
* failing unit test for Issue #3879

* mark test as failing
2019-08-20 16:40:25 +02:00
Ines Montani 86cd7f0efd Add regression test for #4120 2019-08-20 16:33:09 +02:00
Ines Montani 104125edd2 Tidy up errors 2019-08-20 16:03:45 +02:00
Ines Montani cc76a26fe8 Raise error for negative arc indices (closes #3917) 2019-08-20 15:51:37 +02:00
Ines Montani 69e70ffae1 Merge branch 'master' of https://github.com/explosion/spaCy 2019-08-20 15:09:52 +02:00