Commit Graph

2061 Commits

Author SHA1 Message Date
Ines Montani febf71af28 Fix test 2020-12-09 11:23:07 +11:00
Sofie Van Landeghem cfc72c2995
Bugfix multi-label textcat reproducibility (#6481)
* add test for multi-label textcat reproducibility

* remove positive_label

* fix lengths dtype

* fix comments

* remove comment that we should not have forgotten :-)
2020-12-09 06:29:15 +08:00
Sofie Van Landeghem 2c27093c5f
require_cpu functionality (#6336)
* add require_cpu from Thinc 8.0.0rc2

* add docs

* fix test if cupy is not installed
2020-12-08 14:42:40 +08:00
Sofie Van Landeghem f98a04434a
pretrain architectures (#6451)
* define new architectures for the pretraining objective

* add loss function as attr of the omdel

* cleanup

* cleanup

* shorten name

* fix typo

* remove unused error
2020-12-08 14:41:03 +08:00
Adriane Boyd 29b058ebdc
Fix spacy when retokenizing cases with affixes (#6475)
Preserve `token.spacy` corresponding to the span end token in the
original doc rather than adjusting for the current offset.

* If not modifying in place, this checks in the original document
(`doc.c` rather than `tokens`).
* If modifying in place, the document has not been modified past the
current span start position so the value at the current span end
position is valid.
2020-12-08 14:25:56 +08:00
Adriane Boyd 4448680750
Fix alignment for 1-to-1 tokens and lowercasing (#6476)
* When checking for token alignments, check not only that the tokens are
identical but that the character positions are both at the start of a
token.

  It's possible for the tokens to be identical even though the two
tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs.
`["a", "''", "'"]`, where the middle tokens are identical but should not
be aligned on the token level at character position 2 since it's the
start of one token but the middle of another.

* Use the lowercased version of the token texts to create the
character-to-token alignment because lowercasing can change the string
length (e.g., for `İ`, see the not-a-bug bug report:
https://bugs.python.org/issue34723)
2020-12-08 14:25:16 +08:00
Sofie Van Landeghem d6c616a125
Fixes in test suite (#6457)
* fix slow test for textcat readers

* cleanup test_issue5551

* add explicit score weight

* cleanup
2020-12-02 12:57:08 +01:00
Adriane Boyd a7e7d6c6c9
Ignore misaligned in Morphologizer.get_loss (#6363)
Fix bug where `Morphologizer.get_loss` treated misaligned annotation as
`EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH`
mappings.
2020-11-10 20:15:09 +08:00
Ines Montani 363ac73c72 Update docs [ci skip] 2020-11-09 12:43:26 +08:00
Adriane Boyd 1c4df8fd09
Replace pytokenizations with internal alignment (#6293)
* Replace pytokenizations with internal alignment

Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.

* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`

* Refactor trailing whitespace handling

* Remove unnecessary exception for empty docs

Allow a non-empty whitespace-only doc to be aligned with an empty doc

* Remove empty docs exceptions completely
2020-11-03 16:24:38 +01:00
Adriane Boyd a4b32b9552
Handle missing reference values in scorer (#6286)
* Handle missing reference values in scorer

Handle missing values in reference doc during scoring where it is
possible to detect an unset state for the attribute. If no reference
docs contain annotation, `None` is returned instead of a score. `spacy
evaluate` displays `-` for missing scores and the missing scores are
saved as `None`/`null` in the metrics.

Attributes without unset states:

* `token.head`: relies on `token.dep` to recognize unset values
* `doc.cats`: unable to handle missing annotation

Additional changes:

* add optional `has_annotation` check to `score_scans` to replace
`doc.sents` hack
* update `score_token_attr_per_feat` to handle missing and empty morph
representations
* fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START`
vs. `SENT_START`

* Fix import

* Update return types
2020-11-03 15:47:18 +01:00
Adriane Boyd 5d2cb86c34
Fix on_match callback for DependencyMatcher (#6313)
Fix `DependencyMatcher` so that the callback is called only once per
match.
2020-10-31 12:20:27 +01:00
Sofie Van Landeghem 75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
Jan Margeta 1ad2213349 Fix TokenPatternSchema pattern field validation
Empty pattern field should be considered invalid

This is fixed by replacing minItems with min_items
as described in Pydantic docs:
https://pydantic-docs.helpmanual.io/usage/schema/
2020-10-16 00:41:21 +02:00
Ines Montani b1d568a4df Tidy up tests 2020-10-15 10:20:21 +02:00
Ines Montani d165af26be Auto-format [ci skip] 2020-10-15 10:08:53 +02:00
Ines Montani 5d62499266 Fix tests 2020-10-15 09:29:15 +02:00
Ines Montani 178760855f Merge branch 'develop' into master-tmp 2020-10-15 09:06:03 +02:00
svlandeg 0796401c19 call NumpyOps instead of get_current_ops() 2020-10-14 16:55:00 +02:00
svlandeg e94a21638e adding tests for trained models to ensure predict reproducibility 2020-10-13 21:07:13 +02:00
svlandeg ede979d42f formattting 2020-10-13 18:53:17 +02:00
svlandeg ff83bfae3f naming 2020-10-13 18:52:37 +02:00
svlandeg 6ccacff54e add tests for individual spacy layers 2020-10-13 18:50:07 +02:00
svlandeg c23041ae60 component tests single or multiple prediction 2020-10-13 16:26:53 +02:00
svlandeg 3a505e7e14 small edit to ensure the new word was indeed new 2020-10-10 21:05:28 +02:00
svlandeg 68d79796c6 add test for vocab after serializing KB 2020-10-10 20:59:48 +02:00
Ines Montani 539b0c10da Tidy up and auto-format 2020-10-10 19:14:48 +02:00
Ines Montani bfa3931c9d
Revert added_strings change (#6236) 2020-10-10 18:55:07 +02:00
Ines Montani 525f798841 Fix typo in test 2020-10-09 18:00:21 +02:00
Ines Montani b7cb9d95e4
Merge pull request #6229 from svlandeg/bugfix/disabled 2020-10-09 16:05:11 +02:00
Ines Montani 9fb3244672
Merge pull request #6231 from adrianeboyd/feature/include-static-vectors 2020-10-09 15:54:52 +02:00
Adriane Boyd 727370c633 Remove Span._recalculate_indices
Remove `Span._recalculate_indices`, which is a remnant from the
deprecated `Span.merge`.
2020-10-09 14:42:51 +02:00
svlandeg 06b9d213fd formatting 2020-10-09 12:19:47 +02:00
Ines Montani 4771a10503 Make test more explicit [ci skip] 2020-10-09 12:15:26 +02:00
Ines Montani cc3646b06c Add xfailing test for peculiar spans failure [ci skip] 2020-10-09 12:10:25 +02:00
svlandeg 8316bc7d4a bugfix DisabledPipes 2020-10-09 12:06:20 +02:00
Adriane Boyd 39aabf50ab Also rename to include_static_vectors in CharEmbed 2020-10-09 11:54:48 +02:00
Florijan Stamenković 18f5c309dc Fix Issue 6207 (#6208)
* Regression test for issue 6207

* Fix issue 6207

* Sign contributor agreement

* Minor adjustments to test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-09 10:14:40 +02:00
Duygu Altinok 80fb1bffc9 Ordinal numbers for Turkish (#6142)
* minor ordinal number addition

* fixed typo

* added corresponding lexical test
2020-10-09 10:13:15 +02:00
Duygu Altinok 2fad279a44 Turkish language syntax iterators (#6191)
* added tr_vocab to config

* basic test

* added syntax iterator to Turkish lang class

* first version for Turkish syntax iter, without flat

* added simple tests with nmod, amod, det

* more tests to amod and nmod

* separated noun chunks and parser test

* rearrangement after nchunk parser separation

* added recursive NPs

* tests with complicated recursive NPs

* tests with conjed NPs

* additional tests for conj NP

* small modification for shaving off conj from NP

* added tests with flat

* more tests with flat

* added examples with flats conjed

* added inner func for flat trick

* corrected parse

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-09 10:10:22 +02:00
Sofie Van Landeghem d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Ines Montani 8ff73f04db Fix morph in Doc.to_json 2020-10-08 14:44:35 +02:00
Ines Montani 064575d79d
Merge pull request #6216 from svlandeg/feature/nel-initialize 2020-10-08 11:14:12 +02:00
svlandeg eaf5c265cb set_kb method for entity_linker 2020-10-08 10:34:01 +02:00
Ines Montani 010956d493 Clear rule-based components on initialize 2020-10-08 09:51:31 +02:00
Sofie Van Landeghem 2998131416
Reproducibility for TextCat and Tok2Vec (#6218)
* ensure fixed seed in HashEmbed layers

* forgot about the joys of python 2
2020-10-08 00:43:46 +02:00
svlandeg efedccea8d fix tests 2020-10-07 15:29:52 +02:00
svlandeg 6b8bdb2d39 add init_config to nlp.create_pipe 2020-10-07 14:58:16 +02:00
Duygu Altinok 7e821c2776
Turkish language syntax iterators (#6191)
* added tr_vocab to config

* basic test

* added syntax iterator to Turkish lang class

* first version for Turkish syntax iter, without flat

* added simple tests with nmod, amod, det

* more tests to amod and nmod

* separated noun chunks and parser test

* rearrangement after nchunk parser separation

* added recursive NPs

* tests with complicated recursive NPs

* tests with conjed NPs

* additional tests for conj NP

* small modification for shaving off conj from NP

* added tests with flat

* more tests with flat

* added examples with flats conjed

* added inner func for flat trick

* corrected parse

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-07 11:07:52 +02:00
Duygu Altinok b95a11dd95
Ordinal numbers for Turkish (#6142)
* minor ordinal number addition

* fixed typo

* added corresponding lexical test
2020-10-07 10:25:37 +02:00