Commit Graph

265 Commits

Author SHA1 Message Date
Ines Montani be99f1e4de
Remove output dirs before training (#6204)
* Remove output dirs before training

* Re-raise error if cleaning fails
2020-10-05 20:11:16 +02:00
svlandeg fd2d48556c fix E902 and E903 numbering 2020-10-05 13:43:32 +02:00
Ines Montani d38dc466c5 Adjust error [ci skip] 2020-10-04 15:26:01 +02:00
Ines Montani bcd52e5486 Tidy up errors and warnings 2020-10-04 11:16:31 +02:00
Ines Montani d3b3663942 Adjust error message and add test 2020-10-04 10:11:27 +02:00
Ines Montani cc08c88a89
Merge pull request #6187 from svlandeg/fix/begin_training_pipe 2020-10-04 10:01:02 +02:00
svlandeg 3f657ed3a1 implement warning in __init_subclass__ instead 2020-10-03 22:34:10 +02:00
Ines Montani dd542ec6a4
Fix label initialization of textcat component (#6190) 2020-10-03 17:07:38 +02:00
svlandeg fb48de349c bwd compat for pipe.begin_training 2020-10-02 20:31:14 +02:00
Sofie Van Landeghem 09dcb75076
small UX fix for DocBin (#6167)
* add informative warning when messing up store_user_data DocBin flags

* add informative warning when messing up store_user_data DocBin flags

* cleanup test

* rename to patterns_path
2020-10-02 15:43:32 +02:00
Ines Montani f0b30aedad
Make lemmatizers use initialize logic (#6182)
* Make lemmatizer use initialize logic and tidy up

* Fix typo

* Raise for uninitialized tables
2020-10-02 15:42:36 +02:00
Ines Montani 01c1538c72 Integrate file readers 2020-10-02 01:36:06 +02:00
Adriane Boyd 86c3ec9c2b
Refactor Token morph setting (#6175)
* Refactor Token morph setting

* Remove `Token.morph_`
* Add `Token.set_morph()`
  * `0` resets `token.c.morph` to unset
  * Any other values are passed to `Morphology.add`

* Add token.morph setter to set from MorphAnalysis
2020-10-01 22:21:46 +02:00
Ines Montani 381258b75b
Merge pull request #6165 from explosion/feature/update-tokenizers-initialize 2020-10-01 09:49:47 +02:00
Ines Montani 6f29f68f69 Update errors and make Tokenizer.initialize args less strict 2020-09-30 23:48:47 +02:00
Ines Montani a103ab5f1a Update augmenter lookups and docs 2020-09-30 23:03:47 +02:00
Adriane Boyd 6b7bb32834 Refactor Chinese initialization 2020-09-30 11:46:45 +02:00
Ines Montani 1aeef3bfbb Make corpus paths default to None and improve errors 2020-09-29 22:33:46 +02:00
Ines Montani 78021089f9
Merge pull request #6160 from explosion/feature/prepare 2020-09-29 20:55:13 +02:00
Ines Montani ff9a63bfbd begin_training -> initialize 2020-09-28 21:35:09 +02:00
Adriane Boyd 11e195d3ed Update ChineseTokenizer
* Allow `pkuseg_model` to be set to `None` on initialization
* Don't save config within tokenizer
* Force convert pkuseg_model to use pickle protocol 4 by reencoding with
`pickle5` on serialization
* Update pkuseg serialization test
2020-09-27 14:00:18 +02:00
Sofie Van Landeghem 009ba14aaf
Fix pretraining in train script (#6143)
* update pretraining API in train CLI

* bump thinc to 8.0.0a35

* bump to 3.0.0a26

* doc fixes

* small doc fix
2020-09-25 15:47:10 +02:00
Adriane Boyd 59340606b7
Add option to disable Matcher errors (#6125)
* Add option to disable Matcher errors

* Add option to disable Matcher errors when a doc doesn't contain a
particular type of annotation

Minor additional change:

* Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH`
values

* Rename suppress_errors to allow_missing

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Refactor annotation checks in Matcher and PhraseMatcher

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-24 16:54:39 +02:00
Sofie Van Landeghem c7eedd3534
updates to NEL functionality (#6132)
* NEL: read sentences and ents from reference

* fiddling with sent_start annotations

* add KB serialization test

* KB write additional file with strings.json

* score_links function to calculate NEL P/R/F

* formatting

* documentation
2020-09-24 16:53:59 +02:00
Ines Montani 58dde293ce
Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2 2020-09-24 14:44:42 +02:00
Ines Montani 92f8b6959a Fix typo 2020-09-24 13:48:41 +02:00
Adriane Boyd 5c13e0cf1b Remove unused error 2020-09-24 13:41:55 +02:00
Ines Montani be56c0994b Add [training.before_to_disk] callback 2020-09-24 12:40:25 +02:00
Ines Montani f69fea8b25 Improve error handling around non-number scores 2020-09-24 11:29:07 +02:00
Ines Montani 4eb39b5c43 Fix logging 2020-09-24 11:04:35 +02:00
svlandeg 25b34bba94 throw custom error when state_type is invalid 2020-09-23 16:57:14 +02:00
Adriane Boyd b1a7d6c528 Refactor seen token detection 2020-09-22 14:42:51 +02:00
Adriane Boyd 535842e483
Merge branch 'develop' into feature/doc-ents-v3-2 2020-09-22 13:45:50 +02:00
svlandeg b556a10808 rename converts in_to_out 2020-09-22 11:50:19 +02:00
Ines Montani 49e80dbcac
Merge pull request #6103 from explosion/chore/tidy-up-tests-docs-get-doc 2020-09-22 09:45:04 +02:00
Ines Montani 81606b29bd
Merge pull request #6104 from svlandeg/fix/debug_model [ci skip] 2020-09-22 09:31:23 +02:00
Ines Montani 67fbcb3da5 Tidy up tests and docs 2020-09-21 20:43:54 +02:00
Adriane Boyd 177df15d89 Implement Doc.set_ents 2020-09-21 15:54:05 +02:00
svlandeg eb9b447960 Merge remote-tracking branch 'upstream/develop' into fix/debug_model
# Conflicts:
#	spacy/cli/debug_model.py
2020-09-21 14:05:16 +02:00
Adriane Boyd bc02e86494 Extend Doc.__init__ with additional annotation
Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to
`Doc.__init__` to initialize the most common doc/token values.
2020-09-21 13:36:24 +02:00
svlandeg 73ff52b9ec hack for tok2vec listener 2020-09-18 16:43:15 +02:00
Adriane Boyd a88106e852
Remove W106: HEAD and SENT_START in doc.from_array (#6086)
* Remove W106: HEAD and SENT_START in doc.from_array

This warning was hacky and being triggered too often.

* Fix test
2020-09-18 03:01:29 +02:00
Adriane Boyd 7e4cd7575c
Refactor Docs.is_ flags (#6044)
* Refactor Docs.is_ flags

* Add derived `Doc.has_annotation` method

  * `Doc.has_annotation(attr)` returns `True` for partial annotation

  * `Doc.has_annotation(attr, require_complete=True)` returns `True` for
    complete annotation

* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`

* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.

Notes on `Doc.has_annotation`:

* `HEAD` is converted to `DEP` because heads don't have an unset state

* Accept `IS_SENT_START` as a synonym of `SENT_START`

Additional changes:

* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`

* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`

* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`

* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`

* Fix call to set_children_form_heads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-17 00:14:01 +02:00
Ines Montani aaf01689a1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-15 14:24:42 +02:00
Ines Montani d3d7f92f05 Fix lang check and error handling in Language.from_config 2020-09-15 14:24:06 +02:00
Ines Montani 253ba5ef14 Raise for bad Vocab values 2020-09-15 13:25:34 +02:00
Sofie Van Landeghem 3216a33149
positive_label config for textcat (#6062)
* hook up positive_label in textcat

* unit tests

* documentation

* formatting

* tests

* fix typo

* move verify_config to after begin_training

* revert accidential commit
2020-09-14 17:08:00 +02:00
Adriane Boyd ab270364f1
Modify Token.morph to enable unsetting (#6043)
Modify `Token.morph` property so that `Token.c.morph` can be reset back
to an internal value of `0`. Allow setting `Token.morph` from a hash as
long as the morph string is already in the `StringStore`, setting it
indirectly through `Token.morph_` so that the value is added to the
morphology. If the hash is not in the `StringStore`, raise an error.
2020-09-13 14:06:07 +02:00
Sofie Van Landeghem 8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
Sofie Van Landeghem 60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00