Commit Graph

8105 Commits

Author SHA1 Message Date
Ines Montani d5155376fd Update vocab init 2020-09-28 11:30:18 +02:00
Ines Montani 8b74fd19df init pipeline -> init nlp 2020-09-28 11:13:38 +02:00
Ines Montani 2fdb7285a0 Update CLI 2020-09-28 11:06:07 +02:00
Ines Montani 553bfea641 Fix commands 2020-09-28 10:53:17 +02:00
Matthew Honnibal 44bad1474c Add init_pipeline file 2020-09-28 09:47:34 +02:00
Matthew Honnibal 65448b2e34 Remove schema=None until Optional 2020-09-28 03:42:58 +02:00
Matthew Honnibal b886f53c31 init-pipeline runs (maybe doesnt work) 2020-09-28 03:42:47 +02:00
Matthew Honnibal ed2aff2db3 Remove unused train code 2020-09-28 03:12:31 +02:00
Matthew Honnibal 3a0a3b8db6 Dont hard-code for 'corpora' name 2020-09-28 03:06:33 +02:00
Matthew Honnibal a023cf3ecc Add (untested) resolve_dot_names util 2020-09-28 03:06:12 +02:00
Matthew Honnibal a976da168c
Support data augmentation in Corpus (#6155)
* Support data augmentation in Corpus

* Note initial docs for data augmentation

* Add augmenter to quickstart

* Fix flake8

* Format

* Fix test

* Update spacy/tests/training/test_training.py

* Improve data augmentation arguments

* Update templates

* Move randomization out into caller

* Refactor

* Update spacy/training/augment.py

* Update spacy/tests/training/test_training.py

* Fix augment

* Fix test
2020-09-28 03:03:27 +02:00
Matthew Honnibal 13b1605ee6 Add init script 2020-09-28 01:08:49 +02:00
Matthew Honnibal a3e1791c9c Upd train 2020-09-28 01:08:30 +02:00
Matthew Honnibal b5556093e2 Start updating train script 2020-09-27 23:59:44 +02:00
Ines Montani 9016d23cc5 Fix exclude and add test 2020-09-27 23:34:03 +02:00
Ines Montani 658fad428a Fix base schema integration 2020-09-27 22:50:36 +02:00
Ines Montani e04bd16f7f Merge branch 'develop' into feature/new-thinc-config-resolution 2020-09-27 22:34:46 +02:00
Ines Montani d7ad65a9bb Fix handling of error description [ci skip] 2020-09-27 22:31:57 +02:00
Ines Montani 7e938ed63e Update config resolution to use new Thinc 2020-09-27 22:21:31 +02:00
Adriane Boyd 013b66de05
Add tokenizer scoring to ja / ko / zh (#6152) 2020-09-27 22:20:45 +02:00
Adriane Boyd a6548ead17
Add _ as a symbol (#6153)
* Add _ to StringStore in Morphology

* Add _ as a symbol

Add `_` as a symbol instead of adding to the `StringStore`.
2020-09-27 22:20:14 +02:00
Matthew Honnibal 39b178999c Tmp notes 2020-09-27 20:13:38 +02:00
Adriane Boyd 8393dbedad Minor fixes
* Put `cfg` back in serialization
* Add `pickle5` to pytest conf
2020-09-27 15:15:53 +02:00
Adriane Boyd 54fe871935 Fix formatting, refactor pickle5 exceptions 2020-09-27 14:37:28 +02:00
Adriane Boyd 11e195d3ed Update ChineseTokenizer
* Allow `pkuseg_model` to be set to `None` on initialization
* Don't save config within tokenizer
* Force convert pkuseg_model to use pickle protocol 4 by reencoding with
`pickle5` on serialization
* Update pkuseg serialization test
2020-09-27 14:00:18 +02:00
Ines Montani b4486d747d Merge branch 'develop' into fix/train-config-interpolation 2020-09-26 15:32:14 +02:00
Ines Montani 8fea06d55e
Merge pull request #6149 from adrianeboyd/feature/attributeruler-match-ids
Simplify string match IDs for AttributeRuler
2020-09-26 15:31:30 +02:00
Ines Montani b2d07de786 Construct nlp from uninterpolated config before training 2020-09-26 15:16:59 +02:00
Ines Montani ca3c997062 Improve CLI config validation with latest Thinc 2020-09-26 13:13:57 +02:00
Adriane Boyd 6c25e60089 Simplify string match IDs for AttributeRuler 2020-09-26 11:12:39 +02:00
Matthew Honnibal 702edf52a0 Fix attributeruler 2020-09-26 00:30:48 +02:00
Matthew Honnibal 821f37254c Fix attributeruler 2020-09-26 00:19:53 +02:00
Matthew Honnibal 98327f66a9 Fix attributeruler key 2020-09-25 23:20:50 +02:00
Matthew Honnibal 092ce4648e Make DocBin output stable data (set iteration) 2020-09-25 22:20:44 +02:00
Matthew Honnibal 26afd3bd90 Fix iteration order 2020-09-25 21:47:22 +02:00
Matthew Honnibal 3d8388969e Sort paths for cache consistency 2020-09-25 19:07:26 +02:00
Adriane Boyd c3b5a3cfff
Clean up MorphAnalysisC struct (#6146) 2020-09-25 15:56:48 +02:00
Sofie Van Landeghem 009ba14aaf
Fix pretraining in train script (#6143)
* update pretraining API in train CLI

* bump thinc to 8.0.0a35

* bump to 3.0.0a26

* doc fixes

* small doc fix
2020-09-25 15:47:10 +02:00
Adriane Boyd 50f20cf722 Revert changes to Scorer.score_spans 2020-09-25 08:21:47 +02:00
Matthew Honnibal 93d7ff309f Remove print 2020-09-24 21:05:27 +02:00
Matthew Honnibal 16475528f7
Fix skipped documents in entity scorer (#6137)
* Fix skipped documents in entity scorer

* Add back the skipping of unannotated entities

* Update spacy/scorer.py

* Use more specific NER scorer

* Fix import

* Fix get_ner_prf

* Add scorer

* Fix scorer

Co-authored-by: Ines Montani <ines@ines.io>
2020-09-24 20:38:57 +02:00
Matthew Honnibal 2abb4ba9db
Make a pre-check to speed up alignment cache (#6139)
* Dirty trick to fast-track alignment cache

* Improve alignment cache check

* Fix header

* Fix align cache

* Fix align logic
2020-09-24 18:13:39 +02:00
Ines Montani 26e28ed413 Fix combined scores if multiple components report it 2020-09-24 17:11:13 +02:00
Ines Montani 0b52b6904c Update entity_linker.py 2020-09-24 17:10:35 +02:00
Ines Montani 20b89a9717 Increment version [ci skip] 2020-09-24 16:57:02 +02:00
Adriane Boyd 3c062b3911
Add MORPH handling to Matcher (#6107)
* Add MORPH handling to Matcher

* Add `MORPH` to `Matcher` schema
* Rename `_SetMemberPredicate` to `_SetPredicate`
* Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate`
  * Add special handling for normalization and conversion of morph
    values into sets
  * For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only
    matches for 0 or 1 values

* Update test

* Rename to IS_SUBSET and IS_SUPERSET
2020-09-24 16:55:09 +02:00
Adriane Boyd 59340606b7
Add option to disable Matcher errors (#6125)
* Add option to disable Matcher errors

* Add option to disable Matcher errors when a doc doesn't contain a
particular type of annotation

Minor additional change:

* Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH`
values

* Rename suppress_errors to allow_missing

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Refactor annotation checks in Matcher and PhraseMatcher

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-24 16:54:39 +02:00
Sofie Van Landeghem c7eedd3534
updates to NEL functionality (#6132)
* NEL: read sentences and ents from reference

* fiddling with sent_start annotations

* add KB serialization test

* KB write additional file with strings.json

* score_links function to calculate NEL P/R/F

* formatting

* documentation
2020-09-24 16:53:59 +02:00
Ines Montani d0ef4a4cf5 Prevent division by zero in score weights 2020-09-24 16:42:13 +02:00
Matthew Honnibal 74ee456374 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-24 16:11:47 +02:00
Matthew Honnibal 0bc214c102 Fix pull 2020-09-24 16:11:33 +02:00
Ines Montani 3f751e68f5 Increment version [ci skip] 2020-09-24 14:45:41 +02:00
Ines Montani 58dde293ce
Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2 2020-09-24 14:44:42 +02:00
Ines Montani 74e1f192b4
Merge pull request #6134 from explosion/feature/training_before_to_disk 2020-09-24 14:44:11 +02:00
Ines Montani 24e7ac3f2b Fix download CLI [ci skip] 2020-09-24 14:43:56 +02:00
Ines Montani 88e54caa12 accuracy -> performance 2020-09-24 14:32:35 +02:00
Ines Montani 92f8b6959a Fix typo 2020-09-24 13:48:41 +02:00
Adriane Boyd 5c13e0cf1b Remove unused error 2020-09-24 13:41:55 +02:00
Ines Montani be56c0994b Add [training.before_to_disk] callback 2020-09-24 12:40:25 +02:00
Adriane Boyd 8eaacaae97 Refactor Doc.ents setter to use Doc.set_ents
Additional changes:

* Entity spans with missing labels are ignored
* Fix ent_kb_id setting in `Doc.set_ents`
2020-09-24 12:36:51 +02:00
Ines Montani c6c67b606e
Merge pull request #6133 from explosion/fix/score_weights 2020-09-24 12:00:57 +02:00
Ines Montani f69fea8b25 Improve error handling around non-number scores 2020-09-24 11:29:07 +02:00
Ines Montani 4eb39b5c43 Fix logging 2020-09-24 11:04:35 +02:00
Ines Montani 4bbe41f017 Fix combined scores and update test 2020-09-24 10:42:47 +02:00
Sofie Van Landeghem c645c4e7ce
fix micro PRF for textcat (#6130)
* fix micro PRF for textcat

* small fix
2020-09-24 10:31:17 +02:00
Matthew Honnibal 17a6b0a173
Make project pull order insensitive (#6131) 2020-09-24 10:30:42 +02:00
Ines Montani ae51f580c1 Fix handling of score_weights 2020-09-24 10:27:33 +02:00
Ines Montani f25f05c503 Adjust sort order [ci skip] 2020-09-23 20:03:04 +02:00
Ines Montani 3f77eb749c Increment version [ci skip] 2020-09-23 19:50:15 +02:00
svlandeg b816ace4bb format 2020-09-23 17:33:13 +02:00
svlandeg 5a9fdbc8ad state_type as Literal 2020-09-23 17:32:14 +02:00
svlandeg 35dbc63578 Merge remote-tracking branch 'upstream/develop' into fix/nr_features
# Conflicts:
#	spacy/ml/models/parser.py
#	spacy/tests/serialize/test_serialize_config.py
#	website/docs/api/architectures.md
2020-09-23 17:01:13 +02:00
svlandeg 25b34bba94 throw custom error when state_type is invalid 2020-09-23 16:57:14 +02:00
Ines Montani 916050bf2f
Merge pull request #6127 from explosion/feature/literal-nr_feature_tokens 2020-09-23 16:56:08 +02:00
Ines Montani 3c3863654e Increment version [ci skip] 2020-09-23 16:54:43 +02:00
svlandeg dd2292793f 'parser' instead of 'deps' for state_type 2020-09-23 16:53:49 +02:00
Ines Montani 50a4425cda Adjust docs 2020-09-23 16:03:32 +02:00
Ines Montani 76bbed3466 Use Literal type for nr_feature_tokens 2020-09-23 16:00:03 +02:00
Muhammad Fahmi Rasyid 7489d02dea
Update Indonesian Example Phrases (#6124)
* create contributor agreement

* Update Indonesian example. (see  #1107)

Update Indonesian examples with more proper phrases. the current phrases contains sensitive and violent words.
2020-09-23 14:02:26 +02:00
svlandeg 6c85fab316 state_type and extra_state_tokens instead of nr_feature_tokens 2020-09-23 13:35:09 +02:00
Ines Montani 7745d77a38 Fix whitespace in template [ci skip] 2020-09-23 13:21:42 +02:00
svlandeg 6435458d51 simplify expression 2020-09-23 12:12:38 +02:00
svlandeg 20b0ec5dcf avoid logging performance of frozen components 2020-09-23 10:37:12 +02:00
Ines Montani ae5dacf75f Tidy up and add types 2020-09-23 10:14:34 +02:00
Ines Montani 6ca06cb62c Update docs and formatting [ci skip] 2020-09-23 10:14:27 +02:00
Ines Montani 888f936a73
Merge pull request #6106 from svlandeg/feature/textcat-quickstart 2020-09-23 10:11:45 +02:00
Ines Montani 60a317520a
Merge pull request #6109 from svlandeg/feature/2rename 2020-09-23 09:47:12 +02:00
Ines Montani f976bab710 Remove empty file [ci skip] 2020-09-23 09:30:09 +02:00
svlandeg 556f3e4652 add pooling to NEL's TransformerListener 2020-09-23 09:24:28 +02:00
svlandeg 4a56ea72b5 fallbacks for old names 2020-09-23 09:15:07 +02:00
Sofie Van Landeghem 86a08f819d
tok2vec.update instead of predict (#6113) 2020-09-22 21:54:52 +02:00
Adriane Boyd e4acb28658
Fix norm in retokenizer split (#6111)
Parallel to behavior in merge, reset norm on original token in
retokenizer split.
2020-09-22 21:53:33 +02:00
Sofie Van Landeghem e0e793be4d
fix KB IO (#6118) 2020-09-22 21:53:06 +02:00
Adriane Boyd 9b4979407d
Fix overlapping German noun chunks (#6112)
Add a similar fix as in #5470 to prevent the German noun chunks iterator
from producing overlapping spans.
2020-09-22 21:52:42 +02:00
Adriane Boyd b1a7d6c528 Refactor seen token detection 2020-09-22 14:42:51 +02:00
Sofie Van Landeghem d53c84b6d6
avoid None callback (#6100) 2020-09-22 13:54:44 +02:00
Adriane Boyd 535842e483
Merge branch 'develop' into feature/doc-ents-v3-2 2020-09-22 13:45:50 +02:00
Ines Montani 5e3b796b12 Validate section refs in debug config 2020-09-22 12:24:39 +02:00
svlandeg 085a1c8e2b add no_output_layer to TextCatBOW config 2020-09-22 12:06:40 +02:00
svlandeg e1b8090b9b few more fixes 2020-09-22 12:01:06 +02:00
svlandeg b556a10808 rename converts in_to_out 2020-09-22 11:50:19 +02:00
svlandeg e931f4d757 add textcat score 2020-09-22 10:56:43 +02:00
svlandeg 396b33257f add entity_linker to jinja template 2020-09-22 10:40:05 +02:00
Ines Montani db7126ead9 Increment version 2020-09-22 10:31:26 +02:00
svlandeg 135de82a2d add textcat to quickstart 2020-09-22 10:22:06 +02:00
Ines Montani 6316d5f398 Improve messages in project CLI [ci skip] 2020-09-22 09:45:34 +02:00
Ines Montani 49e80dbcac
Merge pull request #6103 from explosion/chore/tidy-up-tests-docs-get-doc 2020-09-22 09:45:04 +02:00
Ines Montani 81606b29bd
Merge pull request #6104 from svlandeg/fix/debug_model [ci skip] 2020-09-22 09:31:23 +02:00
Ines Montani beb766d0a0 Add test 2020-09-22 09:15:57 +02:00
Ines Montani 285fa934d8 Merge branch 'chore/tidy-up-tests-docs-get-doc' of https://github.com/explosion/spaCy into chore/tidy-up-tests-docs-get-doc 2020-09-22 09:10:14 +02:00
Ines Montani 69f7e52c26 Update README.md 2020-09-22 09:10:06 +02:00
svlandeg 45b29c4a5b cleanup 2020-09-21 23:17:23 +02:00
svlandeg fa5c416db6 initialize through nlp object and with train_corpus 2020-09-21 23:09:22 +02:00
Matthew Honnibal 3abc4a5adb Slightly tidy doc.ents.__set__ 2020-09-21 22:58:03 +02:00
Ines Montani 67fbcb3da5 Tidy up tests and docs 2020-09-21 20:43:54 +02:00
Ines Montani a5f6ab4943
Merge pull request #6098 from adrianeboyd/feature/doc-init 2020-09-21 18:35:20 +02:00
Adriane Boyd f212303729 Add sent_starts to Doc.__init__
Add sent_starts to `Doc.__init__`. Officially specify `is_sent_start`
values but also convert to and accept `sent_start` internally.
2020-09-21 17:59:09 +02:00
svlandeg 447b3e5787 Merge remote-tracking branch 'upstream/develop' into fix/debug_model
# Conflicts:
#	spacy/cli/debug_model.py
2020-09-21 16:58:40 +02:00
Ines Montani b3327c1e45 Increment version [ci skip] 2020-09-21 16:04:30 +02:00
Ines Montani e8bcaa44f1 Don't auto-decompress archives with smart_open [ci skip] 2020-09-21 16:01:46 +02:00
Adriane Boyd 6aa91c7ca0 Make user_data keyword-only 2020-09-21 16:00:06 +02:00
Adriane Boyd 177df15d89 Implement Doc.set_ents 2020-09-21 15:54:05 +02:00
Adriane Boyd 13fbf6556a Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2 2020-09-21 14:42:04 +02:00
svlandeg eb9b447960 Merge remote-tracking branch 'upstream/develop' into fix/debug_model
# Conflicts:
#	spacy/cli/debug_model.py
2020-09-21 14:05:16 +02:00
Adriane Boyd ce455f30ca Fix formatting 2020-09-21 13:53:29 +02:00
Adriane Boyd bc02e86494 Extend Doc.__init__ with additional annotation
Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to
`Doc.__init__` to initialize the most common doc/token values.
2020-09-21 13:36:24 +02:00
Ines Montani 758ead8a47 Sync overrides with CLI overrides 2020-09-21 12:50:13 +02:00
Ines Montani 5497acf49a Support config overrides via environment variables 2020-09-21 11:25:10 +02:00
Ines Montani 1114219ae3 Tidy up and auto-format 2020-09-21 10:59:07 +02:00
Ines Montani b2302c0a1c Improve error for missing dependency 2020-09-20 17:44:51 +02:00
Matthew Honnibal 8fb59d958c Format 2020-09-20 16:31:48 +02:00
Matthew Honnibal dc22771f87 Fix sparse checkout 2020-09-20 16:30:05 +02:00
Matthew Honnibal a0fb5e50db Use simple git clone call if not sparse 2020-09-20 16:22:04 +02:00
Matthew Honnibal 2c24d633d0 Use updated run_command 2020-09-20 16:21:43 +02:00
Matthew Honnibal 889128e5c5 Improve error handling in run_command 2020-09-20 16:20:57 +02:00
Ines Montani 554c9a2497 Update docs [ci skip] 2020-09-20 12:30:53 +02:00
svlandeg 6db1d5dc0d trying some stuff 2020-09-19 19:11:30 +02:00
Ines Montani e863b3dc14
Merge pull request #6092 from adrianeboyd/bugfix/load-vocab-lookups-2 2020-09-19 12:33:38 +02:00
Sofie Van Landeghem 39872de1f6
Introducing the gpu_allocator (#6091)
* rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator'

* --code instead of --code-path

* update documentation

* avoid querying the "system" section directly

* add explanation of gpu_allocator to TF/PyTorch section in docs

* fix typo

* fix typo 2

* use set_gpu_allocator from thinc 8.0.0a34

* default null instead of empty string
2020-09-19 01:17:02 +02:00
Adriane Boyd 47080fba98 Minor renaming / refactoring
* Rename loader to `spacy.LookupsDataLoader.v1`, add debugging message
* Make `Vocab.lookups` a property
2020-09-18 19:43:19 +02:00
svlandeg 73ff52b9ec hack for tok2vec listener 2020-09-18 16:43:15 +02:00
Adriane Boyd eed4b785f5 Load vocab lookups tables at beginning of training
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.

The option moves from `nlp.load_vocab_data` to `training.lookups`.

Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.

The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.

To load `lexeme_norm` from `spacy-lookups-data`:

```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
2020-09-18 15:59:16 +02:00
Ines Montani a127fa475e
Merge pull request #6078 from svlandeg/fix/corpus 2020-09-18 14:44:21 +02:00
Matthew Honnibal bbdb5f62b7
Temporary work-around for scoring a subset of components (#6090)
* Try hacking the scorer to work around sentence boundaries

* Upd scorer

* Set dev version

* Upd scorer hack

* Fix version

* Improve comment on hack
2020-09-18 14:26:42 +02:00
Adriane Boyd a88106e852
Remove W106: HEAD and SENT_START in doc.from_array (#6086)
* Remove W106: HEAD and SENT_START in doc.from_array

This warning was hacky and being triggered too often.

* Fix test
2020-09-18 03:01:29 +02:00
svlandeg e4fc7e0222 fixing output sample to proper 2D array 2020-09-17 22:34:36 +02:00
Adriane Boyd 8b650f3a78 Modify setting missing and blocked entity tokens
In order to make it easier to construct `Doc` objects as training data,
modify how missing and blocked entity tokens are set to prioritize
setting `O` and missing entity tokens for training purposes over setting
blocked entity tokens.

* `Doc.ents` setter sets tokens outside entity spans to `O` regardless
of the current state of each token

* For `Doc.ents`, setting a span with a missing label sets the `ent_iob`
to missing instead of blocked

* `Doc.block_ents(spans)` marks spans as hard `O` for use with the
`EntityRecognizer`
2020-09-17 21:27:42 +02:00
Ines Montani 3865214343 Use consistent shortcut 2020-09-17 16:57:02 +02:00
svlandeg 35a3931064 fix typo 2020-09-17 16:36:27 +02:00
svlandeg ddfc1fc146 add pretraining option to init config 2020-09-17 16:05:40 +02:00
svlandeg 427dbecdd6 cleanup and formatting 2020-09-17 11:48:04 +02:00
svlandeg 0c35885751 generalize corpora, dot notation for dev and train corpus 2020-09-17 11:38:59 +02:00
svlandeg 781fae678b Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-17 09:24:36 +02:00
Matthew Honnibal 8303d101a5 Set version to v3.0.0a19 2020-09-17 00:18:49 +02:00
Adriane Boyd 7e4cd7575c
Refactor Docs.is_ flags (#6044)
* Refactor Docs.is_ flags

* Add derived `Doc.has_annotation` method

  * `Doc.has_annotation(attr)` returns `True` for partial annotation

  * `Doc.has_annotation(attr, require_complete=True)` returns `True` for
    complete annotation

* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`

* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.

Notes on `Doc.has_annotation`:

* `HEAD` is converted to `DEP` because heads don't have an unset state

* Accept `IS_SENT_START` as a synonym of `SENT_START`

Additional changes:

* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`

* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`

* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`

* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`

* Fix call to set_children_form_heads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-17 00:14:01 +02:00
Adriane Boyd a119667a36
Clean up spacy.tokens (#6046)
* Clean up spacy.tokens

* Update `set_children_from_heads`:
  * Don't check `dep` when setting lr_* or sentence starts
  * Set all non-sentence starts to `False`

* Use `set_children_from_heads` in `Token.head` setter
  * Reduce similar/duplicate code (admittedly adds a bit of overhead)
  * Update sentence starts consistently

* Remove unused `Doc.set_parse`

* Minor changes:
  * Declare cython variables (to avoid cython warnings)
  * Clean up imports

* Modify set_children_from_heads to set token range

Modify `set_children_from_heads` so that it adjust tokens within a
specified range rather then the whole document.

Modify the `Token.head` setter to adjust only the tokens affected by the
new head assignment.
2020-09-16 20:32:38 +02:00
Matthew Honnibal c776594ab1 Fix 2020-09-16 18:15:14 +02:00
Matthew Honnibal 4a573d18b3 Add comment 2020-09-16 17:51:29 +02:00
Matthew Honnibal d31afc8334 Fix Language.link_components when model is None 2020-09-16 17:49:48 +02:00
Adriane Boyd f3db3f6fe0
Add vectors option to CharacterEmbed (#6069)
* Add vectors option to CharacterEmbed

* Update spacy/pipeline/morphologizer.pyx

* Adjust default morphologizer config

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-16 17:45:04 +02:00
Adriane Boyd d722a439aa
Remove unneeded methods in senter and morphologizer (#6074)
Now that the tagger doesn't manage the tag map, the child classes senter
and morphologizer don't need to override the serialization methods.
2020-09-16 17:39:41 +02:00
Adriane Boyd 87c329c711
Set rule-based lemmatizers as default (#6076)
For languages without provided models and with lemmatizer rules in
`spacy-lookups-data`, make the rule-based lemmatizer the default:
Bengali, Persian, Norwegian, Swedish
2020-09-16 17:37:29 +02:00
svlandeg 1040e250d8 actual commit with test for custom readers with ml_datasets >= 0.2 2020-09-16 16:41:28 +02:00
svlandeg 714a5a05c6 test for custom readers with ml_datasets >= 0.2 2020-09-16 16:39:55 +02:00
svlandeg 0d1392340f Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-15 23:17:08 +02:00
svlandeg f420aa1138 use e.value to get to the ExceptionInfo value 2020-09-15 22:30:09 +02:00
svlandeg 7336657662 corpus is a Dict 2020-09-15 22:07:16 +02:00
svlandeg 51fa929f47 rewrite train_corpus to corpus.train in config 2020-09-15 21:58:04 +02:00
svlandeg bd87e8686e move tests to correct subdir 2020-09-15 21:40:38 +02:00
Ines Montani aaf01689a1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-15 14:24:42 +02:00
Ines Montani 91a6637f74 Remove extra pipe config values before merging 2020-09-15 14:24:17 +02:00
Ines Montani d3d7f92f05 Fix lang check and error handling in Language.from_config 2020-09-15 14:24:06 +02:00
Ines Montani 2ed6e2a218 Auto-format 2020-09-15 14:20:04 +02:00
Ines Montani 2214d1bb7b
Merge pull request #6067 from explosion/feature/spacy-blank-from-config 2020-09-15 14:18:33 +02:00
Ines Montani 253ba5ef14 Raise for bad Vocab values 2020-09-15 13:25:34 +02:00
svlandeg 7677e5c0e2 fix wandb logger when calling multiple times from same script 2020-09-15 12:56:33 +02:00
Ines Montani eff9406718 Support vocab arg in spacy.blank 2020-09-15 11:39:36 +02:00
Ines Montani 99549a5ace Fix consistency and update docs 2020-09-15 11:37:37 +02:00
Ines Montani 7dfc4bc062 Allow overriding meta from spacy.blank 2020-09-15 11:12:12 +02:00
Ines Montani 0f943157af Delegate to Language.from_config in spacy.blank 2020-09-15 11:07:55 +02:00
Ines Montani e977086a9a Update default pretraining config [ci skip] 2020-09-15 01:12:02 +02:00
Ines Montani 154752f9c2 Update docs and consistency [ci skip] 2020-09-15 00:32:49 +02:00
Ines Montani 9cc304c194
Merge pull request #6064 from explosion/fix/sparse-checkout-ux
Fix sparse checkout and error handling
2020-09-15 00:32:20 +02:00
Matthew Honnibal 475323cd36 Set version to v3.0.0a18 2020-09-14 22:05:43 +02:00
Matthew Honnibal e8378b57bc Fix test 2020-09-14 21:21:13 +02:00
Matthew Honnibal adf0bab23a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-14 21:04:49 +02:00
Matthew Honnibal ae15fa9688 Fix iob converter 2020-09-14 21:02:18 +02:00
Sofie Van Landeghem 3216a33149
positive_label config for textcat (#6062)
* hook up positive_label in textcat

* unit tests

* documentation

* formatting

* tests

* fix typo

* move verify_config to after begin_training

* revert accidential commit
2020-09-14 17:08:00 +02:00
Ines Montani c052017025 Fix sparse checkout and error handling 2020-09-14 14:12:58 +02:00
Matthew Honnibal fdd2340f6c Set version to v3.0.0a17 2020-09-13 23:52:03 +02:00
Ines Montani 416deb412f Prevent duplicate traceback on CalledProcessError [ci skip] 2020-09-13 19:28:54 +02:00
Ines Montani 61a4ef0b46 Fix syntax error 2020-09-13 19:23:09 +02:00
Matthew Honnibal b693d2d224 Fix speed report in table 2020-09-13 17:39:31 +02:00
Sofie Van Landeghem 744df9814a
define threshold for scoring textcat in TextCat config (#6055)
* define threshold for scoring textcat in TextCat config

* fix unit test and documentation
2020-09-13 14:15:52 +02:00
Adriane Boyd ab270364f1
Modify Token.morph to enable unsetting (#6043)
Modify `Token.morph` property so that `Token.c.morph` can be reset back
to an internal value of `0`. Allow setting `Token.morph` from a hash as
long as the morph string is already in the `StringStore`, setting it
indirectly through `Token.morph_` so that the value is added to the
morphology. If the hash is not in the `StringStore`, raise an error.
2020-09-13 14:06:07 +02:00
Adriane Boyd c7bd631b5f
Fix token.idx for special cases with affixes (#6035) 2020-09-13 14:05:36 +02:00
Matthew Honnibal 54c40223a1
Improve v3 pretrain command (#6040)
* Starts to run

* Update pretrain script

* Update corpus

* Update pretrain schema

* Remove outdated test

* Make JsonlTexts produce Example objects.
2020-09-13 14:05:05 +02:00
Ines Montani febb99916d Tidy up and auto-format [ci skip] 2020-09-13 10:55:36 +02:00
Ines Montani a5633b205f Fix handling of errors around git [ci skip] 2020-09-13 10:52:28 +02:00
Ines Montani f8846c198d Update types and docstrings 2020-09-13 10:52:02 +02:00
Sofie Van Landeghem e92e850c72
Raise if empty examples (#6052)
* raise error if no valid Example objects were found during initialization

* fix max_length parameter

* remove commit from other branch

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-12 21:01:53 +02:00
Matthew Honnibal 37347830d4 Fix reading in GloVe vectors 2020-09-12 17:31:18 +02:00
Ines Montani b41be87213
Merge pull request #6051 from svlandeg/feature/cli-config 2020-09-12 17:12:35 +02:00
Ines Montani eedaaaec75 Fix handling of existing asset without checksum [ci skip] 2020-09-12 17:02:53 +02:00
svlandeg a75cfe0da6 Merge remote-tracking branch 'upstream/develop' into feature/cli-config 2020-09-12 14:44:40 +02:00
svlandeg 115147804a string_to_list to parse comma-separated string into a list 2020-09-12 14:43:22 +02:00
Ines Montani f886f5bbc8
Merge pull request #6048 from explosion/fix/clone-compat 2020-09-12 10:30:49 +02:00
svlandeg 711166a75a prevent overwriting score_weights 2020-09-11 15:12:05 +02:00
Ines Montani 62eec33bc4 Fix meta.json validation 2020-09-11 11:38:33 +02:00
Ines Montani 0b2e07215d Support overwriting name on spacy package 2020-09-11 11:38:28 +02:00
svlandeg 5b94aeece9 support pipeline as "list in string" 2020-09-11 11:08:46 +02:00
Ines Montani 1bce432b4a Adjust message [ci skip] 2020-09-11 10:00:49 +02:00
Ines Montani 5acd4fbcd8 Merge branch 'develop' into fix/clone-compat 2020-09-11 09:58:30 +02:00
Ines Montani 761bd60d43 Adjust info message 2020-09-11 09:57:00 +02:00
Ines Montani 6831161bfa Resolve path to be extra sure 2020-09-11 09:56:49 +02:00
svlandeg 1723fb73c4 remove brol 2020-09-10 17:44:59 +02:00
svlandeg 08a831ce83 process trailing slash if any 2020-09-10 17:39:52 +02:00
Ines Montani 3e83a509bb WIP: fix project clone compatibility 2020-09-10 15:49:13 +02:00
svlandeg f1bc09c1e9 restore partly 2020-09-10 14:53:02 +02:00
svlandeg 3889747119 asset fix & UX 2020-09-10 14:36:53 +02:00
svlandeg a36766d153 hookup branch 2020-09-10 12:00:34 +02:00
svlandeg 97d99f7efa Merge remote-tracking branch 'upstream/develop' into feature/doc-fixes 2020-09-10 11:51:34 +02:00
Ines Montani 908f3a4494 Update default projects repo [ci skip] 2020-09-10 11:42:14 +02:00
svlandeg 92f9d2f406 small UX fixes 2020-09-10 11:35:50 +02:00
svlandeg 1fc5486792 more fine-grained errors for git_sparse_checkout 2020-09-10 11:31:32 +02:00
Ines Montani 15bc3a37b4 Add --branch to project clone 2020-09-10 11:08:15 +02:00
Ines Montani 1955aaaa20
Merge pull request #6045 from svlandeg/feature/more-layers-docs [ci skip] 2020-09-09 21:46:40 +02:00
Sofie Van Landeghem cb66ea7400
Remove simple_ner code (#6041)
* remove simple_ner code

* remove unused _biluo and _iob files
2020-09-09 16:11:27 +02:00
svlandeg 39aa740777 Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs 2020-09-09 11:59:34 +02:00
Sofie Van Landeghem 8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
Sofie Van Landeghem 60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
svlandeg d0a8849e4d fix typo 2020-09-08 18:32:12 +02:00
svlandeg bd8f9b188b small fixes 2020-09-08 17:24:36 +02:00
Matthew Honnibal 4b82882767 Fix defaults 2020-09-08 15:31:21 +02:00
Matthew Honnibal 5d09e3e154 Set version to v3.0.0a15 2020-09-08 15:25:10 +02:00
Matthew Honnibal ba5f4c9b32 Add words and seconds to train info 2020-09-08 15:24:47 +02:00
Matthew Honnibal b470062153
Add CLI registry (#6037) 2020-09-08 15:23:34 +02:00
svlandeg 06ef66fd73 Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs 2020-09-08 10:28:42 +02:00
Matthew Honnibal dae22f3dfa Fix ignoring of punct labels 2020-09-05 14:11:59 +02:00
Matthew Honnibal 12e1279f6b Set version to v3.0.0a14 2020-09-05 04:13:53 +02:00
Matthew Honnibal 4b7abaafdb Fix learn rate for non-transformer 2020-09-04 21:22:50 +02:00
Matthew Honnibal 465785a672 Fix project pull and push 2020-09-04 21:15:55 +02:00
Ines Montani f174c7b1f3 Merge branch 'develop' into pr/6018 2020-09-04 15:54:49 +02:00
Ines Montani f06eed800e
Merge pull request #6029 from explosion/master-tmp 2020-09-04 15:11:55 +02:00
Ines Montani f9550b4493 Fix components in meta.json and website [ci skip] 2020-09-04 14:42:12 +02:00
Ines Montani d7cc2ee72d Fix tests 2020-09-04 14:05:55 +02:00
Ines Montani 90043a6f9b Tidy up and auto-format 2020-09-04 13:42:33 +02:00
Ines Montani df0b68f60e Remove unicode declarations and update language data 2020-09-04 13:19:16 +02:00
Ines Montani ba600f91c5 Tidy up imports 2020-09-04 13:15:44 +02:00
Ines Montani 864a697e63 Merge branch 'develop' into master-tmp 2020-09-04 13:15:36 +02:00