Ines Montani
d5155376fd
Update vocab init
2020-09-28 11:30:18 +02:00
Ines Montani
8b74fd19df
init pipeline -> init nlp
2020-09-28 11:13:38 +02:00
Ines Montani
2fdb7285a0
Update CLI
2020-09-28 11:06:07 +02:00
Ines Montani
553bfea641
Fix commands
2020-09-28 10:53:17 +02:00
Matthew Honnibal
44bad1474c
Add init_pipeline file
2020-09-28 09:47:34 +02:00
Matthew Honnibal
65448b2e34
Remove schema=None until Optional
2020-09-28 03:42:58 +02:00
Matthew Honnibal
b886f53c31
init-pipeline runs (maybe doesnt work)
2020-09-28 03:42:47 +02:00
Matthew Honnibal
ed2aff2db3
Remove unused train code
2020-09-28 03:12:31 +02:00
Matthew Honnibal
3a0a3b8db6
Dont hard-code for 'corpora' name
2020-09-28 03:06:33 +02:00
Matthew Honnibal
a023cf3ecc
Add (untested) resolve_dot_names util
2020-09-28 03:06:12 +02:00
Matthew Honnibal
a976da168c
Support data augmentation in Corpus ( #6155 )
...
* Support data augmentation in Corpus
* Note initial docs for data augmentation
* Add augmenter to quickstart
* Fix flake8
* Format
* Fix test
* Update spacy/tests/training/test_training.py
* Improve data augmentation arguments
* Update templates
* Move randomization out into caller
* Refactor
* Update spacy/training/augment.py
* Update spacy/tests/training/test_training.py
* Fix augment
* Fix test
2020-09-28 03:03:27 +02:00
Matthew Honnibal
13b1605ee6
Add init script
2020-09-28 01:08:49 +02:00
Matthew Honnibal
a3e1791c9c
Upd train
2020-09-28 01:08:30 +02:00
Matthew Honnibal
b5556093e2
Start updating train script
2020-09-27 23:59:44 +02:00
Ines Montani
9016d23cc5
Fix exclude and add test
2020-09-27 23:34:03 +02:00
Ines Montani
658fad428a
Fix base schema integration
2020-09-27 22:50:36 +02:00
Ines Montani
e04bd16f7f
Merge branch 'develop' into feature/new-thinc-config-resolution
2020-09-27 22:34:46 +02:00
Ines Montani
d7ad65a9bb
Fix handling of error description [ci skip]
2020-09-27 22:31:57 +02:00
Ines Montani
7e938ed63e
Update config resolution to use new Thinc
2020-09-27 22:21:31 +02:00
Adriane Boyd
013b66de05
Add tokenizer scoring to ja / ko / zh ( #6152 )
2020-09-27 22:20:45 +02:00
Adriane Boyd
a6548ead17
Add _ as a symbol ( #6153 )
...
* Add _ to StringStore in Morphology
* Add _ as a symbol
Add `_` as a symbol instead of adding to the `StringStore`.
2020-09-27 22:20:14 +02:00
Matthew Honnibal
39b178999c
Tmp notes
2020-09-27 20:13:38 +02:00
Adriane Boyd
8393dbedad
Minor fixes
...
* Put `cfg` back in serialization
* Add `pickle5` to pytest conf
2020-09-27 15:15:53 +02:00
Adriane Boyd
54fe871935
Fix formatting, refactor pickle5 exceptions
2020-09-27 14:37:28 +02:00
Adriane Boyd
11e195d3ed
Update ChineseTokenizer
...
* Allow `pkuseg_model` to be set to `None` on initialization
* Don't save config within tokenizer
* Force convert pkuseg_model to use pickle protocol 4 by reencoding with
`pickle5` on serialization
* Update pkuseg serialization test
2020-09-27 14:00:18 +02:00
Ines Montani
b4486d747d
Merge branch 'develop' into fix/train-config-interpolation
2020-09-26 15:32:14 +02:00
Ines Montani
8fea06d55e
Merge pull request #6149 from adrianeboyd/feature/attributeruler-match-ids
...
Simplify string match IDs for AttributeRuler
2020-09-26 15:31:30 +02:00
Ines Montani
b2d07de786
Construct nlp from uninterpolated config before training
2020-09-26 15:16:59 +02:00
Ines Montani
ca3c997062
Improve CLI config validation with latest Thinc
2020-09-26 13:13:57 +02:00
Adriane Boyd
6c25e60089
Simplify string match IDs for AttributeRuler
2020-09-26 11:12:39 +02:00
Matthew Honnibal
702edf52a0
Fix attributeruler
2020-09-26 00:30:48 +02:00
Matthew Honnibal
821f37254c
Fix attributeruler
2020-09-26 00:19:53 +02:00
Matthew Honnibal
98327f66a9
Fix attributeruler key
2020-09-25 23:20:50 +02:00
Matthew Honnibal
092ce4648e
Make DocBin output stable data (set iteration)
2020-09-25 22:20:44 +02:00
Matthew Honnibal
26afd3bd90
Fix iteration order
2020-09-25 21:47:22 +02:00
Matthew Honnibal
3d8388969e
Sort paths for cache consistency
2020-09-25 19:07:26 +02:00
Adriane Boyd
c3b5a3cfff
Clean up MorphAnalysisC struct ( #6146 )
2020-09-25 15:56:48 +02:00
Sofie Van Landeghem
009ba14aaf
Fix pretraining in train script ( #6143 )
...
* update pretraining API in train CLI
* bump thinc to 8.0.0a35
* bump to 3.0.0a26
* doc fixes
* small doc fix
2020-09-25 15:47:10 +02:00
Adriane Boyd
50f20cf722
Revert changes to Scorer.score_spans
2020-09-25 08:21:47 +02:00
Matthew Honnibal
93d7ff309f
Remove print
2020-09-24 21:05:27 +02:00
Matthew Honnibal
16475528f7
Fix skipped documents in entity scorer ( #6137 )
...
* Fix skipped documents in entity scorer
* Add back the skipping of unannotated entities
* Update spacy/scorer.py
* Use more specific NER scorer
* Fix import
* Fix get_ner_prf
* Add scorer
* Fix scorer
Co-authored-by: Ines Montani <ines@ines.io>
2020-09-24 20:38:57 +02:00
Matthew Honnibal
2abb4ba9db
Make a pre-check to speed up alignment cache ( #6139 )
...
* Dirty trick to fast-track alignment cache
* Improve alignment cache check
* Fix header
* Fix align cache
* Fix align logic
2020-09-24 18:13:39 +02:00
Ines Montani
26e28ed413
Fix combined scores if multiple components report it
2020-09-24 17:11:13 +02:00
Ines Montani
0b52b6904c
Update entity_linker.py
2020-09-24 17:10:35 +02:00
Ines Montani
20b89a9717
Increment version [ci skip]
2020-09-24 16:57:02 +02:00
Adriane Boyd
3c062b3911
Add MORPH handling to Matcher ( #6107 )
...
* Add MORPH handling to Matcher
* Add `MORPH` to `Matcher` schema
* Rename `_SetMemberPredicate` to `_SetPredicate`
* Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate`
* Add special handling for normalization and conversion of morph
values into sets
* For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only
matches for 0 or 1 values
* Update test
* Rename to IS_SUBSET and IS_SUPERSET
2020-09-24 16:55:09 +02:00
Adriane Boyd
59340606b7
Add option to disable Matcher errors ( #6125 )
...
* Add option to disable Matcher errors
* Add option to disable Matcher errors when a doc doesn't contain a
particular type of annotation
Minor additional change:
* Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH`
values
* Rename suppress_errors to allow_missing
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Refactor annotation checks in Matcher and PhraseMatcher
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-24 16:54:39 +02:00
Sofie Van Landeghem
c7eedd3534
updates to NEL functionality ( #6132 )
...
* NEL: read sentences and ents from reference
* fiddling with sent_start annotations
* add KB serialization test
* KB write additional file with strings.json
* score_links function to calculate NEL P/R/F
* formatting
* documentation
2020-09-24 16:53:59 +02:00
Ines Montani
d0ef4a4cf5
Prevent division by zero in score weights
2020-09-24 16:42:13 +02:00
Matthew Honnibal
74ee456374
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-09-24 16:11:47 +02:00
Matthew Honnibal
0bc214c102
Fix pull
2020-09-24 16:11:33 +02:00
Ines Montani
3f751e68f5
Increment version [ci skip]
2020-09-24 14:45:41 +02:00
Ines Montani
58dde293ce
Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2
2020-09-24 14:44:42 +02:00
Ines Montani
74e1f192b4
Merge pull request #6134 from explosion/feature/training_before_to_disk
2020-09-24 14:44:11 +02:00
Ines Montani
24e7ac3f2b
Fix download CLI [ci skip]
2020-09-24 14:43:56 +02:00
Ines Montani
88e54caa12
accuracy -> performance
2020-09-24 14:32:35 +02:00
Ines Montani
92f8b6959a
Fix typo
2020-09-24 13:48:41 +02:00
Adriane Boyd
5c13e0cf1b
Remove unused error
2020-09-24 13:41:55 +02:00
Ines Montani
be56c0994b
Add [training.before_to_disk] callback
2020-09-24 12:40:25 +02:00
Adriane Boyd
8eaacaae97
Refactor Doc.ents setter to use Doc.set_ents
...
Additional changes:
* Entity spans with missing labels are ignored
* Fix ent_kb_id setting in `Doc.set_ents`
2020-09-24 12:36:51 +02:00
Ines Montani
c6c67b606e
Merge pull request #6133 from explosion/fix/score_weights
2020-09-24 12:00:57 +02:00
Ines Montani
f69fea8b25
Improve error handling around non-number scores
2020-09-24 11:29:07 +02:00
Ines Montani
4eb39b5c43
Fix logging
2020-09-24 11:04:35 +02:00
Ines Montani
4bbe41f017
Fix combined scores and update test
2020-09-24 10:42:47 +02:00
Sofie Van Landeghem
c645c4e7ce
fix micro PRF for textcat ( #6130 )
...
* fix micro PRF for textcat
* small fix
2020-09-24 10:31:17 +02:00
Matthew Honnibal
17a6b0a173
Make project pull order insensitive ( #6131 )
2020-09-24 10:30:42 +02:00
Ines Montani
ae51f580c1
Fix handling of score_weights
2020-09-24 10:27:33 +02:00
Ines Montani
f25f05c503
Adjust sort order [ci skip]
2020-09-23 20:03:04 +02:00
Ines Montani
3f77eb749c
Increment version [ci skip]
2020-09-23 19:50:15 +02:00
svlandeg
b816ace4bb
format
2020-09-23 17:33:13 +02:00
svlandeg
5a9fdbc8ad
state_type as Literal
2020-09-23 17:32:14 +02:00
svlandeg
35dbc63578
Merge remote-tracking branch 'upstream/develop' into fix/nr_features
...
# Conflicts:
# spacy/ml/models/parser.py
# spacy/tests/serialize/test_serialize_config.py
# website/docs/api/architectures.md
2020-09-23 17:01:13 +02:00
svlandeg
25b34bba94
throw custom error when state_type is invalid
2020-09-23 16:57:14 +02:00
Ines Montani
916050bf2f
Merge pull request #6127 from explosion/feature/literal-nr_feature_tokens
2020-09-23 16:56:08 +02:00
Ines Montani
3c3863654e
Increment version [ci skip]
2020-09-23 16:54:43 +02:00
svlandeg
dd2292793f
'parser' instead of 'deps' for state_type
2020-09-23 16:53:49 +02:00
Ines Montani
50a4425cda
Adjust docs
2020-09-23 16:03:32 +02:00
Ines Montani
76bbed3466
Use Literal type for nr_feature_tokens
2020-09-23 16:00:03 +02:00
Muhammad Fahmi Rasyid
7489d02dea
Update Indonesian Example Phrases ( #6124 )
...
* create contributor agreement
* Update Indonesian example. (see #1107 )
Update Indonesian examples with more proper phrases. the current phrases contains sensitive and violent words.
2020-09-23 14:02:26 +02:00
svlandeg
6c85fab316
state_type and extra_state_tokens instead of nr_feature_tokens
2020-09-23 13:35:09 +02:00
Ines Montani
7745d77a38
Fix whitespace in template [ci skip]
2020-09-23 13:21:42 +02:00
svlandeg
6435458d51
simplify expression
2020-09-23 12:12:38 +02:00
svlandeg
20b0ec5dcf
avoid logging performance of frozen components
2020-09-23 10:37:12 +02:00
Ines Montani
ae5dacf75f
Tidy up and add types
2020-09-23 10:14:34 +02:00
Ines Montani
6ca06cb62c
Update docs and formatting [ci skip]
2020-09-23 10:14:27 +02:00
Ines Montani
888f936a73
Merge pull request #6106 from svlandeg/feature/textcat-quickstart
2020-09-23 10:11:45 +02:00
Ines Montani
60a317520a
Merge pull request #6109 from svlandeg/feature/2rename
2020-09-23 09:47:12 +02:00
Ines Montani
f976bab710
Remove empty file [ci skip]
2020-09-23 09:30:09 +02:00
svlandeg
556f3e4652
add pooling to NEL's TransformerListener
2020-09-23 09:24:28 +02:00
svlandeg
4a56ea72b5
fallbacks for old names
2020-09-23 09:15:07 +02:00
Sofie Van Landeghem
86a08f819d
tok2vec.update instead of predict ( #6113 )
2020-09-22 21:54:52 +02:00
Adriane Boyd
e4acb28658
Fix norm in retokenizer split ( #6111 )
...
Parallel to behavior in merge, reset norm on original token in
retokenizer split.
2020-09-22 21:53:33 +02:00
Sofie Van Landeghem
e0e793be4d
fix KB IO ( #6118 )
2020-09-22 21:53:06 +02:00
Adriane Boyd
9b4979407d
Fix overlapping German noun chunks ( #6112 )
...
Add a similar fix as in #5470 to prevent the German noun chunks iterator
from producing overlapping spans.
2020-09-22 21:52:42 +02:00
Adriane Boyd
b1a7d6c528
Refactor seen token detection
2020-09-22 14:42:51 +02:00
Sofie Van Landeghem
d53c84b6d6
avoid None callback ( #6100 )
2020-09-22 13:54:44 +02:00
Adriane Boyd
535842e483
Merge branch 'develop' into feature/doc-ents-v3-2
2020-09-22 13:45:50 +02:00
Ines Montani
5e3b796b12
Validate section refs in debug config
2020-09-22 12:24:39 +02:00
svlandeg
085a1c8e2b
add no_output_layer to TextCatBOW config
2020-09-22 12:06:40 +02:00
svlandeg
e1b8090b9b
few more fixes
2020-09-22 12:01:06 +02:00
svlandeg
b556a10808
rename converts in_to_out
2020-09-22 11:50:19 +02:00
svlandeg
e931f4d757
add textcat score
2020-09-22 10:56:43 +02:00
svlandeg
396b33257f
add entity_linker to jinja template
2020-09-22 10:40:05 +02:00
Ines Montani
db7126ead9
Increment version
2020-09-22 10:31:26 +02:00
svlandeg
135de82a2d
add textcat to quickstart
2020-09-22 10:22:06 +02:00
Ines Montani
6316d5f398
Improve messages in project CLI [ci skip]
2020-09-22 09:45:34 +02:00
Ines Montani
49e80dbcac
Merge pull request #6103 from explosion/chore/tidy-up-tests-docs-get-doc
2020-09-22 09:45:04 +02:00
Ines Montani
81606b29bd
Merge pull request #6104 from svlandeg/fix/debug_model [ci skip]
2020-09-22 09:31:23 +02:00
Ines Montani
beb766d0a0
Add test
2020-09-22 09:15:57 +02:00
Ines Montani
285fa934d8
Merge branch 'chore/tidy-up-tests-docs-get-doc' of https://github.com/explosion/spaCy into chore/tidy-up-tests-docs-get-doc
2020-09-22 09:10:14 +02:00
Ines Montani
69f7e52c26
Update README.md
2020-09-22 09:10:06 +02:00
svlandeg
45b29c4a5b
cleanup
2020-09-21 23:17:23 +02:00
svlandeg
fa5c416db6
initialize through nlp object and with train_corpus
2020-09-21 23:09:22 +02:00
Matthew Honnibal
3abc4a5adb
Slightly tidy doc.ents.__set__
2020-09-21 22:58:03 +02:00
Ines Montani
67fbcb3da5
Tidy up tests and docs
2020-09-21 20:43:54 +02:00
Ines Montani
a5f6ab4943
Merge pull request #6098 from adrianeboyd/feature/doc-init
2020-09-21 18:35:20 +02:00
Adriane Boyd
f212303729
Add sent_starts to Doc.__init__
...
Add sent_starts to `Doc.__init__`. Officially specify `is_sent_start`
values but also convert to and accept `sent_start` internally.
2020-09-21 17:59:09 +02:00
svlandeg
447b3e5787
Merge remote-tracking branch 'upstream/develop' into fix/debug_model
...
# Conflicts:
# spacy/cli/debug_model.py
2020-09-21 16:58:40 +02:00
Ines Montani
b3327c1e45
Increment version [ci skip]
2020-09-21 16:04:30 +02:00
Ines Montani
e8bcaa44f1
Don't auto-decompress archives with smart_open [ci skip]
2020-09-21 16:01:46 +02:00
Adriane Boyd
6aa91c7ca0
Make user_data keyword-only
2020-09-21 16:00:06 +02:00
Adriane Boyd
177df15d89
Implement Doc.set_ents
2020-09-21 15:54:05 +02:00
Adriane Boyd
13fbf6556a
Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2
2020-09-21 14:42:04 +02:00
svlandeg
eb9b447960
Merge remote-tracking branch 'upstream/develop' into fix/debug_model
...
# Conflicts:
# spacy/cli/debug_model.py
2020-09-21 14:05:16 +02:00
Adriane Boyd
ce455f30ca
Fix formatting
2020-09-21 13:53:29 +02:00
Adriane Boyd
bc02e86494
Extend Doc.__init__ with additional annotation
...
Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to
`Doc.__init__` to initialize the most common doc/token values.
2020-09-21 13:36:24 +02:00
Ines Montani
758ead8a47
Sync overrides with CLI overrides
2020-09-21 12:50:13 +02:00
Ines Montani
5497acf49a
Support config overrides via environment variables
2020-09-21 11:25:10 +02:00
Ines Montani
1114219ae3
Tidy up and auto-format
2020-09-21 10:59:07 +02:00
Ines Montani
b2302c0a1c
Improve error for missing dependency
2020-09-20 17:44:51 +02:00
Matthew Honnibal
8fb59d958c
Format
2020-09-20 16:31:48 +02:00
Matthew Honnibal
dc22771f87
Fix sparse checkout
2020-09-20 16:30:05 +02:00
Matthew Honnibal
a0fb5e50db
Use simple git clone call if not sparse
2020-09-20 16:22:04 +02:00
Matthew Honnibal
2c24d633d0
Use updated run_command
2020-09-20 16:21:43 +02:00
Matthew Honnibal
889128e5c5
Improve error handling in run_command
2020-09-20 16:20:57 +02:00
Ines Montani
554c9a2497
Update docs [ci skip]
2020-09-20 12:30:53 +02:00
svlandeg
6db1d5dc0d
trying some stuff
2020-09-19 19:11:30 +02:00
Ines Montani
e863b3dc14
Merge pull request #6092 from adrianeboyd/bugfix/load-vocab-lookups-2
2020-09-19 12:33:38 +02:00
Sofie Van Landeghem
39872de1f6
Introducing the gpu_allocator ( #6091 )
...
* rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator'
* --code instead of --code-path
* update documentation
* avoid querying the "system" section directly
* add explanation of gpu_allocator to TF/PyTorch section in docs
* fix typo
* fix typo 2
* use set_gpu_allocator from thinc 8.0.0a34
* default null instead of empty string
2020-09-19 01:17:02 +02:00
Adriane Boyd
47080fba98
Minor renaming / refactoring
...
* Rename loader to `spacy.LookupsDataLoader.v1`, add debugging message
* Make `Vocab.lookups` a property
2020-09-18 19:43:19 +02:00
svlandeg
73ff52b9ec
hack for tok2vec listener
2020-09-18 16:43:15 +02:00
Adriane Boyd
eed4b785f5
Load vocab lookups tables at beginning of training
...
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.
The option moves from `nlp.load_vocab_data` to `training.lookups`.
Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.
The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.
To load `lexeme_norm` from `spacy-lookups-data`:
```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
2020-09-18 15:59:16 +02:00
Ines Montani
a127fa475e
Merge pull request #6078 from svlandeg/fix/corpus
2020-09-18 14:44:21 +02:00
Matthew Honnibal
bbdb5f62b7
Temporary work-around for scoring a subset of components ( #6090 )
...
* Try hacking the scorer to work around sentence boundaries
* Upd scorer
* Set dev version
* Upd scorer hack
* Fix version
* Improve comment on hack
2020-09-18 14:26:42 +02:00
Adriane Boyd
a88106e852
Remove W106: HEAD and SENT_START in doc.from_array ( #6086 )
...
* Remove W106: HEAD and SENT_START in doc.from_array
This warning was hacky and being triggered too often.
* Fix test
2020-09-18 03:01:29 +02:00
svlandeg
e4fc7e0222
fixing output sample to proper 2D array
2020-09-17 22:34:36 +02:00
Adriane Boyd
8b650f3a78
Modify setting missing and blocked entity tokens
...
In order to make it easier to construct `Doc` objects as training data,
modify how missing and blocked entity tokens are set to prioritize
setting `O` and missing entity tokens for training purposes over setting
blocked entity tokens.
* `Doc.ents` setter sets tokens outside entity spans to `O` regardless
of the current state of each token
* For `Doc.ents`, setting a span with a missing label sets the `ent_iob`
to missing instead of blocked
* `Doc.block_ents(spans)` marks spans as hard `O` for use with the
`EntityRecognizer`
2020-09-17 21:27:42 +02:00
Ines Montani
3865214343
Use consistent shortcut
2020-09-17 16:57:02 +02:00
svlandeg
35a3931064
fix typo
2020-09-17 16:36:27 +02:00
svlandeg
ddfc1fc146
add pretraining option to init config
2020-09-17 16:05:40 +02:00
svlandeg
427dbecdd6
cleanup and formatting
2020-09-17 11:48:04 +02:00
svlandeg
0c35885751
generalize corpora, dot notation for dev and train corpus
2020-09-17 11:38:59 +02:00
svlandeg
781fae678b
Merge remote-tracking branch 'upstream/develop' into fix/corpus
2020-09-17 09:24:36 +02:00
Matthew Honnibal
8303d101a5
Set version to v3.0.0a19
2020-09-17 00:18:49 +02:00
Adriane Boyd
7e4cd7575c
Refactor Docs.is_ flags ( #6044 )
...
* Refactor Docs.is_ flags
* Add derived `Doc.has_annotation` method
* `Doc.has_annotation(attr)` returns `True` for partial annotation
* `Doc.has_annotation(attr, require_complete=True)` returns `True` for
complete annotation
* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`
* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.
Notes on `Doc.has_annotation`:
* `HEAD` is converted to `DEP` because heads don't have an unset state
* Accept `IS_SENT_START` as a synonym of `SENT_START`
Additional changes:
* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`
* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`
* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`
* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`
* Fix call to set_children_form_heads
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-17 00:14:01 +02:00
Adriane Boyd
a119667a36
Clean up spacy.tokens ( #6046 )
...
* Clean up spacy.tokens
* Update `set_children_from_heads`:
* Don't check `dep` when setting lr_* or sentence starts
* Set all non-sentence starts to `False`
* Use `set_children_from_heads` in `Token.head` setter
* Reduce similar/duplicate code (admittedly adds a bit of overhead)
* Update sentence starts consistently
* Remove unused `Doc.set_parse`
* Minor changes:
* Declare cython variables (to avoid cython warnings)
* Clean up imports
* Modify set_children_from_heads to set token range
Modify `set_children_from_heads` so that it adjust tokens within a
specified range rather then the whole document.
Modify the `Token.head` setter to adjust only the tokens affected by the
new head assignment.
2020-09-16 20:32:38 +02:00
Matthew Honnibal
c776594ab1
Fix
2020-09-16 18:15:14 +02:00
Matthew Honnibal
4a573d18b3
Add comment
2020-09-16 17:51:29 +02:00
Matthew Honnibal
d31afc8334
Fix Language.link_components when model is None
2020-09-16 17:49:48 +02:00
Adriane Boyd
f3db3f6fe0
Add vectors option to CharacterEmbed ( #6069 )
...
* Add vectors option to CharacterEmbed
* Update spacy/pipeline/morphologizer.pyx
* Adjust default morphologizer config
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-16 17:45:04 +02:00
Adriane Boyd
d722a439aa
Remove unneeded methods in senter and morphologizer ( #6074 )
...
Now that the tagger doesn't manage the tag map, the child classes senter
and morphologizer don't need to override the serialization methods.
2020-09-16 17:39:41 +02:00
Adriane Boyd
87c329c711
Set rule-based lemmatizers as default ( #6076 )
...
For languages without provided models and with lemmatizer rules in
`spacy-lookups-data`, make the rule-based lemmatizer the default:
Bengali, Persian, Norwegian, Swedish
2020-09-16 17:37:29 +02:00
svlandeg
1040e250d8
actual commit with test for custom readers with ml_datasets >= 0.2
2020-09-16 16:41:28 +02:00
svlandeg
714a5a05c6
test for custom readers with ml_datasets >= 0.2
2020-09-16 16:39:55 +02:00
svlandeg
0d1392340f
Merge remote-tracking branch 'upstream/develop' into fix/corpus
2020-09-15 23:17:08 +02:00
svlandeg
f420aa1138
use e.value to get to the ExceptionInfo value
2020-09-15 22:30:09 +02:00
svlandeg
7336657662
corpus is a Dict
2020-09-15 22:07:16 +02:00
svlandeg
51fa929f47
rewrite train_corpus to corpus.train in config
2020-09-15 21:58:04 +02:00
svlandeg
bd87e8686e
move tests to correct subdir
2020-09-15 21:40:38 +02:00
Ines Montani
aaf01689a1
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-09-15 14:24:42 +02:00
Ines Montani
91a6637f74
Remove extra pipe config values before merging
2020-09-15 14:24:17 +02:00
Ines Montani
d3d7f92f05
Fix lang check and error handling in Language.from_config
2020-09-15 14:24:06 +02:00
Ines Montani
2ed6e2a218
Auto-format
2020-09-15 14:20:04 +02:00
Ines Montani
2214d1bb7b
Merge pull request #6067 from explosion/feature/spacy-blank-from-config
2020-09-15 14:18:33 +02:00
Ines Montani
253ba5ef14
Raise for bad Vocab values
2020-09-15 13:25:34 +02:00
svlandeg
7677e5c0e2
fix wandb logger when calling multiple times from same script
2020-09-15 12:56:33 +02:00
Ines Montani
eff9406718
Support vocab arg in spacy.blank
2020-09-15 11:39:36 +02:00
Ines Montani
99549a5ace
Fix consistency and update docs
2020-09-15 11:37:37 +02:00
Ines Montani
7dfc4bc062
Allow overriding meta from spacy.blank
2020-09-15 11:12:12 +02:00
Ines Montani
0f943157af
Delegate to Language.from_config in spacy.blank
2020-09-15 11:07:55 +02:00
Ines Montani
e977086a9a
Update default pretraining config [ci skip]
2020-09-15 01:12:02 +02:00
Ines Montani
154752f9c2
Update docs and consistency [ci skip]
2020-09-15 00:32:49 +02:00
Ines Montani
9cc304c194
Merge pull request #6064 from explosion/fix/sparse-checkout-ux
...
Fix sparse checkout and error handling
2020-09-15 00:32:20 +02:00
Matthew Honnibal
475323cd36
Set version to v3.0.0a18
2020-09-14 22:05:43 +02:00
Matthew Honnibal
e8378b57bc
Fix test
2020-09-14 21:21:13 +02:00
Matthew Honnibal
adf0bab23a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-09-14 21:04:49 +02:00
Matthew Honnibal
ae15fa9688
Fix iob converter
2020-09-14 21:02:18 +02:00
Sofie Van Landeghem
3216a33149
positive_label config for textcat ( #6062 )
...
* hook up positive_label in textcat
* unit tests
* documentation
* formatting
* tests
* fix typo
* move verify_config to after begin_training
* revert accidential commit
2020-09-14 17:08:00 +02:00
Ines Montani
c052017025
Fix sparse checkout and error handling
2020-09-14 14:12:58 +02:00
Matthew Honnibal
fdd2340f6c
Set version to v3.0.0a17
2020-09-13 23:52:03 +02:00
Ines Montani
416deb412f
Prevent duplicate traceback on CalledProcessError [ci skip]
2020-09-13 19:28:54 +02:00
Ines Montani
61a4ef0b46
Fix syntax error
2020-09-13 19:23:09 +02:00
Matthew Honnibal
b693d2d224
Fix speed report in table
2020-09-13 17:39:31 +02:00
Sofie Van Landeghem
744df9814a
define threshold for scoring textcat in TextCat config ( #6055 )
...
* define threshold for scoring textcat in TextCat config
* fix unit test and documentation
2020-09-13 14:15:52 +02:00
Adriane Boyd
ab270364f1
Modify Token.morph to enable unsetting ( #6043 )
...
Modify `Token.morph` property so that `Token.c.morph` can be reset back
to an internal value of `0`. Allow setting `Token.morph` from a hash as
long as the morph string is already in the `StringStore`, setting it
indirectly through `Token.morph_` so that the value is added to the
morphology. If the hash is not in the `StringStore`, raise an error.
2020-09-13 14:06:07 +02:00
Adriane Boyd
c7bd631b5f
Fix token.idx for special cases with affixes ( #6035 )
2020-09-13 14:05:36 +02:00
Matthew Honnibal
54c40223a1
Improve v3 pretrain command ( #6040 )
...
* Starts to run
* Update pretrain script
* Update corpus
* Update pretrain schema
* Remove outdated test
* Make JsonlTexts produce Example objects.
2020-09-13 14:05:05 +02:00
Ines Montani
febb99916d
Tidy up and auto-format [ci skip]
2020-09-13 10:55:36 +02:00
Ines Montani
a5633b205f
Fix handling of errors around git [ci skip]
2020-09-13 10:52:28 +02:00
Ines Montani
f8846c198d
Update types and docstrings
2020-09-13 10:52:02 +02:00
Sofie Van Landeghem
e92e850c72
Raise if empty examples ( #6052 )
...
* raise error if no valid Example objects were found during initialization
* fix max_length parameter
* remove commit from other branch
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-12 21:01:53 +02:00
Matthew Honnibal
37347830d4
Fix reading in GloVe vectors
2020-09-12 17:31:18 +02:00
Ines Montani
b41be87213
Merge pull request #6051 from svlandeg/feature/cli-config
2020-09-12 17:12:35 +02:00
Ines Montani
eedaaaec75
Fix handling of existing asset without checksum [ci skip]
2020-09-12 17:02:53 +02:00
svlandeg
a75cfe0da6
Merge remote-tracking branch 'upstream/develop' into feature/cli-config
2020-09-12 14:44:40 +02:00
svlandeg
115147804a
string_to_list to parse comma-separated string into a list
2020-09-12 14:43:22 +02:00
Ines Montani
f886f5bbc8
Merge pull request #6048 from explosion/fix/clone-compat
2020-09-12 10:30:49 +02:00
svlandeg
711166a75a
prevent overwriting score_weights
2020-09-11 15:12:05 +02:00
Ines Montani
62eec33bc4
Fix meta.json validation
2020-09-11 11:38:33 +02:00
Ines Montani
0b2e07215d
Support overwriting name on spacy package
2020-09-11 11:38:28 +02:00
svlandeg
5b94aeece9
support pipeline as "list in string"
2020-09-11 11:08:46 +02:00
Ines Montani
1bce432b4a
Adjust message [ci skip]
2020-09-11 10:00:49 +02:00
Ines Montani
5acd4fbcd8
Merge branch 'develop' into fix/clone-compat
2020-09-11 09:58:30 +02:00
Ines Montani
761bd60d43
Adjust info message
2020-09-11 09:57:00 +02:00
Ines Montani
6831161bfa
Resolve path to be extra sure
2020-09-11 09:56:49 +02:00
svlandeg
1723fb73c4
remove brol
2020-09-10 17:44:59 +02:00
svlandeg
08a831ce83
process trailing slash if any
2020-09-10 17:39:52 +02:00
Ines Montani
3e83a509bb
WIP: fix project clone compatibility
2020-09-10 15:49:13 +02:00
svlandeg
f1bc09c1e9
restore partly
2020-09-10 14:53:02 +02:00
svlandeg
3889747119
asset fix & UX
2020-09-10 14:36:53 +02:00
svlandeg
a36766d153
hookup branch
2020-09-10 12:00:34 +02:00
svlandeg
97d99f7efa
Merge remote-tracking branch 'upstream/develop' into feature/doc-fixes
2020-09-10 11:51:34 +02:00
Ines Montani
908f3a4494
Update default projects repo [ci skip]
2020-09-10 11:42:14 +02:00
svlandeg
92f9d2f406
small UX fixes
2020-09-10 11:35:50 +02:00
svlandeg
1fc5486792
more fine-grained errors for git_sparse_checkout
2020-09-10 11:31:32 +02:00
Ines Montani
15bc3a37b4
Add --branch to project clone
2020-09-10 11:08:15 +02:00
Ines Montani
1955aaaa20
Merge pull request #6045 from svlandeg/feature/more-layers-docs [ci skip]
2020-09-09 21:46:40 +02:00
Sofie Van Landeghem
cb66ea7400
Remove simple_ner code ( #6041 )
...
* remove simple_ner code
* remove unused _biluo and _iob files
2020-09-09 16:11:27 +02:00
svlandeg
39aa740777
Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs
2020-09-09 11:59:34 +02:00
Sofie Van Landeghem
8e7557656f
Renaming gold & annotation_setter ( #6042 )
...
* version bump to 3.0.0a16
* rename "gold" folder to "training"
* rename 'annotation_setter' to 'set_extra_annotations'
* formatting
2020-09-09 10:31:03 +02:00
Sofie Van Landeghem
60f22e1800
Pipe API ( #6034 )
...
* ensure Language passes on valid examples for initialization
* fix tagger model initialization
* check for valid get_examples across components
* assume labels were added before begin_training
* fix senter initialization
* fix morphologizer initialization
* use methods to check arguments
* test textcat init, requires thinc>=8.0.0a31
* fix tok2vec init
* fix entity linker init
* use islice
* fix simple NER
* cleanup debug model
* fix assert statements
* fix tests
* throw error when adding a label if the output layer can't be resized anymore
* fix test
* add failing test for simple_ner
* UX improvements
* morphologizer UX
* assume begin_training gets a representative set and processes the labels
* remove assumptions for output of untrained NER model
* restore test for original purpose
2020-09-08 22:44:25 +02:00
svlandeg
d0a8849e4d
fix typo
2020-09-08 18:32:12 +02:00
svlandeg
bd8f9b188b
small fixes
2020-09-08 17:24:36 +02:00
Matthew Honnibal
4b82882767
Fix defaults
2020-09-08 15:31:21 +02:00
Matthew Honnibal
5d09e3e154
Set version to v3.0.0a15
2020-09-08 15:25:10 +02:00
Matthew Honnibal
ba5f4c9b32
Add words and seconds to train info
2020-09-08 15:24:47 +02:00
Matthew Honnibal
b470062153
Add CLI registry ( #6037 )
2020-09-08 15:23:34 +02:00
svlandeg
06ef66fd73
Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs
2020-09-08 10:28:42 +02:00
Matthew Honnibal
dae22f3dfa
Fix ignoring of punct labels
2020-09-05 14:11:59 +02:00
Matthew Honnibal
12e1279f6b
Set version to v3.0.0a14
2020-09-05 04:13:53 +02:00
Matthew Honnibal
4b7abaafdb
Fix learn rate for non-transformer
2020-09-04 21:22:50 +02:00
Matthew Honnibal
465785a672
Fix project pull and push
2020-09-04 21:15:55 +02:00
Ines Montani
f174c7b1f3
Merge branch 'develop' into pr/6018
2020-09-04 15:54:49 +02:00
Ines Montani
f06eed800e
Merge pull request #6029 from explosion/master-tmp
2020-09-04 15:11:55 +02:00
Ines Montani
f9550b4493
Fix components in meta.json and website [ci skip]
2020-09-04 14:42:12 +02:00
Ines Montani
d7cc2ee72d
Fix tests
2020-09-04 14:05:55 +02:00
Ines Montani
90043a6f9b
Tidy up and auto-format
2020-09-04 13:42:33 +02:00
Ines Montani
df0b68f60e
Remove unicode declarations and update language data
2020-09-04 13:19:16 +02:00
Ines Montani
ba600f91c5
Tidy up imports
2020-09-04 13:15:44 +02:00
Ines Montani
864a697e63
Merge branch 'develop' into master-tmp
2020-09-04 13:15:36 +02:00