Commit Graph

8480 Commits

Author SHA1 Message Date
Ines Montani c4993f16d0
Merge pull request #6651 from svlandeg/bugfix/cli_info 2021-01-05 13:44:26 +11:00
Ines Montani 991669c934 Tidy up and auto-format 2021-01-05 13:41:53 +11:00
Adriane Boyd b57be94c78
Fix memory issues in Language.evaluate (#6386)
* Fix memory issues in Language.evaluate

Reset annotation in predicted docs before evaluating and store all data
in `examples`.

* Minor refactor to docs generator init

* Fix generator expression

* Fix final generator check

* Refactor pipeline loop

* Handle examples generator in Language.evaluate

* Add test with generator

* Use make_doc
2020-12-31 10:45:50 +11:00
svlandeg a6a68da673 unskipping tests with python >= 3.6 2020-12-30 18:46:43 +01:00
svlandeg d5ff0fecf8 add docs 2020-12-30 14:01:13 +01:00
svlandeg c74ab6a313 fix imports 2020-12-30 12:40:12 +01:00
svlandeg 712a78b74a add simple unit test 2020-12-30 12:35:26 +01:00
svlandeg 4347e6d39b fixes for CLI info command 2020-12-30 12:05:58 +01:00
svlandeg 62b4fe118f prevent overwriting existing config file 2020-12-29 15:40:22 +01:00
Adam Bittlingmayer f2fe60bacf
Update tokenizer_exceptions.py
See https://github.com/explosion/spaCy/pull/6643
2020-12-29 16:05:11 +04:00
Adriane Boyd 5ca57d8221
Add logger warning when serializing user hooks (#6595)
Add a warning that user hooks are lost on serialization.

Add a `user_hooks` exclude to skip the warning with pickle.
2020-12-29 11:54:32 +01:00
Yosi cf52510631
Add Amharic አማርኛ Language support (#6583)
* Add Amharic to space

* clean up

* Add some PRON_LEMMA

* add Tigrinya support

* remove text_noun_chunks

* Tigrinya Support

* added some more details for ti

* fix unit test

* add amharic char range

* changes from review

* amharic and tigrinya share same unicode block

* get rid of _amharic/_tigrinya in char_classes

Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>
2020-12-22 16:50:34 +01:00
Tim Gates 292c1d6a73
docs: fix simple typo, speficied -> specified (#6611)
There is a small typo in spacy/cli/info.py.

Should read `specified` rather than `speficied`.
2020-12-22 09:14:10 +01:00
Adriane Boyd cabd4ae5b1
Use logger.warning instead of logger.warn (#6596)
Use `logger.warning` instead of deprecated `logger.warn`.
2020-12-21 08:25:10 +08:00
Sofie Van Landeghem 282a3b49ea
Fix parser resizing when there is no upper layer (#6460)
* allow resizing of the parser model even when upper=False

* update from spacy.TransitionBasedParser.v1 to v2

* bugfix
2020-12-18 18:56:57 +08:00
Sofie Van Landeghem 0a923a7915
Tagger robustness (#6580)
* require labels in taggers

* ensure tagger works with incomplete data
2020-12-18 18:51:47 +08:00
Adriane Boyd e10295c9fd
Fix memory leak when adding empty morph (#6581)
Fix lookup of empty morph in the morphology table, which fixes a memory
leak where a new morphology tag was allocated each time the empty morph
tag was added.
2020-12-18 18:51:01 +08:00
Ines Montani e9b0963827
Merge pull request #6333 from adrianeboyd/chore/python39 2020-12-17 22:11:57 +11:00
Ines Montani 47c1ec678b Merge branch 'develop' into pr/6333 2020-12-17 10:19:28 +11:00
Ines Montani 3f90bffa27
Merge pull request #6571 from adrianeboyd/bugfix/debug-data-missing-vectors
Fix alignment and vector checks in debug data
2020-12-17 10:10:47 +11:00
Thomas Bird cbb8c66da3 prevent the root logger from inialising 2020-12-15 19:50:34 +00:00
Adriane Boyd 1ddf2f39c7
Switch converters to generator functions (#6547)
* Switch converters to generator functions

To reduce the memory usage when converting large corpora, refactor the
convert methods to be generator functions.

* Update tests
2020-12-15 16:47:16 +08:00
Adriane Boyd 20e18cc246 Fix alignment and vector checks in debug data
* Update token alignment check to use Example alignment
* Update missing vector check further related to changes in v3
2020-12-15 09:43:14 +01:00
Matthew Honnibal 8656a08777
Add beam_parser and beam_ner components for v3 (#6369)
* Get basic beam tests working

* Get basic beam tests working

* Compile _beam_utils

* Remove prints

* Test beam density

* Beam parser seems to train

* Draft beam NER

* Upd beam

* Add hypothesis as dev dependency

* Implement missing is-gold-parse method

* Implement early update

* Fix state hashing

* Fix test

* Fix test

* Default to non-beam in parser constructor

* Improve oracle for beam

* Start refactoring beam

* Update test

* Refactor beam

* Update nn

* Refactor beam and weight by cost

* Update ner beam settings

* Update test

* Add __init__.pxd

* Upd test

* Fix test

* Upd test

* Fix test

* Remove ring buffer history from StateC

* WIP change arc-eager transitions

* Add state tests

* Support ternary sent start values

* Fix arc eager

* Fix NER

* Pass oracle cut size for beam

* Fix ner test

* Fix beam

* Improve StateC.clone

* Improve StateClass.borrow

* Work directly with StateC, not StateClass

* Remove print statements

* Fix state copy

* Improve state class

* Refactor parser oracles

* Fix arc eager oracle

* Fix arc eager oracle

* Use a vector to implement the stack

* Refactor state data structure

* Fix alignment of sent start

* Add get_aligned_sent_starts method

* Add test for ae oracle when bad sentence starts

* Fix sentence segment handling

* Avoid Reduce that inserts illegal sentence

* Update preset SBD test

* Fix test

* Remove prints

* Fix sent starts in Example

* Improve python API of StateClass

* Tweak comments and debug output of arc eager

* Upd test

* Fix state test

* Fix state test
2020-12-13 09:08:32 +08:00
Ines Montani 513c4e332a
Include custom code via spacy package command (#6531) 2020-12-10 20:36:46 +08:00
Adriane Boyd 7b277661f6 Set version to v2.3.5 2020-12-10 13:32:10 +01:00
Ines Montani 2a6043fabb
Merge pull request #6530 from explosion/feature/init-config-cpu-gpu 2020-12-10 09:38:46 +11:00
Ines Montani 9d32e839d3 Merge branch 'develop' into feature/init-config-cpu-gpu 2020-12-10 08:50:53 +11:00
Adriane Boyd 6ee6e41234 Update docstring for Language.evaluate 2020-12-09 10:21:39 +01:00
Adriane Boyd fa8fa474a3 Add nlp.batch_size setting
Add a default `batch_size` setting for `Language.pipe` and
`Language.evaluate` as `nlp.batch_size`.
2020-12-09 09:13:26 +01:00
Ines Montani f2571b5ec4
Merge pull request #6444 from adrianeboyd/chore/update-develop-from-master 2020-12-09 13:09:58 +11:00
Ines Montani 90171f2031
Merge pull request #6528 from svlandeg/feature/pipe_fill_config 2020-12-09 12:01:22 +11:00
Ines Montani dfaef27f90
Merge pull request #6503 from adrianeboyd/feature/lemmatizer-rule-warning-pos
Warn on empty POS for the rule-based lemmatizer
2020-12-09 11:34:16 +11:00
Ines Montani 271923eaea Fix retokenizer 2020-12-09 11:29:55 +11:00
Ines Montani b85bd63eca Fix test 2020-12-09 11:24:01 +11:00
Ines Montani febf71af28 Fix test 2020-12-09 11:23:07 +11:00
Ines Montani 1da1568110 Remove tag map 2020-12-09 11:13:49 +11:00
Ines Montani 1980203229 Merge branch 'master' into pr/6444 2020-12-09 11:09:40 +11:00
Ines Montani 05a2812ae0 Merge branch 'develop' into pr/6444 2020-12-09 11:04:03 +11:00
Ines Montani 758ad6c3cd Make CPU the default for init config 2020-12-09 11:00:51 +11:00
Ines Montani 5d605d539d Remove output_file from init_config helper 2020-12-09 10:57:55 +11:00
Sofie Van Landeghem cfc72c2995
Bugfix multi-label textcat reproducibility (#6481)
* add test for multi-label textcat reproducibility

* remove positive_label

* fix lengths dtype

* fix comments

* remove comment that we should not have forgotten :-)
2020-12-09 06:29:15 +08:00
Sofie Van Landeghem de108ed3e8
Add specific error when StaticVectors can't read the vectors data (#6450) 2020-12-09 06:16:07 +08:00
Koichi Yasuoka 0afb54ac93
JapaneseTokenizer.pipe added (#6515)
* JapaneseTokenizer.pipe added

For [spacymoji](https://spacy.io/universe/project/spacymoji)  with `Japanese()`.

* DummyTokenizer.pipe added instead
2020-12-08 20:02:23 +01:00
svlandeg 8f8a7f1733 returning config in init_config 2020-12-08 17:37:20 +01:00
Ines Montani 8921364579
Merge pull request #6521 from explosion/feature/config-stdin
Allow reading config from stdin in spacy train
2020-12-08 22:07:43 +11:00
Ines Montani 6c7a930ee8 Fix variable 2020-12-08 20:44:59 +11:00
Ines Montani 94a5a9814f Update argument handling and documentation 2020-12-08 20:41:18 +11:00
Adriane Boyd 6c221d4841 Fix subsequent pipe detection in EntityRuler
Fix subsequent pipe detection to detect the position of the current
object by comparing the component itself rather than from the factory
name.
2020-12-08 10:01:30 +01:00
Adriane Boyd 5ceac425ee Remove non-working --use-chars from train CLI
Remove the non-working `--use-chars` option from the train CLI. The
implementation of the option across component types and the CLI settings
could be fixed, but the `CharacterEmbed` model does not work on GPU in
v2 so it's better to remove it.
2020-12-08 08:30:00 +01:00
Ines Montani d25b1606d6 Allow reading config from sdtin in spacy train 2020-12-08 18:01:40 +11:00
Ines Montani 6cfa66ed1c
Make training.loop return nlp object and path (#6520) 2020-12-08 14:55:55 +08:00
Sofie Van Landeghem 2c27093c5f
require_cpu functionality (#6336)
* add require_cpu from Thinc 8.0.0rc2

* add docs

* fix test if cupy is not installed
2020-12-08 14:42:40 +08:00
Sofie Van Landeghem f98a04434a
pretrain architectures (#6451)
* define new architectures for the pretraining objective

* add loss function as attr of the omdel

* cleanup

* cleanup

* shorten name

* fix typo

* remove unused error
2020-12-08 14:41:03 +08:00
Adriane Boyd 29b058ebdc
Fix spacy when retokenizing cases with affixes (#6475)
Preserve `token.spacy` corresponding to the span end token in the
original doc rather than adjusting for the current offset.

* If not modifying in place, this checks in the original document
(`doc.c` rather than `tokens`).
* If modifying in place, the document has not been modified past the
current span start position so the value at the current span end
position is valid.
2020-12-08 14:25:56 +08:00
Adriane Boyd 4448680750
Fix alignment for 1-to-1 tokens and lowercasing (#6476)
* When checking for token alignments, check not only that the tokens are
identical but that the character positions are both at the start of a
token.

  It's possible for the tokens to be identical even though the two
tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs.
`["a", "''", "'"]`, where the middle tokens are identical but should not
be aligned on the token level at character position 2 since it's the
start of one token but the middle of another.

* Use the lowercased version of the token texts to create the
character-to-token alignment because lowercasing can change the string
length (e.g., for `İ`, see the not-a-bug bug report:
https://bugs.python.org/issue34723)
2020-12-08 14:25:16 +08:00
Adriane Boyd e931d3f72b
Move max_length to nlp.make_doc() (#6512)
Move max_length check to `nlp.make_doc()` so that's it's also checked
for `nlp.pipe()`.
2020-12-08 14:24:02 +08:00
Ines Montani ee2ec52f48
Merge pull request #6409 from svlandeg/feature/trf-docs 2020-12-08 06:32:10 +01:00
Ines Montani 82e88f0e3b
Merge pull request #6379 from svlandeg/fix/labels-constructor 2020-12-08 06:29:56 +01:00
Adriane Boyd d70950605c Warn on empty POS for the rule-based lemmatizer
Add a warning to the rule-based lemmatizer for any tokens without POS
annotation.
2020-12-04 11:46:15 +01:00
Adriane Boyd 78085fab1f
Check for spacy-nightly package in download (#6502)
Also check for spacy-nightly in download so that `--no-deps` isn't set
for normal nightly installs.
2020-12-04 09:40:03 +01:00
Ines Montani 63f83e7034
Merge pull request #6470 from adrianeboyd/feature/license-in-package 2020-12-04 03:55:54 +01:00
Sofie Van Landeghem d6c616a125
Fixes in test suite (#6457)
* fix slow test for textcat readers

* cleanup test_issue5551

* add explicit score weight

* cleanup
2020-12-02 12:57:08 +01:00
Adriane Boyd 31ec9a906e
Clean up 3rd party license info (#6478)
Move scikit-learn license from `Scorer` to
`licenses/3rd_party_licenses.txt`.
2020-12-02 10:15:23 +01:00
Adriane Boyd 591cd48aa8 Remove config.cfg from MANIFEST 2020-12-01 12:58:02 +01:00
Adriane Boyd b0dd13e0ba Support LICENSE in spacy package
If present, include the file `input_dir/LICENSE` at the top level of the
packaged model.
2020-11-30 13:43:58 +01:00
Adriane Boyd 53c0fb7431
Only set NORM on Token in retokenizer (#6464)
* Only set NORM on Token in retokenizer

Instead of setting `NORM` on both the token and lexeme, set `NORM` only
on the token.

The retokenizer tries to set all possible attributes with
`Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate
which attributes are available for each. `NORM` is the only attribute
that's stored on both and for most cases it doesn't make sense to set
the global norms based on a individual retokenization. For lexeme-only
attributes like `IS_STOP` there's no way to avoid the global side
effects, but I think that `NORM` would be better only on the token.

* Fix test
2020-11-30 09:35:42 +08:00
Adriane Boyd 03ae77e603
Add SPACY as a Matcher attribute (#6463) 2020-11-30 09:34:50 +08:00
Sofie Van Landeghem 079f6ea474
avoid resolving the full config (#6465) 2020-11-30 09:34:29 +08:00
Ines Montani 9beba7164f Make jinja2 top-level import
No problem anymore since it's now an official dependency
2020-11-27 15:17:14 +08:00
Adriane Boyd 26296ab223
Add error message if DocBin zlib decompress fails (#6394)
Add a better error message if DocBin zlib decompress fails, indicating
that the data is not in `DocBin` format.
2020-11-27 14:39:49 +08:00
Adriane Boyd 3a5cc5f8b4 Set version to v2.3.4 2020-11-26 08:48:52 +01:00
Adriane Boyd e0f5646a4a
Restore cleanup_beam method (#6446) 2020-11-25 13:21:48 +01:00
Adriane Boyd cf693f0eae Fix token_match in tokenizer 2020-11-25 11:49:34 +01:00
Adriane Boyd 724831b066 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master
* Update Macedonian for v3
* Update Turkish for v3
2020-11-25 11:49:34 +01:00
Adriane Boyd 573f5c863f
Fix tag map clobbering in spacy train (#6437)
Fix bug from #5768 where the tag map is clobbered if a custom tag map
isn't provided.
2020-11-24 13:13:16 +01:00
Adriane Boyd ce18fc6588 Set version to v2.3.3 2020-11-24 10:03:45 +01:00
Adriane Boyd cd61d264ef Set version to v2.3.3.dev0 2020-11-23 13:51:59 +01:00
Sofie Van Landeghem 2af31a8c8d
Bugfix textcat reproducibility on GPU (#6411)
* add seed argument to ParametricAttention layer

* bump thinc to 7.4.3

* set thinc version range

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-11-23 12:29:35 +01:00
Adriane Boyd 3f61f5eb54
Use int8_t instead of char in Matcher (#6413)
* Use signed char instead of char in Matcher

Remove unused char* utf8_t typedef

* Use int8_t instead of signed char
2020-11-23 10:26:47 +01:00
Adriane Boyd 4284605683
Remove Beam cleanup (#6414)
Beam cleanup is handled through the Beam finalization method.
2020-11-23 10:01:46 +01:00
Adriane Boyd a8c2dad466
Add all vectors to vocab before pruning (#6408)
Add all vectors to the vocab before pruning to correct the selection of
vectors to prioritize.
2020-11-23 10:00:59 +01:00
svlandeg 636be3c791 Merge remote-tracking branch 'upstream/develop' into feature/trf-docs 2020-11-19 14:15:35 +01:00
svlandeg 73fc1ed963 remove labels from morphologizer constructor 2020-11-11 21:48:50 +01:00
svlandeg d5a920325f remove labels from constructor 2020-11-11 21:34:12 +01:00
Adriane Boyd 320a8b1481
Add ent_id_ to strings serialized with Doc (#6353) 2020-11-10 20:16:07 +08:00
Adriane Boyd a7e7d6c6c9
Ignore misaligned in Morphologizer.get_loss (#6363)
Fix bug where `Morphologizer.get_loss` treated misaligned annotation as
`EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH`
mappings.
2020-11-10 20:15:09 +08:00
Sofie Van Landeghem a0c899a0ff
Fix textcat + transformer architecture (#6371)
* add pooling to textcat TransformerListener

* maybe_get_dim in case it's null
2020-11-10 20:14:47 +08:00
Ines Montani de6453940e
Merge pull request #6305 from svlandeg/feature/score-docs [ci skip] 2020-11-10 02:52:11 +01:00
Ines Montani d7950c5ada
Merge pull request #6297 from adrianeboyd/docs/nightly-conda-install [ci skip] 2020-11-10 02:45:52 +01:00
svlandeg 789fb3d124 add docs for upstream argument of TransformerListener 2020-11-09 21:42:58 +01:00
Ines Montani 363ac73c72 Update docs [ci skip] 2020-11-09 12:43:26 +08:00
Daniel Vasic 20d72de986
Added Multext-East V5 tagset for Croatian language (#6248)
* Added Multext-East V5 tagset for Croatian language

* Create danielvasic.md

* Update danielvasic.md

* Update danielvasic.md

* Add tag map to CroatianDefaults

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-11-05 12:19:22 +01:00
Robert Šípek 6069efe57d
Add tag map to cs language (#6284) 2020-11-05 10:13:11 +01:00
Vu Ha 6d465ec52c
add oprd to the list of accepted deps for noun chunking (#6302)
* add oprd to the list of accepted deps for noun chunking

* add SCA
2020-11-05 09:17:35 +01:00
Adriane Boyd 31de700b0f
Fix on_match callback and remove empty patterns (#6312)
For the `DependencyMatcher`:

* Fix on_match callback so that it is called once per matched pattern
* Fix results so that patterns with empty match lists are not returned
2020-11-05 09:16:26 +01:00
Sofie Van Landeghem 8ef056cf98
fix embed_size in Entity Linker architecture (#6343) 2020-11-04 22:20:13 +01:00
Adriane Boyd 084fc575aa Set version to v3.0.0rc3 2020-11-03 17:29:57 +01:00
Adriane Boyd 1c4df8fd09
Replace pytokenizations with internal alignment (#6293)
* Replace pytokenizations with internal alignment

Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.

* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`

* Refactor trailing whitespace handling

* Remove unnecessary exception for empty docs

Allow a non-empty whitespace-only doc to be aligned with an empty doc

* Remove empty docs exceptions completely
2020-11-03 16:24:38 +01:00
Adriane Boyd a4b32b9552
Handle missing reference values in scorer (#6286)
* Handle missing reference values in scorer

Handle missing values in reference doc during scoring where it is
possible to detect an unset state for the attribute. If no reference
docs contain annotation, `None` is returned instead of a score. `spacy
evaluate` displays `-` for missing scores and the missing scores are
saved as `None`/`null` in the metrics.

Attributes without unset states:

* `token.head`: relies on `token.dep` to recognize unset values
* `doc.cats`: unable to handle missing annotation

Additional changes:

* add optional `has_annotation` check to `score_scans` to replace
`doc.sents` hack
* update `score_token_attr_per_feat` to handle missing and empty morph
representations
* fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START`
vs. `SENT_START`

* Fix import

* Update return types
2020-11-03 15:47:18 +01:00
Adriane Boyd 5d2cb86c34
Fix on_match callback for DependencyMatcher (#6313)
Fix `DependencyMatcher` so that the callback is called only once per
match.
2020-10-31 12:20:27 +01:00
Adriane Boyd 45c9a68828
Identify final Matcher pattern node by quantifier (#6317)
Modify the internal pattern representation in `Matcher` patterns to
identify the final ID state using a unique quantifier rather than a
combination of other attributes.

It was insufficient to identify the final ID node based on an
uninitialized `quantifier` (coincidentally being the same as the `ZERO`)
with `nr_attr` as 0. (In addition, it was potentially bug-prone that
`nr_attr` was set to 0 even though attrs were allocated.)

In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr`
is 0 and the quantifier is ZERO, so the previous methods for
incrementing to the ID node at the end of the pattern weren't able to
distinguish the final ID node from the `{"OP": "!"}` pattern.
2020-10-31 12:18:48 +01:00
Sofie Van Landeghem 2918923541
fix resolving of dot notation (#6326) 2020-10-31 12:17:06 +01:00
Duygu Altinok 0e55f806dd
Turkish tokenization improvements (#6268)
* added single and paired orth variants

* added token match

* added long text tokenization test

* inverted init

* normalized lemmas to lowercase

* more abbrevs

* tests for ordinals and abbrevs

* separated period abbvrevs to another list

* fiex typo

* added ordinal and abbrev tests

* added number tests for dates

* minor refinement

* added inflected abbrevs regex

* added percentage and inflection

* cosmetics

* added token match

* added url inflection tests

* excluded url tokens from custom pattern

* removed url match import
2020-10-29 09:43:17 +01:00
svlandeg 080066ae74 remove TODO note 2020-10-26 10:37:25 +01:00
Ines Montani 2c9804038d Fix success message [ci skip] 2020-10-23 16:11:54 +02:00
Adriane Boyd 4299a7f654 Setup / install / quickstart updates
* Add `cuda110` to setup.cfg and quickstart dropdown
* Switch to `pip` for pip-only packages in conda quickstart instructions
* Update zh pkuseg install message with version range and conda
* Remove `zh` from `extras_require` because the default doesn't require
additional packages
2020-10-23 11:27:54 +02:00
Adriane Boyd 563a21834e Save raw scores in evaluate output 2020-10-19 15:49:09 +02:00
Adriane Boyd dd207ca6d0 Add dep_las_per_type and more generic PRF printer 2020-10-19 15:49:02 +02:00
Adriane Boyd 4300858ecb Include per-type/feat scores in evaluate output 2020-10-19 15:48:55 +02:00
Sofie Van Landeghem 75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
Ines Montani 5a6ed01ce0
Merge pull request #6262 from adrianeboyd/bugfix/template-en-vectors 2020-10-16 15:38:08 +02:00
Adriane Boyd c8d04b79e2 Sort and add vectors for langs without transformers 2020-10-16 08:25:16 +02:00
Adriane Boyd 2fbd43c603 Use core lg models as vectors models in quickstart 2020-10-16 08:17:53 +02:00
Jan Margeta 1ad2213349 Fix TokenPatternSchema pattern field validation
Empty pattern field should be considered invalid

This is fixed by replacing minItems with min_items
as described in Pydantic docs:
https://pydantic-docs.helpmanual.io/usage/schema/
2020-10-16 00:41:21 +02:00
Borijan Georgievski 2311192ba1
Include Macedonian language (#6230)
* Include Macedonian language

* Fix indentation at char_classes.py

* Fix indentation at char_classes.py

* Add Macedonian tests, update lex_attrs and char_classes

* Import unicode literals for python 2
2020-10-15 15:55:01 +02:00
Ines Montani ff4267d181 Fix success message [ci skip] 2020-10-15 14:42:08 +02:00
Ines Montani 10611bf56a Increment version [ci skip] 2020-10-15 13:30:11 +02:00
Ines Montani 4e17ddf75e
Merge pull request #6256 from adrianeboyd/bugfix/docs-to-json-raw 2020-10-15 10:35:01 +02:00
Ines Montani b1d568a4df Tidy up tests 2020-10-15 10:20:21 +02:00
Ines Montani d165af26be Auto-format [ci skip] 2020-10-15 10:08:53 +02:00
Adriane Boyd a93d42861d Use null raw for has_unknown_spaces in docs_to_json 2020-10-15 09:57:54 +02:00
Ines Montani 5665a21517 Tidy up 2020-10-15 09:30:32 +02:00
Ines Montani 5d62499266 Fix tests 2020-10-15 09:29:15 +02:00
Ines Montani 178760855f Merge branch 'develop' into master-tmp 2020-10-15 09:06:03 +02:00
Ines Montani bc85b12e6d
Merge pull request #6249 from svlandeg/feature/batch-tests 2020-10-15 08:57:56 +02:00
svlandeg 0796401c19 call NumpyOps instead of get_current_ops() 2020-10-14 16:55:00 +02:00
svlandeg 44e14ccae8 one more losses fix 2020-10-14 15:11:34 +02:00
svlandeg 0aa8851878 always return losses 2020-10-14 15:00:49 +02:00
svlandeg e94a21638e adding tests for trained models to ensure predict reproducibility 2020-10-13 21:07:13 +02:00
svlandeg ede979d42f formattting 2020-10-13 18:53:17 +02:00
svlandeg ff83bfae3f naming 2020-10-13 18:52:37 +02:00
svlandeg 6ccacff54e add tests for individual spacy layers 2020-10-13 18:50:07 +02:00
svlandeg c23041ae60 component tests single or multiple prediction 2020-10-13 16:26:53 +02:00
Ines Montani 1f49300862 Update transformer recommendations [ci skip] 2020-10-13 15:41:17 +02:00
Sofie Van Landeghem f8a1c1afd6
avoid dropout at runtime (#6247) 2020-10-13 14:39:59 +02:00
Ines Montani 86d648740f Fix morph representation in Doc.to_json 2020-10-13 11:39:03 +02:00
Ines Montani 7f92a5ee6a
Update spacy/lang/ta/examples.py 2020-10-13 11:03:35 +02:00
Ines Montani a0e12c136b Increment version [ci skip] 2020-10-13 10:00:53 +02:00
Ines Montani f090f39f17
Merge pull request #6245 from svlandeg/bugfix/else
bugfix in _pipe
2020-10-13 09:59:06 +02:00
svlandeg 1f465bea18 if-else 2020-10-13 09:27:19 +02:00
svlandeg 40276fd3be update NEL docs after latest refactor 2020-10-12 11:41:27 +02:00
Ines Montani 4fa967ea84 Increment version [ci skip] 2020-10-11 13:10:58 +02:00
Ines Montani ab890a35f9 Make console logger table more compact 2020-10-11 12:55:46 +02:00
Ines Montani 99606e46fe Relax meta.json schema [ci skip] 2020-10-11 12:30:57 +02:00
svlandeg 3a505e7e14 small edit to ensure the new word was indeed new 2020-10-10 21:05:28 +02:00
svlandeg 68d79796c6 add test for vocab after serializing KB 2020-10-10 20:59:48 +02:00
Ines Montani 539b0c10da Tidy up and auto-format 2020-10-10 19:14:48 +02:00
Ines Montani bfa3931c9d
Revert added_strings change (#6236) 2020-10-10 18:55:07 +02:00
Ines Montani 796f8b9424 Increment version 2020-10-09 18:00:27 +02:00
Ines Montani 525f798841 Fix typo in test 2020-10-09 18:00:21 +02:00
Ines Montani 8ac5f22253 Adjust error message 2020-10-09 18:00:16 +02:00
svlandeg 08cb085f6c Merge remote-tracking branch 'upstream/develop' into fix/various 2020-10-09 17:01:27 +02:00
Ines Montani b7cb9d95e4
Merge pull request #6229 from svlandeg/bugfix/disabled 2020-10-09 16:05:11 +02:00
svlandeg e972ecba72 add utf8 encoding for opening file 2020-10-09 16:03:14 +02:00
Ines Montani 9fb3244672
Merge pull request #6231 from adrianeboyd/feature/include-static-vectors 2020-10-09 15:54:52 +02:00
svlandeg 040c7c0541 fix get_dim calls in build_simple_cnn_text_classifier 2020-10-09 15:40:58 +02:00
Adriane Boyd 727370c633 Remove Span._recalculate_indices
Remove `Span._recalculate_indices`, which is a remnant from the
deprecated `Span.merge`.
2020-10-09 14:42:51 +02:00
svlandeg 853edace37 fix MultiHashEmbed example in documentation 2020-10-09 14:11:06 +02:00
svlandeg 06b9d213fd formatting 2020-10-09 12:19:47 +02:00
svlandeg 2cafba5f50 shorten error message for clarity 2020-10-09 12:17:35 +02:00
Ines Montani 4771a10503 Make test more explicit [ci skip] 2020-10-09 12:15:26 +02:00
Ines Montani cc3646b06c Add xfailing test for peculiar spans failure [ci skip] 2020-10-09 12:10:25 +02:00
svlandeg 8316bc7d4a bugfix DisabledPipes 2020-10-09 12:06:20 +02:00
svlandeg 18dfb27985 Add custom error when evaluation throws a KeyError 2020-10-09 12:05:33 +02:00
Adriane Boyd 39aabf50ab Also rename to include_static_vectors in CharEmbed 2020-10-09 11:54:48 +02:00
Florijan Stamenković 18f5c309dc Fix Issue 6207 (#6208)
* Regression test for issue 6207

* Fix issue 6207

* Sign contributor agreement

* Minor adjustments to test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-09 10:14:40 +02:00
Duygu Altinok 80fb1bffc9 Ordinal numbers for Turkish (#6142)
* minor ordinal number addition

* fixed typo

* added corresponding lexical test
2020-10-09 10:13:15 +02:00
Duygu Altinok 2fad279a44 Turkish language syntax iterators (#6191)
* added tr_vocab to config

* basic test

* added syntax iterator to Turkish lang class

* first version for Turkish syntax iter, without flat

* added simple tests with nmod, amod, det

* more tests to amod and nmod

* separated noun chunks and parser test

* rearrangement after nchunk parser separation

* added recursive NPs

* tests with complicated recursive NPs

* tests with conjed NPs

* additional tests for conj NP

* small modification for shaving off conj from NP

* added tests with flat

* more tests with flat

* added examples with flats conjed

* added inner func for flat trick

* corrected parse

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-09 10:10:22 +02:00
Sofie Van Landeghem d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Ines Montani 8ff73f04db Fix morph in Doc.to_json 2020-10-08 14:44:35 +02:00
Ines Montani 064575d79d
Merge pull request #6216 from svlandeg/feature/nel-initialize 2020-10-08 11:14:12 +02:00
svlandeg 3e2e1fd323 cleanup 2020-10-08 10:37:32 +02:00
svlandeg eaf5c265cb set_kb method for entity_linker 2020-10-08 10:34:01 +02:00
Ines Montani 010956d493 Clear rule-based components on initialize 2020-10-08 09:51:31 +02:00
Baranitharan d6037c1860
added sentence 2020-10-08 08:22:58 +05:30
Baranitharan 81afe9b19d
Update examples.py 2020-10-08 08:17:25 +05:30
Sofie Van Landeghem 241cd112f5
add reenabled pipe names back to the meta before serializing (#6219) 2020-10-08 00:44:16 +02:00
Sofie Van Landeghem 2998131416
Reproducibility for TextCat and Tok2Vec (#6218)
* ensure fixed seed in HashEmbed layers

* forgot about the joys of python 2
2020-10-08 00:43:46 +02:00
svlandeg efedccea8d fix tests 2020-10-07 15:29:52 +02:00
svlandeg 6b8bdb2d39 add init_config to nlp.create_pipe 2020-10-07 14:58:16 +02:00
svlandeg 33c2d4af16 move kb_loader to initialize for NEL instead of constructor 2020-10-07 14:56:00 +02:00
Wannaphong Phatthiyaphaibun 9fc8392b38
Add Thai tag map (LST20 Corpus) (#6163)
* Add Thai tag map (LST20 Corpus)

By @korakot

* Update tag_map.py

* Update tag_map.py

* Update tag_map.py
2020-10-07 11:12:01 +02:00
Duygu Altinok 7e821c2776
Turkish language syntax iterators (#6191)
* added tr_vocab to config

* basic test

* added syntax iterator to Turkish lang class

* first version for Turkish syntax iter, without flat

* added simple tests with nmod, amod, det

* more tests to amod and nmod

* separated noun chunks and parser test

* rearrangement after nchunk parser separation

* added recursive NPs

* tests with complicated recursive NPs

* tests with conjed NPs

* additional tests for conj NP

* small modification for shaving off conj from NP

* added tests with flat

* more tests with flat

* added examples with flats conjed

* added inner func for flat trick

* corrected parse

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-07 11:07:52 +02:00
Duygu Altinok 2ce6fc2611
Turkish tag map and morph rules addition (#6141)
* feat: added turkish tag map

* feat: morph rules cconj and sconj

* feat: more conjuncts

* feat: added popular postpositions

* feat: added adverbs

* feat: added personal pronouns

* feat: added reflexive pronouns

* minor: corrected case capital

* minor: fixed comma typo

* feat: added indef pronouns

* feat: added dict iter

* fixed comma typo

* updated language class with tag map and morph

* use default tag map instead

* removed tag map
2020-10-07 10:27:36 +02:00
Duygu Altinok b95a11dd95
Ordinal numbers for Turkish (#6142)
* minor ordinal number addition

* fixed typo

* added corresponding lexical test
2020-10-07 10:25:37 +02:00
Rahul Gupta 1a00bff06d
Hindi: Adds tests for lexical attributes (norm and like_num) (#5829)
* Hindi: Adds tests for lexical attributes (norm and like_num)

* Signs and sdds the contributor agreement

* Add ordinal numbers to be tagged as like_num

* Adds alternate pronunciation for 31 and 39
2020-10-07 10:23:32 +02:00
Nuccy90 c809b2c8e7
Update morph_rules.py (#6102)
* Update morph_rules.py

Added "dig" and "dej" ("you" in accusative form)

* Create Nuccy90.md

* Update Nuccy90.md
2020-10-06 15:14:47 +02:00
Matthew Honnibal 1a500f9717 Set version to v3.0.0a35 2020-10-06 14:19:07 +02:00
Sofie Van Landeghem fff3f8ccfa
Fix packaging pin (#6212)
* pin packaging to >=20.0

* ignore spacy-pkuseg in requirements unit test
2020-10-06 14:16:05 +02:00
Matthew Honnibal cfb9770a94
Fix empty input into StaticVectors layer (#6211)
* Add test for empty doc(s)

* Fix empty check in staticvectors

* Remove xfail

* Update spacy/ml/staticvectors.py
2020-10-06 14:15:41 +02:00
Florijan Stamenković 9db670b996
Fix Issue 6207 (#6208)
* Regression test for issue 6207

* Fix issue 6207

* Sign contributor agreement

* Minor adjustments to test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-06 11:17:37 +02:00
Ines Montani 568e12215d
Merge pull request #6206 from svlandeg/fix/patterns-init 2020-10-06 10:27:23 +02:00
svlandeg 9b4cf7b0b6 update output of debug config command 2020-10-06 09:47:23 +02:00
svlandeg ff9ac39c88 read entity_ruler patterns with srsly.read_jsonl.v1 2020-10-05 22:50:14 +02:00
Ines Montani 126268ce50 Auto-format [ci skip] 2020-10-05 21:58:18 +02:00
Ines Montani 1a554bdcb1 Update docs and docstring [ci skip] 2020-10-05 21:55:27 +02:00
Ines Montani 9614e53b02 Tidy up and auto-format 2020-10-05 21:55:18 +02:00
Ines Montani 181039bd17
Merge pull request #6205 from explosion/feature/embed-features 2020-10-05 21:49:10 +02:00
Ines Montani 5ba418b08c Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-10-05 21:44:01 +02:00