Commit Graph

7864 Commits

Author SHA1 Message Date
Ines Montani 61a4ef0b46 Fix syntax error 2020-09-13 19:23:09 +02:00
Matthew Honnibal b693d2d224 Fix speed report in table 2020-09-13 17:39:31 +02:00
Sofie Van Landeghem 744df9814a
define threshold for scoring textcat in TextCat config (#6055)
* define threshold for scoring textcat in TextCat config

* fix unit test and documentation
2020-09-13 14:15:52 +02:00
Adriane Boyd ab270364f1
Modify Token.morph to enable unsetting (#6043)
Modify `Token.morph` property so that `Token.c.morph` can be reset back
to an internal value of `0`. Allow setting `Token.morph` from a hash as
long as the morph string is already in the `StringStore`, setting it
indirectly through `Token.morph_` so that the value is added to the
morphology. If the hash is not in the `StringStore`, raise an error.
2020-09-13 14:06:07 +02:00
Adriane Boyd c7bd631b5f
Fix token.idx for special cases with affixes (#6035) 2020-09-13 14:05:36 +02:00
Matthew Honnibal 54c40223a1
Improve v3 pretrain command (#6040)
* Starts to run

* Update pretrain script

* Update corpus

* Update pretrain schema

* Remove outdated test

* Make JsonlTexts produce Example objects.
2020-09-13 14:05:05 +02:00
Ines Montani febb99916d Tidy up and auto-format [ci skip] 2020-09-13 10:55:36 +02:00
Ines Montani a5633b205f Fix handling of errors around git [ci skip] 2020-09-13 10:52:28 +02:00
Ines Montani f8846c198d Update types and docstrings 2020-09-13 10:52:02 +02:00
Sofie Van Landeghem e92e850c72
Raise if empty examples (#6052)
* raise error if no valid Example objects were found during initialization

* fix max_length parameter

* remove commit from other branch

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-12 21:01:53 +02:00
Matthew Honnibal 37347830d4 Fix reading in GloVe vectors 2020-09-12 17:31:18 +02:00
Ines Montani b41be87213
Merge pull request #6051 from svlandeg/feature/cli-config 2020-09-12 17:12:35 +02:00
Ines Montani eedaaaec75 Fix handling of existing asset without checksum [ci skip] 2020-09-12 17:02:53 +02:00
svlandeg a75cfe0da6 Merge remote-tracking branch 'upstream/develop' into feature/cli-config 2020-09-12 14:44:40 +02:00
svlandeg 115147804a string_to_list to parse comma-separated string into a list 2020-09-12 14:43:22 +02:00
Ines Montani f886f5bbc8
Merge pull request #6048 from explosion/fix/clone-compat 2020-09-12 10:30:49 +02:00
svlandeg 711166a75a prevent overwriting score_weights 2020-09-11 15:12:05 +02:00
Ines Montani 62eec33bc4 Fix meta.json validation 2020-09-11 11:38:33 +02:00
Ines Montani 0b2e07215d Support overwriting name on spacy package 2020-09-11 11:38:28 +02:00
svlandeg 5b94aeece9 support pipeline as "list in string" 2020-09-11 11:08:46 +02:00
Ines Montani 1bce432b4a Adjust message [ci skip] 2020-09-11 10:00:49 +02:00
Ines Montani 5acd4fbcd8 Merge branch 'develop' into fix/clone-compat 2020-09-11 09:58:30 +02:00
Ines Montani 761bd60d43 Adjust info message 2020-09-11 09:57:00 +02:00
Ines Montani 6831161bfa Resolve path to be extra sure 2020-09-11 09:56:49 +02:00
svlandeg 1723fb73c4 remove brol 2020-09-10 17:44:59 +02:00
svlandeg 08a831ce83 process trailing slash if any 2020-09-10 17:39:52 +02:00
Ines Montani 3e83a509bb WIP: fix project clone compatibility 2020-09-10 15:49:13 +02:00
svlandeg f1bc09c1e9 restore partly 2020-09-10 14:53:02 +02:00
svlandeg 3889747119 asset fix & UX 2020-09-10 14:36:53 +02:00
svlandeg a36766d153 hookup branch 2020-09-10 12:00:34 +02:00
svlandeg 97d99f7efa Merge remote-tracking branch 'upstream/develop' into feature/doc-fixes 2020-09-10 11:51:34 +02:00
Ines Montani 908f3a4494 Update default projects repo [ci skip] 2020-09-10 11:42:14 +02:00
svlandeg 92f9d2f406 small UX fixes 2020-09-10 11:35:50 +02:00
svlandeg 1fc5486792 more fine-grained errors for git_sparse_checkout 2020-09-10 11:31:32 +02:00
Ines Montani 15bc3a37b4 Add --branch to project clone 2020-09-10 11:08:15 +02:00
Ines Montani 1955aaaa20
Merge pull request #6045 from svlandeg/feature/more-layers-docs [ci skip] 2020-09-09 21:46:40 +02:00
Sofie Van Landeghem cb66ea7400
Remove simple_ner code (#6041)
* remove simple_ner code

* remove unused _biluo and _iob files
2020-09-09 16:11:27 +02:00
svlandeg 39aa740777 Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs 2020-09-09 11:59:34 +02:00
Sofie Van Landeghem 8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
Sofie Van Landeghem 60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
svlandeg d0a8849e4d fix typo 2020-09-08 18:32:12 +02:00
svlandeg bd8f9b188b small fixes 2020-09-08 17:24:36 +02:00
Matthew Honnibal 4b82882767 Fix defaults 2020-09-08 15:31:21 +02:00
Matthew Honnibal 5d09e3e154 Set version to v3.0.0a15 2020-09-08 15:25:10 +02:00
Matthew Honnibal ba5f4c9b32 Add words and seconds to train info 2020-09-08 15:24:47 +02:00
Matthew Honnibal b470062153
Add CLI registry (#6037) 2020-09-08 15:23:34 +02:00
svlandeg 06ef66fd73 Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs 2020-09-08 10:28:42 +02:00
Matthew Honnibal dae22f3dfa Fix ignoring of punct labels 2020-09-05 14:11:59 +02:00
Matthew Honnibal 12e1279f6b Set version to v3.0.0a14 2020-09-05 04:13:53 +02:00
Matthew Honnibal 4b7abaafdb Fix learn rate for non-transformer 2020-09-04 21:22:50 +02:00
Matthew Honnibal 465785a672 Fix project pull and push 2020-09-04 21:15:55 +02:00
Ines Montani f174c7b1f3 Merge branch 'develop' into pr/6018 2020-09-04 15:54:49 +02:00
Ines Montani f06eed800e
Merge pull request #6029 from explosion/master-tmp 2020-09-04 15:11:55 +02:00
Ines Montani f9550b4493 Fix components in meta.json and website [ci skip] 2020-09-04 14:42:12 +02:00
Ines Montani d7cc2ee72d Fix tests 2020-09-04 14:05:55 +02:00
Ines Montani 90043a6f9b Tidy up and auto-format 2020-09-04 13:42:33 +02:00
Ines Montani df0b68f60e Remove unicode declarations and update language data 2020-09-04 13:19:16 +02:00
Ines Montani ba600f91c5 Tidy up imports 2020-09-04 13:15:44 +02:00
Ines Montani 864a697e63 Merge branch 'develop' into master-tmp 2020-09-04 13:15:36 +02:00
Adriane Boyd b927893309
Merge branch 'develop' into feature/dependency-matcher-v3 2020-09-04 13:03:30 +02:00
Ines Montani ab1bb421ed Update docs links in codebase 2020-09-04 12:58:50 +02:00
holubvl3 0a27fca557
Create examples.py (#5985)
* Create examples.py

* Create tag_map.py

* Delete tag_map.py

* Update examples.py

formatting: add empty line

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-09-04 11:00:14 +02:00
Ines Montani 2189046869
Merge pull request #6024 from explosion/chore/registry-renaming 2020-09-04 10:54:10 +02:00
svlandeg c32fcdf4c9 fix typo 2020-09-04 09:10:21 +02:00
Ines Montani 595f9dc2e4 Make displacy color registry consistent with others
This was the only registry that expected the registered objects to be dictionaries instead of functions that return something. We can still support plain dicts but we should also support functions for consistency
2020-09-03 23:05:41 +02:00
Matthew Honnibal 1c07820681 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-03 18:54:21 +02:00
Matthew Honnibal 7be8a0516a Fix project pull 2020-09-03 18:54:03 +02:00
Ines Montani 23b7d9cfa3 Prefix span getters 2020-09-03 17:37:06 +02:00
Ines Montani 5afe6447cd registry.assets -> registry.misc 2020-09-03 17:31:14 +02:00
Ines Montani c063e55eb7 Add prefix to batchers 2020-09-03 17:30:41 +02:00
Ines Montani 896caf45e3
Merge pull request #6023 from explosion/ux/model-terminology-consistency [ci skip] 2020-09-03 17:13:44 +02:00
Ines Montani c53b1433b9 Adjust more arguments [ci skip] 2020-09-03 17:12:24 +02:00
Ines Montani b5a0657fd6 "model" terminology consistency in docs 2020-09-03 13:13:03 +02:00
Matthew Honnibal f038841798 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-03 12:52:39 +02:00
Matthew Honnibal ef0d0630a4 Let Langugae.use_params work with falsey inputs
The Language.use_params method was failing if you passed in None, which
meant we had to use awkward conditionals for the parameter averaging.
This solves the problem.
2020-09-03 12:51:04 +02:00
Yohei Tamura 5af432e0f2
fix for empty string (#5936) 2020-09-03 10:09:03 +02:00
Adriane Boyd 77ac4a38aa
Simplify specials and cache checks (#6012) 2020-09-03 09:42:49 +02:00
Adriane Boyd 8b5594df86 Remove near-duplicate test 2020-09-02 20:32:01 +02:00
Matthew Honnibal 122cb02001 Fix averages 2020-09-02 19:37:43 +02:00
Adriane Boyd 960d9cfadc Officially support DependencyMatcher
Add official support for the `DependencyMatcher`. Redesign the pattern
specification. Fix and extend operator implementations. Update API docs
and add usage docs.

Patterns
--------

Refactor pattern structure to:

```
{
  "LEFT_ID": str,
  "REL_OP": str,
  "RIGHT_ID": str,
  "RIGHT_ATTRS": dict,
}
```

The first node contains only `RIGHT_ID` and `RIGHT_ATTRS` and all
subsequent nodes contain all four keys.

New operators
-------------

Because of the way patterns are constructed from left to right, it's
helpful to have `follows` operators along with `precedes` operators. Add
operators for simple precedes / follows alongside immediate precedes /
follows.

* `.*`: precedes
* `;`: immediately follows
* `;*`: follows

Operator fixes
--------------

* `<` and `<<` do not include the node itself
* Fix reversed order for all operators involving linear precedence (`.`,
  all sibling operators)
* Linear precedence operators do not match nodes outside the same parse

Additional fixes
----------------

* Use v3 Matcher API
* Support `get` and `remove`
* Support pickling
2020-09-02 17:45:29 +02:00
Marek Grzenkowicz 92d7832a86
Fix off-by-one error for best iteration calculation (closes #6014) (#6016) 2020-09-02 15:15:45 +02:00
Matthew Honnibal 737a1408d9 Improve implementation of fix #6010
Follow-ups to the parser efficiency fix.

* Avoid introducing new counter for number of pushes
* Base cut on number of transitions, keeping it more even
* Reintroduce the randomization we had in v2.
2020-09-02 14:42:32 +02:00
Sofie Van Landeghem eb56377799
Fix overfitting test (#6011)
* remove unused MORPH_RULES

* fix textcat architecture in overfitting test
2020-09-02 13:07:41 +02:00
Adriane Boyd b97d98783a
Fix Hungarian % tokenization (#6013) 2020-09-02 13:06:16 +02:00
Matthew Honnibal c1bf3a5602
Fix significant performance bug in parser training (#6010)
The parser training makes use of a trick for long documents, where we
use the oracle to cut up the document into sections, so that we can have
batch items in the middle of a document. For instance, if we have one
document of 600 words, we might make 6 states, starting at words 0, 100,
200, 300, 400 and 500.

The problem is for v3, I screwed this up and didn't stop parsing! So
instead of a batch of [100, 100, 100, 100, 100, 100], we'd have a batch
of [600, 500, 400, 300, 200, 100]. Oops.

The implementation here could probably be improved, it's annoying to
have this extra variable in the state. But this'll do.

This makes the v3 parser training 5-10 times faster, depending on document
lengths. This problem wasn't in v2.
2020-09-02 12:57:13 +02:00
Sofie Van Landeghem f7a25d69f7
Bugfix in merge_entities (#6005)
* failing test

* bugfix
2020-09-01 21:57:52 +02:00
Sofie Van Landeghem 6bfb1b3a29
Fix sparse checkout for 'spacy project' (#6008)
* exit if cloning fails

* UX

* rewrite http link to git protocol, don't use stdin

* fixes to sparse checkout

* formatting
2020-09-01 19:49:01 +02:00
Matthew Honnibal 4cce32f090 Fix tagger initialization 2020-09-01 16:38:34 +02:00
Matthew Honnibal 046c38bd26
Remove 'cleanup' of strings (#6007)
A long time ago we went to some trouble to try to clean up "unused"
strings, to avoid the `StringStore` growing in long-running processes.

This never really worked reliably, and I think it was a really wrong
approach. It's much better to let the user reload the `nlp` object as
necessary, now that the string encoding is stable (in v1, the string IDs
were sequential integers, making reloading the NLP object really
annoying.)

The extra book-keeping does make some performance difference, and the
feature is unsed, so it's past time we killed it.
2020-09-01 16:12:15 +02:00
Ines Montani 70b226f69d Support ignore marker in project document [ci skip] 2020-09-01 12:49:04 +02:00
Ines Montani a4c51f0f18 Add v3 info to project docs [ci skip] 2020-09-01 12:36:21 +02:00
Ines Montani ef9005273b Update fill-config command and add silent mode [ci skip] 2020-09-01 12:07:04 +02:00
Matthew Honnibal ec660e3131 Fix use_pytorch_for_gpu_memory 2020-09-01 00:41:38 +02:00
Adriane Boyd 9130094199
Prevent Tagger model init with 0 labels (#5984)
* Prevent Tagger model init with 0 labels

Raise an error before trying to initialize a tagger model with 0 labels.

* Add dummy tagger label for test

* Remove tagless tagger model initializiation

* Fix error number after merge

* Add dummy tagger label to test

* Fix formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-31 21:24:33 +02:00
Matthw Honnibal c38298b8fa Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-31 19:55:55 +02:00
Matthw Honnibal fe298fa50a Shuffle on first epoch of train 2020-08-31 19:55:22 +02:00
Ines Montani 9af82f3f11
Merge pull request #6003 from explosion/feature/matcher-as-spans 2020-08-31 17:50:56 +02:00
Ines Montani add9de5487 Deprecate (Phrase)Matcher.pipe 2020-08-31 17:01:24 +02:00
Ines Montani 83aff38c59
Make argument keyword-only
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-31 15:39:03 +02:00
Ines Montani 6340d1c63d Add as_spans to Matcher/PhraseMatcher 2020-08-31 14:53:22 +02:00
svlandeg 13ee742fb4 example of custom logger 2020-08-31 14:24:41 +02:00
svlandeg c18eb63483 Merge remote-tracking branch 'upstream/develop' into feature/vectors-docs
# Conflicts:
#	website/docs/usage/embeddings-transformers.md
2020-08-31 13:21:36 +02:00
Sofie Van Landeghem ec14744ee4
Rename Transformer listener (#6001)
* rename to spacy-transformers.TransformerListener

* add some more tok2vec tests

* use select_pipes

* fix docs - annotation setter was not changed in the end
2020-08-31 12:41:39 +02:00
Adriane Boyd 216efaf5f5 Restrict tokenizer exceptions to ORTH and NORM 2020-08-31 09:55:01 +02:00
Matthew Honnibal 9341cbc013 Set version to v3.0.0a13 2020-08-30 23:10:43 +02:00
Ines Montani 45f46a5c85
Merge pull request #5993 from explosion/feature/disabled-components 2020-08-29 15:58:41 +02:00
Ines Montani 34146750d4 Use frozen list with custom errors
We don't want to break backwards compatibility too much but we also want to provide the best possible UX
2020-08-29 15:20:11 +02:00
Ines Montani 744f432420
Merge pull request #5994 from explosion/feature/idempotent-component-decorator 2020-08-29 13:17:13 +02:00
Ines Montani 5de3f8604d
Update spacy/util.py
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-29 13:17:06 +02:00
Ines Montani 091a9b522a Remove unused variable [ci skip] 2020-08-29 13:11:26 +02:00
Ines Montani 2bc31e15c9 Tidy up and auto-format [ci skip] 2020-08-29 13:01:10 +02:00
Ines Montani 6520d1a1df Work around set order in Language.disabled 2020-08-29 12:58:22 +02:00
Ines Montani f45095a666
Merge pull request #5995 from adrianeboyd/bugfix/attribute-ruler-bugfixes 2020-08-29 12:38:30 +02:00
Ines Montani e0b4984aa4 Make deprecated disable_pipes call into select_pipes 2020-08-29 12:08:46 +02:00
Ines Montani 15d73f4dc3 Make user-facing Language.disabled return list
More consistent with all the other properties
2020-08-29 12:08:33 +02:00
Matthew Honnibal 58f19421b1 Return empty batch from tok2vec listener if no doc.tensor 2020-08-29 03:46:50 +02:00
svlandeg 5230529de2 add loggers registry & logger docs sections 2020-08-28 21:44:04 +02:00
Ines Montani 0687d7148e Rename user-facing API 2020-08-28 21:04:02 +02:00
Adriane Boyd 0104bd1600 Sort the AttributeRuler matches by rule order
Sort the returned matches by rule order (the `match_id`) so that the
rules are applied in the order they were added. This is necessary, for
instance, if the `AttributeRuler` is used for the tag map and later
rules require POS tags.
2020-08-28 21:01:06 +02:00
Ines Montani 6a999c9303 Remove outdated component attr check 2020-08-28 20:59:19 +02:00
Adriane Boyd 8674b17651 Serialize AttributeRuler.patterns
Serialize `AttributeRuler.patterns` instead of the individual lists to
simplify the serialized and so that patterns are reloaded exactly as
they were originally provided (preserving `_attrs_unnormed`).
2020-08-28 20:44:45 +02:00
Ines Montani 10da74382f Raise if disabled components are removed before DisabledPipes.restore 2020-08-28 20:35:26 +02:00
Ines Montani 1e0363290e Remove todos and update docstrings 2020-08-28 20:34:46 +02:00
Ines Montani cad988da7f Allow component decorators to re-run with same function 2020-08-28 16:27:22 +02:00
Ines Montani 3ce5be4b76 Allow loaded but disabled components 2020-08-28 15:20:14 +02:00
Ines Montani 89f692bc8a
Merge pull request #5992 from svlandeg/feature/wandb-restrict-config 2020-08-28 15:05:29 +02:00
Ines Montani 9c4049b57f
Merge pull request #5986 from explosion/fix/language-config-interpolate-disk-bytes 2020-08-28 15:03:52 +02:00
Ines Montani adc050cdc5 Fix code style in test [ci skip] 2020-08-28 15:03:21 +02:00
svlandeg 05a1bafa15 fix type 2020-08-28 14:08:33 +02:00
svlandeg 33883aa764 rename field 2020-08-28 14:06:23 +02:00
svlandeg 1d8c4070aa add disable_fields to wandb_logger 2020-08-28 13:55:32 +02:00
Ines Montani a51b4f3a19 Merge branch 'develop' into fix/language-config-interpolate-disk-bytes 2020-08-28 13:21:17 +02:00
Ines Montani 03dde511b4
Merge pull request #5987 from explosion/feature/debug-config [ci skip] 2020-08-28 11:30:18 +02:00
Ines Montani 62e9967228 Merge branch 'develop' into fix/language-config-interpolate-disk-bytes 2020-08-28 11:19:36 +02:00
Ines Montani 4ca2698f85 Merge branch 'develop' into feature/debug-config 2020-08-28 11:19:17 +02:00
svlandeg 9a8255ffd5 two tests because of different exit type 2020-08-28 10:50:26 +02:00
svlandeg 73baaf330a update error type 2020-08-28 10:46:21 +02:00
Matthew Honnibal c558ca4485 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-27 19:47:26 +02:00
Matthew Honnibal d3ffe4ca63 Fix error when tagger was initialized with no labels 2020-08-27 18:56:58 +02:00
Ines Montani d1780db6a4 Tidy up and use different error [ci skip] 2020-08-27 18:56:55 +02:00
Ines Montani ff4175e839 Add more info to debug config 2020-08-27 18:17:58 +02:00
Ines Montani daac8ebacd Don't interpolate config on Language deserialization 2020-08-27 16:44:36 +02:00
Matthew Honnibal e1e1760fd6 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-27 03:22:11 +02:00
Matthew Honnibal 95adb58f15 Force tagger to pass batch of docs into model in begin_training 2020-08-27 03:21:03 +02:00
Ines Montani cdc114e212
Merge pull request #5977 from explosion/refactor/vector-names 2020-08-26 19:03:16 +02:00
Ines Montani 8692d176f6
Merge pull request #5978 from explosion/feature/update-wasabi
Update wasabi: new diff_strings and MarkdownRenderer
2020-08-26 19:02:52 +02:00
Matthew Honnibal 9b22714a4e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-26 15:48:45 +02:00
Matthew Honnibal 172af24f95 Fix upload and download 2020-08-26 15:48:23 +02:00
Ines Montani a5fff1df51 Remove outdated non-empty output dir warning [ci skip] 2020-08-26 15:45:51 +02:00
Matthew Honnibal 2d520d3b45 Remove unused error 2020-08-26 15:41:14 +02:00
Adriane Boyd 90d88729e0
Add AttributeRuler.score (#5963)
* Add AttributeRuler.score

Add scoring for TAG / POS / MORPH / LEMMA if these are present in the
assigned token attributes.

Add default score weights (that don't really make a lot of sense) so
that the scores are in the default config in some form.

* Update docs
2020-08-26 15:39:30 +02:00
Ines Montani 3aec98ca38 Update wasabi: new diff_strings and MarkdownRenderer 2020-08-26 15:33:11 +02:00
Sofie Van Landeghem 79d460e3a2
Weights & Biases logger for train CLI (#5971)
* quick test as part of train script

* train_logger in config, default ConsoleLogger in loggers catalogue

* entitiy typo

* add wandb_logger

* cleanup

* Update spacy/cli/train_logger.py

Co-authored-by: Ines Montani <ines@ines.io>

* move loggers to gold.loggers

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-26 15:24:33 +02:00
Ines Montani 0997c30b9e
Merge pull request #5974 from explosion/feature/project-document 2020-08-26 15:14:13 +02:00
Matthew Honnibal 191fb4144f Merge branch 'develop' into refactor/vector-names 2020-08-26 14:26:45 +02:00
Ines Montani 627617a079 Tidy up and add docs [ci skip] 2020-08-26 13:24:55 +02:00
Adriane Boyd 43c61da209 Set macro AUC score in Scorer.score_cats 2020-08-26 10:49:30 +02:00
Ines Montani aeebc6678d Small cleanup and adjustments 2020-08-26 10:26:57 +02:00
Ines Montani 31567d1e42 Link project.yml 2020-08-26 10:26:32 +02:00
Ines Montani 6c2a5ff53b Auto-link local sources 2020-08-26 10:26:06 +02:00
Matthew Honnibal 77852d2428 Fix run_command for python 3.6 2020-08-26 05:02:43 +02:00
Matthew Honnibal 884cac5fb5 Make run_command backwards compatible 2020-08-26 04:33:42 +02:00
Matthew Honnibal 6547472347 Set version to v3.0.0a12 2020-08-26 04:02:34 +02:00
Adriane Boyd 7d7b65ffd4
Fix raw strings in URL pattern (#5972)
Add missing raw string specifiers.
2020-08-26 04:00:49 +02:00
Matthew Honnibal 2771e4f2b3
Fix the git "sparse checkout" functionality (#5973)
* Fix the git sparse checkout functionality

* Format
2020-08-26 04:00:14 +02:00
Ines Montani 1c958a76c1 Add comment markers to only replace auto-generated docs 2020-08-26 00:03:06 +02:00
Ines Montani f10989e8c4 Add "project document" and more project.yml meta fields 2020-08-25 17:14:27 +02:00
Ines Montani fdcaf86c54 Adjust docstring
End sentence earlier so it's shown as a full sentence in --help
2020-08-25 17:13:50 +02:00
Ines Montani b89f6fa011 Fix meta defaults and error in package command 2020-08-25 17:13:33 +02:00
Ines Montani 94705c21c8 Allow reuse on validators to prevent reload error
Otherwise this will cause an error if spaCy is live reloaded, e.g. in Streamlit
2020-08-25 17:13:11 +02:00
Matthew Honnibal 4f82a02b70 Remove 'fix_pretrained_vectors_name' hack 2020-08-25 14:37:45 +02:00
Adriane Boyd 0bab7c8b91
Remove PRON_LEMMA symbol (#5968) 2020-08-25 14:21:29 +02:00
Hiroshi Matsuda 332803eda9
fix ja leading spaces (#5969)
* change condition for space after

* add NAUGHTY_STRINGS test example
2020-08-25 14:16:24 +02:00
Ines Montani dd84577a98 Update CLI utils, project.yml schema and add test 2020-08-25 11:54:53 +02:00
Shashank 450720aca2
Added support for Sanskrit language (#5956)
* Added support for Sanskrit language

* Added tests for lexical attribute like_num
2020-08-25 10:56:29 +02:00
Matthew Honnibal ef43152af4 Update scorer 2020-08-25 02:42:47 +02:00
Matthew Honnibal 8d6e1ce306 Update v3.0.0a11 2020-08-25 00:32:08 +02:00
Matthew Honnibal 8038b87f04
Various small tweaks to project CLI (#5965)
* Fix up/download of http and local paths

* Support git_sparse_checkout for assets

* Fix scorer

* Handle already-present directories for git assets

* Improve convert command

* Fix support for existant files in git assets

* Support branches in git sparse checkout

* Format

* Fix git assets

* Document git block in assets

* Fix test

* Fix test

* Revert "Fix test"

This reverts commit cf3097260f.

* Revert "Fix test"

This reverts commit 964d636e27.

* Dont multiply p/r/f by 100

* Display scores * 100 during training
2020-08-25 00:30:52 +02:00
Adriane Boyd abd3f2b65a
Rename Polish lemmatizer method (#5960)
Rename Polish lemmatizer method to `pos_lookup` to distinguish it from
pure token-based lookup methods.
2020-08-25 00:22:27 +02:00
Ines Montani e12b03358b
Support removing extra values in fill-config (#5966)
* Support removing extra values in fill-config

* Fix test
2020-08-24 22:53:47 +02:00
Matthew Honnibal f232d8db96 Report p/r/f out of 100 2020-08-24 17:17:23 +02:00
Ines Montani 0e7f99da58
Fix handling of optional [pretraining] block (#5954)
* Fix handling of optional [pretraining] block

* Remote pretraining from default config

* Fix test

* Add schema option for empty pretrain block
2020-08-24 15:56:03 +02:00
idoshr b10c7bc56e
Hebrew like num (#5952)
* Update stop_words.py

Hebrew STOP WORDS

* Update stop_words.py

* contributor

* contributor

* add some common domain extentions
support human number 1K/1M....

* support human number 1K/1M....

* hebrew number tokenize
1K/1M implement in EN

* test human tokenize fix

* test

* heb like num
revert human number change

* heb like num
2020-08-24 14:30:05 +02:00
Matthew Honnibal 64df37643f Update lockfile after project pull 2020-08-24 03:27:09 +02:00
Matthew Honnibal 588c28fe45 Fix project pull when deps missing 2020-08-24 01:23:36 +02:00
Matthew Honnibal 001546c19e Set version to v3.0.0a10 2020-08-23 21:15:38 +02:00
Matthew Honnibal 160a855246 Format 2020-08-23 21:15:12 +02:00
Matthew Honnibal 89f5b8abb3 Fix project push 2020-08-23 21:14:44 +02:00
Matthew Honnibal 3828bc3ed0 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-23 18:32:24 +02:00
Matthew Honnibal e559867605
Allow spacy project to push and pull to/from remote storage (#5949)
* Add utils for working with remote storage

* WIP add remote_cache for project

* WIP add push and pull commands

* Use pathy in remote_cache

* Updarte util

* Update remote_cache

* Update util

* Update project assets

* Update pull script

* Update push script

* Fix type annotation in util

* Work on remote storage

* Remove site and env hash

* Fix imports

* Fix type annotation

* Require pathy

* Require pathy

* Fix import

* Add a util to handle project variable substitution

* Import push and pull commands

* Fix pull command

* Fix push command

* Fix tarfile in remote_storage

* Improve printing

* Fiddle with status messages

* Set version to v3.0.0a9

* Draft docs for spacy project remote storages

* Update docs [ci skip]

* Use Thinc config to simplify and unify template variables

* Auto-format

* Don't import Pathy globally for now

Causes slow and annoying Google Cloud warning

* Tidy up test

* Tidy up and update tests

* Update to latest Thinc

* Update docs

* variables -> vars

* Update docs [ci skip]

* Update docs [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-23 18:32:09 +02:00
Matthew Honnibal fe1cf7e124 Allow score_weights to list extra scores 2020-08-23 18:31:30 +02:00
Ines Montani 9bdc9e81f5 Fix error message [ci skip] 2020-08-23 12:14:02 +02:00
Sofie Van Landeghem 56eabcb2f2
Adding num_like test for Czech (#5946)
* Create lex_attrs.py

Hello,

I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech.

* Update __init__.py

Updated for use with new Czech Lex_attrs file

* Update stop_words.py

* Create test_text.py

* add like_num testing for czech

Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com>
Co-authored-by: holubvl3 <vilemrousi@gmail.com>
Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>
2020-08-21 17:06:33 +02:00
holubvl3 a341b4ef09
Adding support for Czech language (#5826)
* Create lex_attrs.py

Hello,

I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech.

* Update __init__.py

Updated for use with new Czech Lex_attrs file

* Update stop_words.py

* Create test_text.py

Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>
2020-08-21 16:17:53 +02:00
svlandeg af36d77d01 fix typo in docstring 2020-08-21 15:56:03 +02:00
svlandeg 3060e4ae65 Merge remote-tracking branch 'upstream/develop' into feature/docs-docs-docs
# Conflicts:
#	website/src/widgets/quickstart-training-generator.js
2020-08-21 15:16:30 +02:00
svlandeg cc926267f8 small fixes 2020-08-21 15:05:40 +02:00
Ines Montani aa6a7cd6e7 Update docs and consistency [ci skip] 2020-08-21 13:49:18 +02:00
Ines Montani 3826cfb8fe
Merge pull request #5930 from svlandeg/feature/init-config-fix
UX for init config
2020-08-21 12:06:33 +02:00
Ines Montani 79af7dcd6d Small wording adjustments [ci skip] 2020-08-21 12:06:19 +02:00