Commit Graph

7670 Commits

Author SHA1 Message Date
svlandeg af36d77d01 fix typo in docstring 2020-08-21 15:56:03 +02:00
svlandeg 3060e4ae65 Merge remote-tracking branch 'upstream/develop' into feature/docs-docs-docs
# Conflicts:
#	website/src/widgets/quickstart-training-generator.js
2020-08-21 15:16:30 +02:00
svlandeg cc926267f8 small fixes 2020-08-21 15:05:40 +02:00
Ines Montani aa6a7cd6e7 Update docs and consistency [ci skip] 2020-08-21 13:49:18 +02:00
Ines Montani 3826cfb8fe
Merge pull request #5930 from svlandeg/feature/init-config-fix
UX for init config
2020-08-21 12:06:33 +02:00
Ines Montani 79af7dcd6d Small wording adjustments [ci skip] 2020-08-21 12:06:19 +02:00
Ines Montani e60442d83a Adjust label casing in displaCy NER visualizer (resolves #4866)
- Accept any case for label names in ents and colors option, even if actual predicted label uses different casing
- Don't text-transform: uppercase visually, if it's important to users that the label is represented as-is in the UI
2020-08-21 11:51:31 +02:00
Matthew Honnibal c356e62908 Minor adjustments to quickstart template 2020-08-21 00:10:21 +02:00
Ines Montani 6ad59d59fe Merge branch 'develop' of https://github.com/explosion/spaCy into develop [ci skip] 2020-08-20 11:20:58 +02:00
Sofie Van Landeghem 071c09ff35
add coding (#5942) 2020-08-20 11:08:38 +02:00
Ines Montani ea6640ea72
Merge pull request #5939 from explosion/feature/thinc-v8.0.0a28
Update Thinc and config variables
2020-08-19 21:14:36 +02:00
Ines Montani 3dd390b1a1 Update Thinc and config variables 2020-08-19 19:46:12 +02:00
svlandeg b96cd9fa5e fix typo 2020-08-19 18:46:08 +02:00
Ines Montani e2f2ef3a5a Update init config and recommendations
- As much as I dislike YAML, it seemed like a better format here because it allows us to add comments if we want to explain the different recommendations
- Don't include the generated JS in the repo by default and build it on the fly when running or deploying the site. This ensures it's always up to date.
- Simplify jinja_to_js script and use fewer dependencies
2020-08-19 13:33:15 +02:00
Ines Montani 2285e59765
Merge pull request #5933 from svlandeg/feature/more-v3-docs [ci skip] 2020-08-19 11:29:02 +02:00
Matthew Honnibal c0f6e77a41 Set version to v3.0.0a8 2020-08-18 23:29:00 +02:00
svlandeg a8acedd4ba example of custom reader and batcher 2020-08-18 19:15:16 +02:00
Sofie Van Landeghem 358cbb21e3
Define candidate generator in EL config (#5876)
* candidate generator as separate part of EL config

* update comment

* ent instead of str as input for candidate generation

* Span instead of str: correct type indication

* fix types

* unit test to create new candidate generator

* fix replace_pipe argument passing

* move error message, general cleanup

* add vocab back to KB constructor

* provide KB as callable from Vocab arg

* rename to kb_loader, fix KB serialization as part of the EL pipe

* fix typo

* reformatting

* cleanup

* fix comment

* fix wrongly duplicated code from merge conflict

* rename dump to to_disk

* from_disk instead of load_bulk

* update test after recent removal of set_morphology in tagger

* remove old doc
2020-08-18 16:10:36 +02:00
Sofie Van Landeghem 688e77562b
Train CLI script fixes (#5931)
* fix dash replacement in overrides arguments

* perform interpolation on training config

* make sure only .spacy files are read
2020-08-18 16:06:37 +02:00
Ines Montani 82f0e20318 Update docs and consistency [ci skip] 2020-08-18 14:39:40 +02:00
svlandeg 10e67b400c output_file required, spacy-transformers prefered instead of required 2020-08-18 13:38:43 +02:00
Ines Montani 1c3bcfb488 Update docs and util consistency 2020-08-18 01:22:59 +02:00
Ines Montani 990c6b4c32 Update docs and CLI [ci skip] 2020-08-17 21:38:20 +02:00
Ines Montani 3ae5e02f4f Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
Matthew Honnibal a95a36ce2a Set version to v3.0.0a7 2020-08-16 15:51:05 +02:00
Ines Montani 6ae83bde0c Fix CLI consistency [ci skip] 2020-08-16 15:46:29 +02:00
Ines Montani 45f13cbf64
Merge pull request #5916 from explosion/feature/new-thinc-config 2020-08-16 15:24:12 +02:00
Ines Montani 34bda91695 Show warnings if there's nothing to auto-fill 2020-08-16 14:19:43 +02:00
Ines Montani dd5804d499 Update type hints 2020-08-16 14:19:33 +02:00
Ines Montani a570c304df Update quickstart, template and docs 2020-08-15 14:50:29 +02:00
Ines Montani 3272a63430
Merge pull request #5920 from explosion/fix/logging-warning-various 2020-08-15 14:41:15 +02:00
Ines Montani fdcde9b0bf Add init fill-config 2020-08-14 16:49:26 +02:00
Matthew Honnibal 9ebf39fb5f Relax test 2020-08-14 16:31:09 +02:00
Ines Montani 8128e5eb35 Replace lexeme_norm warning with logging 2020-08-14 15:00:52 +02:00
Ines Montani 37814b608d Remove env_opt and simplfy default Optimizer 2020-08-14 14:59:54 +02:00
Ines Montani ab1d165bba Pass optimizer defined in config to resume/begin_training
Otherwise, this would create a default optimizer, which isn't what we want?
2020-08-14 14:59:22 +02:00
Ines Montani e4d0990857 Only receive from listener if listener exists 2020-08-14 14:58:48 +02:00
Ines Montani cef97e4b63 Fix path check 2020-08-14 14:58:18 +02:00
Ines Montani db2dbc8e59 Remove unused warning 2020-08-14 14:58:03 +02:00
Ines Montani 67cc39af7f Update Thinc and include section order 2020-08-14 14:06:22 +02:00
Ines Montani 88b0a96801 Update for new Thinc and adjust config 2020-08-13 17:38:30 +02:00
Adam Bittlingmayer 7b33b2854f
Add Armenian sentence-final verchaket, Greek question mark and Arabic question mark to default punct (#5910)
* Add Armenian sentence-final verchaket

* Add Greek and Arabic question marks, and contributor agreement

* Check box
2020-08-12 15:36:14 +02:00
graue70 49e690bde1
Fix typos in comments (#5904)
* Fix typo in comment

* Fix typo

* Add spaCy Contributor Agreement
2020-08-12 15:35:25 +02:00
graue70 ba84371ab0
Use init parameter (#5909) 2020-08-11 23:41:58 +02:00
Ines Montani 950832f087
Tidy up pipes (#5906)
* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-11 23:29:31 +02:00
Ines Montani f79e4c094d Remove generic type
Seems to cause error on Python 3.8 with Cython?
2020-08-10 17:24:30 +02:00
Ines Montani c099f6eece Add Token.lex 2020-08-10 16:43:52 +02:00
Ines Montani 933a7cf8d1 Fix Lexeme.from_ptr 2020-08-10 16:43:37 +02:00
Ines Montani 64f2f84098 Update docstrings and docs [ci skip] 2020-08-10 13:45:22 +02:00
Ines Montani a4b448eec4 Remove unused compiler flag 2020-08-10 13:13:18 +02:00
Ines Montani 3eaeb73342 Tidy up and auto-format 2020-08-09 22:36:23 +02:00
Ines Montani d5c78c7a34 Update docs and fix consistency 2020-08-09 22:31:52 +02:00
Ines Montani 7c6854d8d4 Fix missing imports 2020-08-09 22:28:29 +02:00
Matthew Honnibal 0fc13b2f14 Set version to v3.0.0a6 2020-08-09 21:53:32 +02:00
Ines Montani a15c5fb191 Update docstrings and docs 2020-08-09 16:10:48 +02:00
Ines Montani 8d2baa153d Update tokenizer docs and add test 2020-08-09 15:24:01 +02:00
Matthew Honnibal 134d933d67 Add docstring for entity linker factory 2020-08-09 15:19:28 +02:00
Matthew Honnibal 992ee1c02f Update tagger docstring 2020-08-09 15:09:31 +02:00
Matthew Honnibal ebf9a7acbf Add textcat docstring 2020-08-09 15:07:09 +02:00
Matthew Honnibal 8a13f510d6 Update tests 2020-08-09 15:01:16 +02:00
Matthew Honnibal bbd8acd4bf Add docstrings for parser and NER. Simplify some arguments 2020-08-09 14:46:13 +02:00
Matthew Honnibal 39a3d64c01 Add docstrings for Tok2Vec component 2020-08-09 00:48:03 +02:00
Ines Montani fd20f84927
Merge pull request #5895 from explosion/docs/batchers
Draft docstrings for batchers
2020-08-07 20:07:10 +02:00
Matthew Honnibal f5c4e0b751 Add docstrings for batchers 2020-08-07 18:51:02 +02:00
Ines Montani fe29ceec9e Merge branch 'develop' into docs/model-docstrings 2020-08-07 18:42:01 +02:00
Ines Montani 3a193eb8f1 Fix imports, types and default configs 2020-08-07 18:40:54 +02:00
Matthew Honnibal b1d83fc13e Fix imports 2020-08-07 16:55:54 +02:00
Matthew Honnibal 473504d837 Format 2020-08-07 16:49:00 +02:00
Matthew Honnibal 234c52a91e Add tok2vec docstrings 2020-08-07 16:48:48 +02:00
Matthew Honnibal 547bc8a82b Add docstring notes 2020-08-07 16:17:34 +02:00
Ines Montani 6f3649923c
Merge pull request #5893 from explosion/feature/validate-arg 2020-08-07 15:47:20 +02:00
Adriane Boyd e962784531
Add Lemmatizer and simplify related components (#5848)
* Add Lemmatizer and simplify related components

* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests

Differences:

* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map

* Fix test

* Initial fix for Lemmatizer config/serialization

* Adjust init test to be more generic

* Adjust init test to force empty Lookups

* Add simple cache to rule-based lemmatizer

* Convert language-specific lemmatizers

Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.

* Fix French and Polish lemmatizers

* Remove outdated UPOS conversions

* Update Russian lemmatizer init in tests

* Add minimal init/run tests for custom lemmatizers

* Add option to overwrite existing lemmas

* Update mode setting, lookup loading, and caching

* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods

* Implement strict when lang is not found in lookups

* Fix tables/lookups in make_lemmatizer

* Reallow provided lookups and allow for stricter checks

* Add lookups asset to all Lemmatizer pipe tests

* Rename lookups in lemmatizer init test

* Clean up merge

* Refactor lookup table loading

* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.

Additional slight refactor of lookups:

* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`

* Move registry assets into test methods

* Refactor lookups tables config

Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.

* Add pipe and score to lemmatizer

* Simplify Tagger.score

* Add missing import

* Clean up imports and auto-format

* Remove unused kwarg

* Tidy up and auto-format

* Update docstrings for Lemmatizer

Update docstrings for Lemmatizer.

Additionally modify `is_base_form` API to take `Token` instead of
individual features.

* Update docstrings

* Remove tag map values from Tagger.add_label

* Update API docs

* Fix relative link in Lemmatizer API docs
2020-08-07 15:27:13 +02:00
Matthew Honnibal da6e59519e Add docstrings for simple_ner 2020-08-07 15:09:49 +02:00
Matthew Honnibal 7ef8a64df9 Add docstring for parser 2020-08-07 14:59:34 +02:00
Ines Montani fc9a4fe827 Update attribute ruler 2020-08-07 14:43:55 +02:00
Ines Montani a8404c3517 validation -> validate 2020-08-07 14:43:47 +02:00
Ines Montani 1d01d89b79 Update CLI docs and evaluate command [ci skip] 2020-08-07 14:40:58 +02:00
Ines Montani ef2c67cca5
Add DocBin to/from_disk methods and update docs (#5892)
* Add DocBin to/from_disk methods and update docs

* Use DocBin.from_disk in Corpus
2020-08-07 14:30:59 +02:00
Ines Montani 4ca08c6d5d
Merge pull request #5891 from adrianeboyd/docs/attribute-ruler-api
Add AttributeRuler API docs
2020-08-07 13:55:12 +02:00
Adriane Boyd b8d0c23857 Add AttributeRuler API docs
With additional minor updates to AttributeRuler docstrings.
2020-08-07 12:43:23 +02:00
svlandeg b17db0e994 Merge remote-tracking branch 'upstream/develop' into feature/el-docs
# Conflicts:
#	website/docs/usage/training.md
2020-08-06 19:48:52 +02:00
Adriane Boyd 06c3a5e048
Add pipe to AttributeRuler (#5889) 2020-08-06 19:43:09 +02:00
Ines Montani 9b7f198390 Fix format 2020-08-06 19:30:53 +02:00
Ines Montani 3c4389110d Remove unused imports 2020-08-06 19:30:47 +02:00
Matthew Honnibal d4525816ef
Be less choosy about reporting textcat scores (#5879)
* Set textcat scores more consistently

* Refactor textcat scores

* Fixes to scorer

* Add comments

* Add threshold

* Rename just 'f' to micro_f in textcat scorer

* Fix textcat score for two-class

* Fix syntax

* Fix textcat score

* Fix docstring
2020-08-06 16:24:13 +02:00
svlandeg 0b4d1e1bc4 'debug data' instead of 'debug-data' 2020-08-06 15:47:31 +02:00
svlandeg 881e3f8fd0 add docbin explanation and example 2020-08-06 15:29:44 +02:00
Adriane Boyd 5e683a6e46
Fix return values for per feat score (#5885)
* Fix return values for per feat score

Convert `PRFScore` to dict as other per type scores.

* Update tests accordingly
2020-08-06 15:14:47 +02:00
Ines Montani 913d21f0a3
Merge pull request #5882 from explosion/feature/raise-from
Use "raise ... from" in custom errors for better tracebacks
2020-08-06 00:35:26 +02:00
Ines Montani 06e80d95cd
Sync develop with nightly docs state (#5883)
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2020-08-06 00:28:14 +02:00
Ines Montani d92954ac1d
Merge pull request #5881 from explosion/feature/better-error-model-shortcuts 2020-08-06 00:13:35 +02:00
Ines Montani 56c17973aa Use "raise ... from" in custom errors for better tracebacks 2020-08-05 23:53:21 +02:00
Ines Montani 5cc0d89fad
Simplify config overrides in CLI and deserialization (#5880) 2020-08-05 23:35:09 +02:00
Ines Montani 0881455a5d Update error message 2020-08-05 23:15:05 +02:00
Ines Montani 2a1fa86a0d Add better error for failed model shortcut loading 2020-08-05 23:10:29 +02:00
Ines Montani c675746ca2 Update docstrings and types 2020-08-05 20:29:46 +02:00
Ines Montani 823e533dc1
Add config callbacks for modifying nlp object before and after init (#5866)
* WIP: Concept for modifying nlp object before and after init

* Make callbacks return nlp object

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Raise if callbacks don't return correct type

* Rename, update types, add after_pipeline_creation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-05 19:47:54 +02:00
Ines Montani 586d695775 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-05 16:01:11 +02:00
Ines Montani e68459296d Tidy up and auto-format 2020-08-05 16:00:59 +02:00
Matthew Honnibal 50c0e49741 Fix train CLI 2020-08-05 15:40:47 +02:00
Matthew Honnibal b9df4d6116 Fix textcat.begin_training if vectors set 2020-08-05 15:40:36 +02:00
Adriane Boyd 4193402c47
Add warning when Matcher subpattern is discarded (#5873)
* Add a warning when a subpattern is not processed and discarded

* Normalize subpattern attribute/operator keys to upper case like
top-level attributes
2020-08-05 14:56:14 +02:00
Adriane Boyd af125875cf
Update SimpleNER (#5878)
* Fix `get_loss` to use NER annotation
* Add labels as part of cfg
* Add simple overfitting test
2020-08-05 14:43:29 +02:00
Sofie Van Landeghem b88c5c701a
Bugfix in nlp.replace_pipe (#5875)
* bugfix and unit test

* merge two conditions
2020-08-05 09:30:58 +02:00
Ines Montani b795f02fbd
Allow adding pipeline components from source model (#5857)
* Allow adding pipeline components from source model

* Config: name -> component

* Improve error messages

* Fix error and test

* Add frozen components and exclude logic

* Remove exclude from Language.evaluate

* Init sourced components with current vocab

* Fix error codes
2020-08-04 23:39:19 +02:00
Sofie Van Landeghem 34873c4911
Example Dict format consistency (#5858)
* consistently use upper-case IDS in token_annotation format and for get_aligned

* remove ID from to_dict (not used in from_dict either)

* fix test

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-04 22:22:26 +02:00
Adriane Boyd fa79a0db9f
Add AttributeRuler for token attribute exceptions (#5842)
* Add AttributeRuler for token attribute exceptions

Add the `AttributeRuler` to handle exceptions for token-level
attributes. The `AttributeRuler` uses `Matcher` patterns to identify
target spans and applies the specified attributes to the token at the
provided index in the matched span. A negative index can be used to
index from the end of the matched span. The retokenizer is used to
"merge" the individual tokens and assign them the provided attributes.

Helper functions can import existing tag maps and morph rules to the
corresponding `Matcher` patterns.

There is an additional minor bug fix for `MORPH` attributes in the
retokenizer to correctly normalize the values and to handle `MORPH`
alongside `_` in an attrs dict.

* Fix default name

* Update name in error message

* Extend AttributeRuler functionality

* Add option to initialize with a dict of AttributeRuler patterns

* Instead of silently discarding overlapping matches (the default
behavior for the retokenizer if only the attrs differ), split the
matches into disjoint sets and retokenize each set separately. This
allows, for instance, one pattern to set the POS and another pattern to
set the lemma. (If two matches modify the same attribute, it looks like
the attrs are applied in the order they were added, but it may not be
deterministic?)

* Improve types

* Sort spans before processing

* Fix index boundaries in Span

* Refactor retokenizer to separate attrs methods

Add top-level `normalize_token_attrs` and `set_token_attrs` methods.

* Update AttributeRuler to use refactored methods

Update `AttributeRuler` to replace use of full retokenizer with only the
relevant methods for normalizing and setting attributes for a single
token.

* Update spacy/pipeline/attributeruler.py

Co-authored-by: Ines Montani <ines@ines.io>

* Make API more similar to EntityRuler

* Add `AttributeRuler.add_patterns` to add patterns from a list of dicts
* Return list of dicts as property `AttributeRuler.patterns`

* Make attrs_unnormed private

* Add test loading patterns from assets

* Revert "Fix index boundaries in Span"

This reverts commit 8f8a5c3386.

* Add Span index boundary checks (#5861)

* Add Span index boundary checks

* Return Span-specific IndexError in all cases

* Simplify and fix if/else

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 17:02:39 +02:00
Sofie Van Landeghem 492d1ec5de
Prevent alignment when texts don't match (#5867)
* remove empty gold.pyx

* add alignment unit test (to be used in docs)

* ensure that Alignment is only used on equal texts

* additional test using example.alignment

* formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-04 16:29:18 +02:00
Matthew Honnibal ecb3c4e8f4
Create corpus iterator and batcher from registry during training (#5865)
* Move batchers into their own module (and registry)

* Update CLI

* Update Corpus and batcher

* Update tests

* Update one config

* Merge 'evaluation' block back under [training]

* Import batchers in gold __init__

* Fix batchers

* Update config

* Update schema

* Update util

* Don't assume train and dev are actually paths

* Update onto-joint config

* Fix missing import

* Format

* Format

* Update spacy/gold/corpus.py

Co-authored-by: Ines Montani <ines@ines.io>

* Fix name

* Update default config

* Fix get_length option in batchers

* Update test

* Add comment

* Pass path into Corpus

* Update docstring

* Update schema and configs

* Update config

* Fix test

* Fix paths

* Fix print

* Fix create_train_batches

* [training.read_train] -> [training.train_corpus]

* Update onto-joint config

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 15:09:37 +02:00
Sofie Van Landeghem 82347110f5
Default empty KB in EL component (#5872)
* EL field documentation

* documentation consistent with docs

* default empty KB, initialize vocab separately

* formatting

* add test for changing the default entity vector length

* update comment
2020-08-04 14:34:09 +02:00
Adriane Boyd b7e3018d97
Recalculate alignment if tokenization differs (#5868)
* Recalculate alignment if tokenization differs

* Refactor cached alignment data
2020-08-04 14:31:32 +02:00
Adriane Boyd c62fd878a3
Allow Doc.char_span to snap to token boundaries (#5849)
* Allow Doc.char_span to snap to token boundaries

Add a `mode` option to allow `Doc.char_span` to snap to token
boundaries. The `mode` options:

* `strict`: character offsets must match token boundaries (default, same as
before)
* `inside`: all tokens completely within the character span
* `outside`: all tokens at least partially covered by the character span

Add a new helper function `token_by_char` that returns the token
corresponding to a character position in the text. Update
`token_by_start` and `token_by_end` to use `token_by_char` for more
efficient searching.

* Remove unused import

* Rename mode to alignment_mode

Rename `mode` to `alignment_mode` with the options
`strict`/`contract`/`expand`. Any unrecognized modes are silently
converted to `strict`.
2020-08-04 13:36:32 +02:00
Adriane Boyd b841248589
Add Span index boundary checks (#5861)
* Add Span index boundary checks

* Return Span-specific IndexError in all cases

* Simplify and fix if/else
2020-08-04 13:35:25 +02:00
Adriane Boyd cd59979ab4
Fix span boundary handling in Spanish noun_chunks (#5860) 2020-08-03 13:53:15 +02:00
Ines Montani 934447a611
Merge pull request #5855 from svlandeg/fix/cli-debug 2020-08-03 13:09:20 +02:00
Ines Montani 4c055f0aa7
Add init CLI and init config (#5854)
* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
2020-08-02 15:18:30 +02:00
svlandeg 6f4e46ee93 Merge remote-tracking branch 'upstream/develop' into fix/cli-debug
# Conflicts:
#	pyproject.toml
#	requirements.txt
#	setup.cfg
2020-08-01 18:38:59 +02:00
Ines Montani b40f44419b Simplify pipe analysis
- remove unused code
- don't print by default
- integrate attrs info into analysis output
2020-08-01 13:40:06 +02:00
Ines Montani b68c53858c Remove global 2020-07-31 18:37:58 +02:00
Ines Montani 30a76fcf6f Integrate and simplify pipe analysis 2020-07-31 18:34:35 +02:00
svlandeg 9b719dfb1a use divider inbetween steps 2020-07-31 18:06:48 +02:00
svlandeg 51ffc4a166 rename pipe_name to component 2020-07-31 17:58:55 +02:00
svlandeg 878327d38e printing final predictions by default to False 2020-07-31 17:36:32 +02:00
Ines Montani 2d955fbf98 Fix linting [ci skip] 2020-07-31 17:05:28 +02:00
Ines Montani e9e8fa2466 Update docs and types 2020-07-31 17:02:54 +02:00
svlandeg cc2f58a1b0 use data_validation context manager 2020-07-31 16:49:42 +02:00
Adriane Boyd ac14ce7c30
Prefer earlier spans in EntityRuler (#5843)
Similar to #4414, update the sorting in EntityRuler to prefer the first
span in overlapping spans.
2020-07-31 16:09:32 +02:00
svlandeg 5fa3235d06 set DATA_VALIDATION to False for debug_model (upgrade thinc) 2020-07-31 15:21:01 +02:00
svlandeg 08d3c36c20 bugfix in train CLI 2020-07-31 15:03:43 +02:00
Adriane Boyd 9b509aa87f Move Language.evaluate scorer config to new arg
Move `Language.evaluate` scorer config from `component_cfg` to separate
argument `scorer_cfg`.
2020-07-31 11:05:16 +02:00
Adriane Boyd 901801b33b Fix default arguments in DependencyParser.score 2020-07-31 10:55:44 +02:00
Adriane Boyd 9d79916792 Merge branch 'develop' into feature/scorer-adjustments 2020-07-31 10:48:14 +02:00
Sofie Van Landeghem ca491722ad
The Parser is now a Pipe (2) (#5844)
* moving syntax folder to _parser_internals

* moving nn_parser and transition_system

* move nn_parser and transition_system out of internals folder

* moving nn_parser code into transition_system file

* rename transition_system to transition_parser

* moving parser_model and _state to ml

* move _state back to internals

* The Parser now inherits from Pipe!

* small code fixes

* removing unnecessary imports

* remove link_vectors_to_models

* transition_system to internals folder

* little bit more cleanup

* newlines
2020-07-30 23:30:54 +02:00
svlandeg 0b23594953 pipe_name instead of section in debug_model 2020-07-30 20:06:28 +02:00
Rahul Gupta f76fae0e8d
English: adds ordinal numbers (#5830) 2020-07-29 20:22:47 +02:00
Ines Montani 7a21775cd0
Merge pull request #5834 from explosion/feature/vectors 2020-07-29 18:49:26 +02:00
Gustavo Zadrozny Leyendecker 90b958fd01
Fix on EntityRendered to support break lines (after last entity) (closes #5838) 2020-07-29 18:48:39 +02:00
Ines Montani b0f57a0cac Update docs and consistency 2020-07-29 15:14:07 +02:00
Matthew Honnibal a2d573c039 Merge branch 'feature/vectors' of https://github.com/explosion/spaCy into feature/vectors 2020-07-29 14:56:27 +02:00
Matthew Honnibal 2af741d7e3 Fix train arg 2020-07-29 14:56:01 +02:00
Matthew Honnibal c27309f839
Merge branch 'develop' into feature/vectors 2020-07-29 14:54:10 +02:00
Ines Montani 62266fb828 Fix broken type annotation 2020-07-29 14:49:49 +02:00
Matthew Honnibal 142b58be92 Fix import 2020-07-29 14:45:09 +02:00
Matthew Honnibal c99a653070 Adjust textcat model 2020-07-29 14:38:15 +02:00
Matthew Honnibal 9e1b11dd81 Update vectors in textcat 2020-07-29 14:35:36 +02:00
Matthew Honnibal 105cf29967 Fix DocBin 2020-07-29 14:23:13 +02:00
Ines Montani ff0bc05da8 Fix docstrings [ci skip] 2020-07-29 14:09:37 +02:00
Ines Montani 6e2623d3f8 Fix docstring [ci skip] 2020-07-29 14:08:05 +02:00
Ines Montani 8d56260d92 Fix docstrings [ci skip] 2020-07-29 14:07:13 +02:00
Ines Montani 80b18124d2 Fix docstring [ci skip] 2020-07-29 14:03:35 +02:00
Matthew Honnibal f0cf4a2dca Update tests 2020-07-29 14:01:14 +02:00
Matthew Honnibal 07b47eaac8 Update tok2vec layer 2020-07-29 14:01:13 +02:00
Matthew Honnibal 5ae8628571 Fix CharacterEmbed layer 2020-07-29 14:01:13 +02:00
Matthew Honnibal 97d3651574 Fix stray link_vectors_to_models call 2020-07-29 14:01:13 +02:00
Matthew Honnibal c7d1ece3eb Update tests 2020-07-29 14:01:13 +02:00
Matthew Honnibal 00de30bcc2 Update CharacterEmbed function 2020-07-29 14:01:12 +02:00
Matthew Honnibal 6a6b09bd32 Update morphologizer model 2020-07-29 14:01:12 +02:00
Matthew Honnibal 20e9098e3f Update tests 2020-07-29 14:01:12 +02:00
Matthew Honnibal c35d6282fc Add previous HashEmbedCNN tok2vec to make transition easier 2020-07-29 14:01:12 +02:00
Matthew Honnibal 1784c95827 Clean up link_vectors_to_models unused stuff 2020-07-29 14:01:11 +02:00
Matthew Honnibal 0c17ea4c85 Format 2020-07-29 14:00:13 +02:00
Matthew Honnibal 2aff3c4b5a Load vectors in 'spacy train' 2020-07-29 14:00:13 +02:00
Matthew Honnibal 7852a68a75 Fix load_vectors_into_model function 2020-07-29 14:00:13 +02:00
Matthew Honnibal 7299419fe4 Dont load vectors in Language.from_config 2020-07-29 14:00:12 +02:00
Matthew Honnibal 30dd96c540 Load vectors in Language.from_config 2020-07-29 14:00:12 +02:00
Matthew Honnibal df95e2af64 Add load_vectors_into_model util 2020-07-29 14:00:12 +02:00
Matthew Honnibal 475d7c1c7c Fix StaticVectors class 2020-07-29 14:00:11 +02:00
Matthew Honnibal 44d350dc94 Use spaCy's StaticVectors 2020-07-29 14:00:11 +02:00
Matthew Honnibal acc64e138a Add import 2020-07-29 14:00:11 +02:00
Matthew Honnibal 9987ea9e4d Fix Tok2Vec begin_training 2020-07-29 14:00:10 +02:00
Matthew Honnibal 099e9331c5 Fix tok2vec 2020-07-29 14:00:10 +02:00
Matthew Honnibal fe0cdcd461 Fixes 2020-07-29 14:00:09 +02:00
Matthew Honnibal 123f8b832d Refactor Tok2Vec model 2020-07-29 14:00:09 +02:00
Matthew Honnibal c6b4f63c7c Remove obsolete function 2020-07-29 14:00:09 +02:00
Matthew Honnibal 9cc7262224 Draft StaticVectors layer 2020-07-29 14:00:09 +02:00
Matthew Honnibal cb9654e98c WIP on new StaticVectors 2020-07-29 14:00:09 +02:00
Ines Montani e257e66ab9 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-29 11:36:45 +02:00
Ines Montani e0ffe36e79 Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
Sofie Van Landeghem 40c995b1be
Option for returning only greedy matches (#5771)
* add "greedy" option for match pattern

* distinction between greedy FIRST or LONGEST

* check for proper values, throw custom warning otherwise

* unxfail one more test

* add comment in docstring

* add test that LONGEST also prefers first match if equal length

* use c arrays for more efficient processing

* rename 'greediness' to 'greedy'
2020-07-29 11:04:43 +02:00
Adriane Boyd 191a12d75f
Fix score_weights typo in train CLI (#5835) 2020-07-29 11:04:12 +02:00
Adriane Boyd 0cddb0dbe9
Move timing into Language.evaluate (#5836)
Move timing into `Language.evaluate` so that only the processing is
timing, not processing + scoring. `Language.evaluate` returns
`scores["speed"]` as words per second, which should be identical to how
the speed was added to the scores previously. Also add the speed to the
evaluate CLI output.
2020-07-29 11:02:31 +02:00
Adriane Boyd c689ae8f0a Fix types in Scorer 2020-07-29 10:40:30 +02:00
oculusrepairo 03ab518f28
Update examples.py (#5820)
* Update examples.py

adding factual sentences to the list

* Add missing comma separators

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-07-29 10:28:56 +02:00
Ines Montani 7adffc5361 Remove unused schema 2020-07-28 23:12:47 +02:00
Ines Montani e5d9eaf79c Tidy up docstrings and arguments 2020-07-28 23:12:42 +02:00
Ines Montani ac24adec73 Small adjustments to Scorer and docs 2020-07-28 21:39:42 +02:00
Ines Montani 2c7a32cf12 Remove unused methods 2020-07-28 16:50:02 +02:00
Ines Montani ba22111ff4 Move error to Errors 2020-07-28 16:24:14 +02:00
Ines Montani 2748249217 Re-add meta["pipeline"] for now 2020-07-28 16:14:23 +02:00
Ines Montani b83ead5bf5
Merge pull request #5824 from svlandeg/fix/textcat-v3 2020-07-28 15:04:25 +02:00
Ines Montani 06a97a8766 Support --opt=value format in CLI config overrides 2020-07-28 13:43:15 +02:00
Ines Montani ae4d8a6ffd Update docstrings, docs and pipe consistency 2020-07-28 13:37:31 +02:00
Ines Montani 0094cb0d04 Remove scores list from config and document 2020-07-28 11:22:24 +02:00
graue70 b97dbab998
Fix typo in unit tests (#5823) 2020-07-27 20:18:48 +02:00
Ines Montani 894e20c466 Merge branch 'develop' into feature/component-scores 2020-07-27 18:14:39 +02:00
Ines Montani d8b519c23c API docs, docstrings and argument consistency 2020-07-27 18:11:45 +02:00
svlandeg 85b2dcfd67 cleanup 2020-07-27 17:54:44 +02:00
svlandeg 61068e0fb1 util function dot_to_object and corresponding unit test 2020-07-27 17:50:12 +02:00
Ines Montani 10b84e1e27 Add flag to toggle sdist creation on package [ci skip] 2020-07-27 16:52:23 +02:00
Adriane Boyd 34c92dfe63 Add missing Scorer imports 2020-07-27 15:08:51 +02:00