Commit Graph

795 Commits

Author SHA1 Message Date
graue70 0fddc0447c
Fix copy & paste error in API docs 2021-03-02 14:00:14 +01:00
Ines Montani 8f7c7b2658
Merge pull request #7211 from svlandeg/docs/el_update [ci skip]
kb.get_candidates renamed to get_alias_candidates
2021-02-27 11:51:22 +11:00
Ines Montani 408b94887a
Merge pull request #7207 from adrianeboyd/docs/get-noun-chunks [ci skip]
Extend docs related to Vocab.get_noun_chunks
2021-02-27 11:51:08 +11:00
svlandeg 248339039e fix type in docs 2021-02-26 14:27:10 +01:00
svlandeg 08fd901a1b kb.get_candidates renamed to get_alias_candidates 2021-02-25 20:09:36 +01:00
Adriane Boyd 6a37f343d5 Extend docs related to Vocab.get_noun_chunks 2021-02-25 16:38:21 +01:00
Ken fa7ddc7f88
Update sentencizer documentation example with sentencizer pipe name (#7185) 2021-02-24 08:06:54 +01:00
Sofie Van Landeghem b92f81d5da
fix NEL config and IO, and n_sents functionality (#7100)
* fix NEL config and IO, and n_sents functionality

* add docs

* fix test
2021-02-22 14:49:52 +11:00
Sofie Van Landeghem ba5a50f62b
NEL docs & UX (#7129)
* EL set_kb docs fix

* custom warning for set_kb mistake
2021-02-22 11:04:22 +11:00
Sofie Van Landeghem 709c9e75af
span.ent only returns first sentence (#7084)
* return first sentence when span contains sentence boundary

* docs fix

* small fixes

* cleanup
2021-02-19 23:02:38 +11:00
Peter Baumann 61b04a70d5
Run PhraseMatcher on Spans (#6918)
* Add regression test

* Run PhraseMatcher on Spans

* Add test for PhraseMatcher on Spans and Docs

* Add SCA

* Add test with 3 matches in Doc, 1 match in Span

* Update docs

* Use doc.length for find_matches in tokenizer

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-02-10 23:43:32 +11:00
tarskiandhutch e897e7aaad
Line 70: syntax error
Original config definition treated dictionary key as a function argument.
2021-02-08 15:24:57 -05:00
Sofie Van Landeghem 6ed423c16c
reduce memory load when reading all vectors from file (#6945)
* reduce memory load when reading all vectors from file

* one more small typo fix
2021-02-07 08:05:43 +08:00
svlandeg 7cda5605a0 add type 2021-02-03 13:13:58 +01:00
svlandeg 94929c2b98 small doc fixes 2021-02-03 13:10:22 +01:00
Ines Montani a59f3fcf5d Make wheel the default format and update docs [ci skip] 2021-02-01 23:18:43 +11:00
Ines Montani 7752f80f39 Update docs [ci skip] 2021-01-31 16:11:24 +11:00
Ines Montani 45c551037d Update CLI docs [ci skip] 2021-01-30 21:50:23 +11:00
Ines Montani 2332c4280b Update and use unified --build option 2021-01-30 13:11:36 +11:00
Ines Montani 2609ba4e89 Support building wheel in spacy package 2021-01-30 11:54:02 +11:00
Ines Montani 95e958a229
Merge pull request #6852 from explosion/feature/replace-listeners 2021-01-30 00:58:08 +11:00
Ines Montani e766e8c56d
Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-01-29 21:41:17 +11:00
svlandeg d7d838281c adding new="3" mentions in the doc 2021-01-29 11:26:37 +01:00
Ines Montani 99af9e7125 Update documentation 2021-01-29 18:45:48 +11:00
Sofie Van Landeghem 24a697abb8
avoid empty aliases and improve UX and docs (#6840) 2021-01-29 08:51:40 +08:00
Sofie Van Landeghem 837a4f53c2
Error handling in nlp.pipe (#6817)
* add error handler for pipe methods

* add unit tests

* remove pipe method that are the same as their base class

* have Language keep track of a default error handler

* cleanup

* formatting

* small refactor

* add documentation
2021-01-29 08:51:21 +08:00
Ines Montani 230e651ad6 Merge branch 'develop' into master-tmp 2021-01-27 13:26:29 +11:00
Adriane Boyd c447aa2b98 Update --code arg in evaluate CLI docs 2021-01-26 15:30:46 +01:00
jganseman 907bce7a78
Merge pull request #1 from jganseman/patch-1
Patch 1
2021-01-26 11:12:30 +01:00
jganseman 8bc57ec372
also update is_oov in lexeme docs 2021-01-26 11:09:16 +01:00
jganseman 1f2b0ec168
proposing a more concise explanation for is_oov
proposing a more concise explanation for is_oov
2021-01-26 10:53:39 +01:00
Matthew Honnibal f049df1715
Revert "Set annotations in update" (#6810)
* Revert "Set annotations in update (#6767)"

This reverts commit e680efc7cc.

* Fix version

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/transition_parser.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update website/docs/api/multilabel_textcategorizer.md

* Update website/docs/api/tok2vec.md

* Update website/docs/usage/layers-architectures.md

* Update website/docs/usage/layers-architectures.md

* Update website/docs/api/transformer.md

* Update website/docs/api/textcategorizer.md

* Update website/docs/api/tagger.md

* Update spacy/pipeline/entity_linker.py

* Update website/docs/api/sentencerecognizer.md

* Update website/docs/api/pipe.md

* Update website/docs/api/morphologizer.md

* Update website/docs/api/entityrecognizer.md

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/multitask.pyx

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/trainable_pipe.pyx

* Update spacy/pipeline/trainable_pipe.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update website/docs/api/entitylinker.md

* Update website/docs/api/dependencyparser.md

* Update spacy/pipeline/trainable_pipe.pyx
2021-01-25 22:18:45 +08:00
Adriane Boyd 61c9f8bf24
Remove transformers model max length section (#6807) 2021-01-25 19:59:34 +08:00
Adriane Boyd d0236136a2
Fix default config init in Transformer API docs (#6781) 2021-01-21 23:18:03 +08:00
Sofie Van Landeghem e680efc7cc
Set annotations in update (#6767)
* bump to 3.0.0rc4

* do set_annotations in component update calls

* update docs and remove set_annotations flag

* fix EL test
2021-01-20 11:49:25 +11:00
Ines Montani f50502dad7 Update docs [ci skip] 2021-01-19 00:22:47 +11:00
Sofie Van Landeghem fed8f48965
raise NotImplementedError when noun_chunks iterator is not implemented (#6711)
* raise NotImplementedError when noun_chunks iterator is not implemented

* bring back, fix and document span.noun_chunks

* formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2021-01-17 19:56:05 +08:00
Adriane Boyd bf0cdae8d4
Add token_splitter component (#6726)
* Add long_token_splitter component

Add a `long_token_splitter` component for use with transformer
pipelines. This component splits up long tokens like URLs into smaller
tokens. This is particularly relevant for pretrained pipelines with
`strided_spans`, since the user can't change the length of the span
`window` and may not wish to preprocess the input texts.

The `long_token_splitter` splits tokens that are at least
`long_token_length` tokens long into smaller tokens of `split_length`
size.

Notes:

* Since this is intended for use as the first component in a pipeline,
the token splitter does not try to preserve any token annotation.
* API docs to come when the API is stable.

* Adjust API, add test

* Fix name in factory
2021-01-17 19:54:41 +08:00
Adriane Boyd 9328dd5625
Handle unset token.morph in Morphologizer (#6704)
* Handle unset token.morph in Morphologizer

Handle unset `token.morph` in `Morphologizer.initialize` and
`Morphologizer.get_loss`. If both `token.morph` and `token.pos` are
unset, treat the annotation as missing rather than empty.

* Add token.has_morph()
2021-01-15 17:20:10 +01:00
Adriane Boyd 0c936004d1 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3 2021-01-14 11:49:58 +01:00
Matthew Honnibal f277bfdf0f
Add SpanGroup and Graph container types to represent arbitrary annotations (#6696)
* Draft out initial Spans data structure

* Initial span group commit

* Basic span group support on Doc

* Basic test for span group

* Compile span_group.pyx

* Draft addition of SpanGroup to DocBin

* Add deserialization for SpanGroup

* Add tests for serializing SpanGroup

* Fix serialization of SpanGroup

* Add EdgeC and GraphC structs

* Add draft Graph data structure

* Compile graph

* More work on Graph

* Update GraphC

* Upd graph

* Fix walk functions

* Let Graph take nodes and edges on construction

* Fix walking and getting

* Add graph tests

* Fix import

* Add module with the SpanGroups dict thingy

* Update test

* Rename 'span_groups' attribute

* Try to fix c++11 compilation

* Fix test

* Update DocBin

* Try to fix compilation

* Try to fix graph

* Improve SpanGroup docstrings

* Add doc.spans to documentation

* Fix serialization

* Tidy up and add docs

* Update docs [ci skip]

* Add SpanGroup.has_overlap

* WIP updated Graph API

* Start testing new Graph API

* Update Graph tests

* Update Graph

* Add docstring

Co-authored-by: Ines Montani <ines@ines.io>
2021-01-14 17:30:41 +11:00
Sofie Van Landeghem 75d9019343
Fix types of Tok2Vec encoding architectures (#6442)
* fix TorchBiLSTMEncoder documentation

* ensure the types of the encoding Tok2vec layers are correct

* update references from v1 to v2 for the new architectures
2021-01-07 16:39:27 +11:00
Sofie Van Landeghem 82ae95267a
Docs for pretrain architectures (#6605)
* document pretraining architectures

* formatting

* bit more info

* small fixes
2021-01-06 16:12:30 +11:00
Sofie Van Landeghem afc5714d32
multi-label textcat component (#6474)
* multi-label textcat component

* formatting

* fix comment

* cleanup

* fix from #6481

* random edit to push the tests

* add explicit error when textcat is called with multi-label gold data

* fix error nr

* small fix
2021-01-06 13:07:14 +11:00
Ines Montani 6f83abb971
Merge pull request #6647 from svlandeg/feature/init_config_overwrite 2021-01-05 14:59:04 +11:00
Ines Montani 3614472e29
Merge pull request #6646 from svlandeg/feature/cli-docs [ci skip] 2021-01-05 13:52:49 +11:00
Ines Montani 9c078a5885
Update formatting for consistency [ci skip] 2021-01-05 13:52:28 +11:00
Ines Montani a9e845426f Use --force for consistency and add docs 2021-01-05 13:49:59 +11:00
svlandeg d5ff0fecf8 add docs 2020-12-30 14:01:13 +01:00
svlandeg 2fa23b0304 fix capitalization for link 2020-12-29 15:01:22 +01:00
svlandeg 43cc6aea93 remove non-existing link 2020-12-29 14:59:39 +01:00
svlandeg 543073bf9d add pretrain example 2020-12-29 14:51:23 +01:00
svlandeg 1d0ef98873 move example 2020-12-29 14:46:03 +01:00
svlandeg 20113b8063 add train CLI example 2020-12-29 14:44:56 +01:00
Sofie Van Landeghem 87562e470d
fix backticks in docs (#6635) 2020-12-27 22:12:37 +01:00
Sofie Van Landeghem 8df5b7f513
fix documentation of 'path' in tokenizer.to_disk (#6634) 2020-12-27 22:01:06 +01:00
Sofie Van Landeghem 282a3b49ea
Fix parser resizing when there is no upper layer (#6460)
* allow resizing of the parser model even when upper=False

* update from spacy.TransitionBasedParser.v1 to v2

* bugfix
2020-12-18 18:56:57 +08:00
Gareth Sparks efc229c3f4
Doc.char_span arg: alignment_mode (#6591)
Currently labeled "mode", actually "alignment_mode"
2020-12-18 09:54:56 +01:00
Ines Montani 513c4e332a
Include custom code via spacy package command (#6531) 2020-12-10 20:36:46 +08:00
Ines Montani 2a6043fabb
Merge pull request #6530 from explosion/feature/init-config-cpu-gpu 2020-12-10 09:38:46 +11:00
Ines Montani 9d32e839d3 Merge branch 'develop' into feature/init-config-cpu-gpu 2020-12-10 08:50:53 +11:00
Adriane Boyd 972820e2b3 Add batch_size to data formats docs 2020-12-09 12:44:04 +01:00
Adriane Boyd 80ac8af1bf Format 2020-12-09 12:44:01 +01:00
Adriane Boyd 795b5bd049
Update website/docs/api/language.md
Co-authored-by: Ines Montani <ines@ines.io>
2020-12-09 12:23:32 +01:00
Adriane Boyd fa8fa474a3 Add nlp.batch_size setting
Add a default `batch_size` setting for `Language.pipe` and
`Language.evaluate` as `nlp.batch_size`.
2020-12-09 09:13:26 +01:00
Ines Montani 34449b66fd Update matcher.md 2020-12-09 11:09:45 +11:00
Ines Montani 758ad6c3cd Make CPU the default for init config 2020-12-09 11:00:51 +11:00
Ines Montani 94a5a9814f Update argument handling and documentation 2020-12-08 20:41:18 +11:00
Adriane Boyd 5ceac425ee Remove non-working --use-chars from train CLI
Remove the non-working `--use-chars` option from the train CLI. The
implementation of the option across component types and the CLI settings
could be fixed, but the `CharacterEmbed` model does not work on GPU in
v2 so it's better to remove it.
2020-12-08 08:30:00 +01:00
Sofie Van Landeghem 2c27093c5f
require_cpu functionality (#6336)
* add require_cpu from Thinc 8.0.0rc2

* add docs

* fix test if cupy is not installed
2020-12-08 14:42:40 +08:00
Ines Montani ee2ec52f48
Merge pull request #6409 from svlandeg/feature/trf-docs 2020-12-08 06:32:10 +01:00
Ines Montani 82e88f0e3b
Merge pull request #6379 from svlandeg/fix/labels-constructor 2020-12-08 06:29:56 +01:00
svlandeg 636be3c791 Merge remote-tracking branch 'upstream/develop' into feature/trf-docs 2020-11-19 14:15:35 +01:00
Sofie Van Landeghem 165993d8e5
fix typo in transformer docs (#6404) 2020-11-19 14:11:38 +01:00
svlandeg 73fc1ed963 remove labels from morphologizer constructor 2020-11-11 21:48:50 +01:00
svlandeg fcd79e0655 remove set_morphology from docs 2020-11-11 21:32:34 +01:00
svlandeg 789fb3d124 add docs for upstream argument of TransformerListener 2020-11-09 21:42:58 +01:00
Ines Montani 363ac73c72 Update docs [ci skip] 2020-11-09 12:43:26 +08:00
Adriane Boyd 8644ee3e3f
Update TIGER link and tag description (#6344) 2020-11-05 09:33:00 +01:00
Sofie Van Landeghem 8ef056cf98
fix embed_size in Entity Linker architecture (#6343) 2020-11-04 22:20:13 +01:00
Adriane Boyd a4b32b9552
Handle missing reference values in scorer (#6286)
* Handle missing reference values in scorer

Handle missing values in reference doc during scoring where it is
possible to detect an unset state for the attribute. If no reference
docs contain annotation, `None` is returned instead of a score. `spacy
evaluate` displays `-` for missing scores and the missing scores are
saved as `None`/`null` in the metrics.

Attributes without unset states:

* `token.head`: relies on `token.dep` to recognize unset values
* `doc.cats`: unable to handle missing annotation

Additional changes:

* add optional `has_annotation` check to `score_scans` to replace
`doc.sents` hack
* update `score_token_attr_per_feat` to handle missing and empty morph
representations
* fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START`
vs. `SENT_START`

* Fix import

* Update return types
2020-11-03 15:47:18 +01:00
Sofie Van Landeghem 75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
Ines Montani 4d99d2b94a Update docs [ci skip] 2020-10-13 11:38:52 +02:00
svlandeg 40276fd3be update NEL docs after latest refactor 2020-10-12 11:41:27 +02:00
Ines Montani e50dc2c1c9 Update docs [ci skip] 2020-10-09 12:04:52 +02:00
Ines Montani 329b61ee7b Update docs [ci skip] 2020-10-09 10:36:06 +02:00
Sofie Van Landeghem d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Ines Montani 064575d79d
Merge pull request #6216 from svlandeg/feature/nel-initialize 2020-10-08 11:14:12 +02:00
Ines Montani 43e59bb22a Update docs and install extras [ci skip] 2020-10-08 10:58:50 +02:00
svlandeg eaf5c265cb set_kb method for entity_linker 2020-10-08 10:34:01 +02:00
Ines Montani 2fd7122074 Update docs [ci skip] 2020-10-06 10:31:48 +02:00
Ines Montani 568e12215d
Merge pull request #6206 from svlandeg/fix/patterns-init 2020-10-06 10:27:23 +02:00
svlandeg 9b4cf7b0b6 update output of debug config command 2020-10-06 09:47:23 +02:00
svlandeg fd0f60e2bc updates to data format for training and pretraining 2020-10-06 09:28:53 +02:00
svlandeg ff9ac39c88 read entity_ruler patterns with srsly.read_jsonl.v1 2020-10-05 22:50:14 +02:00
Ines Montani 1a554bdcb1 Update docs and docstring [ci skip] 2020-10-05 21:55:27 +02:00
Matthew Honnibal 919790cb47 Upd MultiHashEmbed docs 2020-10-05 20:28:21 +02:00
svlandeg 193e0d5a98 add docs for entity_ruler.initialize 2020-10-05 18:04:08 +02:00
svlandeg 65abd77779 add finish_update to Pipe 2020-10-05 16:23:33 +02:00
Ines Montani 0f64556c04
Merge pull request #6197 from svlandeg/feature/pipe-docs [ci skip] 2020-10-05 11:55:40 +02:00