Commit Graph

987 Commits

Author SHA1 Message Date
Adriane Boyd 14df00ae98 Add Morphology and MorphAnalsysis API docs
Add initial draft of `Morphology` and `MorphAnalysis` API docs.
2020-07-21 10:33:46 +02:00
Ines Montani 644074b954 Merge branch 'develop' into master-tmp 2020-07-20 14:58:04 +02:00
Adriane Boyd 986f7e4d69 Initial draft of Morphologizer API docs 2020-07-20 12:53:02 +02:00
Adriane Boyd 39ebcd9ec9
Refactor Chinese tokenizer configuration (#5736)
* Refactor Chinese tokenizer configuration

Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.

* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)

* Fix Chinese serialization test to use char default

* Warn if attempting to customize other segmenter

Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.
2020-07-19 13:34:37 +02:00
Adriane Boyd cd5af72c9a
Update pkuseg version (#5774)
* Update pkuseg version in Chinese tokenizer warnings
* Update pkuseg version in `Makefile`
* Remove warning about python3.8 wheels in docs
2020-07-19 11:09:49 +02:00
Ines Montani 872938ec76
Merge pull request #5747 from explosion/feature/refactor-config-args 2020-07-14 00:00:22 +02:00
Ines Montani 5f6f4ff594 Remove object subclassing 2020-07-12 14:03:23 +02:00
Ines Montani c96535e338 Update command docstrings and docs 2020-07-12 13:53:49 +02:00
Ines Montani 3f948b9c74 Update docs 2020-07-12 12:32:28 +02:00
Ines Montani 11bbc82c24 Update cli.md [ci skip] 2020-07-10 23:37:52 +02:00
Ines Montani 9455b060d2 Update cli.md 2020-07-10 22:57:22 +02:00
Ines Montani 7b5717cac3 Merge branch 'develop' into feature/refactor-config-args 2020-07-10 22:50:07 +02:00
Ines Montani e6a6587a9a Update projects.md [ci skip] 2020-07-10 22:41:27 +02:00
Ines Montani f2cd982e7b Update training.md 2020-07-10 22:34:27 +02:00
Ines Montani 52e9b5b472 Fix formatting 2020-07-09 23:25:58 +02:00
Ines Montani 28cdae898a Update projects.md 2020-07-09 22:35:54 +02:00
Ines Montani 7bcf9f7cfb Document new features 2020-07-09 21:10:36 +02:00
Ines Montani ea01831f6a Update projects docs etc. 2020-07-09 19:43:25 +02:00
Ines Montani 175d34d8f9 Update sidebar menu 2020-07-09 11:44:09 +02:00
Ines Montani 9ee5b71412 Update cli.md 2020-07-09 11:44:00 +02:00
Ines Montani 9ae4040183 Update API docs 2020-07-08 13:34:35 +02:00
svlandeg c94279ac1b remove tensors, fix predict, get_loss and set_annotations 2020-07-08 13:11:54 +02:00
svlandeg 90b100c39f remove component.Model, update constructor, losses is return value of update 2020-07-08 12:14:30 +02:00
Ines Montani 2298e129e6 Update example and training docs 2020-07-07 20:30:12 +02:00
svlandeg 2b60e894cb fix component constructors, update, begin_training, reference to GoldParse 2020-07-07 19:17:19 +02:00
svlandeg 14a796e3f9 add Example API with examples of Example usage 2020-07-07 14:46:41 +02:00
Ines Montani bb3ee38cf9 Update WIP 2020-07-06 22:22:37 +02:00
Ines Montani 44da24ddd0 Update doc.md 2020-07-06 18:17:00 +02:00
Ines Montani 44790c1c32 Update docs and add keyword-only tag 2020-07-06 18:14:57 +02:00
Ines Montani a35236e5f0 Update v3 docs WIP [ci skip] 2020-07-06 15:57:44 +02:00
Ines Montani 63247cbe87 Update v3 docs [ci skip] 2020-07-05 16:11:16 +02:00
Matthew Honnibal 3e78e82a83
Experimental character-based pretraining (#5700)
* Use cosine loss in Cloze multitask

* Fix char_embed for gpu

* Call resume_training for base model in train CLI

* Fix bilstm_depth default in pretrain command

* Implement character-based pretraining objective

* Use chars loss in ClozeMultitask

* Add method to decode predicted characters

* Fix number characters

* Rescale gradients for mlm

* Fix char embed+vectors in ml

* Fix pipes

* Fix pretrain args

* Move get_characters_loss

* Fix import

* Fix import

* Mention characters loss option in pretrain

* Remove broken 'self attention' option in pretrain

* Revert "Remove broken 'self attention' option in pretrain"

This reverts commit 56b820f6af.

* Document 'characters' objective of pretrain
2020-07-05 15:48:39 +02:00
Ines Montani dc8c9d912f Update docs [ci skip] 2020-07-04 16:47:24 +02:00
Ines Montani 4498dfe99d Update docs 2020-07-04 16:25:30 +02:00
Ines Montani 1e0d54edd1 Update docs 2020-07-04 14:23:10 +02:00
Ines Montani fe224dc2dd Merge branch 'develop' into nightly.spacy.io 2020-07-03 16:48:27 +02:00
Ines Montani 06f1ecb308 Update v3 docs 2020-07-03 16:48:21 +02:00
Ines Montani cdf9ee1716 Add stub for Example API docs [ci skip] 2020-07-03 15:46:10 +02:00
Ines Montani fa8e097c04 Update convert docs [ci skip] 2020-07-03 15:42:04 +02:00
Jan Jessewitsch e4dcac4a4b
Merging multiple docs into one (#5032)
* Add static method to Doc to allow merging of multiple docs.

* Add error description for the error that occurs if docs with different
vocabs (from different languages) are merged in Doc.from_docs().

* Add test for Doc.from_docs() implementation.

* Fix using numpy's concatenate in Doc.from_docs.

* Replace typing's type annotations in from_docs.

* Simply remove type annotations in from_docs.

* Add documentation for Doc.from_docs to api.

* Simplify from_docs, its test and the api doc for codebase consistency.

* Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes.

* Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages.

* Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test.

* Add MORPH to attrs

* Update warnings calls

* Remove out-dated error from merge

* Rename space_delimiter to ensure_whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-07-03 11:32:42 +02:00
Adriane Boyd a723fa02a1
DocBin: add version number, missing attributes and strings (#5685)
* Add version number to DocBin

Add a version number to DocBin for future use.

* Add POS to all attributes in DocBin

* Add morph string to strings in DocBin

* Update DocBin API

* Add string for ENT_KB_ID in DocBin
2020-07-02 17:41:50 +02:00
Ines Montani b5268955d7 Update matcher usage examples [ci skip] 2020-07-02 15:39:45 +02:00
Ines Montani a4cfe9fc33 Remove inline notes on v2 changes [ci skip] 2020-07-01 22:29:22 +02:00
Ines Montani fe4cfd0632 Start updating website for v3 [ci skip] 2020-07-01 21:26:39 +02:00
Ines Montani 26df4efa94 Add new in v3.0 2020-07-01 13:02:17 +02:00
Ines Montani 18a900abc2 Fix markup 2020-07-01 13:02:07 +02:00
Ines Montani 414dc7ace1 Merge branch 'spacy.io' into spacy.io-develop 2020-07-01 11:47:47 +02:00
Álvaro Abella Bascarán 7111b9de2e Fix in docs: pipe(docs) instead of pipe(texts) (#5680)
Very minor fix in docs, specifically in this part:

```
 matcher = PhraseMatcher(nlp.vocab)
>   for doc in matcher.pipe(texts, batch_size=50):
>       pass
```

`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
2020-06-30 20:01:12 +02:00
Álvaro Abella Bascarán ff0dbe5c64
Fix in docs: pipe(docs) instead of pipe(texts) (#5680)
Very minor fix in docs, specifically in this part:

```
 matcher = PhraseMatcher(nlp.vocab)
>   for doc in matcher.pipe(texts, batch_size=50):
>       pass
```

`texts` suggests the input is an iterable of strings. I replaced it for `docs`.
2020-06-30 20:00:50 +02:00
Matthias Hertel 305221f3e5 Website: fixed the token span in the text about the rule-based matching example (#5669)
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:55 +02:00
Matthias Hertel 8b0f749606
Website: fixed the token span in the text about the rule-based matching example (#5669)
* fixed token span in pattern matcher example

* contributor agreement
2020-06-30 19:58:23 +02:00
Adriane Boyd d777d9cc38 Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:13:01 +02:00
Adriane Boyd c4d0209472
Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:12:29 +02:00
Adriane Boyd a2660bd9c6 Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:57 +02:00
Adriane Boyd fd4287c178
Fix backslashes in warnings config diff (#5640)
Fix backslashes in warnings config diff in v2.3 migration section.
2020-06-24 10:26:12 +02:00
Adriane Boyd 4f73ced914 Extend what's new in v2.3 with vocab / is_oov (#5635) 2020-06-23 16:50:43 +02:00
Adriane Boyd 7ce451c211
Extend what's new in v2.3 with vocab / is_oov (#5635) 2020-06-23 16:48:59 +02:00
Adriane Boyd fcdecefacf Add warnings example in v2.3 migration guide (#5627) 2020-06-22 14:38:06 +02:00
Adriane Boyd bc1cb30b21
Add warnings example in v2.3 migration guide (#5627) 2020-06-22 14:37:24 +02:00
Ines Montani 52728d8fa3 Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
Adriane Boyd 66889de166 Warning for sudachipy 0.4.5 (#5611) 2020-06-19 13:45:23 +02:00
Adriane Boyd 931d80de72
Warning for sudachipy 0.4.5 (#5611) 2020-06-19 12:43:41 +02:00
Ines Montani 6d712f3e06
Merge pull request #5599 from adrianeboyd/docs/v2.3.0-minor 2020-06-16 13:49:25 -07:00
Adriane Boyd 02369f91d3 Fix spacy convert argument 2020-06-16 20:41:17 +02:00
Adriane Boyd f0fd77648f Change example title to Dr.
Change example title to Dr. so the current model does exclude the title
in the initial example.
2020-06-16 20:36:21 +02:00
Adriane Boyd a6abdfbc3c Fix numpy.zeros() dtype for Doc.from_array 2020-06-16 20:35:45 +02:00
Adriane Boyd 9aff317ca7 Update POS in tagging example 2020-06-16 20:26:57 +02:00
Adriane Boyd 457babfa0c Update alignment example for new gold.align 2020-06-16 20:22:03 +02:00
Ines Montani 44af53bdd9 Add pkuseg warnings and auto-format [ci skip] 2020-06-16 17:13:35 +02:00
Ines Montani a9e5b840ee Fix typos and auto-format [ci skip] 2020-06-16 16:38:45 +02:00
Adriane Boyd d5110ffbf2
Documentation updates for v2.3.0 (#5593)
* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
2020-06-16 15:37:35 +02:00
Sofie Van Landeghem c0f4a1e43b
train is from-config by default (#5575)
* verbose and tag_map options

* adding init_tok2vec option and only changing the tok2vec that is specified

* adding omit_extra_lookups and verifying textcat config

* wip

* pretrain bugfix

* add replace and resume options

* train_textcat fix

* raw text functionality

* improve UX when KeyError or when input data can't be parsed

* avoid unnecessary access to goldparse in TextCat pipe

* save performance information in nlp.meta

* add noise_level to config

* move nn_parser's defaults to config file

* multitask in config - doesn't work yet

* scorer offering both F and AUC options, need to be specified in config

* add textcat verification code from old train script

* small fixes to config files

* clean up

* set default config for ner/parser to allow create_pipe to work as before

* two more test fixes

* small fixes

* cleanup

* fix NER pickling + additional unit test

* create_pipe as before
2020-06-12 02:02:07 +02:00
Sofie Van Landeghem 4d1ba6feb4
add tag variant for 2.3 (#5542) 2020-06-04 19:16:33 +02:00
Ines Montani 810fce3bb1 Merge branch 'develop' into master-tmp 2020-06-03 14:36:59 +02:00
svlandeg 5f0a91cf37 fix conv-depth parameter 2020-05-29 09:56:29 +02:00
Ines Montani 262d306eaa unicode -> str consistency 2020-05-24 17:23:00 +02:00
Ines Montani 5d3806e059 unicode -> str consistency 2020-05-24 17:20:58 +02:00
Jannis aa53ce6996
Documentation Typo Fix (#5492)
* Fix typo

Change 'realize' to 'realise'

* Add contributer agreement
2020-05-22 19:50:26 +02:00
Matthew Honnibal f6078d866a
Merge pull request #5121 from adrianeboyd/bugfix/revert-token-match
Revert token_match priority changes from #4374 and extend token match options
2020-05-22 14:42:51 +02:00
Ines Montani 65c7e82de2 Auto-format and remove 2.3 feature [ci skip] 2020-05-22 13:50:30 +02:00
Adriane Boyd e4a1b5dab1 Rename to url_match
Rename to `url_match` and update docs.
2020-05-22 12:41:03 +02:00
Adriane Boyd 730fa493a4 Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-22 12:18:00 +02:00
Ines Montani 24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
Sofie Van Landeghem 0d94737857
Feature toggle_pipes (#5378)
* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-18 22:27:10 +02:00
Ines Montani f333c2a011
Merge pull request #5386 from svlandeg/fix/nel-docs 2020-05-10 12:00:09 +02:00
adrianeboyd 4a15b559ba
Clarify Token.pos as UPOS (#5419) 2020-05-08 10:36:25 +02:00
adrianeboyd a2345618f1
Fix Token API docs from #5375 (#5418) 2020-05-08 10:25:02 +02:00
Adriane Boyd 565e0eef73 Add tokenizer option for token match with affixes
To fix the slow tokenizer URL (#4374) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
2020-05-05 10:35:33 +02:00
Adriane Boyd 792c8af8cf Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match 2020-05-05 09:25:57 +02:00
svlandeg ebaed7dcfa Few more updates to the EL documentation 2020-04-30 10:17:06 +02:00
adrianeboyd bdff76dede
Various updates/additions to CLI scripts (#5362)
* `debug-data`: determine coverage of provided vectors

* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization

* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-04-29 12:56:46 +02:00
Sofie Van Landeghem cfdaf99b80
Fix passing of component configuration (#5374)
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument

* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Sofie Van Landeghem f67343295d
Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd a6e521cd79
Add is_sent_end token property (#5375)
Reconstruction of the original PR #4697 by @MiniLau.

Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
2020-04-29 12:53:16 +02:00
adrianeboyd 90ce34db42
Add cuda101 and cuda102 options to setup (#5377)
* Add cuda101 and cuda102 options to setup

* Update cudaNNN options in docs
2020-04-29 12:51:12 +02:00
adrianeboyd 792aa7b6ab
Remove references to textcat spans (#5360)
Remove references to unimplemented `TextCategorizer` span labels in
`GoldParse` and `Doc`.
2020-04-27 18:01:12 +02:00
adrianeboyd 90c754024f
Update nlp.vectors to nlp.vocab.vectors (#5357) 2020-04-27 10:53:05 +02:00
Mike 481574cbc8
[minor doc change] embedding vis. link is broken in `website/docs/usage/examples.md` (#5325)
* The embedding vis. link is broken

The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share?

* contributor agreement

* Update Mlawrence95.md

* Update website/docs/usage/examples.md

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-21 20:35:12 +02:00
laszabine fb73d4943a
Amend documentation to Language.evaluate (#5319)
* Specified usage of arguments to Language.evaluate

* Created contributor agreement
2020-04-16 20:00:18 +02:00
Sofie Van Landeghem a3965ec13d
tag-map-path since 2.2.4 instead of 2.2.3 (#5289) 2020-04-14 14:53:47 +02:00
Marek Grzenkowicz 6a8a52650f
[Closes #5292] Fix typo in option name "--n-save_every" (#5293)
* Sign contributor agreement for chopeen

* Fix typo in option name and close #5292
2020-04-11 23:35:01 +02:00
Sofie Van Landeghem 1137420840
Small doc fixes (#5250)
* fix link

* torchtext instead tochtext
2020-04-03 13:01:43 +02:00
Nikhil Saldanha d1ddfa1cb7 update docs for EntityRecognizer.predict
return type was wrongly written as a tuple, changed to syntax.StateClass
2020-03-28 18:13:02 +01:00
Sofie Van Landeghem 9b412516e7
Fixing pickling of the parser (#5218)
* fix __reduce__ for pickling parser

* setting the move object as 'state' during pickling

* unskip test_issue4725 - works again
2020-03-27 19:35:26 +01:00
Ines Montani 46568f40a7 Merge branch 'master' into tmp/sync 2020-03-26 13:38:14 +01:00
Tiljander e53232533b
Describing priority rules for overlapping matches (#5197)
* Describing priority rules for overlapping matches

* Create Tiljander.md

* Describing priority rules for overlapping matches

* Update website/docs/api/entityruler.md

Co-Authored-By: Ines Montani <ines@ines.io>

Co-authored-by: Ines Montani <ines@ines.io>
2020-03-26 13:13:22 +01:00
adrianeboyd d88a377bed
Remove Vectors.from_glove (#5209) 2020-03-26 10:45:47 +01:00
Ines Montani 17bd9ed84f
Merge pull request #5153 from pinealan/fix/website-docs
Fix website typos and weird sentences
2020-03-16 15:03:01 +01:00
Alan Chan 2124be100d Tweak run-on sentence 2020-03-15 03:45:20 +08:00
Alan Chan 7c3a4ce933 Missing word in api/cli doc 2020-03-15 03:45:20 +08:00
Alan Chan 36e3532475 Remove unfinished sentence 2020-03-15 03:45:17 +08:00
Mark Abraham a0ffa346c0 Fix broken link in docs 2020-03-13 14:07:26 +01:00
Ines Montani c669435c62
Merge pull request #5125 from renaud/patch-1
small typo in code sample
2020-03-12 11:19:12 +01:00
svlandeg 1724a4f75b additional information if doc is empty 2020-03-09 18:08:18 +01:00
Renaud Richardet eccf6b1686
small typo in code sample 2020-03-09 14:49:11 +01:00
Adriane Boyd 0c31f03ec5 Update docs [ci skip] 2020-03-09 13:41:17 +01:00
Adriane Boyd 1139247532 Revert changes to token_match priority from #4374
* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly
2020-03-09 12:09:41 +01:00
Ines Montani 1d6aec805d Fix formatting and update docs for v2.2.4 2020-03-09 11:17:20 +01:00
Ines Montani acb4e3c7ba
Merge pull request #5039 from adrianeboyd/typo/website-token-api-shape
Fix formatting in Token API
2020-02-25 14:57:25 +01:00
Sofie Van Landeghem 479bd8d09f
add lemma option to displacy 'dep' visualiser (#5041)
* add lemma option to displacy 'dep' visualiser

* more compact list comprehension

* add option to doc

* fix test and add lemmas to util.get_doc

* fix capital

* remove lemma from get_doc

* cleanup
2020-02-22 14:11:51 +01:00
Adriane Boyd 3853d385fa Fix formatting in Token API 2020-02-20 13:41:24 +01:00
Ines Montani de11ea753a Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
Kabir Khan f6ed07b85c
Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
2020-02-16 18:17:47 +01:00
Julin S 479e81bafc
fix link (#4977) 2020-02-10 20:31:26 -05:00
Ines Montani 9c08d9baa3 Remove old sections [ci skip] (closes #4961) 2020-02-03 13:10:46 +01:00
Ines Montani abd5c06374 Adjust formatting [ci skip] 2020-02-03 13:00:02 +01:00
Martin A. Kayser 02a44c5be2
Adding a note on retrieving the string rep of the match_id (#4904)
Stolen from here: https://stackoverflow.com/questions/47638877/using-phrasematcher-in-spacy-to-find-multiple-match-types
2020-02-03 12:58:58 +01:00
adrianeboyd 7ad000fce7 Update docs for train CLI --use_gpu option (#4927) 2020-01-20 17:02:47 +01:00
Preston Badeer b216ff43c9 Update vectors-similarity.md (#4889)
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
Geoffrey Gordon Ashbrook 53929138d7 remove extra word typo (#4875)
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani 400257a802 Update index.md [ci skip] 2020-01-04 01:52:18 +01:00
Ivan Echevarria ef13e0c038 Add n_process to Language.pipe documentation (#4842) [ci skip]
* Add n_process to documentation

* Auto-format and add default [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Ines Montani db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani 158b98a3ef Merge branch 'master' into develop 2019-12-21 18:55:03 +01:00
Ines Montani 1b838d1313 Divide models into core and starters [ci skip] 2019-12-21 14:10:22 +01:00
Sofie Van Landeghem 8ebbb85117 Documentation for PhraseMatcher constructor (#4826)
* add max_length as argument for init PhraseMatcher

* improve error message too
2019-12-20 23:00:04 +01:00
Thiago Lages de Alencar a067ded495 Update doc.md (#4796) 2019-12-11 18:21:40 +01:00
Tclack88 ab8dc2732c Update token.md (#4767)
* Update token.md

documentation is confusing: A '?' is a right punct, but '¿' is a left punct

* Update token.md

add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation

* Move quotes into code block [ci skip]
2019-12-06 19:22:02 +01:00
Ines Montani bf611ebca7 Document jsonl option on converter [ci skip] 2019-12-06 19:17:45 +01:00
Nicolai Bjerre Pedersen de5453cdcb Fix link to user hooks in docs (#4778)
* Fix link to user hooks in docs

* Update mr_bjerre.md

Mistake in contributor agreement

* Apparently hard to get it right (wrong name of sca)
2019-12-06 19:17:12 +01:00
Ines Montani cbacb0f1a4 Update shape docs and examples (resolves #4615) [ci skip] 2019-11-23 17:16:55 +01:00
Ines Montani a6200bc424 Update scorer.md [ci skip] 2019-11-21 17:02:43 +01:00
Ines Montani 235fe6fe3b Auto-format [ci skip] 2019-11-20 13:14:58 +01:00
adrianeboyd 2c876eb672 Add tokenizer explain() debugging method (#4596)
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00
Ines Montani e8b9cee6fd Make example consistent with model (closes #4587) [ci skip] 2019-11-18 12:41:48 +01:00
Ines Montani e01a1a237f Auto-format [ci skip] 2019-11-18 12:41:31 +01:00
adrianeboyd 62e00fd9da Update tokenization usage docs (#4666)
Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
2019-11-18 12:35:13 +01:00
Ines Montani 5adcb352e9 Adjust order of docs sections [ci skip] 2019-11-17 16:08:56 +01:00
Ines Montani e30d08410a
Add CI for Python 3.8 (#4479)
* Add 3.8 classifier

* Update azure-pipelines.yml

* Remove 3.8 warning from docs [ci skip]
2019-11-15 01:13:48 +01:00
adrianeboyd faaa832518 Generalize handling of tokenizer special cases (#4259)
* Generalize handling of tokenizer special cases

Handle tokenizer special cases more generally by using the Matcher
internally to match special cases after the affix/token_match
tokenization is complete.

Instead of only matching special cases while processing balanced or
nearly balanced prefixes and suffixes, this recognizes special cases in
a wider range of contexts:

* Allows arbitrary numbers of prefixes/affixes around special cases
* Allows special cases separated by infixes

Existing tests/settings that couldn't be preserved as before:

* The emoticon '")' is no longer a supported special case
* The emoticon ':)' in "example:)" is a false positive again

When merged with #4258 (or the relevant cache bugfix), the affix and
token_match properties should be modified to flush and reload all
special cases to use the updated internal tokenization with the Matcher.

* Remove accidentally added test case

* Really remove accidentally added test

* Reload special cases when necessary

Reload special cases when affixes or token_match are modified. Skip
reloading during initialization.

* Update error code number

* Fix offset and whitespace in Matcher special cases

* Fix offset bugs when merging and splitting tokens
* Set final whitespace on final token in inserted special case

* Improve cache flushing in tokenizer

* Separate cache and specials memory (temporarily)
* Flush cache when adding special cases
* Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()`
are necessary due to this bug:
https://github.com/explosion/preshed/issues/21

* Remove reinitialized PreshMaps on cache flush

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Use special Matcher only for cases with affixes

* Reinsert specials cache checks during normal tokenization for special
cases as much as possible
  * Additionally include specials cache checks while splitting on infixes
  * Since the special Matcher needs consistent affix-only tokenization
    for the special cases themselves, introduce the argument
    `with_special_cases` in order to do tokenization with or without
    specials cache checks
* After normal tokenization, postprocess with special cases Matcher for
special cases containing affixes

* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add test for #4248, clean up test

* Improve efficiency of special cases handling

* Use PhraseMatcher instead of Matcher
* Improve efficiency of merging/splitting special cases in document
  * Process merge/splits in one pass without repeated token shifting
  * Merge in place if no splits

* Update error message number

* Remove UD script modifications

Only used for timing/testing, should be a separate PR

* Remove final traces of UD script modifications

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Switch to PhraseMatcher.find_matches

* Switch to local cdef functions for span filtering

* Switch special case reload threshold to variable

Refer to variable instead of hard-coded threshold

* Move more of special case retokenize to cdef nogil

Move as much of the special case retokenization to nogil as possible.

* Rewrap sort as stdsort for OS X

* Rewrap stdsort with specific types

* Switch to qsort

* Fix merge

* Improve cmp functions

* Fix realloc

* Fix realloc again

* Initialize span struct while retokenizing

* Temporarily skip retokenizing

* Revert "Move more of special case retokenize to cdef nogil"

This reverts commit 0b7e52c797.

* Revert "Switch to qsort"

This reverts commit a98d71a942.

* Fix specials check while caching

* Modify URL test with emoticons

The multiple suffix tests result in the emoticon `:>`, which is now
retokenized into one token as a special case after the suffixes are
split off.

* Refactor _apply_special_cases()

* Use cdef ints for span info used in multiple spots

* Modify _filter_special_spans() to prefer earlier

Parallel to #4414, modify _filter_special_spans() so that the earlier
span is preferred for overlapping spans of the same length.

* Replace MatchStruct with Entity

Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.

* Replace Entity with more general SpanC

* Replace MatchStruct with SpanC

* Add error in debug-data if no dev docs are available (see #4575)

* Update azure-pipelines.yml

* Revert "Update azure-pipelines.yml"

This reverts commit ed1060cf59.

* Use latest wasabi

* Reorganise install_requires

* add dframcy to universe.json (#4580)

* Update universe.json [ci skip]

* Fix multiprocessing for as_tuples=True (#4582)

* Fix conllu script (#4579)

* force extensions to avoid clash between example scripts

* fix arg order and default file encoding

* add example config for conllu script

* newline

* move extension definitions to main function

* few more encodings fixes

* Add load_from_docbin example [ci skip]

TODO: upload the file somewhere

* Update README.md

* Add warnings about 3.8 (resolves #4593) [ci skip]

* Fixed typo: Added space between "recognize" and "various" (#4600)

* Fix DocBin.merge() example (#4599)

* Replace function registries with catalogue (#4584)

* Replace functions registries with catalogue

* Update __init__.py

* Fix test

* Revert unrelated flag [ci skip]

* Bugfix/dep matcher issue 4590 (#4601)

* add contributor agreement for prilopes

* add test for issue #4590

* fix on_match params for DependencyMacther (#4590)

* Minor updates to language example sentences (#4608)

* Add punctuation to Spanish example sentences

* Combine multilanguage examples for lang xx

* Add punctuation to nb examples

* Always realloc to a larger size

Avoid potential (unlikely) edge case and cymem error seen in #4604.

* Add error in debug-data if no dev docs are available (see #4575)

* Update debug-data for GoldCorpus / Example

* Ignore None label in misaligned NER data
2019-11-13 21:24:35 +01:00
f11r 877971860e Fix assert in sentencizer documentation. (#4639) 2019-11-13 15:24:14 +01:00
Ines Montani 9d5ff177c4 Work around Markdown rendering issue surfaced in #4600 [ci skip] 2019-11-11 17:12:08 +01:00
adrianeboyd 0f8678c0b1 Fix DocBin.merge() example (#4599) 2019-11-07 11:26:48 +01:00
walterhenry 5563c42ef5 Fixed typo: Added space between "recognize" and "various" (#4600) 2019-11-06 23:06:36 +01:00
Ines Montani 828ef27a32 Add warnings about 3.8 (resolves #4593) [ci skip] 2019-11-05 18:30:11 +01:00
Ines Montani 59358d9b71
Remove box-decoration-break from entities in displacy (#4564) 2019-10-31 15:09:43 +01:00
Ines Montani 4e1de85e43 Update syntax iterators [ci skip] 2019-10-30 14:31:40 +01:00
Matthew Honnibal d5509e0989 Support Mish activation (requires Thinc 7.3) (#4536)
* Add arch for MishWindowEncoder

* Support mish in tok2vec and conv window >=2

* Pass new tok2vec settings from parser

* Syntax error

* Fix tok2vec setting

* Fix registration of MishWindowEncoder

* Fix receptive field setting

* Fix mish arch

* Pass more options from parser

* Support more tok2vec options in pretrain

* Require thinc 7.3

* Add docs [ci skip]

* Require thinc 7.3.0.dev0 to run CI

* Run black

* Fix typo

* Update Thinc version


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-28 15:16:33 +01:00
Ines Montani cfffdba7b1 Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522)
* Implement new API for {Phrase}Matcher.add (backwards-compatible)

* Update docs

* Also update DependencyMatcher.add

* Update internals

* Rewrite tests to use new API

* Add basic check for common mistake

Raise error with suggestion if user likely passed in a pattern instead of a list of patterns

* Fix typo [ci skip]
2019-10-25 22:21:08 +02:00
Ines Montani d2da117114 Also support passing list to Language.disable_pipes (#4521)
* Also support passing list to Language.disable_pipes

* Adjust internals
2019-10-25 16:19:08 +02:00
Ines Montani 493be8e9db Update new version identifier [ci skip] 2019-10-25 11:42:49 +02:00
Ines Montani 2abf1028cb Update docs [ci skip] 2019-10-25 11:27:00 +02:00
Ines Montani f31876154d Adjust formatting [ci skip] 2019-10-25 11:19:46 +02:00
Kabir Khan 93640373c7 Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513)
* Update entityruler.py

* Making ent_id resolution 2x faster and adding docs

* Fixing newlines in docstrings

* Fixing newlines in docstrings
2019-10-25 11:16:42 +02:00
adrianeboyd 1b0bbe4b76 Update tag maps and docs for English and German (#4501)
* Update English tag_map

Update English tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/en-penn-uposf.html

* Update German tag_map

Update German tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/de-stts-uposf.html

* Add missing Tiger dependencies to glossary

* Add quotes to definition of TO

* Update POS/TAG tables in docs

Update POS/TAG tables for English and German docs using current
information generated from the tag_maps and GLOSSARY.

* Update warning that -PRON- is specific to English

* Revert docs to default JSON output with convert

* Revert "Revert docs to default JSON output with convert"

This reverts commit 6b78c048f1.
2019-10-24 12:56:05 +02:00
adrianeboyd 8516e9d53b Support train dict format as JSONL (#4471)
* Support train dict format as JSONL

* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples

* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`

* Revert docs to default JSON output with convert
2019-10-23 16:01:44 +02:00
adrianeboyd 7fc39f124c Fix logic in rules+model entity example [ci skip] (#4510) 2019-10-23 14:41:21 +02:00
Ines Montani 4659435573 Fix argument type in PhraseMatcher.add docs (closes #4496) [ci skip] 2019-10-22 14:37:30 +02:00
Ines Montani b2f88e2060 Fix formatting [ci skip] 2019-10-21 12:26:07 +02:00
adrianeboyd 3195a8f170 Add Entity Linking to menu (#4489) 2019-10-21 12:17:30 +02:00
Pepe Berba 7772d5d3c5 Update `vocab.get_vector` docs to include features on Fasttext ngram (#4464)
* Update `vocab.get_vector`

* Added contrib agreement
2019-10-20 01:28:18 +02:00
Ghola 258eb9e064 Misspelling on Lemmatizer Example #4406 (#4449)
Removing extra o in the lookups = Loookups()
2019-10-16 23:23:15 +02:00
Anastassia 4a77d03ff7 Fix documentation for the docs_to_json function (#4456) 2019-10-16 23:17:58 +02:00
Ines Montani 573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
Ines Montani e65dffd80b Clarify serialization of extension attributes (closes #4377) [ci skip] 2019-10-05 11:58:00 +02:00
Sofie Van Landeghem 4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ines Montani ce1d441de5 Add docs for Vectors.most_similar [ci skip] 2019-10-03 14:29:47 +02:00
Ines Montani 80cf385f65 Update v2-2.md [ci skip] 2019-10-02 16:58:21 +02:00
Ines Montani b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Ines Montani 475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Ines Montani 0dd127bb00 Update v2-2.md [ci skip] 2019-10-01 21:37:06 +02:00
Ines Montani cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani bc7e7db208 Fix wording [ci skip] 2019-10-01 14:20:44 +02:00
Ines Montani 2a3a4565cd Update infobox [ci skip] 2019-10-01 14:19:34 +02:00
Ines Montani 66aa0d479f Update v2.2 page [ci skip] 2019-10-01 14:11:05 +02:00
Ines Montani a8a1800f2a Update lemma data documentation [ci skip] 2019-10-01 13:22:13 +02:00
Ines Montani 932ad9cb91 Fix typos and formatting [ci skip] 2019-10-01 12:30:04 +02:00
Ines Montani 3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani 3bd4da068e Fix link [ci skip] 2019-09-29 17:30:38 +02:00
Ines Montani 089f44cc56 Update serialization docs [ci skip] 2019-09-29 17:11:13 +02:00
Ines Montani c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Ines Montani 10742d3219 Update v2 docs [ci skip] 2019-09-28 15:57:22 +02:00
Ines Montani f8d1e2f214 Update CLI docs [ci skip] 2019-09-28 13:12:30 +02:00
Ines Montani 59beab8405 Update v2-2.md [ci skip] 2019-09-27 18:10:43 +02:00
Ines Montani 685e4b2554 Update v2-2.md [ci skip] 2019-09-27 16:35:01 +02:00
Ines Montani aad66d9bb9 Document PhraseMatcher.remove [ci skip] 2019-09-27 16:34:53 +02:00
Ines Montani eb0649e38e Fix tag [ci skip] 2019-09-26 16:22:33 +02:00
Ines Montani da9a869d3f Update vectors name docs [ci skip] 2019-09-26 16:21:32 +02:00
Em Zhan aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Matthew Honnibal 92ed4dc5e0
Allow vectors name to be set in init-model (#4321)
* Allow vectors name to be specified in init-model

* Document --vectors-name argument to init-model

* Update website/docs/api/cli.md

Co-Authored-By: Ines Montani <ines@ines.io>
2019-09-25 13:11:00 +02:00