Commit Graph

1649 Commits

Author SHA1 Message Date
Ines Montani e30d08410a
Add CI for Python 3.8 (#4479)
* Add 3.8 classifier

* Update azure-pipelines.yml

* Remove 3.8 warning from docs [ci skip]
2019-11-15 01:13:48 +01:00
f11r 877971860e Fix assert in sentencizer documentation. (#4639) 2019-11-13 15:24:14 +01:00
Ines Montani 9d5ff177c4 Work around Markdown rendering issue surfaced in #4600 [ci skip] 2019-11-11 17:12:08 +01:00
adrianeboyd 0f8678c0b1 Fix DocBin.merge() example (#4599) 2019-11-07 11:26:48 +01:00
walterhenry 5563c42ef5 Fixed typo: Added space between "recognize" and "various" (#4600) 2019-11-06 23:06:36 +01:00
Ines Montani 828ef27a32 Add warnings about 3.8 (resolves #4593) [ci skip] 2019-11-05 18:30:11 +01:00
Ines Montani 4b95587ad4 Update universe.json [ci skip] 2019-11-04 13:55:55 +01:00
Yash Patadia 0c396aeed4 add dframcy to universe.json (#4580) 2019-11-04 13:53:23 +01:00
Ines Montani 59358d9b71
Remove box-decoration-break from entities in displacy (#4564) 2019-10-31 15:09:43 +01:00
Ines Montani 4e1de85e43 Update syntax iterators [ci skip] 2019-10-30 14:31:40 +01:00
Ines Montani 726c5dd306 Update universe.json [ci skip] 2019-10-30 13:29:00 +01:00
Neel Kamath 6c036ab57d Add "spaCy Server" to spaCy Universe (#4553)
* Add "spaCy Server" to spaCy Universe

* Accept the spaCy Contributor Agreement
2019-10-30 13:20:46 +01:00
Nipun Sadvilkar 2a5e71232b project: pySBD - Python Sentence Boundary Disambiguation (#4455)
*   project: pySBD - Python Sentence Boundary Disambiguation

* 📝  Update links and description

* 🐛  Fix missing comma

* Update universe.json

pysbd as a spacy component through entrypoints

* 🚨  Fix universe.json

* 📝  Update code_example
2019-10-30 12:13:29 +01:00
Matthew Honnibal d5509e0989 Support Mish activation (requires Thinc 7.3) (#4536)
* Add arch for MishWindowEncoder

* Support mish in tok2vec and conv window >=2

* Pass new tok2vec settings from parser

* Syntax error

* Fix tok2vec setting

* Fix registration of MishWindowEncoder

* Fix receptive field setting

* Fix mish arch

* Pass more options from parser

* Support more tok2vec options in pretrain

* Require thinc 7.3

* Add docs [ci skip]

* Require thinc 7.3.0.dev0 to run CI

* Run black

* Fix typo

* Update Thinc version


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-28 15:16:33 +01:00
Ines Montani 1180304449 Update languages.json [ci skip] 2019-10-26 13:51:42 +02:00
Ines Montani cfffdba7b1 Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522)
* Implement new API for {Phrase}Matcher.add (backwards-compatible)

* Update docs

* Also update DependencyMatcher.add

* Update internals

* Rewrite tests to use new API

* Add basic check for common mistake

Raise error with suggestion if user likely passed in a pattern instead of a list of patterns

* Fix typo [ci skip]
2019-10-25 22:21:08 +02:00
Ines Montani d2da117114 Also support passing list to Language.disable_pipes (#4521)
* Also support passing list to Language.disable_pipes

* Adjust internals
2019-10-25 16:19:08 +02:00
Ines Montani 493be8e9db Update new version identifier [ci skip] 2019-10-25 11:42:49 +02:00
Ines Montani 2abf1028cb Update docs [ci skip] 2019-10-25 11:27:00 +02:00
Ines Montani f31876154d Adjust formatting [ci skip] 2019-10-25 11:19:46 +02:00
Kabir Khan 93640373c7 Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513)
* Update entityruler.py

* Making ent_id resolution 2x faster and adding docs

* Fixing newlines in docstrings

* Fixing newlines in docstrings
2019-10-25 11:16:42 +02:00
adrianeboyd 1b0bbe4b76 Update tag maps and docs for English and German (#4501)
* Update English tag_map

Update English tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/en-penn-uposf.html

* Update German tag_map

Update German tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/de-stts-uposf.html

* Add missing Tiger dependencies to glossary

* Add quotes to definition of TO

* Update POS/TAG tables in docs

Update POS/TAG tables for English and German docs using current
information generated from the tag_maps and GLOSSARY.

* Update warning that -PRON- is specific to English

* Revert docs to default JSON output with convert

* Revert "Revert docs to default JSON output with convert"

This reverts commit 6b78c048f1.
2019-10-24 12:56:05 +02:00
adrianeboyd 8516e9d53b Support train dict format as JSONL (#4471)
* Support train dict format as JSONL

* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples

* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`

* Revert docs to default JSON output with convert
2019-10-23 16:01:44 +02:00
adrianeboyd 7fc39f124c Fix logic in rules+model entity example [ci skip] (#4510) 2019-10-23 14:41:21 +02:00
Ines Montani 388ea03065 Update universe.json [ci skip] 2019-10-22 14:54:47 +02:00
Kabir Khan 8a7a30ea1d Add cookiecutter-spacy-fastapi to spacy universe (#4498) 2019-10-22 14:50:40 +02:00
Ines Montani 4659435573 Fix argument type in PhraseMatcher.add docs (closes #4496) [ci skip] 2019-10-22 14:37:30 +02:00
Julin S 3ee15fce0d Update information about Rasa (#4492)
Rasa has been updated and rasa core and rasa nlu have been merged.
2019-10-22 14:32:31 +02:00
Ines Montani b2f88e2060 Fix formatting [ci skip] 2019-10-21 12:26:07 +02:00
adrianeboyd 3195a8f170 Add Entity Linking to menu (#4489) 2019-10-21 12:17:30 +02:00
Pepe Berba 7772d5d3c5 Update `vocab.get_vector` docs to include features on Fasttext ngram (#4464)
* Update `vocab.get_vector`

* Added contrib agreement
2019-10-20 01:28:18 +02:00
Ghola 258eb9e064 Misspelling on Lemmatizer Example #4406 (#4449)
Removing extra o in the lookups = Loookups()
2019-10-16 23:23:15 +02:00
Anastassia 4a77d03ff7 Fix documentation for the docs_to_json function (#4456) 2019-10-16 23:17:58 +02:00
Ines Montani 5cbe21700b Only show label scheme if not empty [ci skip] 2019-10-08 15:52:59 +02:00
Ines Montani 8f76d6c9ef Update transformer model details [ci skip] 2019-10-08 15:39:38 +02:00
Ines Montani 573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
Ines Montani e65dffd80b Clarify serialization of extension attributes (closes #4377) [ci skip] 2019-10-05 11:58:00 +02:00
Ines Montani e7ddc6f662 Add conda install for lookups [ci skip] 2019-10-03 17:52:53 +02:00
Sofie Van Landeghem 4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ines Montani ce1d441de5 Add docs for Vectors.most_similar [ci skip] 2019-10-03 14:29:47 +02:00
Ines Montani 80cf385f65 Update v2-2.md [ci skip] 2019-10-02 16:58:21 +02:00
Ines Montani 12a941d841 Update binder version [ci skip] 2019-10-02 16:47:01 +02:00
Ines Montani b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Ines Montani 475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Ines Montani 0dd127bb00 Update v2-2.md [ci skip] 2019-10-01 21:37:06 +02:00
Ines Montani cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani bc7e7db208 Fix wording [ci skip] 2019-10-01 14:20:44 +02:00
Ines Montani 2a3a4565cd Update infobox [ci skip] 2019-10-01 14:19:34 +02:00
Ines Montani 66aa0d479f Update v2.2 page [ci skip] 2019-10-01 14:11:05 +02:00
Ines Montani a8a1800f2a Update lemma data documentation [ci skip] 2019-10-01 13:22:13 +02:00
Ines Montani 932ad9cb91 Fix typos and formatting [ci skip] 2019-10-01 12:30:04 +02:00
Ines Montani ca0b20ae8b Make prereleases less verbose [ci skip] 2019-10-01 12:29:14 +02:00
Ines Montani 61263e2fbc Update universe.json [ci skip] 2019-09-30 13:49:44 +02:00
Ines Montani 71bd040834 Update models.js [ci skip] 2019-09-30 12:01:09 +02:00
Ines Montani 3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani 3bd4da068e Fix link [ci skip] 2019-09-29 17:30:38 +02:00
Ines Montani 089f44cc56 Update serialization docs [ci skip] 2019-09-29 17:11:13 +02:00
Ines Montani c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Ines Montani 10742d3219 Update v2 docs [ci skip] 2019-09-28 15:57:22 +02:00
Ines Montani a2815f6643 Fix model table display [ci skip] 2019-09-28 14:23:03 +02:00
Ines Montani 129670283e Pass meta labels through correctly [ci skip] 2019-09-28 14:08:33 +02:00
Ines Montani f8d1e2f214 Update CLI docs [ci skip] 2019-09-28 13:12:30 +02:00
Ines Montani 59beab8405 Update v2-2.md [ci skip] 2019-09-27 18:10:43 +02:00
Ines Montani 685e4b2554 Update v2-2.md [ci skip] 2019-09-27 16:35:01 +02:00
Ines Montani aad66d9bb9 Document PhraseMatcher.remove [ci skip] 2019-09-27 16:34:53 +02:00
Ines Montani 3624153591 Update languages.json [ci skip] 2019-09-27 15:15:41 +02:00
Ajinkya Kale 975aebd7e4 typo fix for wordnet_annotator (#4326) 2019-09-27 11:52:53 +02:00
Ines Montani eb0649e38e Fix tag [ci skip] 2019-09-26 16:22:33 +02:00
Ines Montani da9a869d3f Update vectors name docs [ci skip] 2019-09-26 16:21:32 +02:00
Em Zhan aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Matthew Honnibal 92ed4dc5e0
Allow vectors name to be set in init-model (#4321)
* Allow vectors name to be specified in init-model

* Document --vectors-name argument to init-model

* Update website/docs/api/cli.md

Co-Authored-By: Ines Montani <ines@ines.io>
2019-09-25 13:11:00 +02:00
Eric Semeniuc 09816f8323 update sense2vec version (#4320) 2019-09-25 12:17:54 +02:00
Sofie Van Landeghem 42340740e3 update neuralcoref example (#4317) 2019-09-24 10:47:17 +02:00
Ines Montani 197406de1d Update v2-2.md [ci skip] 2019-09-19 14:33:58 +02:00
Ines Montani ddc09b08ed Update v2-2.md [ci skip] 2019-09-19 00:58:30 +02:00
Matthew Honnibal e2047576c4 Fix merge conflict 2019-09-18 21:42:11 +02:00
Matthew Honnibal 46c02d25b1 Merge changes to test_ner 2019-09-18 21:41:24 +02:00
Ines Montani d84763727c Remove unused setting [ci skip] 2019-09-18 21:24:14 +02:00
Ines Montani 9c940eab94 Update version in examples [ci skip] 2019-09-18 21:23:26 +02:00
Ines Montani f873548f6c Add backwards incompatibility [ci skip] 2019-09-18 21:21:48 +02:00
Ines Montani 6ebdc5f7d2 Update download docs [ci skip] 2019-09-18 21:21:39 +02:00
Ines Montani dd1810f05a Update DocBin and add docs 2019-09-18 20:23:21 +02:00
Ines Montani d62690b3ba Update examples 2019-09-18 19:57:36 +02:00
Ines Montani bd435faddd Add note about usage docs [ci skip] 2019-09-18 19:56:43 +02:00
Matthew Honnibal 931e96b6c7 DocPallet->DocBin in docs 2019-09-18 15:17:26 +02:00
Matthew Honnibal f537cbeacc Update v2-2 docs 2019-09-18 14:07:55 +02:00
Ines Montani c922f8e8b0 Fix sources rendering [ci skip] 2019-09-18 12:09:21 +02:00
Ines Montani ea2a686cf7 Support new model sources format [ci skip] 2019-09-18 11:42:45 +02:00
Ines Montani ee15fdfe88 Fix wording [ci skip] 2019-09-17 14:59:42 +02:00
Ines Montani f566e69f38 Fix --vectors-loc docs (closes #4270) 2019-09-17 14:59:12 +02:00
Ines Montani 25c2b4b9a5 Improve init-model docs (see #4137) 2019-09-17 14:51:44 +02:00
Ines Montani 198b7e9789 Auto-format [ci skip] 2019-09-17 14:48:35 +02:00
adrianeboyd b5d999e510 Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level

Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.

* `spacy.gold.docs_to_json()` writes `docs.cats`

* `GoldCorpus` reads in cats in each `GoldParse`

* Update instances of gold_tuples to handle cats

Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.

* Add textcat to train CLI

* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
  * For binary exclusive classes with provided label: F1 for label
  * For 2+ exclusive classes: F1 macro average
  * For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI

* Fix handling of unset arguments and config params

Fix handling of unset arguments and model confiug parameters in Scorer
initialization.

* Temporarily add sklearn requirement

* Remove sklearn version number

* Improve Scorer handling of models without textcats

* Fixing Scorer handling of models without textcats

* Update Scorer output for python 2.7

* Modify inf in Scorer for python 2.7

* Auto-format

Also make small adjustments to make auto-formatting with black easier and produce nicer results

* Move error message to Errors

* Update documentation

* Add cats to annotation JSON format [ci skip]

* Fix tpl flag and docs [ci skip]

* Switch to internal roc_auc_score

Switch to internal `roc_auc_score()` adapted from scikit-learn.

* Add AUCROCScore tests and improve errors/warnings

* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors

* Make reduced roc_auc_score functions private

Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.

* Check that data corresponds with multilabel flag

Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.

* Add textcat score to early stopping check

* Add more checks to debug-data for textcat

* Add example training data for textcat

* Add more checks to textcat train CLI

* Check configuration when extending base model
* Fix typos

* Update textcat example data

* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-15 22:31:31 +02:00
Ines Montani bab9976d9a
💫 Adjust Table API and add docs (#4289)
* Adjust Table API and add docs

* Add attributes and update description [ci skip]

* Use strings.get_string_id instead of hash_string

* Fix table method calls

* Make orth arg in Lemmatizer.lookup optional

Fall back to string, which is now handled by Table.__contains__ out-of-the-box

* Fix method name

* Auto-format
2019-09-15 22:08:13 +02:00
Ines Montani 23e28e2844 Merge branch 'master' into develop 2019-09-15 17:57:09 +02:00
Ines Montani 57f4c088be Use full model name in quickstart install [ci skip] 2019-09-15 17:56:54 +02:00
Ines Montani c7e4ea7154 Update examples and languages.json [ci skip] 2019-09-15 17:56:40 +02:00
Ines Montani 16c2522791 Merge branch 'master' into develop 2019-09-14 16:42:01 +02:00
Ines Montani 86befc80bf WIP: Add v2.2 page [ci skip] 2019-09-14 16:41:48 +02:00
Ines Montani 04d36d2471 Remove unused link [ci skip] 2019-09-14 16:41:19 +02:00
Ines Montani 76d26a3d5e Update site.json [ci skip] 2019-09-14 16:32:24 +02:00
Ines Montani fe87ccc8d1 Update languages.json [ci skip] 2019-09-14 16:23:50 +02:00
Ines Montani 5c8b5e68ec Fix docs consistency [ci skip] 2019-09-14 16:23:37 +02:00
Ines Montani bbf7337eaf Update adding languages docs [ci skip] 2019-09-14 15:32:15 +02:00
Ines Montani 3126dd0904 Tidy up and auto-format [ci skip] 2019-09-14 12:58:06 +02:00
Ines Montani 3c3658ef9f Merge branch 'master' into develop 2019-09-12 18:03:01 +02:00
Ines Montani 03809b82b7 Support label schemes in model directory 2019-09-12 18:01:46 +02:00
Sofie Van Landeghem 9be4d1c105 Allow copying of user_data in as_doc (#4282)
* Allow copying the user_data with as_doc + unit test

* add option to docs

* add typing

* import fix

* workaround to avoid bool clashing ...

* bint instead of bool
2019-09-12 17:08:14 +02:00
Ines Montani ff51fba96a Update lemmaitzer docs [ci skip] 2019-09-12 16:26:33 +02:00
Ines Montani 25b2b3ff45 Remove LEMMA from exception examples [ci skip] 2019-09-12 16:26:27 +02:00
Ines Montani 82c16b7943 Remove u-strings and fix formatting [ci skip] 2019-09-12 16:11:15 +02:00
Ines Montani 38037d6816 Update landing [ci skip] 2019-09-12 15:33:39 +02:00
Ines Montani a31e9e1cd5 Update training docs [ci skip] 2019-09-12 15:32:39 +02:00
Ines Montani b544dcb3c5 Document debug-data [ci skip] 2019-09-12 15:26:20 +02:00
Ines Montani 72274e83f2 Ensure accordion label is left-aligned [ci skip] 2019-09-12 15:24:17 +02:00
Ines Montani c0a4cab178 Update "Adding languages" docs [ci skip] 2019-09-12 14:53:06 +02:00
Ines Montani 10257f3131 Document Lookups [ci skip] 2019-09-12 14:00:14 +02:00
Ines Montani aa4ff0baa1 Auto-format [ci skip] 2019-09-12 13:05:53 +02:00
Ines Montani 625ce2db8e Update Language docs [ci skip] 2019-09-12 13:03:38 +02:00
Ines Montani cb41a33d14 Update displaCy API docs [ci skip] 2019-09-12 12:59:20 +02:00
Ines Montani e7c20ad1d2 Update colors entry points docs [ci skip] 2019-09-12 12:59:10 +02:00
Ines Montani 7b59a919e6 Update entry points docs [ci skip] 2019-09-12 12:52:06 +02:00
Sofie Van Landeghem 0b4b4f1819 Documentation for Entity Linking (#4065)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts
2019-09-12 11:38:34 +02:00
Sofie Van Landeghem 53a9ca45c9 Docs: bufsize instead of buffsize (#4247) 2019-09-06 11:11:54 +02:00
Sofie Van Landeghem 6b012cebff Make pos/tag distinction more clear in docs (#4246)
* make distinction between tag and pos more prominent in docs

* out of the 101
2019-09-06 10:31:21 +02:00
Ines Montani 232a029de6 Send referrer for internal links [ci skip] 2019-09-05 10:41:46 +02:00
Ines Montani 2f31f96fce Update languages.json [ci skip] 2019-09-04 18:15:42 +02:00
Ines Montani 2245e95e2d Update languages.json [ci skip] 2019-09-04 17:11:40 +02:00
adrianeboyd 82159b5c19 Updates/bugfixes for NER/IOB converters (#4186)
* Updates/bugfixes for NER/IOB converters

* Converter formats `ner` and `iob` use autodetect to choose a converter if
  possible

* `iob2json` is reverted to handle sentence-per-line data like
  `word1|pos1|ent1 word2|pos2|ent2`

  * Fix bug in `merge_sentences()` so the second sentence in each batch isn't
    skipped

* `conll_ner2json` is made more general so it can handle more formats with
  whitespace-separated columns

  * Supports all formats where the first column is the token and the final
    column is the IOB tag; if present, the second column is the POS tag

  * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
    separates documents

  * Add option for segmenting sentences (new flag `-s`)

  * Parser-based sentence segmentation with a provided model, otherwise with
    sentencizer (new option `-b` to specify model)

  * Can group sentences into documents with `n_sents` as long as sentence
    segmentation is available

  * Only applies automatic segmentation when there are no existing delimiters
    in the data

* Provide info about settings applied during conversion with warnings and
  suggestions if settings conflict or might not be not optimal.

* Add tests for common formats

* Add '(default)' back to docs for -c auto

* Add document count back to output

* Revert changes to converter output message

* Use explicit tabs in convert CLI test data

* Adjust/add messages for n_sents=1 default

* Add sample NER data to training examples

* Update README

* Add links in docs to example NER data

* Define msg within converters
2019-08-29 12:04:01 +02:00
Ines Montani b91425f803 Update universe.json [ci skip] 2019-08-28 13:45:06 +02:00
Ines Montani aedae8b4c5 Update universe.json [ci skip] 2019-08-28 11:59:06 +02:00
Björn Böing bae0455f91 Fix visualizer options linking for displaCy. (#4202) 2019-08-27 14:04:28 +02:00
Ines Montani 8114933f01 Fix universe.json [ci skip] 2019-08-27 12:13:42 +02:00
Ines Montani 48385552c6 Update languages.json [ci skip] 2019-08-27 11:52:51 +02:00
yanaiela 5d7bc26735 new universe project - the numeric fused-head (#4192)
* new universe project

* Update website/meta/universe.json

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/meta/universe.json

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-25 17:25:28 +02:00
Christos Aridas 61f5c007a0 DOC Fix pipeline functions examples (#4189) 2019-08-23 19:15:32 +02:00
Ines Montani b072c13017 Update universe with videos [ci skip] 2019-08-21 21:35:37 +02:00
Pavle Vidanović 4fe9329bfb Serbian language code update "rs" -> "sr" (#4159)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix
2019-08-21 19:57:37 +02:00
adrianeboyd 8fe7bdd0fa Improve token pattern checking without validation (#4105)
* Fix typo in rule-based matching docs

* Improve token pattern checking without validation

Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses #4070 (also related: #4063, #4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

* Remove unnecessary .keys()

* Rephrase error messages

* Add another type check to Matcher

Add another type check to Matcher for more understandable error messages
in some rare cases.

* Support phrase_matcher_attr=TEXT for EntityRuler

* Don't use spacy.errors in examples and bin scripts

* Fix error code

* Auto-format

Also try get Azure pipelines to finally start a build :(

* Update errors.py


Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2019-08-21 14:00:37 +02:00
Ines Montani 3134a9b6e0 Add section on expanding regex match to token boundaries (see #4158) [ci skip] 2019-08-21 12:53:31 +02:00
Ines Montani 072860fcd0 Auto-format [ci skip] 2019-08-20 14:46:41 +02:00
Andrei-Marius Avram 199589228e Added RONEC to spaCy Universe (#4151)
* Added RONEC to spaCy Universe

* Added contributor file

* Corrected date from .github/contributors/avramandrei.md

* Convert tabs to spaces

* Remove duplicate keys

Can only have one GitHub link unfortunately

* Also add models category

* Adjust ID

This is used to generate the URL, so a simpler string is better
2019-08-20 14:46:07 +02:00
Ines Montani fe230c8776 Fix typo [ci skip] 2019-08-20 13:02:05 +02:00
Daniel Bourke b0a28fd0de fix PhraseMatcher link typo (#4150)
/api/phtasematcher -> /api/phrasematcher
2019-08-20 13:01:43 +02:00
Ines Montani ce4c3e5204 Document force flag on set_extension (closes #4148) 2019-08-19 19:22:07 +02:00
Ines Montani 66aba2d676 Improve regex matching docs [ci skip] 2019-08-19 13:59:41 +02:00
Sofie Van Landeghem cc66f47893 Make enabling/disabling jupyter mode more explicit (#4144)
* make enabling/disabling jupyter mode more explicit

* markup fix
2019-08-19 11:53:34 +02:00
Ines Montani e520eb3f6c Make visualized NER examples more clear (closes #4104) [ci skip] 2019-08-18 16:29:29 +02:00
Jeno 91441f169c Update universe.json to include negspacy (#4132) 2019-08-16 17:48:17 +02:00
Ines Montani 1362f793cf Improve docs on phrase pattern attributes (closes #4100) [ci skip] 2019-08-11 11:13:49 +02:00