Commit Graph

1558 Commits

Author SHA1 Message Date
Paul O'Leary McCann f0e3e606a6 Replace python-mecab3 with fugashi for Japanese (#4621)
* Switch from mecab-python3 to fugashi

mecab-python3 has been the best MeCab binding for a long time but it's
not very actively maintained, and since it's based on old SWIG code
distributed with MeCab there's a limit to how effectively it can be
maintained.

Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not
based on the old SWIG code it's easier to keep it current and make small
deviations from the MeCab C/C++ API where that makes sense.

* Change mecab-python3 to fugashi in setup.cfg

* Change "mecab tags" to "unidic tags"

The tags come from MeCab, but the tag schema is specified by Unidic, so
it's more proper to refer to it that way.

* Update conftest

* Add fugashi link to external deps list for Japanese
2019-11-23 14:31:04 +01:00
Ines Montani a6200bc424 Update scorer.md [ci skip] 2019-11-21 17:02:43 +01:00
richardpaulhudson 8d06386e1e Update to Holmes Universe entry (#4679)
* Updated Universe entry for Holmes

* Correction

* Updated model name

* Updated wording
2019-11-21 16:23:24 +01:00
Ines Montani 235fe6fe3b Auto-format [ci skip] 2019-11-20 13:14:58 +01:00
adrianeboyd 2c876eb672 Add tokenizer explain() debugging method (#4596)
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00
Ines Montani e8b9cee6fd Make example consistent with model (closes #4587) [ci skip] 2019-11-18 12:41:48 +01:00
Ines Montani e01a1a237f Auto-format [ci skip] 2019-11-18 12:41:31 +01:00
adrianeboyd 62e00fd9da Update tokenization usage docs (#4666)
Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
2019-11-18 12:35:13 +01:00
Ines Montani 5adcb352e9 Adjust order of docs sections [ci skip] 2019-11-17 16:08:56 +01:00
Ines Montani e30d08410a
Add CI for Python 3.8 (#4479)
* Add 3.8 classifier

* Update azure-pipelines.yml

* Remove 3.8 warning from docs [ci skip]
2019-11-15 01:13:48 +01:00
f11r 877971860e Fix assert in sentencizer documentation. (#4639) 2019-11-13 15:24:14 +01:00
Ines Montani 9d5ff177c4 Work around Markdown rendering issue surfaced in #4600 [ci skip] 2019-11-11 17:12:08 +01:00
adrianeboyd 0f8678c0b1 Fix DocBin.merge() example (#4599) 2019-11-07 11:26:48 +01:00
walterhenry 5563c42ef5 Fixed typo: Added space between "recognize" and "various" (#4600) 2019-11-06 23:06:36 +01:00
Ines Montani 828ef27a32 Add warnings about 3.8 (resolves #4593) [ci skip] 2019-11-05 18:30:11 +01:00
Ines Montani 4b95587ad4 Update universe.json [ci skip] 2019-11-04 13:55:55 +01:00
Yash Patadia 0c396aeed4 add dframcy to universe.json (#4580) 2019-11-04 13:53:23 +01:00
Ines Montani 59358d9b71
Remove box-decoration-break from entities in displacy (#4564) 2019-10-31 15:09:43 +01:00
Ines Montani 4e1de85e43 Update syntax iterators [ci skip] 2019-10-30 14:31:40 +01:00
Ines Montani 726c5dd306 Update universe.json [ci skip] 2019-10-30 13:29:00 +01:00
Neel Kamath 6c036ab57d Add "spaCy Server" to spaCy Universe (#4553)
* Add "spaCy Server" to spaCy Universe

* Accept the spaCy Contributor Agreement
2019-10-30 13:20:46 +01:00
Nipun Sadvilkar 2a5e71232b project: pySBD - Python Sentence Boundary Disambiguation (#4455)
*   project: pySBD - Python Sentence Boundary Disambiguation

* 📝  Update links and description

* 🐛  Fix missing comma

* Update universe.json

pysbd as a spacy component through entrypoints

* 🚨  Fix universe.json

* 📝  Update code_example
2019-10-30 12:13:29 +01:00
Matthew Honnibal d5509e0989 Support Mish activation (requires Thinc 7.3) (#4536)
* Add arch for MishWindowEncoder

* Support mish in tok2vec and conv window >=2

* Pass new tok2vec settings from parser

* Syntax error

* Fix tok2vec setting

* Fix registration of MishWindowEncoder

* Fix receptive field setting

* Fix mish arch

* Pass more options from parser

* Support more tok2vec options in pretrain

* Require thinc 7.3

* Add docs [ci skip]

* Require thinc 7.3.0.dev0 to run CI

* Run black

* Fix typo

* Update Thinc version


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-28 15:16:33 +01:00
Ines Montani 1180304449 Update languages.json [ci skip] 2019-10-26 13:51:42 +02:00
Ines Montani cfffdba7b1 Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522)
* Implement new API for {Phrase}Matcher.add (backwards-compatible)

* Update docs

* Also update DependencyMatcher.add

* Update internals

* Rewrite tests to use new API

* Add basic check for common mistake

Raise error with suggestion if user likely passed in a pattern instead of a list of patterns

* Fix typo [ci skip]
2019-10-25 22:21:08 +02:00
Ines Montani d2da117114 Also support passing list to Language.disable_pipes (#4521)
* Also support passing list to Language.disable_pipes

* Adjust internals
2019-10-25 16:19:08 +02:00
Ines Montani 493be8e9db Update new version identifier [ci skip] 2019-10-25 11:42:49 +02:00
Ines Montani 2abf1028cb Update docs [ci skip] 2019-10-25 11:27:00 +02:00
Ines Montani f31876154d Adjust formatting [ci skip] 2019-10-25 11:19:46 +02:00
Kabir Khan 93640373c7 Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513)
* Update entityruler.py

* Making ent_id resolution 2x faster and adding docs

* Fixing newlines in docstrings

* Fixing newlines in docstrings
2019-10-25 11:16:42 +02:00
adrianeboyd 1b0bbe4b76 Update tag maps and docs for English and German (#4501)
* Update English tag_map

Update English tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/en-penn-uposf.html

* Update German tag_map

Update German tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/de-stts-uposf.html

* Add missing Tiger dependencies to glossary

* Add quotes to definition of TO

* Update POS/TAG tables in docs

Update POS/TAG tables for English and German docs using current
information generated from the tag_maps and GLOSSARY.

* Update warning that -PRON- is specific to English

* Revert docs to default JSON output with convert

* Revert "Revert docs to default JSON output with convert"

This reverts commit 6b78c048f1.
2019-10-24 12:56:05 +02:00
adrianeboyd 8516e9d53b Support train dict format as JSONL (#4471)
* Support train dict format as JSONL

* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples

* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`

* Revert docs to default JSON output with convert
2019-10-23 16:01:44 +02:00
adrianeboyd 7fc39f124c Fix logic in rules+model entity example [ci skip] (#4510) 2019-10-23 14:41:21 +02:00
Ines Montani 388ea03065 Update universe.json [ci skip] 2019-10-22 14:54:47 +02:00
Kabir Khan 8a7a30ea1d Add cookiecutter-spacy-fastapi to spacy universe (#4498) 2019-10-22 14:50:40 +02:00
Ines Montani 4659435573 Fix argument type in PhraseMatcher.add docs (closes #4496) [ci skip] 2019-10-22 14:37:30 +02:00
Julin S 3ee15fce0d Update information about Rasa (#4492)
Rasa has been updated and rasa core and rasa nlu have been merged.
2019-10-22 14:32:31 +02:00
Ines Montani b2f88e2060 Fix formatting [ci skip] 2019-10-21 12:26:07 +02:00
adrianeboyd 3195a8f170 Add Entity Linking to menu (#4489) 2019-10-21 12:17:30 +02:00
Pepe Berba 7772d5d3c5 Update `vocab.get_vector` docs to include features on Fasttext ngram (#4464)
* Update `vocab.get_vector`

* Added contrib agreement
2019-10-20 01:28:18 +02:00
Ghola 258eb9e064 Misspelling on Lemmatizer Example #4406 (#4449)
Removing extra o in the lookups = Loookups()
2019-10-16 23:23:15 +02:00
Anastassia 4a77d03ff7 Fix documentation for the docs_to_json function (#4456) 2019-10-16 23:17:58 +02:00
Ines Montani 5cbe21700b Only show label scheme if not empty [ci skip] 2019-10-08 15:52:59 +02:00
Ines Montani 8f76d6c9ef Update transformer model details [ci skip] 2019-10-08 15:39:38 +02:00
Ines Montani 573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
Ines Montani e65dffd80b Clarify serialization of extension attributes (closes #4377) [ci skip] 2019-10-05 11:58:00 +02:00
Ines Montani e7ddc6f662 Add conda install for lookups [ci skip] 2019-10-03 17:52:53 +02:00
Sofie Van Landeghem 4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ines Montani ce1d441de5 Add docs for Vectors.most_similar [ci skip] 2019-10-03 14:29:47 +02:00
Ines Montani 80cf385f65 Update v2-2.md [ci skip] 2019-10-02 16:58:21 +02:00