Commit Graph

11076 Commits

Author SHA1 Message Date
Kabir Khan 93640373c7 Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513)
* Update entityruler.py

* Making ent_id resolution 2x faster and adding docs

* Fixing newlines in docstrings

* Fixing newlines in docstrings
2019-10-25 11:16:42 +02:00
Ines Montani cc05d9dad6 Auto-format [ci skip] 2019-10-24 16:21:08 +02:00
Ines Montani 73dc63d3bf Tidy up and auto-format [ci skip] 2019-10-24 16:20:48 +02:00
Ines Montani 6dd2832438 Use numpy.frombuffer instead of fromstring
Deprecation warning says we should do this
2019-10-24 16:18:41 +02:00
Ines Montani 9a849fe54e Explicitly catch warning in test 2019-10-24 16:16:27 +02:00
adrianeboyd 1b0bbe4b76 Update tag maps and docs for English and German (#4501)
* Update English tag_map

Update English tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/en-penn-uposf.html

* Update German tag_map

Update German tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/de-stts-uposf.html

* Add missing Tiger dependencies to glossary

* Add quotes to definition of TO

* Update POS/TAG tables in docs

Update POS/TAG tables for English and German docs using current
information generated from the tag_maps and GLOSSARY.

* Update warning that -PRON- is specific to English

* Revert docs to default JSON output with convert

* Revert "Revert docs to default JSON output with convert"

This reverts commit 6b78c048f1.
2019-10-24 12:56:05 +02:00
Zhuoru Lin 10d88b09bb Bugfix/fix wikidata train entity linker (#4509)
* Fix labels_discard Nonetype iteration error

* Contributor agreement for Zhuoru Lin

* Enhance EntityLinker.predict() to handle labels_discard is None case.
2019-10-24 12:52:59 +02:00
adrianeboyd 8516e9d53b Support train dict format as JSONL (#4471)
* Support train dict format as JSONL

* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples

* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`

* Revert docs to default JSON output with convert
2019-10-23 16:01:44 +02:00
adrianeboyd 7fc39f124c Fix logic in rules+model entity example [ci skip] (#4510) 2019-10-23 14:41:21 +02:00
Ines Montani 835498d24f Update azure-pipelines.yml 2019-10-23 14:31:09 +02:00
Matthew Honnibal ca7f0e669e Set version to v2.2.2.dev1 2019-10-22 20:11:25 +02:00
Matthew Honnibal 9489c5f6b2 Clip most_similar to range [-1, 1] (fixes #4506) (#4507)
* Clip most_similar to range [-1, 1]

* Add/fix vectors tests

* Fix test
2019-10-22 20:10:42 +02:00
Ines Montani 74a19aeb1c Add xfailing test [ci skip] 2019-10-22 18:18:43 +02:00
Matthew Honnibal 3f6cb618a9 Set version to v2.2.2.dev0 2019-10-22 17:47:36 +02:00
Sofie Van Landeghem 48886afc78 prevent zero-length mem alloc (#4429)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing

* goldparse init: allocate fields only if doc is not empty

* avoid zero length alloc in saving tokenizer cache

* avoid allocating zero length mem in matcher

* asserts to avoid allocating zero length mem

* fix zero-length allocation in matcher

* bump cymem version

* revert cymem version bump
2019-10-22 16:54:33 +02:00
adrianeboyd 3dfc764577 Free pointers in parser activations (#4486)
* Free pointers in ActivationsC

* Restructure alloc/free for parser activations

* Rewrite/restructure to have allocation and free in parallel functions
in `_parser_model` rather than partially in `_parseC()` in `Parser`.

* Remove `resize_activations` from `_parser_model.pxd`.
2019-10-22 15:06:44 +02:00
Ines Montani 388ea03065 Update universe.json [ci skip] 2019-10-22 14:54:47 +02:00
Kabir Khan 8a7a30ea1d Add cookiecutter-spacy-fastapi to spacy universe (#4498) 2019-10-22 14:50:40 +02:00
tamuhey fb89f6792b refactor: remove unused variable (#4499) 2019-10-22 14:38:17 +02:00
Ines Montani 4659435573 Fix argument type in PhraseMatcher.add docs (closes #4496) [ci skip] 2019-10-22 14:37:30 +02:00
Julin S 3ee15fce0d Update information about Rasa (#4492)
Rasa has been updated and rasa core and rasa nlu have been merged.
2019-10-22 14:32:31 +02:00
Ines Montani b0bdb18b4d Fix typo 2019-10-21 18:36:22 +02:00
Ines Montani 945b8ecfba Relax plac version requirement 2019-10-21 18:08:33 +02:00
gustavengstrom 050e2445a8 Adding noun_chunks to the Swedish language model (sv) (#4422)
* Create syntax_iterators.py

Replica of spacy/lang/fr/syntax_iterators.py

* Added import statements for SYNTAX_ITERATORS

* Create gustavengstrom.md

* Added "dobj" to list of labels in noun_chunks method and a test_noun_chunks method to the  Swedish language model.

* Delete README-checkpoint.md


Co-authored-by: Gustav <gustav@davcon.se>
Co-authored-by: Ines Montani <ines@ines.io>
2019-10-21 12:57:06 +02:00
Ines Montani b2f88e2060 Fix formatting [ci skip] 2019-10-21 12:26:07 +02:00
adrianeboyd f5c551a43a Checks/errors related to ill-formed IOB input in CLI convert and debug-data (#4487)
* Error for ill-formed input to iob_to_biluo()

Check for empty label in iob_to_biluo(), which can result from
ill-formed input.

* Check for empty NER label in debug-data
2019-10-21 12:20:28 +02:00
adrianeboyd 3195a8f170 Add Entity Linking to menu (#4489) 2019-10-21 12:17:30 +02:00
Sofie Van Landeghem d5d55312b2 prevent division by zero in most_similar method (#4488) 2019-10-21 12:04:46 +02:00
Ines Montani a98d1cd58e Update Thinc version and remove GPU ops 2019-10-20 19:01:45 +02:00
Pepe Berba 7772d5d3c5 Update `vocab.get_vector` docs to include features on Fasttext ngram (#4464)
* Update `vocab.get_vector`

* Added contrib agreement
2019-10-20 01:28:18 +02:00
Ines Montani 2c96a5e5b0
Remove lemma attrs on BaseDefaults (#4468) 2019-10-19 23:18:09 +02:00
Ines Montani f6af3cf8d9 Add 3.8 classifier [ci skip] 2019-10-19 18:13:25 +02:00
Ines Montani 5e59c9b3ee Fix unicode strings in examples [ci skip] 2019-10-18 18:47:59 +02:00
adrianeboyd 8d3de90bc4 Suppress convert output if writing to stdout (#4472) 2019-10-18 18:12:59 +02:00
Ines Montani 692d7f4291 Fix formatting [ci skip] 2019-10-18 11:33:38 +02:00
Ines Montani 181c01f629 Tidy up and auto-format 2019-10-18 11:27:38 +02:00
Ines Montani fb11852750 Remove unused imports 2019-10-18 11:06:41 +02:00
adrianeboyd d359da9687 Replace Entity/MatchStruct with SpanC (#4459)
* Replace MatchStruct with Entity

Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.

* Replace Entity with more general SpanC
2019-10-18 11:01:47 +02:00
adrianeboyd 29e3da6493 Add missing cats to gold annot_tuples in Scorer (#4466)
Add missing `cats` in `Scorer` call to `GoldParse.from_annot_tuples()`
when the `doc` and `gold` have differing lengths.
2019-10-18 11:00:02 +02:00
adrianeboyd 135e3de531 Check for docs with 2+ sentences in debug-data (#4467) 2019-10-18 10:59:16 +02:00
Ghola 258eb9e064 Misspelling on Lemmatizer Example #4406 (#4449)
Removing extra o in the lookups = Loookups()
2019-10-16 23:23:15 +02:00
Daniel King e646956176 Most similar bug (#4446)
* Add batch size indexing

* Don't sort if n == 1

* Add test for most similar vectors issue

* Change > to >=
2019-10-16 23:18:55 +02:00
Anastassia 4a77d03ff7 Fix documentation for the docs_to_json function (#4456) 2019-10-16 23:17:58 +02:00
adrianeboyd 275c9ad872 Allow int values in token patterns (#4444)
* Add missing int value option to top-level pattern validation in Matcher

* Adjust existing tests accordingly

* Add new test for valid pattern `{"LENGTH": int}`
2019-10-16 13:40:18 +02:00
Sofie Van Landeghem 7d1efac4eb Fix remove pattern from matcher (#4454)
* raise specific error when removing a matcher rule that doesn't exist

* rephrasing

* bugfix in remove matcher + extended unit test
2019-10-16 13:34:58 +02:00
Sofie Van Landeghem 2d249a9502 KB extensions and better parsing of WikiData (#4375)
* fix overflow error on windows

* more documentation & logging fixes

* md fix

* 3 different limit parameters to play with execution time

* bug fixes directory locations

* small fixes

* exclude dev test articles from prior probabilities stats

* small fixes

* filtering wikidata entities, removing numeric and meta items

* adding aliases from wikidata also to the KB

* fix adding WD aliases

* adding also new aliases to previously added entities

* fixing comma's

* small doc fixes

* adding subclassof filtering

* append alias functionality in KB

* prevent appending the same entity-alias pair

* fix for appending WD aliases

* remove date filter

* remove unnecessary import

* small corrections and reformatting

* remove WD aliases for now (too slow)

* removing numeric entities from training and evaluation

* small fixes

* shortcut during prediction if there is only one candidate

* add counts and fscore logging, remove FP NER from evaluation

* fix entity_linker.predict to take docs instead of single sentences

* remove enumeration sentences from the WP dataset

* entity_linker.update to process full doc instead of single sentence

* spelling corrections and dump locations in readme

* NLP IO fix

* reading KB is unnecessary at the end of the pipeline

* small logging fix

* remove empty files
2019-10-14 12:28:53 +02:00
Peter Gilles 428887b8f2 Initial commit: New language Luxembourgish (lb) (#4424)
* new language: Luxembourgish (lb)

* update

* update

* Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md

* Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md

* Update norm_exceptions.py

* Delete README.md

* moved test_lemma.py

* deactivated 'lemma_lookup = LOOKUP'

* update

* Update conftest.py

* update

* tests updated

* import unicode_literals

* Update spacy/tests/lang/lb/test_text.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Create PeterGilles.md
2019-10-14 12:27:50 +02:00
adrianeboyd 98a961a60e Fix PhraseMatcher.remove for overlapping patterns (#4437) 2019-10-14 12:19:51 +02:00
Ines Montani f8f68bb062 Auto-format [ci skip] 2019-10-10 17:08:39 +02:00
adrianeboyd d2d2baaf76 Revert training example edit from #4327 (#4403)
I think the original annotation was correct and this change also
unfortunately introduced a cycle into the dependency tree.
2019-10-10 17:00:26 +02:00