Commit Graph

18 Commits

Author SHA1 Message Date
adrianeboyd 8fe7bdd0fa Improve token pattern checking without validation (#4105)
* Fix typo in rule-based matching docs

* Improve token pattern checking without validation

Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses #4070 (also related: #4063, #4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

* Remove unnecessary .keys()

* Rephrase error messages

* Add another type check to Matcher

Add another type check to Matcher for more understandable error messages
in some rare cases.

* Support phrase_matcher_attr=TEXT for EntityRuler

* Don't use spacy.errors in examples and bin scripts

* Fix error code

* Auto-format

Also try get Azure pipelines to finally start a build :(

* Update errors.py


Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2019-08-21 14:00:37 +02:00
Sofie Van Landeghem 0ba1b5eebc CLI scripts for entity linking (wikipedia & generic) (#4091)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* turn kb_creator into CLI script (wip)

* proper parameters for training entity vectors

* wikidata pipeline split up into two executable scripts

* remove context_width

* move wikidata scripts in bin directory, remove old dummy script

* refine KB script with logs and preprocessing options

* small edits

* small improvements to logging of EL CLI script
2019-08-13 15:38:59 +02:00
svlandeg a037206f0a use pathlib instead 2019-07-23 12:17:19 +02:00
svlandeg cd6c263fe4 format offsets 2019-07-23 11:31:29 +02:00
svlandeg 3e140534d9 format 2019-07-22 15:04:57 +02:00
svlandeg dae8a21282 rename entity frequency 2019-07-19 17:40:28 +02:00
svlandeg f75d1299a7 formatting 2019-07-19 14:52:45 +02:00
svlandeg 21176517a7 have gold.links correspond exactly to doc.ents 2019-07-19 12:36:15 +02:00
svlandeg ec55d2fccd filter training data beforehand (+black formatting) 2019-07-18 10:22:24 +02:00
svlandeg d833d4c358 fixes in kb and gold 2019-07-17 17:18:26 +02:00
svlandeg 4086c6ff60 get vector functionality + unit test 2019-07-17 12:17:02 +02:00
svlandeg 3420cbe496 small fixes 2019-07-03 10:25:51 +02:00
svlandeg 68a0662019 context encoder with Tok2Vec + linking model instead of cosine 2019-06-28 08:29:31 +02:00
svlandeg b58bace84b small fixes 2019-06-24 10:55:04 +02:00
svlandeg cc9ae28a52 custom error and warning messages 2019-06-19 12:35:26 +02:00
svlandeg a31648d28b further code cleanup 2019-06-19 09:15:43 +02:00
svlandeg 478305cd3f small tweaks and documentation 2019-06-18 18:38:09 +02:00
svlandeg 0d177c1146 clean up code, remove old code, move to bin 2019-06-18 13:20:40 +02:00