Commit Graph

47 Commits

Author SHA1 Message Date
Sofie Van Landeghem 0b4b4f1819 Documentation for Entity Linking (#4065)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts
2019-09-12 11:38:34 +02:00
Sofie Van Landeghem 0ba1b5eebc CLI scripts for entity linking (wikipedia & generic) (#4091)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* turn kb_creator into CLI script (wip)

* proper parameters for training entity vectors

* wikidata pipeline split up into two executable scripts

* remove context_width

* move wikidata scripts in bin directory, remove old dummy script

* refine KB script with logs and preprocessing options

* small edits

* small improvements to logging of EL CLI script
2019-08-13 15:38:59 +02:00
svlandeg cd6c263fe4 format offsets 2019-07-23 11:31:29 +02:00
svlandeg 9f8c1e71a2 fix for Issue #4000 2019-07-22 13:34:12 +02:00
svlandeg 21176517a7 have gold.links correspond exactly to doc.ents 2019-07-19 12:36:15 +02:00
svlandeg e1213eaf6a use original gold object in get_loss function 2019-07-18 13:35:10 +02:00
svlandeg ec55d2fccd filter training data beforehand (+black formatting) 2019-07-18 10:22:24 +02:00
svlandeg b7a0c9bf60 fixing the context/prior weight settings 2019-07-03 17:48:09 +02:00
svlandeg 8840d4b1b3 fix for context encoder optimizer 2019-07-03 13:35:36 +02:00
svlandeg 3420cbe496 small fixes 2019-07-03 10:25:51 +02:00
svlandeg 2d2dea9924 experiment with adding NER types to the feature vector 2019-06-29 14:52:36 +02:00
svlandeg c664f58246 adding prior probability as feature in the model 2019-06-28 16:22:58 +02:00
svlandeg 68a0662019 context encoder with Tok2Vec + linking model instead of cosine 2019-06-28 08:29:31 +02:00
svlandeg dbc53b9870 rename to KBEntryC 2019-06-26 15:55:26 +02:00
svlandeg 1de61f68d6 improve speed of prediction loop 2019-06-26 13:53:10 +02:00
svlandeg bee23cd8af try Tok2Vec instead of SpacyVectors 2019-06-25 16:09:22 +02:00
svlandeg b58bace84b small fixes 2019-06-24 10:55:04 +02:00
svlandeg a31648d28b further code cleanup 2019-06-19 09:15:43 +02:00
svlandeg 478305cd3f small tweaks and documentation 2019-06-18 18:38:09 +02:00
svlandeg 0d177c1146 clean up code, remove old code, move to bin 2019-06-18 13:20:40 +02:00
svlandeg 6961215578 refactor code to separate functionality into different files 2019-05-06 10:56:56 +02:00
svlandeg f5190267e7 run only 100M of WP data as training dataset (9%) 2019-05-03 18:09:09 +02:00
svlandeg 4e929600e5 fix WP id parsing, speed up processing and remove ambiguous strings in one doc (for now) 2019-05-03 17:37:47 +02:00
svlandeg 34600c92bd try catch per article to ensure the pipeline goes on 2019-05-03 15:10:09 +02:00
svlandeg bbcb9da466 creating training data with clean WP texts and QID entities true/false 2019-05-03 10:44:29 +02:00
svlandeg cba9680d13 run NER on clean WP text and link to gold-standard entity IDs 2019-05-02 17:24:52 +02:00
svlandeg 581dc9742d parsing clean text from WP articles to use as input data for NER and NEL 2019-05-02 17:09:56 +02:00
svlandeg 8353552191 cleanup 2019-05-01 23:26:16 +02:00
svlandeg 1ae41daaa9 allow small rounding errors 2019-05-01 23:05:40 +02:00
svlandeg 3629a52ede reading all persons in wikidata 2019-05-01 01:00:59 +02:00
svlandeg 60b54ae8ce bulk entity writing and experiment with regex wikidata reader to speed up processing 2019-05-01 00:00:38 +02:00
svlandeg 653b7d9c87 calculate entity raw counts offline to speed up KB construction 2019-04-30 11:39:42 +02:00
svlandeg 19e8f339cb deduce entity freq from WP corpus and serialize vocab in WP test 2019-04-29 17:37:29 +02:00
svlandeg 54d0cea062 unit test for KB serialization 2019-04-24 23:52:34 +02:00
svlandeg 3e0cb69065 KB aliases to and from file 2019-04-24 20:24:24 +02:00
svlandeg ad6c5e581c writing and reading number of entries to/from header 2019-04-24 15:31:44 +02:00
svlandeg 6e3223f234 bulk loading in proper order of entity indices 2019-04-24 11:26:38 +02:00
svlandeg 694fea597a dumping all entryC entries + (inefficient) reading back in 2019-04-23 18:36:50 +02:00
svlandeg 8e70a564f1 custom reader and writer for _EntryC fields (first stab at it - not complete) 2019-04-23 16:33:40 +02:00
svlandeg 004e5e7d1c little fixes 2019-04-19 14:24:02 +02:00
svlandeg 9a8197185b fix alias capitalization 2019-04-18 22:37:50 +02:00
svlandeg 9f308eb5dc fixes for prior prob and linking wikidata IDs with wikipedia titles 2019-04-18 16:14:25 +02:00
svlandeg 10ee8dfea2 poc with few entities and collecting aliases from the WP links 2019-04-18 14:12:17 +02:00
svlandeg 6763e025e1 parse wp dump for links to determine prior probabilities 2019-04-15 11:41:57 +02:00
svlandeg 3163331b1e wikipedia dump parser and mediawiki format regex cleanup 2019-04-14 21:52:01 +02:00
svlandeg b31a390a9a reading types, claims and sitelinks 2019-04-11 21:42:44 +02:00
svlandeg 6e997be4b4 reading wikidata descriptions and aliases 2019-04-11 21:08:22 +02:00