Commit Graph

126 Commits

Author SHA1 Message Date
Sofie Van Landeghem 2d249a9502 KB extensions and better parsing of WikiData (#4375)
* fix overflow error on windows

* more documentation & logging fixes

* md fix

* 3 different limit parameters to play with execution time

* bug fixes directory locations

* small fixes

* exclude dev test articles from prior probabilities stats

* small fixes

* filtering wikidata entities, removing numeric and meta items

* adding aliases from wikidata also to the KB

* fix adding WD aliases

* adding also new aliases to previously added entities

* fixing comma's

* small doc fixes

* adding subclassof filtering

* append alias functionality in KB

* prevent appending the same entity-alias pair

* fix for appending WD aliases

* remove date filter

* remove unnecessary import

* small corrections and reformatting

* remove WD aliases for now (too slow)

* removing numeric entities from training and evaluation

* small fixes

* shortcut during prediction if there is only one candidate

* add counts and fscore logging, remove FP NER from evaluation

* fix entity_linker.predict to take docs instead of single sentences

* remove enumeration sentences from the WP dataset

* entity_linker.update to process full doc instead of single sentence

* spelling corrections and dump locations in readme

* NLP IO fix

* reading KB is unnecessary at the end of the pipeline

* small logging fix

* remove empty files
2019-10-14 12:28:53 +02:00
Sofie Van Landeghem 0b4b4f1819 Documentation for Entity Linking (#4065)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts
2019-09-12 11:38:34 +02:00
Sofie Van Landeghem 0ba1b5eebc CLI scripts for entity linking (wikipedia & generic) (#4091)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* turn kb_creator into CLI script (wip)

* proper parameters for training entity vectors

* wikidata pipeline split up into two executable scripts

* remove context_width

* move wikidata scripts in bin directory, remove old dummy script

* refine KB script with logs and preprocessing options

* small edits

* small improvements to logging of EL CLI script
2019-08-13 15:38:59 +02:00
svlandeg cd6c263fe4 format offsets 2019-07-23 11:31:29 +02:00
svlandeg 9f8c1e71a2 fix for Issue #4000 2019-07-22 13:34:12 +02:00
svlandeg dae8a21282 rename entity frequency 2019-07-19 17:40:28 +02:00
svlandeg 21176517a7 have gold.links correspond exactly to doc.ents 2019-07-19 12:36:15 +02:00
svlandeg e1213eaf6a use original gold object in get_loss function 2019-07-18 13:35:10 +02:00
svlandeg ec55d2fccd filter training data beforehand (+black formatting) 2019-07-18 10:22:24 +02:00
svlandeg b7a0c9bf60 fixing the context/prior weight settings 2019-07-03 17:48:09 +02:00
svlandeg 8840d4b1b3 fix for context encoder optimizer 2019-07-03 13:35:36 +02:00
svlandeg 3420cbe496 small fixes 2019-07-03 10:25:51 +02:00
svlandeg 2d2dea9924 experiment with adding NER types to the feature vector 2019-06-29 14:52:36 +02:00
svlandeg c664f58246 adding prior probability as feature in the model 2019-06-28 16:22:58 +02:00
svlandeg 1c80b85241 fix tests 2019-06-28 08:59:23 +02:00
svlandeg 68a0662019 context encoder with Tok2Vec + linking model instead of cosine 2019-06-28 08:29:31 +02:00
svlandeg dbc53b9870 rename to KBEntryC 2019-06-26 15:55:26 +02:00
svlandeg 1de61f68d6 improve speed of prediction loop 2019-06-26 13:53:10 +02:00
svlandeg bee23cd8af try Tok2Vec instead of SpacyVectors 2019-06-25 16:09:22 +02:00
svlandeg b58bace84b small fixes 2019-06-24 10:55:04 +02:00
svlandeg a31648d28b further code cleanup 2019-06-19 09:15:43 +02:00
svlandeg 478305cd3f small tweaks and documentation 2019-06-18 18:38:09 +02:00
svlandeg 0d177c1146 clean up code, remove old code, move to bin 2019-06-18 13:20:40 +02:00
svlandeg ffae7d3555 sentence encoder only (removing article/mention encoder) 2019-06-18 00:05:47 +02:00
svlandeg 6332af40de baseline performances: oracle KB, random and prior prob 2019-06-17 14:39:40 +02:00
svlandeg 24db1392b9 reprocessing all of wikipedia for training data 2019-06-16 21:14:45 +02:00
svlandeg 81731907ba performance per entity type 2019-06-14 19:55:46 +02:00
svlandeg b312f2d0e7 redo training data to be independent of KB and entity-level instead of doc-level 2019-06-14 15:55:26 +02:00
svlandeg 0b04d142de regenerating KB 2019-06-13 22:32:56 +02:00
svlandeg 78dd3e11da write entity linking pipe to file and keep vocab consistent between kb and nlp 2019-06-13 16:25:39 +02:00
svlandeg b12001f368 small fixes 2019-06-12 22:05:53 +02:00
svlandeg 6521cfa132 speeding up training 2019-06-12 13:37:05 +02:00
svlandeg 66813a1fdc speed up predictions 2019-06-11 14:18:20 +02:00
svlandeg fe1ed432ef eval on dev set, varying combo's of prior and context scores 2019-06-11 11:40:58 +02:00
svlandeg 83dc7b46fd first tests with EL pipe 2019-06-10 21:25:26 +02:00
svlandeg 7de1ee69b8 training loop in proper pipe format 2019-06-07 15:55:10 +02:00
svlandeg 0486ccabfd introduce goldparse.links 2019-06-07 13:54:45 +02:00
svlandeg a5c061f506 storing NEL training data in GoldParse objects 2019-06-07 12:58:42 +02:00
svlandeg 61f0e2af65 code cleanup 2019-06-06 20:22:14 +02:00
svlandeg d8b435ceff pretraining description vectors and storing them in the KB 2019-06-06 19:51:27 +02:00
svlandeg 5c723c32c3 entity vectors in the KB + serialization of them 2019-06-05 18:29:18 +02:00
svlandeg 9abbd0899f separate entity encoder to get 64D descriptions 2019-06-05 00:09:46 +02:00
svlandeg fb37cdb2d3 implementing el pipe in pipes.pyx (not tested yet) 2019-06-03 21:32:54 +02:00
svlandeg 9e88763dab 60% acc run 2019-06-03 08:04:49 +02:00
svlandeg 268a52ead7 experimenting with cosine sim for negative examples (not OK yet) 2019-05-29 16:07:53 +02:00
svlandeg a761929fa5 context encoder combining sentence and article 2019-05-28 18:14:49 +02:00
svlandeg 992fa92b66 refactor again to clusters of entities and cosine similarity 2019-05-28 00:05:22 +02:00
svlandeg 8c4aa076bc small fixes 2019-05-27 14:29:38 +02:00
svlandeg cfc27d7ff9 using Tok2Vec instead 2019-05-26 23:39:46 +02:00
svlandeg abf9af81c9 learn rate en epochs 2019-05-24 22:04:25 +02:00