Commit Graph

55 Commits

Author SHA1 Message Date
Sofie Van Landeghem f9b541f9ef More robust set entities method in KB (#4794)
* add unit test for setting entities with duplicate identifiers

* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00
Sofie Van Landeghem 2d249a9502 KB extensions and better parsing of WikiData (#4375)
* fix overflow error on windows

* more documentation & logging fixes

* md fix

* 3 different limit parameters to play with execution time

* bug fixes directory locations

* small fixes

* exclude dev test articles from prior probabilities stats

* small fixes

* filtering wikidata entities, removing numeric and meta items

* adding aliases from wikidata also to the KB

* fix adding WD aliases

* adding also new aliases to previously added entities

* fixing comma's

* small doc fixes

* adding subclassof filtering

* append alias functionality in KB

* prevent appending the same entity-alias pair

* fix for appending WD aliases

* remove date filter

* remove unnecessary import

* small corrections and reformatting

* remove WD aliases for now (too slow)

* removing numeric entities from training and evaluation

* small fixes

* shortcut during prediction if there is only one candidate

* add counts and fscore logging, remove FP NER from evaluation

* fix entity_linker.predict to take docs instead of single sentences

* remove enumeration sentences from the WP dataset

* entity_linker.update to process full doc instead of single sentence

* spelling corrections and dump locations in readme

* NLP IO fix

* reading KB is unnecessary at the end of the pipeline

* small logging fix

* remove empty files
2019-10-14 12:28:53 +02:00
Sofie Van Landeghem 0b4b4f1819 Documentation for Entity Linking (#4065)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts
2019-09-12 11:38:34 +02:00
Sofie Van Landeghem 0ba1b5eebc CLI scripts for entity linking (wikipedia & generic) (#4091)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* turn kb_creator into CLI script (wip)

* proper parameters for training entity vectors

* wikidata pipeline split up into two executable scripts

* remove context_width

* move wikidata scripts in bin directory, remove old dummy script

* refine KB script with logs and preprocessing options

* small edits

* small improvements to logging of EL CLI script
2019-08-13 15:38:59 +02:00
svlandeg b1911f7105 Errors.E146 for IO error when FP is null 2019-07-22 14:56:13 +02:00
svlandeg 5d544f89ba Errors.E145 for IO errors when reading KB 2019-07-22 14:36:07 +02:00
svlandeg 76184374e2 test corner cases 2019-07-22 13:39:32 +02:00
svlandeg 9f8c1e71a2 fix for Issue #4000 2019-07-22 13:34:12 +02:00
svlandeg dae8a21282 rename entity frequency 2019-07-19 17:40:28 +02:00
svlandeg ec55d2fccd filter training data beforehand (+black formatting) 2019-07-18 10:22:24 +02:00
svlandeg d833d4c358 fixes in kb and gold 2019-07-17 17:18:26 +02:00
svlandeg 4086c6ff60 get vector functionality + unit test 2019-07-17 12:17:02 +02:00
svlandeg dbc53b9870 rename to KBEntryC 2019-06-26 15:55:26 +02:00
svlandeg cc9ae28a52 custom error and warning messages 2019-06-19 12:35:26 +02:00
svlandeg a31648d28b further code cleanup 2019-06-19 09:15:43 +02:00
svlandeg 78dd3e11da write entity linking pipe to file and keep vocab consistent between kb and nlp 2019-06-13 16:25:39 +02:00
svlandeg a5c061f506 storing NEL training data in GoldParse objects 2019-06-07 12:58:42 +02:00
svlandeg d8b435ceff pretraining description vectors and storing them in the KB 2019-06-06 19:51:27 +02:00
svlandeg 5c723c32c3 entity vectors in the KB + serialization of them 2019-06-05 18:29:18 +02:00
svlandeg 1ae41daaa9 allow small rounding errors 2019-05-01 23:05:40 +02:00
svlandeg 60b54ae8ce bulk entity writing and experiment with regex wikidata reader to speed up processing 2019-05-01 00:00:38 +02:00
svlandeg 54d0cea062 unit test for KB serialization 2019-04-24 23:52:34 +02:00
svlandeg 3e0cb69065 KB aliases to and from file 2019-04-24 20:24:24 +02:00
svlandeg ad6c5e581c writing and reading number of entries to/from header 2019-04-24 15:31:44 +02:00
svlandeg 6e3223f234 bulk loading in proper order of entity indices 2019-04-24 11:26:38 +02:00
svlandeg 694fea597a dumping all entryC entries + (inefficient) reading back in 2019-04-23 18:36:50 +02:00
svlandeg 8e70a564f1 custom reader and writer for _EntryC fields (first stab at it - not complete) 2019-04-23 16:33:40 +02:00
svlandeg 10ee8dfea2 poc with few entities and collecting aliases from the WP links 2019-04-18 14:12:17 +02:00
svlandeg 9a7d534b1b enable nogil for cython functions in kb.pxd 2019-04-10 17:25:10 +02:00
svlandeg 61a33f55d2 little fixes 2019-04-10 16:06:09 +02:00
svlandeg 8814b9010d entity as one field instead of both ID and name 2019-03-25 18:10:41 +01:00
svlandeg 46f4eb5db3 error and warning messages 2019-03-22 16:55:05 +01:00
svlandeg b4cd5d5ee9 property annotations for fields with only a getter 2019-03-22 16:10:49 +01:00
svlandeg a48241e9a2 use nlp's vocab for stringstore 2019-03-22 11:36:45 +01:00
svlandeg 1ee0e78fd7 select candidate with highest prior probabiity 2019-03-22 11:36:45 +01:00
svlandeg 7b708ab8a4 name per entity 2019-03-22 11:36:45 +01:00
svlandeg c593607ce2 minimal EL pipe 2019-03-22 11:36:45 +01:00
svlandeg c71123dd0c ensure no candidates are returned for unknown aliases 2019-03-22 11:36:45 +01:00
svlandeg b6c3255a9f Entity class 2019-03-22 11:36:45 +01:00
svlandeg 1289cd6e8f property getters and keep track of KB internally 2019-03-22 11:36:45 +01:00
svlandeg 9a46c431c3 store entity hash instead of pointer 2019-03-22 11:36:45 +01:00
svlandeg 9819dca80e create candidate object from entry pointer (not fully functional yet) 2019-03-22 11:36:45 +01:00
svlandeg a9074e0886 check the length of entities and probabilities vector + unit test 2019-03-22 11:36:45 +01:00
svlandeg d133ffaff9 correct size, not counting dummy elements in the vector 2019-03-22 11:36:45 +01:00
svlandeg 33f8a0fe2e check and unit test in case prior probs exceed 1 2019-03-22 11:36:45 +01:00
svlandeg b55baaa1dc avoid value 0 in preshmap and helpful user warnings 2019-03-22 11:36:45 +01:00
svlandeg 20a7b7b1c0 raising error when adding alias for unknown entity + unit test 2019-03-22 11:36:45 +01:00
svlandeg 8843f9279c use StringStore 2019-03-22 11:36:45 +01:00
svlandeg 51560bf0ed bugfix adding aliases 2019-03-22 11:36:45 +01:00
svlandeg c4ba942765 get candidates by alias 2019-03-22 11:36:45 +01:00