spaCy

History

Sofie Van Landeghem 7b96a5e10f Reduce mem usage in training Entity Linker (#4811 ) * move nlp processing for el pipe to batch training instead of preprocessing * adding dev eval back in, and limit in articles instead of entities * use pipe whenever possible * few more small doc changes * access dev data through generator * tqdm description * small fixes * update documentation		2020-01-06 14:59:50 +01:00
..
README.md	Reduce mem usage in training Entity Linker (#4811 )	2020-01-06 14:59:50 +01:00
__init__.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
entity_linker_evaluation.py	Reduce mem usage in training Entity Linker (#4811 )	2020-01-06 14:59:50 +01:00
kb_creator.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
train_descriptions.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
wiki_io.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
wiki_namespaces.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
wikidata_pretrain_kb.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
wikidata_processor.py	KB extensions and better parsing of WikiData (#4375 )	2019-10-14 12:28:53 +02:00
wikidata_train_entity_linker.py	Reduce mem usage in training Entity Linker (#4811 )	2020-01-06 14:59:50 +01:00
wikipedia_processor.py	Reduce mem usage in training Entity Linker (#4811 )	2020-01-06 14:59:50 +01:00

Entity Linking with Wikipedia and Wikidata

Run wikipedia_pretrain_kb.py

This takes as input the locations of a Wikipedia and a Wikidata dump, and produces a KB directory + training file
- WikiData: get latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/
- Wikipedia: get enwiki-latest-pages-articles-multistream.xml.bz2 from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
You can set the filtering parameters for KB construction:
- max_per_alias (-a): (max) number of candidate entities in the KB per alias/synonym
- min_freq (-f): threshold of number of times an entity should occur in the corpus to be included in the KB
- min_pair (-c): threshold of number of times an entity+alias combination should occur in the corpus to be included in the KB
Further parameters to set:
- descriptions_from_wikipedia (-wp): whether to parse descriptions from Wikipedia (True) or Wikidata (False)
- entity_vector_length (-v): length of the pre-trained entity description vectors
- lang (-la): language for which to fetch Wikidata information (as the dump contains all languages)

Quick testing and rerunning:

When trying out the pipeline for a quick test, set limit_prior (-lp), limit_train (-lt) and/or limit_wd (-lw) to read only parts of the dumps instead of everything.
If you only want to (re)run certain parts of the pipeline, just remove the corresponding files and they will be recalculated or reparsed.

Run wikidata_train_entity_linker.py

This takes the KB directory produced by Step 1, and trains an Entity Linking model
Specify the output directory (-o) in which the final, trained model will be saved
You can set the learning parameters for the EL training:
- epochs (-e): number of training iterations
- dropout (-p): dropout rate
- lr (-n): learning rate
- l2 (-r): L2 regularization
Specify the number of training and dev testing articles with train_articles (-t) and dev_articles (-d) respectively
- If not specified, the full dataset will be processed - this may take a LONG time !
Further parameters to set:
- labels_discard (-l): NER label types to discard during training