spaCy

Commit Graph

Author	SHA1	Message	Date
Sofie Van Landeghem	2d249a9502	KB extensions and better parsing of WikiData (#4375 ) * fix overflow error on windows * more documentation & logging fixes * md fix * 3 different limit parameters to play with execution time * bug fixes directory locations * small fixes * exclude dev test articles from prior probabilities stats * small fixes * filtering wikidata entities, removing numeric and meta items * adding aliases from wikidata also to the KB * fix adding WD aliases * adding also new aliases to previously added entities * fixing comma's * small doc fixes * adding subclassof filtering * append alias functionality in KB * prevent appending the same entity-alias pair * fix for appending WD aliases * remove date filter * remove unnecessary import * small corrections and reformatting * remove WD aliases for now (too slow) * removing numeric entities from training and evaluation * small fixes * shortcut during prediction if there is only one candidate * add counts and fscore logging, remove FP NER from evaluation * fix entity_linker.predict to take docs instead of single sentences * remove enumeration sentences from the WP dataset * entity_linker.update to process full doc instead of single sentence * spelling corrections and dump locations in readme * NLP IO fix * reading KB is unnecessary at the end of the pipeline * small logging fix * remove empty files	2019-10-14 12:28:53 +02:00
Sofie Van Landeghem	0b4b4f1819	Documentation for Entity Linking (#4065 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts	2019-09-12 11:38:34 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
svlandeg	cd6c263fe4	format offsets	2019-07-23 11:31:29 +02:00
svlandeg	9f8c1e71a2	fix for Issue #4000	2019-07-22 13:34:12 +02:00
svlandeg	dae8a21282	rename entity frequency	2019-07-19 17:40:28 +02:00
svlandeg	21176517a7	have gold.links correspond exactly to doc.ents	2019-07-19 12:36:15 +02:00
svlandeg	e1213eaf6a	use original gold object in get_loss function	2019-07-18 13:35:10 +02:00
svlandeg	ec55d2fccd	filter training data beforehand (+black formatting)	2019-07-18 10:22:24 +02:00
svlandeg	b7a0c9bf60	fixing the context/prior weight settings	2019-07-03 17:48:09 +02:00
svlandeg	8840d4b1b3	fix for context encoder optimizer	2019-07-03 13:35:36 +02:00
svlandeg	3420cbe496	small fixes	2019-07-03 10:25:51 +02:00
svlandeg	2d2dea9924	experiment with adding NER types to the feature vector	2019-06-29 14:52:36 +02:00
svlandeg	c664f58246	adding prior probability as feature in the model	2019-06-28 16:22:58 +02:00
svlandeg	1c80b85241	fix tests	2019-06-28 08:59:23 +02:00
svlandeg	68a0662019	context encoder with Tok2Vec + linking model instead of cosine	2019-06-28 08:29:31 +02:00
svlandeg	dbc53b9870	rename to KBEntryC	2019-06-26 15:55:26 +02:00
svlandeg	1de61f68d6	improve speed of prediction loop	2019-06-26 13:53:10 +02:00
svlandeg	bee23cd8af	try Tok2Vec instead of SpacyVectors	2019-06-25 16:09:22 +02:00
svlandeg	b58bace84b	small fixes	2019-06-24 10:55:04 +02:00
svlandeg	a31648d28b	further code cleanup	2019-06-19 09:15:43 +02:00
svlandeg	478305cd3f	small tweaks and documentation	2019-06-18 18:38:09 +02:00
svlandeg	0d177c1146	clean up code, remove old code, move to bin	2019-06-18 13:20:40 +02:00
svlandeg	ffae7d3555	sentence encoder only (removing article/mention encoder)	2019-06-18 00:05:47 +02:00
svlandeg	6332af40de	baseline performances: oracle KB, random and prior prob	2019-06-17 14:39:40 +02:00
svlandeg	24db1392b9	reprocessing all of wikipedia for training data	2019-06-16 21:14:45 +02:00
svlandeg	81731907ba	performance per entity type	2019-06-14 19:55:46 +02:00
svlandeg	b312f2d0e7	redo training data to be independent of KB and entity-level instead of doc-level	2019-06-14 15:55:26 +02:00
svlandeg	0b04d142de	regenerating KB	2019-06-13 22:32:56 +02:00
svlandeg	78dd3e11da	write entity linking pipe to file and keep vocab consistent between kb and nlp	2019-06-13 16:25:39 +02:00
svlandeg	b12001f368	small fixes	2019-06-12 22:05:53 +02:00
svlandeg	6521cfa132	speeding up training	2019-06-12 13:37:05 +02:00
svlandeg	66813a1fdc	speed up predictions	2019-06-11 14:18:20 +02:00
svlandeg	fe1ed432ef	eval on dev set, varying combo's of prior and context scores	2019-06-11 11:40:58 +02:00
svlandeg	83dc7b46fd	first tests with EL pipe	2019-06-10 21:25:26 +02:00
svlandeg	7de1ee69b8	training loop in proper pipe format	2019-06-07 15:55:10 +02:00
svlandeg	0486ccabfd	introduce goldparse.links	2019-06-07 13:54:45 +02:00
svlandeg	a5c061f506	storing NEL training data in GoldParse objects	2019-06-07 12:58:42 +02:00
svlandeg	61f0e2af65	code cleanup	2019-06-06 20:22:14 +02:00
svlandeg	d8b435ceff	pretraining description vectors and storing them in the KB	2019-06-06 19:51:27 +02:00
svlandeg	5c723c32c3	entity vectors in the KB + serialization of them	2019-06-05 18:29:18 +02:00
svlandeg	9abbd0899f	separate entity encoder to get 64D descriptions	2019-06-05 00:09:46 +02:00
svlandeg	fb37cdb2d3	implementing el pipe in pipes.pyx (not tested yet)	2019-06-03 21:32:54 +02:00
svlandeg	9e88763dab	60% acc run	2019-06-03 08:04:49 +02:00
svlandeg	268a52ead7	experimenting with cosine sim for negative examples (not OK yet)	2019-05-29 16:07:53 +02:00
svlandeg	a761929fa5	context encoder combining sentence and article	2019-05-28 18:14:49 +02:00
svlandeg	992fa92b66	refactor again to clusters of entities and cosine similarity	2019-05-28 00:05:22 +02:00
svlandeg	8c4aa076bc	small fixes	2019-05-27 14:29:38 +02:00
svlandeg	cfc27d7ff9	using Tok2Vec instead	2019-05-26 23:39:46 +02:00
svlandeg	abf9af81c9	learn rate en epochs	2019-05-24 22:04:25 +02:00

1 2 3

126 Commits