Commit Graph

340 Commits

Author SHA1 Message Date
svlandeg d8b435ceff pretraining description vectors and storing them in the KB 2019-06-06 19:51:27 +02:00
svlandeg 5c723c32c3 entity vectors in the KB + serialization of them 2019-06-05 18:29:18 +02:00
svlandeg 9abbd0899f separate entity encoder to get 64D descriptions 2019-06-05 00:09:46 +02:00
svlandeg fb37cdb2d3 implementing el pipe in pipes.pyx (not tested yet) 2019-06-03 21:32:54 +02:00
svlandeg d83a1e3052 Merge branch 'master' into feature/nel-wiki 2019-06-03 09:35:10 +02:00
svlandeg 9e88763dab 60% acc run 2019-06-03 08:04:49 +02:00
svlandeg 268a52ead7 experimenting with cosine sim for negative examples (not OK yet) 2019-05-29 16:07:53 +02:00
svlandeg a761929fa5 context encoder combining sentence and article 2019-05-28 18:14:49 +02:00
svlandeg 992fa92b66 refactor again to clusters of entities and cosine similarity 2019-05-28 00:05:22 +02:00
svlandeg 8c4aa076bc small fixes 2019-05-27 14:29:38 +02:00
svlandeg cfc27d7ff9 using Tok2Vec instead 2019-05-26 23:39:46 +02:00
svlandeg abf9af81c9 learn rate en epochs 2019-05-24 22:04:25 +02:00
svlandeg 86ed771e0b adding local sentence encoder 2019-05-23 16:59:11 +02:00
svlandeg 4392c01b7b obtain sentence for each mention 2019-05-23 15:37:05 +02:00
svlandeg 97241a3ed7 upsampling and batch processing 2019-05-22 23:40:10 +02:00
svlandeg 1a16490d20 update per entity 2019-05-22 12:46:40 +02:00
svlandeg eb08bdb11f hidden with for encoders 2019-05-21 23:42:46 +02:00
svlandeg 7b13e3d56f undersampling negatives 2019-05-21 18:35:10 +02:00
svlandeg 2fa3fac851 fix concat bp and more efficient batch calls 2019-05-21 13:43:59 +02:00
svlandeg 0a15ee4541 fix in bp call 2019-05-20 23:54:55 +02:00
svlandeg 89e322a637 small fixes 2019-05-20 17:20:39 +02:00
svlandeg 7edb2e1711 fix convolution layer 2019-05-20 11:58:48 +02:00
svlandeg dd691d0053 debugging 2019-05-17 17:44:11 +02:00
svlandeg 400b19353d simplify architecture and larger-scale test runs 2019-05-17 01:51:18 +02:00
svlandeg d51bffe63b clean up code 2019-05-16 18:36:15 +02:00
svlandeg b5470f3d75 various tests, architectures and experiments 2019-05-16 18:25:34 +02:00
svlandeg 9ffe5437ae calculate gradient for entity encoding 2019-05-15 02:23:08 +02:00
svlandeg 2713abc651 implement loss function using dot product and prob estimate per candidate cluster 2019-05-14 22:55:56 +02:00
svlandeg 09ed446b20 different architecture / settings 2019-05-14 08:37:52 +02:00
svlandeg 4142e8dd1b train and predict per article (saving time for doc encoding) 2019-05-13 17:02:34 +02:00
svlandeg 3b81b00954 evaluating on dev set during training 2019-05-13 14:26:04 +02:00
svlandeg b6d788064a some first experiments with different architectures and metrics 2019-05-10 12:53:14 +02:00
svlandeg 9d089c0410 grouping clusters of instances per doc+mention 2019-05-09 18:11:49 +02:00
svlandeg c6ca8649d7 first stab at model - not functional yet 2019-05-09 17:23:19 +02:00
svlandeg 9f33732b96 using entity descriptions and article texts as input embedding vectors for training 2019-05-07 16:03:42 +02:00
svlandeg 7e348d7f7f baseline evaluation using highest-freq candidate 2019-05-06 15:13:50 +02:00
Ines Montani dd153b2b33 Simplify helper (see #3681) [ci skip] 2019-05-06 15:13:10 +02:00
Ines Montani f8fce6c03c Fix typo (see #3681) 2019-05-06 15:02:11 +02:00
Ines Montani f2a56c1b56 Rewrite example to use Retokenizer (resolves #3681)
Also add helper to filter spans
2019-05-06 14:51:18 +02:00
svlandeg 6961215578 refactor code to separate functionality into different files 2019-05-06 10:56:56 +02:00
svlandeg f5190267e7 run only 100M of WP data as training dataset (9%) 2019-05-03 18:09:09 +02:00
svlandeg 4e929600e5 fix WP id parsing, speed up processing and remove ambiguous strings in one doc (for now) 2019-05-03 17:37:47 +02:00
svlandeg 34600c92bd try catch per article to ensure the pipeline goes on 2019-05-03 15:10:09 +02:00
svlandeg bbcb9da466 creating training data with clean WP texts and QID entities true/false 2019-05-03 10:44:29 +02:00
svlandeg cba9680d13 run NER on clean WP text and link to gold-standard entity IDs 2019-05-02 17:24:52 +02:00
svlandeg 581dc9742d parsing clean text from WP articles to use as input data for NER and NEL 2019-05-02 17:09:56 +02:00
svlandeg 8353552191 cleanup 2019-05-01 23:26:16 +02:00
svlandeg 1ae41daaa9 allow small rounding errors 2019-05-01 23:05:40 +02:00
svlandeg 3629a52ede reading all persons in wikidata 2019-05-01 01:00:59 +02:00
svlandeg 60b54ae8ce bulk entity writing and experiment with regex wikidata reader to speed up processing 2019-05-01 00:00:38 +02:00