svlandeg
c6ca8649d7
first stab at model - not functional yet
2019-05-09 17:23:19 +02:00
svlandeg
9f33732b96
using entity descriptions and article texts as input embedding vectors for training
2019-05-07 16:03:42 +02:00
svlandeg
7e348d7f7f
baseline evaluation using highest-freq candidate
2019-05-06 15:13:50 +02:00
svlandeg
6961215578
refactor code to separate functionality into different files
2019-05-06 10:56:56 +02:00
svlandeg
f5190267e7
run only 100M of WP data as training dataset (9%)
2019-05-03 18:09:09 +02:00
svlandeg
4e929600e5
fix WP id parsing, speed up processing and remove ambiguous strings in one doc (for now)
2019-05-03 17:37:47 +02:00
svlandeg
34600c92bd
try catch per article to ensure the pipeline goes on
2019-05-03 15:10:09 +02:00
svlandeg
bbcb9da466
creating training data with clean WP texts and QID entities true/false
2019-05-03 10:44:29 +02:00
svlandeg
cba9680d13
run NER on clean WP text and link to gold-standard entity IDs
2019-05-02 17:24:52 +02:00
svlandeg
581dc9742d
parsing clean text from WP articles to use as input data for NER and NEL
2019-05-02 17:09:56 +02:00
svlandeg
8353552191
cleanup
2019-05-01 23:26:16 +02:00
svlandeg
1ae41daaa9
allow small rounding errors
2019-05-01 23:05:40 +02:00
svlandeg
3629a52ede
reading all persons in wikidata
2019-05-01 01:00:59 +02:00
svlandeg
60b54ae8ce
bulk entity writing and experiment with regex wikidata reader to speed up processing
2019-05-01 00:00:38 +02:00
svlandeg
653b7d9c87
calculate entity raw counts offline to speed up KB construction
2019-04-30 11:39:42 +02:00
svlandeg
19e8f339cb
deduce entity freq from WP corpus and serialize vocab in WP test
2019-04-29 17:37:29 +02:00
svlandeg
54d0cea062
unit test for KB serialization
2019-04-24 23:52:34 +02:00
svlandeg
3e0cb69065
KB aliases to and from file
2019-04-24 20:24:24 +02:00
svlandeg
ad6c5e581c
writing and reading number of entries to/from header
2019-04-24 15:31:44 +02:00
svlandeg
6e3223f234
bulk loading in proper order of entity indices
2019-04-24 11:26:38 +02:00
svlandeg
694fea597a
dumping all entryC entries + (inefficient) reading back in
2019-04-23 18:36:50 +02:00
svlandeg
8e70a564f1
custom reader and writer for _EntryC fields (first stab at it - not complete)
2019-04-23 16:33:40 +02:00
svlandeg
004e5e7d1c
little fixes
2019-04-19 14:24:02 +02:00
svlandeg
9a8197185b
fix alias capitalization
2019-04-18 22:37:50 +02:00
svlandeg
9f308eb5dc
fixes for prior prob and linking wikidata IDs with wikipedia titles
2019-04-18 16:14:25 +02:00
svlandeg
10ee8dfea2
poc with few entities and collecting aliases from the WP links
2019-04-18 14:12:17 +02:00
svlandeg
6763e025e1
parse wp dump for links to determine prior probabilities
2019-04-15 11:41:57 +02:00
svlandeg
3163331b1e
wikipedia dump parser and mediawiki format regex cleanup
2019-04-14 21:52:01 +02:00
svlandeg
b31a390a9a
reading types, claims and sitelinks
2019-04-11 21:42:44 +02:00
svlandeg
6e997be4b4
reading wikidata descriptions and aliases
2019-04-11 21:08:22 +02:00
svlandeg
9a7d534b1b
enable nogil for cython functions in kb.pxd
2019-04-10 17:25:10 +02:00
Ines Montani
24cecdb44f
Update compatibility [ci skip]
2019-04-01 16:25:16 +02:00
Sofie
a4a6bfa4e1
Merge branch 'master' into feature/el-framework
2019-03-26 11:00:02 +01:00
svlandeg
8814b9010d
entity as one field instead of both ID and name
2019-03-25 18:10:41 +01:00
Matthew Honnibal
6c783f8045
Bug fixes and options for TextCategorizer ( #3472 )
...
* Fix code for bag-of-words feature extraction
The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).
* Support 'bow' architecture for TextCategorizer
This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.
* Fix size limits in train_textcat example
* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
svlandeg
9de9900510
adding future import unicode literals to .py files
2019-03-22 16:18:04 +01:00
Matthew Honnibal
4c5f265884
Fix train loop for train_textcat example
2019-03-22 16:10:11 +01:00
svlandeg
5318ce88fa
'entity_linker' instead of 'el'
2019-03-22 13:55:10 +01:00
svlandeg
a48241e9a2
use nlp's vocab for stringstore
2019-03-22 11:36:45 +01:00
svlandeg
1ee0e78fd7
select candidate with highest prior probabiity
2019-03-22 11:36:45 +01:00
Matthew Honnibal
4e3ed2ea88
Add -t2v argument to train_textcat script
2019-03-20 23:05:42 +01:00
Ines Montani
399987c216
Test and update examples [ci skip]
2019-03-16 14:15:49 +01:00
Ines Montani
cb5dbfa63a
Tidy up references to n_threads and fix default
2019-03-15 16:24:26 +01:00
Matthew Honnibal
4dc57d9e15
Update train_new_entity_type example
2019-02-24 16:41:03 +01:00
Matthew Honnibal
7ac0f9626c
Update rehearsal example
2019-02-24 16:17:41 +01:00
Matthew Honnibal
981cb89194
Fix f-score calculation if zero
2019-02-23 12:45:41 +01:00
Matthew Honnibal
5063d999e5
Set architecture in textcat example
2019-02-23 11:57:59 +01:00
Matthew Honnibal
582be8746c
Update multi_processing example
2019-02-21 10:33:16 +01:00
Ines Montani
9696cf16c1
Merge branch 'master' into develop
2019-02-20 21:31:27 +01:00
Michael Liberman
386cec1979
- Json fix in comment ( #3294 )
2019-02-19 18:01:35 +01:00