Commit Graph

10357 Commits

Author SHA1 Message Date
Ines Montani f2ea3e3ea2
Merge branch 'master' into feature/nel-wiki 2019-07-09 21:57:47 +02:00
Ines Montani 547464609d Remove merge_subtokens from parser postprocessing for now 2019-07-09 21:50:30 +02:00
Björn Böing 04982ccc40 Update pretrain to prevent unintended overwriting of weight fil… (#3902)
* Update pretrain to prevent unintended overwriting of weight files for #3859

* Add '--epoch-start' to pretrain docs

* Add mising pretrain arguments to bash example

* Update doc tag for v2.1.5
2019-07-09 21:48:30 +02:00
Alejandro Alcalde 6d577f0b92 Evaluation of NER model per entity type, closes #3490 (#3911)
* Evaluation of NER model per entity type, closes ##3490

Now each ent score is tracked individually in order to have its own Precision, Recall and F1 Score

* Keep track of each entity individually using dicts

* Improving how to compute the scores for each entity

* Fixed bug computing scores for ents

* Formatting with black

* Added key ents_per_type to the scores function

The key `ents_per_type` contains the metrics Precision, Recall and F1-Score for each entity individually
2019-07-09 20:54:59 +02:00
Joshua Smith 2eb925bd05 Added an argument to `EntityRuler` constructor to pass attrs to… (#3919)
* Perserve flags in EntityRuler

The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized.  This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.

* add signed contributor agreement

* flake8 cleanup

mostly blank line issues.

* mark test from the issue as needing a model

The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.

* Adds `phrase_matcher_attr` to allow args to PhraseMatcher

This is an added arg to pass to the `PhraseMatcher`. For example,
this allows creation of a case insensitive phrase matcher when the
`EntityRuler` is created.  References explosion/spaCy#3822

* remove unneeded model loading

The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)

* updated docstring for new argument

* updated docs to reflect new argument to the EntityRuler constructor

* change tempdir handling to be compatible with python 2.7

* return conflicted code to entityruler

Some stuff got cut out because of merge conflicts, this
returns that code for the phrase_matcher_attr.

* fixed typo in the code added back after conflicts

* flake8 compliance

When I deconflicted the branch there were some flake8 issues
introduced. This resolves the spacing problems.

* test changes:  attempts to fix flaky test in python3.5

These tests seem to be alittle flaky in 3.5 so I changed the check to avoid
the comparisons that seem to be fail sometimes.
2019-07-09 20:09:17 +02:00
Alex a795fbd3b2 added contributor agreement ameyuuno.md (#3925)
@ines hi! 
I asked to change my username (yuukos -> ameyuuno). So I added a new contributor agreement.
2019-07-09 10:09:52 +02:00
Joshua Smith e8420ab2b7 Added support for serializing overwrite and ent_id_sep (#3918)
* Perserve flags in EntityRuler

The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized.  This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.

* add signed contributor agreement

* flake8 cleanup

mostly blank line issues.

* mark test from the issue as needing a model

The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.

* remove unneeded model loading

The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)

* change tempdir handling to be compatible with python 2.7

* Adds code to handle item saved before this change.

This code chanes how the save files are handled and how the bytes
are stored as well.  This code adds check to dispatch correctly
if it encounters bytes or files saved in the old format (and tests
for those cases).

* use util function for tempdir management

Updated after PR comments: this code now uses the make_tempdir function from util
instead of doing it by hand.
2019-07-08 17:28:28 +02:00
Knut O. Hellan a54f0cfc2b Norwegian tweaks (#3894)
* Norwegian fix

Add support for alternative past tense verb form (vaska).

* Norwegian months

Add all Norwegian months to tokenizer excpetions.

* More Norwegian abbreviations

Add more Norwegian abbreviations to tokenizer_exceptions.

* Contributor agreement khellan

Add signed contributor agreement for khellan (Knut O. Hellan).
2019-07-08 10:28:47 +02:00
Patrick Hogan 8c0586fd9c Update example and sign contributor agreement (#3916)
* Sign contributor agreement for askhogan

* Remove unneeded `seen_tokens` which is never used within the scope
2019-07-08 10:27:20 +02:00
Rokas Ramanauskas 61ce126d4c Lithuanian language support (#3895)
* initial LT lang support

* Added more stopwords. Started setting up some basic test environment (not complete)

* Initial morph rules for LT lang

* Closes #1 Adds tokenizer exceptions for Lithuanian

* Closes #5 Punctuation rules. Closes #6 Lexical Attributes

* test: add native examples to basic tests

* feat: add tag map for lt lang

* fix: remove undefined tag attribute 'Definite'

* feat: add lemmatizer for lt lang

* refactor: add new instances to lt lang morph rules; use tags from tag map

* refactor: add morph rules to lt lang defaults

* refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup

* refactor: add capitalized words to lt lang lemmatizer

* refactor: add more num words to lt lang lex attrs

* refactor: update lt lang stop word set

* refactor: add new instances to lt lang tokenizer exceptions

* refactor: remove comments form lt lang init file

* refactor: use function instead of lambda in lt lex lang getter

* refactor: remove conversion to dict in lt init when dict is already provided

* chore: rename lt 'test_basic' to 'test_text'

* feat: add more lt text tests

* feat: add lemmatizer tests

* refactor: remove unused imports, add newline to end of file

* chore: add contributor agreement

* chore: change 'en' to 'lt' in lt example description

* fix: add missing encoding info

* style: add newline to end of file

* refactor: use python2 compatible syntax

* style: reformat code using black
2019-07-08 10:25:22 +02:00
svlandeg b7a0c9bf60 fixing the context/prior weight settings 2019-07-03 17:48:09 +02:00
svlandeg 0ea52c86b8 remove redundancy 2019-07-03 15:02:10 +02:00
svlandeg 668b17ea4a deuglify kb deserializer 2019-07-03 15:00:42 +02:00
svlandeg 8840d4b1b3 fix for context encoder optimizer 2019-07-03 13:35:36 +02:00
svlandeg 3420cbe496 small fixes 2019-07-03 10:25:51 +02:00
svlandeg 2d2dea9924 experiment with adding NER types to the feature vector 2019-06-29 14:52:36 +02:00
svlandeg c664f58246 adding prior probability as feature in the model 2019-06-28 16:22:58 +02:00
svlandeg 1c80b85241 fix tests 2019-06-28 08:59:23 +02:00
svlandeg 68a0662019 context encoder with Tok2Vec + linking model instead of cosine 2019-06-28 08:29:31 +02:00
Ines Montani 4f1dae1c6b Update languages and examples (see #1107) 2019-06-26 16:19:17 +02:00
svlandeg dbc53b9870 rename to KBEntryC 2019-06-26 15:55:26 +02:00
Ines Montani 37f744ca00 Auto-format [ci skip] 2019-06-26 14:48:09 +02:00
Ines Montani d361e380b8 Fix matcher callback example (closes #3862) 2019-06-26 14:47:26 +02:00
Ines Montani 6ccdf37574 Exclude user_data when copying doc in displaCy (closes #3882) 2019-06-26 14:37:05 +02:00
svlandeg 1de61f68d6 improve speed of prediction loop 2019-06-26 13:53:10 +02:00
svlandeg bee23cd8af try Tok2Vec instead of SpacyVectors 2019-06-25 16:09:22 +02:00
svlandeg 8608685543 ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
svlandeg 58a5b40ef6 clean up duplicate code 2019-06-24 15:19:58 +02:00
svlandeg ddc73b11a9 fix unicode literals 2019-06-24 12:58:18 +02:00
Bram Vanroy f22704621e Update CITATION (#3873)
As discussed in https://github.com/explosion/spaCy/pull/2167 the citation should look slightly different.
2019-06-24 11:03:16 +02:00
svlandeg f4af47ce4a Merge branch 'feature/nel-wiki' of https://github.com/svlandeg/spaCy into feature/nel-wiki 2019-06-24 10:57:07 +02:00
svlandeg b58bace84b small fixes 2019-06-24 10:55:04 +02:00
Ines Montani c833d9b314 Add "v.s." to English tokenizer exceptions (see #3868) 2019-06-20 17:48:45 +02:00
Ines Montani ae2c208735 Auto-format [ci skip] 2019-06-20 10:36:38 +02:00
Ines Montani 872121955c Update error code 2019-06-20 10:35:51 +02:00
Ines Montani e1be80e3ec Merge branch 'master' into pr/3864 2019-06-20 10:35:37 +02:00
Guillaume Claret d7a519a922 Typo (#3865)
* Typo

* Add contributor agreement
2019-06-20 10:31:19 +02:00
Björn Böing ebf5a04d6c Update pretrain docs and add unsupported loss_func error (#3860)
* Add error to `get_vectors_loss` for unsupported loss function of `pretrain`

* Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs.

* Add missing quotation marks
2019-06-20 10:30:44 +02:00
Alejandro Alcalde 4866a7ee9e Changed learning rate by its param name. (#3855)
* Changed learning rate by its param name.

I've been searching for a while how the parameter learning rate was named, with `beta1` and `beta2` its easy as they are marked as code, but learning rate wasn't. I think writing the actual parameter name would be helpful.

* Signing SCA
2019-06-20 10:29:20 +02:00
svlandeg b76a43bee4 unicode strings 2019-06-19 13:26:33 +02:00
svlandeg 0b0959b363 UTF8 encoding 2019-06-19 13:11:39 +02:00
svlandeg cc9ae28a52 custom error and warning messages 2019-06-19 12:35:26 +02:00
svlandeg 791327e3c5 Merge remote-tracking branch 'upstream/master' into feature/nel-wiki 2019-06-19 09:44:05 +02:00
svlandeg a31648d28b further code cleanup 2019-06-19 09:15:43 +02:00
svlandeg 478305cd3f small tweaks and documentation 2019-06-18 18:38:09 +02:00
svlandeg 0d177c1146 clean up code, remove old code, move to bin 2019-06-18 13:20:40 +02:00
svlandeg ffae7d3555 sentence encoder only (removing article/mention encoder) 2019-06-18 00:05:47 +02:00
svlandeg 6332af40de baseline performances: oracle KB, random and prior prob 2019-06-17 14:39:40 +02:00
svlandeg 24db1392b9 reprocessing all of wikipedia for training data 2019-06-16 21:14:45 +02:00
Ines Montani 81c12640ab Auto-format [ci skip] 2019-06-16 14:33:20 +02:00