Commit Graph

6373 Commits

Author SHA1 Message Date
veer-bains 874bd8c8dd Fixed syntax error in lang/ko when using python 2 (#4082) (closes #4068)
* fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py

* fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py

* Update __init__.py

* Create veer-bains.md

* Update __init__.py

fixed syntax errors in variable datatype assignment when calling spacy.blank("ko") with python 2.7
2019-08-05 10:19:32 +02:00
Ines Montani 87ddbdc33e Fix handling of kwargs in Language.evaluate
Makes it consistent with other methods
2019-08-04 13:44:21 +02:00
Muhammad Irfan d1d30b0442 added missing punctuation following conventions. (#4066) 2019-08-04 13:41:18 +02:00
Anastassia 33b14724a5 Update gold corpus code to properly ingest a directory of jsonl… (#4067)
* Update gold corpus code to properly ingest a directory of jsonlines files

In response to: https://github.com/explosion/spaCy/issues/3975

* Update spacy/gold.pyx

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-02 09:58:51 +02:00
Matthew Honnibal 944a66c326 Add span.tensor and token.tensor attributes 2019-08-01 18:30:50 +02:00
Matthew Honnibal d3071ecdbc Set version to v2.1.7 2019-08-01 18:09:19 +02:00
Matthew Honnibal 97c51ef93b Set version to v2.1.7.dev1 2019-08-01 17:29:25 +02:00
Matthew Honnibal 4632c597e7 Fix Pipe base class 2019-08-01 17:29:01 +02:00
Ines Montani 8718ca8b1f
Fix init_model if there's no vocab (closes #4048) (#4049) 2019-08-01 17:26:09 +02:00
adrianeboyd 925a852bb6 Improve NER per type scoring (#4052)
* Improve NER per type scoring

* include all gold labels in per type scoring, not only when recall > 0
* improve efficiency of per type scoring

* Create Scorer tests, initially with NER tests

* move regression test #3968 (per type NER scoring) to Scorer tests

* add new test for per type NER scoring with imperfect P/R/F and per
type P/R/F including a case where R == 0.0
2019-08-01 17:15:36 +02:00
Sofie Van Landeghem f7d950de6d ensure the lang of vocab and nlp stay consistent (#4057)
* ensure the language of vocab and nlp stay consistent across serialization

* equality with =
2019-08-01 17:13:01 +02:00
Sofie Van Landeghem 7de3b129ab Resolve edge case when calling textcat.predict with empty doc (#4035)
* resolve edge case where no doc has tokens when calling textcat.predict

* more explicit value test
2019-07-30 14:58:01 +02:00
Matthew Honnibal 89c92c65fb Update version 2019-07-28 17:56:38 +02:00
Matthew Honnibal 06eb428ed1 Make pipe base class a bit less presumptuous 2019-07-28 17:56:11 +02:00
Matthew Honnibal 16b5144095 Don't raise NotImplemented in Pipe.update 2019-07-28 17:54:11 +02:00
Ines Montani fc69da0acb
💫 Support simple training format in nlp.evaluate and add tests (#4033)
* Support simple training format in nlp.evaluate and add tests

* Update docs [ci skip]
2019-07-27 17:30:18 +02:00
Ines Montani a3723f439c Fix formatting [ci skip] 2019-07-27 16:35:42 +02:00
Ines Montani d5bce35fb1 Fix bug in Span.similarity when called via hook 2019-07-27 15:33:27 +02:00
Ines Montani 109b5e1798 Fix bug in Token.similarity when called via hook 2019-07-27 15:26:01 +02:00
Ines Montani e000b5ed82 Also support "requirements" in model.json 2019-07-27 13:34:57 +02:00
Ines Montani 307ffe472d
Support custom language factory setting in meta.json (#4031) 2019-07-27 13:17:43 +02:00
Bae Yong-Ju 05fbf5d976 Fix error when Korean text contains regexp special characters. (#4022) 2019-07-25 17:53:33 +02:00
Matthew Honnibal 73e095923f 💫 Improve error message when model.from_bytes() dies (#4014)
* Improve error message when model.from_bytes() dies

When Thinc's model.from_bytes() is called with a mismatched model, often
we get a particularly ungraceful error,

e.g. "AttributeError: FunctionLayer has no attribute G"

This is because we're trying to load the parameters for something like
a LayerNorm layer, and the model architecture has some other layer there
instead. This is obviously terrible, especially since the error *type*
is wrong.

I've changed it to raise a ValueError. The error message is still
probably a bit terse, but it's hard to be sure exactly what's gone
wrong.

* Update spacy/pipeline/pipes.pyx

* Update spacy/pipeline/pipes.pyx

* Update spacy/pipeline/pipes.pyx

* Update spacy/syntax/nn_parser.pyx

* Update spacy/syntax/nn_parser.pyx

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>


Co-authored-by: Ines Montani <ines@ines.io>
2019-07-24 11:27:34 +02:00
Ines Montani 87fcf3141c
Merge pull request #4003 from svlandeg/feature/nel-fixes
API changes for Entity linking functionality
2019-07-23 23:17:07 +02:00
Paul O'Leary McCann c8949ce88a Remove old comment (#4012)
Norwegian used to borrow from French but that doesn't appear to have
been true for a while now, so the comment that was here is no longer
relevant.
2019-07-23 23:10:06 +02:00
Sofie Van Landeghem ba02957c80 Fix dependency copy for as_doc (#3969)
* failing unit test for issue 3962

* attempt to fix Issue #3962

* create artificial unit test example

* using length instead of self.length

* sp

* reformat with black

* find better ancestor within span and use generic 'dep'

* attach to span.root if there is no appropriate ancestor

* comment span text

* clean up ancestor code

* reconstruct dep tree to keep same number of sentences
2019-07-23 18:28:54 +02:00
svlandeg 4e7ec1ed31 return fix 2019-07-23 14:23:58 +02:00
svlandeg 400ff342cf replace assert's with custom error messages 2019-07-23 11:52:48 +02:00
svlandeg 20389e4553 format and bugfix 2019-07-22 15:08:17 +02:00
svlandeg b1911f7105 Errors.E146 for IO error when FP is null 2019-07-22 14:56:13 +02:00
svlandeg 5d544f89ba Errors.E145 for IO errors when reading KB 2019-07-22 14:36:07 +02:00
Ines Montani a32b033b8c Add regression test for #4002
Test that the PhraseMatcher can match on overwritten NORM attributes.
2019-07-22 14:18:24 +02:00
svlandeg ad65171837 Merge remote-tracking branch 'upstream/master' into feature/nel-fixes 2019-07-22 13:41:28 +02:00
svlandeg 76184374e2 test corner cases 2019-07-22 13:39:32 +02:00
svlandeg 9f8c1e71a2 fix for Issue #4000 2019-07-22 13:34:12 +02:00
svlandeg dae8a21282 rename entity frequency 2019-07-19 17:40:28 +02:00
svlandeg 41fb5204ba output tensors as part of predict 2019-07-19 14:47:36 +02:00
svlandeg 21176517a7 have gold.links correspond exactly to doc.ents 2019-07-19 12:36:15 +02:00
BreakBB 3e370cf2ba Add 'Prof.' to Englisch tokenizer_exceptions 2019-07-19 10:00:45 +02:00
svlandeg e1213eaf6a use original gold object in get_loss function 2019-07-18 13:35:10 +02:00
svlandeg ec55d2fccd filter training data beforehand (+black formatting) 2019-07-18 10:22:24 +02:00
Falak Asad ff1e73e35c Bugfix/issue 3968 (#3982)
* Fix for issue-3968

* Added contributor agreement

* Made suggested changes
2019-07-18 00:20:32 +02:00
svlandeg d833d4c358 fixes in kb and gold 2019-07-17 17:18:26 +02:00
Ines Montani 73565c6d9d Rename function arguments 2019-07-17 14:29:52 +02:00
Matthew Honnibal 394e4d8058 Add docstring for spacy.gold.align 2019-07-17 13:59:17 +02:00
Ines Montani 073013f129 Auto-format [ci skip] 2019-07-17 12:34:13 +02:00
svlandeg 4086c6ff60 get vector functionality + unit test 2019-07-17 12:17:02 +02:00
Ines Montani 62ff128888 Add regression test for #3951 2019-07-16 14:00:00 +02:00
Ines Montani 7f551050b1 Add regression test for #3972 2019-07-16 13:07:35 +02:00
svlandeg a63d15a142 code cleanup 2019-07-15 17:36:43 +02:00
svlandeg cdc589d344 small fix 2019-07-15 12:04:45 +02:00
svlandeg 60f299374f set default context width 2019-07-15 12:03:09 +02:00
svlandeg 6e809e9b8b proper error for missing cfg arguments 2019-07-15 11:42:50 +02:00
svlandeg 6026958957 tokenizer doc fix 2019-07-15 11:19:34 +02:00
Ines Montani c0e29f7029
Merge pull request #3957 from sorenlind/danish-tokenizer-slash
Make Danish tokenizer split on forward slash
2019-07-12 18:19:22 +02:00
Matthew Honnibal ef666656b3 Fix attrs alignment 2019-07-12 17:59:47 +02:00
Matthew Honnibal c345c042b0 Fix symbol alignment 2019-07-12 17:48:38 +02:00
Ines Montani 7281026879 Increment version [ci skip] 2019-07-12 17:40:00 +02:00
Søren Lind Kristiansen 26aee70d95 Make Danish tokenizer split on forward slash 2019-07-12 15:20:42 +02:00
Matthew Honnibal 3bc4d618f9 Set version to v2.1.5 2019-07-12 13:26:12 +02:00
Sofie Van Landeghem ed774cb953 Fixing ngram bug (#3953)
* minimal failing example for Issue #3661

* referenced Issue #3661 instead of Issue #3611

* cleanup
2019-07-12 10:01:35 +02:00
Matthew Honnibal 09dc01a426 Fix #3853, and add warning 2019-07-11 14:46:47 +02:00
Matthew Honnibal 7369949d2e Add warning for #3853 2019-07-11 14:46:47 +02:00
Ines Montani 673c864a06
Fix doc.count_by functionality (#3950)
Fix doc.count_by functionality
2019-07-11 13:44:00 +02:00
Ines Montani 2426f4d44c
Fix default punctuation rules for splitting Hindi text (#3948)
Fix default punctuation rules for splitting Hindi text

Co-authored-by: yash <patadiayash@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-07-11 13:36:28 +02:00
svlandeg 349107daa3 cleanup 2019-07-11 13:09:22 +02:00
svlandeg 0f0f07318a counter instead of preshcounter 2019-07-11 13:05:53 +02:00
Matthew Honnibal b40b4c2c31
💫 Fix issue #3839: Incorrect entity IDs from Matcher with operators (#3949)
* Add regression test for issue #3541

* Add comment on bugfix

* Remove incorrect test

* Un-xfail test
2019-07-11 12:55:11 +02:00
Matthew Honnibal e19f4ee719 Add warning message re Issue #3853 2019-07-11 12:50:38 +02:00
Ines Montani 197cfd7ebc Merge branch 'master' into pr/3948 2019-07-11 12:18:31 +02:00
Ines Montani d166756607 Fix test 2019-07-11 12:16:43 +02:00
Ines Montani 0b8406a05c Tidy up and auto-format 2019-07-11 12:02:25 +02:00
yash 6751af3e78 Merge branch 'master' of https://github.com/yash1994/spaCy 2019-07-11 15:26:57 +05:30
yash ae2d52e323 Add default encoding utf-8 for test file 2019-07-11 15:26:27 +05:30
Ines Montani 33ca0a036a Merge branch 'master' into pr/3948 2019-07-11 11:55:54 +02:00
Matthew Honnibal 0491a8e7c8 Reformat 2019-07-11 11:49:36 +02:00
Matthew Honnibal bd3c3f342b Fix _serialize 2019-07-11 11:48:55 +02:00
yash 815f8d13dd Fix default punctuation rules for hindi text (#3625 explosion) 2019-07-11 15:00:51 +05:30
yash d5311b3c42 Add test file for issue (#3625) and spacy contributor agreement 2019-07-11 14:53:14 +05:30
svlandeg e080412385 tracked the bug down to PreshCounter.inc - still unclear what goes wrong 2019-07-11 01:53:06 +02:00
svlandeg a89fecce97 failing unit test for issue #3869 2019-07-11 00:43:55 +02:00
Matthew Honnibal a388888074 Merge branch 'master' of https://github.com/explosion/spaCy 2019-07-10 22:54:17 +02:00
Matthew Honnibal c6cb782758 Set version to 2.1.5.dev0 2019-07-10 22:54:09 +02:00
Sofie Van Landeghem c4c21cb428 more friendly textcat errors (#3946)
* more friendly textcat errors with require_model and require_labels

* update thinc version with recent bugfix
2019-07-10 19:39:38 +02:00
Matthew Honnibal b94c5443d9 Rename Binder->DocBox, and improve it. 2019-07-10 19:37:20 +02:00
Matthew Honnibal 3d18600c05 Return True from doc.is_... when no ambiguity
* Make doc.is_sentenced return True if len(doc) < 2.

* Make doc.is_nered return True if len(doc) == 0, for consistency.

Closes #3934
2019-07-10 19:21:42 +02:00
Matthew Honnibal 465456edb9 Un-xfail test #3880 2019-07-10 14:01:17 +02:00
Matthew Honnibal 87f7ec34d5 Add test for #3880 2019-07-10 13:53:55 +02:00
Ines Montani 4e04080b76 Only compare sorted patterns in test
Try to work around flaky tests on Python 3.5
2019-07-10 13:00:52 +02:00
Ines Montani 82045aac8a Merge regression tests 2019-07-10 12:49:18 +02:00
Ines Montani 40cd03fc35 Improve EntityRuler serialization 2019-07-10 12:25:45 +02:00
Ines Montani 570ab1f481 Fix handling of old entity ruler files
Expected an `entity_ruler.jsonl` file in the top-level model directory, so the path passed to from_disk by default (model path plus componentn name), but with the suffix ".jsonl".
2019-07-10 12:14:12 +02:00
Ines Montani 874d914a44 Tidy up test 2019-07-10 12:13:23 +02:00
Ines Montani ea2050079b Auto-format 2019-07-10 12:03:05 +02:00
Ines Montani 6ba5ddbd5f
Merge pull request #3864 from svlandeg/feature/nel-wiki
Entity linking using Wikipedia & Wikidata
2019-07-10 11:25:41 +02:00
Ines Montani 8721849423 Update Scorer.ents_per_type 2019-07-10 11:19:28 +02:00
Björn Böing 205c73a589 Update tokenizer and doc init example (#3939)
* Fix Doc.to_json hyperlink

* Update tokenizer and doc init examples

* Change "matchin rules" to "punctuation rules"

* Auto-format
2019-07-10 10:16:48 +02:00
cedar101 58f06e6180 Korean support (#3901)
* start lang/ko

* add test codes

* using natto-py

* add test_ko_tokenizer_full_tags()

* spaCy contributor agreement

* external dependency for ko

* collections.namedtuple for python version < 3.5

* case fix

* tuple unpacking

* add jongseong(final consonant)

* apply mecab option

* Remove Pipfile for now


Co-authored-by: Ines Montani <ines@ines.io>
2019-07-09 22:23:16 +02:00
Ines Montani f2ea3e3ea2
Merge branch 'master' into feature/nel-wiki 2019-07-09 21:57:47 +02:00
Ines Montani 547464609d Remove merge_subtokens from parser postprocessing for now 2019-07-09 21:50:30 +02:00
Björn Böing 04982ccc40 Update pretrain to prevent unintended overwriting of weight fil… (#3902)
* Update pretrain to prevent unintended overwriting of weight files for #3859

* Add '--epoch-start' to pretrain docs

* Add mising pretrain arguments to bash example

* Update doc tag for v2.1.5
2019-07-09 21:48:30 +02:00
Alejandro Alcalde 6d577f0b92 Evaluation of NER model per entity type, closes #3490 (#3911)
* Evaluation of NER model per entity type, closes ##3490

Now each ent score is tracked individually in order to have its own Precision, Recall and F1 Score

* Keep track of each entity individually using dicts

* Improving how to compute the scores for each entity

* Fixed bug computing scores for ents

* Formatting with black

* Added key ents_per_type to the scores function

The key `ents_per_type` contains the metrics Precision, Recall and F1-Score for each entity individually
2019-07-09 20:54:59 +02:00
Joshua Smith 2eb925bd05 Added an argument to `EntityRuler` constructor to pass attrs to… (#3919)
* Perserve flags in EntityRuler

The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized.  This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.

* add signed contributor agreement

* flake8 cleanup

mostly blank line issues.

* mark test from the issue as needing a model

The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.

* Adds `phrase_matcher_attr` to allow args to PhraseMatcher

This is an added arg to pass to the `PhraseMatcher`. For example,
this allows creation of a case insensitive phrase matcher when the
`EntityRuler` is created.  References explosion/spaCy#3822

* remove unneeded model loading

The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)

* updated docstring for new argument

* updated docs to reflect new argument to the EntityRuler constructor

* change tempdir handling to be compatible with python 2.7

* return conflicted code to entityruler

Some stuff got cut out because of merge conflicts, this
returns that code for the phrase_matcher_attr.

* fixed typo in the code added back after conflicts

* flake8 compliance

When I deconflicted the branch there were some flake8 issues
introduced. This resolves the spacing problems.

* test changes:  attempts to fix flaky test in python3.5

These tests seem to be alittle flaky in 3.5 so I changed the check to avoid
the comparisons that seem to be fail sometimes.
2019-07-09 20:09:17 +02:00
Joshua Smith e8420ab2b7 Added support for serializing overwrite and ent_id_sep (#3918)
* Perserve flags in EntityRuler

The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized.  This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.

* add signed contributor agreement

* flake8 cleanup

mostly blank line issues.

* mark test from the issue as needing a model

The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.

* remove unneeded model loading

The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)

* change tempdir handling to be compatible with python 2.7

* Adds code to handle item saved before this change.

This code chanes how the save files are handled and how the bytes
are stored as well.  This code adds check to dispatch correctly
if it encounters bytes or files saved in the old format (and tests
for those cases).

* use util function for tempdir management

Updated after PR comments: this code now uses the make_tempdir function from util
instead of doing it by hand.
2019-07-08 17:28:28 +02:00
Knut O. Hellan a54f0cfc2b Norwegian tweaks (#3894)
* Norwegian fix

Add support for alternative past tense verb form (vaska).

* Norwegian months

Add all Norwegian months to tokenizer excpetions.

* More Norwegian abbreviations

Add more Norwegian abbreviations to tokenizer_exceptions.

* Contributor agreement khellan

Add signed contributor agreement for khellan (Knut O. Hellan).
2019-07-08 10:28:47 +02:00
Rokas Ramanauskas 61ce126d4c Lithuanian language support (#3895)
* initial LT lang support

* Added more stopwords. Started setting up some basic test environment (not complete)

* Initial morph rules for LT lang

* Closes #1 Adds tokenizer exceptions for Lithuanian

* Closes #5 Punctuation rules. Closes #6 Lexical Attributes

* test: add native examples to basic tests

* feat: add tag map for lt lang

* fix: remove undefined tag attribute 'Definite'

* feat: add lemmatizer for lt lang

* refactor: add new instances to lt lang morph rules; use tags from tag map

* refactor: add morph rules to lt lang defaults

* refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup

* refactor: add capitalized words to lt lang lemmatizer

* refactor: add more num words to lt lang lex attrs

* refactor: update lt lang stop word set

* refactor: add new instances to lt lang tokenizer exceptions

* refactor: remove comments form lt lang init file

* refactor: use function instead of lambda in lt lex lang getter

* refactor: remove conversion to dict in lt init when dict is already provided

* chore: rename lt 'test_basic' to 'test_text'

* feat: add more lt text tests

* feat: add lemmatizer tests

* refactor: remove unused imports, add newline to end of file

* chore: add contributor agreement

* chore: change 'en' to 'lt' in lt example description

* fix: add missing encoding info

* style: add newline to end of file

* refactor: use python2 compatible syntax

* style: reformat code using black
2019-07-08 10:25:22 +02:00
svlandeg 0ea52c86b8 remove redundancy 2019-07-03 15:02:10 +02:00
svlandeg 668b17ea4a deuglify kb deserializer 2019-07-03 15:00:42 +02:00
svlandeg 8840d4b1b3 fix for context encoder optimizer 2019-07-03 13:35:36 +02:00
svlandeg 2d2dea9924 experiment with adding NER types to the feature vector 2019-06-29 14:52:36 +02:00
svlandeg c664f58246 adding prior probability as feature in the model 2019-06-28 16:22:58 +02:00
svlandeg 1c80b85241 fix tests 2019-06-28 08:59:23 +02:00
svlandeg 68a0662019 context encoder with Tok2Vec + linking model instead of cosine 2019-06-28 08:29:31 +02:00
Ines Montani 4f1dae1c6b Update languages and examples (see #1107) 2019-06-26 16:19:17 +02:00
svlandeg dbc53b9870 rename to KBEntryC 2019-06-26 15:55:26 +02:00
Ines Montani 37f744ca00 Auto-format [ci skip] 2019-06-26 14:48:09 +02:00
Ines Montani 6ccdf37574 Exclude user_data when copying doc in displaCy (closes #3882) 2019-06-26 14:37:05 +02:00
svlandeg 1de61f68d6 improve speed of prediction loop 2019-06-26 13:53:10 +02:00
svlandeg bee23cd8af try Tok2Vec instead of SpacyVectors 2019-06-25 16:09:22 +02:00
svlandeg 8608685543 ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
svlandeg 58a5b40ef6 clean up duplicate code 2019-06-24 15:19:58 +02:00
svlandeg ddc73b11a9 fix unicode literals 2019-06-24 12:58:18 +02:00
svlandeg f4af47ce4a Merge branch 'feature/nel-wiki' of https://github.com/svlandeg/spaCy into feature/nel-wiki 2019-06-24 10:57:07 +02:00
svlandeg b58bace84b small fixes 2019-06-24 10:55:04 +02:00
Ines Montani c833d9b314 Add "v.s." to English tokenizer exceptions (see #3868) 2019-06-20 17:48:45 +02:00
Ines Montani ae2c208735 Auto-format [ci skip] 2019-06-20 10:36:38 +02:00
Ines Montani 872121955c Update error code 2019-06-20 10:35:51 +02:00
Ines Montani e1be80e3ec Merge branch 'master' into pr/3864 2019-06-20 10:35:37 +02:00
Björn Böing ebf5a04d6c Update pretrain docs and add unsupported loss_func error (#3860)
* Add error to `get_vectors_loss` for unsupported loss function of `pretrain`

* Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs.

* Add missing quotation marks
2019-06-20 10:30:44 +02:00
svlandeg b76a43bee4 unicode strings 2019-06-19 13:26:33 +02:00
svlandeg 0b0959b363 UTF8 encoding 2019-06-19 13:11:39 +02:00
svlandeg cc9ae28a52 custom error and warning messages 2019-06-19 12:35:26 +02:00
svlandeg 791327e3c5 Merge remote-tracking branch 'upstream/master' into feature/nel-wiki 2019-06-19 09:44:05 +02:00
svlandeg a31648d28b further code cleanup 2019-06-19 09:15:43 +02:00
svlandeg 478305cd3f small tweaks and documentation 2019-06-18 18:38:09 +02:00
svlandeg 0d177c1146 clean up code, remove old code, move to bin 2019-06-18 13:20:40 +02:00
svlandeg ffae7d3555 sentence encoder only (removing article/mention encoder) 2019-06-18 00:05:47 +02:00
Kabir Khan 1e19f34e29 Add optional `id` property to EntityRuler patterns (#3591)
* Adding support for entity_id in EntityRuler pipeline component

* Adding Spacy Contributor aggreement

* Updating EntityRuler to use string.format instead of f strings

* Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity.

* Fixing tests

* Remove custom extension entity_id and use built in ent_id token attribute.

* Changing entity_id to ent_id for consistent naming

* entity_ids => ent_ids

* Removing kb, cleaning up tests, making util functions private, use rsplit instead of split
2019-06-16 13:29:04 +02:00
Suraj Rajan 46c78d0a41 Dependency tree pattern matcher (#3465)
* Functional dependency tree pattern matcher

* Tests fail due to inconsistent behaviour

* Renamed dependencymatcher and added optimizations
2019-06-16 13:25:32 +02:00
BreakBB d8573ee715 Update error raising for CLI pretrain to fix #3840 (#3843)
* Add check for empty input file to CLI pretrain

* Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key

* Skip empty values for correct pretrain keys and log a counter as warning

* Add tests for CLI pretrain core function make_docs.

* Add a short hint for the `tokens` key to the CLI pretrain docs

* Add success message to CLI pretrain

* Update model loading to fix the tests

* Skip empty values and do not create docs out of it
2019-06-16 13:22:57 +02:00
svlandeg b312f2d0e7 redo training data to be independent of KB and entity-level instead of doc-level 2019-06-14 15:55:26 +02:00
Azagh3l 5accfbb938 Update exemples.py (#3838)
Added missing hyphen and accent.
2019-06-14 09:31:05 +02:00
svlandeg 78dd3e11da write entity linking pipe to file and keep vocab consistent between kb and nlp 2019-06-13 16:25:39 +02:00
svlandeg b12001f368 small fixes 2019-06-12 22:05:53 +02:00
Ines Montani f35ce09776 Add regression test for #3839 2019-06-12 13:38:30 +02:00
Ines Montani aae9034492 Tidy up [ci skip] 2019-06-12 13:38:23 +02:00
svlandeg 6521cfa132 speeding up training 2019-06-12 13:37:05 +02:00
Motoki Wu 9c064e6ad9 Add resume logic to spacy pretrain (#3652)
* Added ability to resume training

* Add to readmee

* Remove duplicate entry
2019-06-12 13:29:23 +02:00
svlandeg fe1ed432ef eval on dev set, varying combo's of prior and context scores 2019-06-11 11:40:58 +02:00
Azagh3l eb3e4263ee Update lex_attrs.py (#3835)
Corrected typos, added french (from France) versions of some numbers.
2019-06-11 10:59:16 +02:00
svlandeg 83dc7b46fd first tests with EL pipe 2019-06-10 21:25:26 +02:00
Matthew Honnibal 7f71cf0b02 Merge branch 'master' of https://github.com/explosion/spaCy 2019-06-07 20:41:00 +02:00
Matthew Honnibal a931d72459 Add merge_subtokens as parser post-process. Re #3830 2019-06-07 20:40:41 +02:00
svlandeg 7de1ee69b8 training loop in proper pipe format 2019-06-07 15:55:10 +02:00
svlandeg 0486ccabfd introduce goldparse.links 2019-06-07 13:54:45 +02:00
svlandeg a5c061f506 storing NEL training data in GoldParse objects 2019-06-07 12:58:42 +02:00
svlandeg 61f0e2af65 code cleanup 2019-06-06 20:22:14 +02:00
svlandeg d8b435ceff pretraining description vectors and storing them in the KB 2019-06-06 19:51:27 +02:00
svlandeg 5c723c32c3 entity vectors in the KB + serialization of them 2019-06-05 18:29:18 +02:00
svlandeg 9abbd0899f separate entity encoder to get 64D descriptions 2019-06-05 00:09:46 +02:00
svlandeg fb37cdb2d3 implementing el pipe in pipes.pyx (not tested yet) 2019-06-03 21:32:54 +02:00
intrafind 2bba2a3536 Fix for #3811 (#3815)
Corrected type of seed parameter.
2019-06-03 18:32:47 +02:00
svlandeg d83a1e3052 Merge branch 'master' into feature/nel-wiki 2019-06-03 09:35:10 +02:00
Germán 86eb817b74 Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810) (closes #3803))
* (#3803) Spanish like_num returning false for number-like token

* (#3803) Spanish like_num now returning True for number-like token
2019-06-02 12:22:57 +02:00
Ines Montani 09e78b52cf Improve E024 text for incorrect GoldParse (closes #3558) 2019-06-01 14:37:27 +02:00
Ramanan Balakrishnan 26c37c5a4d fix all references to BILUO annotation format (#3797) 2019-05-31 12:19:19 +02:00
Ines Montani a7fd42d937 Make jsonschema dependency optional (#3784) 2019-05-30 14:34:58 +02:00
Ujwal Narayan ed7be3f64c Update norm_exceptions.py (#3778)
* Update norm_exceptions.py

Extended the Currency set to include Franc, Indian Rupee, Bangladeshi Taka, Korean Won, Mexican Dollar, and Egyptian Pound

* Fix formatting [ci skip]
2019-05-27 11:52:52 +02:00
estr4ng7d 604acb6ace Marathi Language Support (#3767)
* Adding Marathi language details and folder to it

* Adding few changes and running tests

* Adding few changes and running tests

* Update __init__.py

mh -> mr

* Rename spacy/lang/mh/__init__.py to spacy/lang/mr/__init__.py

* mh -> mr
2019-05-24 14:29:42 +02:00
Ines Montani 7634812172 Document Language.evaluate 2019-05-24 14:06:36 +02:00
Ines Montani 45e6855550 Update Language.update docs 2019-05-24 14:06:26 +02:00
Ines Montani b78a8dc1d2 Update Scorer and add API docs 2019-05-24 14:06:04 +02:00
Ujwal Narayan 4d550a3055 Enhancing Kannada language Resources (#3755)
* Updated stop_words.py

Added more stopwords

* Create ujwal-narayan.md

Enhancing Kannada language resources
2019-05-20 12:56:10 +02:00
svlandeg dd691d0053 debugging 2019-05-17 17:44:11 +02:00
BreakBB ed18a6efbd Add check for callable to 'Language.replace_pipe' to fix #3737 (#3741) 2019-05-14 16:59:31 +02:00
Ines Montani 8baff1c7c0
💫 Improve introspection of custom extension attributes (#3729)
* Add custom __dir__ to Underscore (see #3707)

* Make sure custom extension methods keep their docstrings (see #3707)

* Improve tests

* Prepend note on partial to docstring (see #3707)

* Remove print statement

* Handle cases where docstring is None
2019-05-12 00:53:11 +02:00
Matthew Honnibal 3aceeeaaeb Set version to v2.1.4 2019-05-11 22:57:53 +02:00
Ines Montani aea1c93a05 Replace cytoolz.partition_all with util.minibatch 2019-05-11 21:12:09 +02:00
Ines Montani 0bf6441863 Fix .iob converter (closes #3620) 2019-05-11 19:15:26 +02:00
Matthew Honnibal a5159ddcf5 Set version to v2.1.4.dev1 2019-05-11 19:03:51 +02:00
Ines Montani 6b3a79ac96 Call rmtree and copytree with strings (closes #3713) 2019-05-11 15:48:35 +02:00
devforfu 21af12eb53 Make "text" key in JSONL format optional when "tokens" key is provided (#3721)
* Fix issue with forcing text key when it is not required

* Extending the docs to reflect the new behavior
2019-05-11 15:41:29 +02:00
Luca Dorigo 82d034f976 Update glossary.py to match information found in documentation (#3704) (closes ##3679)
* Update glossary.py to match information found in documentation

I used regexes to add any dependency tag that was in the documentation but not in the glossary. Solves #3679 👍

* Adds forgotten colon
2019-05-10 14:23:20 +02:00
Wannaphong Phatthiyaphaibun 5a14a13f64 fix thai bug (#3693)
fix tokenize for pythainlp
2019-05-10 14:21:34 +02:00
Ines Montani 505c9e0e19 Add util.filter_spans helper (#3686) 2019-05-08 02:33:40 +02:00
F0rge1cE dd1e6b0bc6 Fix offset bug in loading pre-trained word2vec. (#3689)
* Fix offset bug in loading pre-trained word2vec.

* add contributor agreement
2019-05-06 23:00:38 +02:00
Ines Montani 78cb807a9a Auto-format [ci skip] 2019-05-06 16:58:29 +02:00
Brad Jascob 955b95cb8b Fix inconsistant lemmatizer issue #3484 (#3646)
* Fix inconsistant lemmatizer issue #3484

* Remove test case
2019-05-04 18:16:03 +02:00
svlandeg 1ae41daaa9 allow small rounding errors 2019-05-01 23:05:40 +02:00
Dobita21 f95ecedd83 Add Thai lex_attrs (#3655)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words

* add lex_attrs

* Add lexical attribute getters into the language defaults

* fix LEX_ATTRS


Co-authored-by: Donut <dobita21@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-05-01 12:03:14 +02:00
BreakBB 8952004dfc Update French example sents and add two German stop words (#3662)
* Update french example sentences

* Add 'anderem' and 'ihren' to German stop words
2019-05-01 12:01:35 +02:00
svlandeg 60b54ae8ce bulk entity writing and experiment with regex wikidata reader to speed up processing 2019-05-01 00:00:38 +02:00
svlandeg 19e8f339cb deduce entity freq from WP corpus and serialize vocab in WP test 2019-04-29 17:37:29 +02:00
svlandeg 387263d618 simplify chains 2019-04-29 13:58:07 +02:00
svlandeg 54d0cea062 unit test for KB serialization 2019-04-24 23:52:34 +02:00
svlandeg 3e0cb69065 KB aliases to and from file 2019-04-24 20:24:24 +02:00
svlandeg ad6c5e581c writing and reading number of entries to/from header 2019-04-24 15:31:44 +02:00
svlandeg 6e3223f234 bulk loading in proper order of entity indices 2019-04-24 11:26:38 +02:00
svlandeg 694fea597a dumping all entryC entries + (inefficient) reading back in 2019-04-23 18:36:50 +02:00
svlandeg 8e70a564f1 custom reader and writer for _EntryC fields (first stab at it - not complete) 2019-04-23 16:33:40 +02:00