Commit Graph

10070 Commits

Author SHA1 Message Date
svlandeg 5318ce88fa 'entity_linker' instead of 'el' 2019-03-22 13:55:10 +01:00
svlandeg ec3e860b44 Merge remote-tracking branch 'upstream/master' into feature/el-framework 2019-03-22 13:47:08 +01:00
Ines Montani c9bd0e5a96 Set version to 2.1.2 2019-03-22 13:44:47 +01:00
svlandeg 12d4caf341 Merge remote-tracking branch 'upstream/master' into feature/el-framework 2019-03-22 13:44:36 +01:00
Matthew Honnibal e65b5bb9a0 Fix tokenizer on Python2.7 (#3460)
spaCy v2.1 switched to the built-in re module, where v2.0 had been using
the third-party regex library. When the tokenizer was deserialized on
Python2.7, the `re.compile()` function was called with expressions that
featured escaped unicode codepoints that were not in Python2.7's unicode
database.

Problems occurred when we had a range between two of these unknown
codepoints, like this:

```
    '[\\uAA77-\\uAA79]'
```

On Python2.7, the unknown codepoints are not unescaped correctly,
resulting in arbitrary out-of-range characters being matched by the
expression.

This problem does not occur if we instead have a range between two
unicode literals, rather than the escape sequences. To fix the bug, we
therefore add a new compat function that unescapes unicode sequences
using the `ast.literal_eval()` function. Care is taken to ensure we
do not also escape non-unicode sequences.

Closes #3356.

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-22 13:42:47 +01:00
Ines Montani c81923ee30 Update wasabi pin 2019-03-22 13:31:58 +01:00
Ines Montani 188ccd5750 Fix xfail marker 2019-03-22 12:54:14 +01:00
Ines Montani 7dd5e2f564 Update v2-1.md 2019-03-22 12:43:23 +01:00
svlandeg 7cf0bc9a8c delete sandbox folder 2019-03-22 12:25:11 +01:00
svlandeg 5b1cd49222 error msg and unit tests for setting kb_id on span 2019-03-22 12:05:35 +01:00
svlandeg 3c9ac59ea0 Merge branch 'backup_el' of https://github.com/svlandeg/spaCy into backup_el 2019-03-22 11:43:52 +01:00
svlandeg a48241e9a2 use nlp's vocab for stringstore 2019-03-22 11:36:45 +01:00
svlandeg 1ee0e78fd7 select candidate with highest prior probabiity 2019-03-22 11:36:45 +01:00
svlandeg 7b708ab8a4 name per entity 2019-03-22 11:36:45 +01:00
svlandeg c593607ce2 minimal EL pipe 2019-03-22 11:36:45 +01:00
svlandeg c71123dd0c ensure no candidates are returned for unknown aliases 2019-03-22 11:36:45 +01:00
svlandeg b6c3255a9f Entity class 2019-03-22 11:36:45 +01:00
svlandeg 1289cd6e8f property getters and keep track of KB internally 2019-03-22 11:36:45 +01:00
svlandeg 98ae77a682 unit test on number of candidates generated 2019-03-22 11:36:45 +01:00
svlandeg 9a46c431c3 store entity hash instead of pointer 2019-03-22 11:36:45 +01:00
svlandeg 9819dca80e create candidate object from entry pointer (not fully functional yet) 2019-03-22 11:36:45 +01:00
svlandeg a9074e0886 check the length of entities and probabilities vector + unit test 2019-03-22 11:36:45 +01:00
svlandeg d133ffaff9 correct size, not counting dummy elements in the vector 2019-03-22 11:36:45 +01:00
svlandeg 33f8a0fe2e check and unit test in case prior probs exceed 1 2019-03-22 11:36:45 +01:00
svlandeg b55baaa1dc avoid value 0 in preshmap and helpful user warnings 2019-03-22 11:36:45 +01:00
svlandeg 20a7b7b1c0 raising error when adding alias for unknown entity + unit test 2019-03-22 11:36:45 +01:00
svlandeg 8843f9279c use StringStore 2019-03-22 11:36:45 +01:00
svlandeg 51560bf0ed bugfix adding aliases 2019-03-22 11:36:45 +01:00
svlandeg c4ba942765 get candidates by alias 2019-03-22 11:36:45 +01:00
svlandeg 151b855cc8 adding and retrieving aliases 2019-03-22 11:36:45 +01:00
svlandeg cf34113250 very minimal KB functionality working 2019-03-22 11:36:44 +01:00
svlandeg af281c5466 adding aliases per entity in the KB 2019-03-22 11:36:44 +01:00
svlandeg f77b99c103 fix compile errors 2019-03-22 11:36:44 +01:00
svlandeg 27483f9080 add pyx and separate method to add aliases 2019-03-22 11:36:44 +01:00
svlandeg feb71e15fd hash the entity name 2019-03-22 11:36:44 +01:00
svlandeg 839dafa104 documented some comments and todos 2019-03-22 11:36:44 +01:00
svlandeg 7f37737878 kb snippet, draft by Matt (wip) 2019-03-22 11:36:44 +01:00
svlandeg 735fc2a735 annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
svlandeg d849eb2455 adding kb_id as field to token, el as nlp pipeline component 2019-03-22 11:34:46 +01:00
Matthew Honnibal d811c97da1 Fix test that caused pytest to choke on Python3 2019-03-22 10:28:51 +01:00
Matthew Honnibal a2ad9832e5 Add failing test for #3356 2019-03-22 02:42:37 +01:00
svlandeg 4820b43313 use nlp's vocab for stringstore 2019-03-21 23:17:25 +01:00
Matthew Honnibal 7ec64a36fd
Merge pull request #3455 from explosion/bugfix/fix-en-tag-map
💫 Bring English tag_map in line with UD Treebank
2019-03-21 21:19:30 +01:00
svlandeg 6e2433b95e select candidate with highest prior probabiity 2019-03-21 18:55:01 +01:00
svlandeg 24a0c4a8d4 name per entity 2019-03-21 18:20:57 +01:00
svlandeg d0c763ba44 minimal EL pipe 2019-03-21 17:33:25 +01:00
svlandeg 26afa4800f ensure no candidates are returned for unknown aliases 2019-03-21 15:24:40 +01:00
Matthew Honnibal c66bd61e88 Fix lemmas 2019-03-21 14:22:12 +01:00
Matthew Honnibal 04395ffa49 Bring English tag_map in line with UD Treebank
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
2019-03-21 13:53:44 +01:00
svlandeg a5d5a05930 Entity class 2019-03-21 13:32:21 +01:00