Commit Graph

1247 Commits

Author SHA1 Message Date
Ines Montani 6ccdf37574 Exclude user_data when copying doc in displaCy (closes #3882) 2019-06-26 14:37:05 +02:00
Kabir Khan 1e19f34e29 Add optional `id` property to EntityRuler patterns (#3591)
* Adding support for entity_id in EntityRuler pipeline component

* Adding Spacy Contributor aggreement

* Updating EntityRuler to use string.format instead of f strings

* Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity.

* Fixing tests

* Remove custom extension entity_id and use built in ent_id token attribute.

* Changing entity_id to ent_id for consistent naming

* entity_ids => ent_ids

* Removing kb, cleaning up tests, making util functions private, use rsplit instead of split
2019-06-16 13:29:04 +02:00
Suraj Rajan 46c78d0a41 Dependency tree pattern matcher (#3465)
* Functional dependency tree pattern matcher

* Tests fail due to inconsistent behaviour

* Renamed dependencymatcher and added optimizations
2019-06-16 13:25:32 +02:00
BreakBB d8573ee715 Update error raising for CLI pretrain to fix #3840 (#3843)
* Add check for empty input file to CLI pretrain

* Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key

* Skip empty values for correct pretrain keys and log a counter as warning

* Add tests for CLI pretrain core function make_docs.

* Add a short hint for the `tokens` key to the CLI pretrain docs

* Add success message to CLI pretrain

* Update model loading to fix the tests

* Skip empty values and do not create docs out of it
2019-06-16 13:22:57 +02:00
Ines Montani f35ce09776 Add regression test for #3839 2019-06-12 13:38:30 +02:00
Ines Montani aae9034492 Tidy up [ci skip] 2019-06-12 13:38:23 +02:00
Germán 86eb817b74 Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810) (closes #3803))
* (#3803) Spanish like_num returning false for number-like token

* (#3803) Spanish like_num now returning True for number-like token
2019-06-02 12:22:57 +02:00
BreakBB ed18a6efbd Add check for callable to 'Language.replace_pipe' to fix #3737 (#3741) 2019-05-14 16:59:31 +02:00
Ines Montani 8baff1c7c0
💫 Improve introspection of custom extension attributes (#3729)
* Add custom __dir__ to Underscore (see #3707)

* Make sure custom extension methods keep their docstrings (see #3707)

* Improve tests

* Prepend note on partial to docstring (see #3707)

* Remove print statement

* Handle cases where docstring is None
2019-05-12 00:53:11 +02:00
Ines Montani 505c9e0e19 Add util.filter_spans helper (#3686) 2019-05-08 02:33:40 +02:00
BreakBB 5b8dbe4975 Fix symlink creation to show error message on failure (#3589) (resolves #3307))
* Fix symlink creation to show error message on failure. Update tests to reflect those changes.

* Fix test to succeed on non windows systems.
2019-04-16 11:58:31 +02:00
Ines Montani 4d198a7e92 Ensure match pattern error isn't raised on empty errors (closes #3549) 2019-04-09 12:50:43 +02:00
Ines Montani 145c0b7e88 Tidy up and auto-format 2019-04-09 11:40:19 +02:00
Ines Montani 5f005adf61 Add xfailing test for #3555 2019-04-09 11:07:14 +02:00
Ines Montani 4faf62d515
Merge pull request #3530 from svlandeg/fix/issue_3521
Allow English stopwords with any type of apostrophe
2019-04-03 14:14:03 +02:00
Yves Peirsman 951825532c Improved Dutch language resources and Dutch lemmatization (#3409)
* Improved Dutch language resources and Dutch lemmatization

* Fix conftest

* Update punctuation.py

* Auto-format

* Format and fix tests

* Remove unused test file

* Re-add deleted test

* removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains

* Cleaner lemmatization files
2019-04-03 14:13:26 +02:00
svlandeg 4ff786e113 addressed all comments by Ines 2019-04-03 13:50:33 +02:00
Ines Montani 6a4575a56c Don't make "settings" or "title" required in displaCy data (closes #3531) 2019-04-03 10:13:16 +02:00
svlandeg 85b4319f33 specify encoding in files 2019-04-02 15:05:31 +02:00
svlandeg 673c81bbb4 unicode string for python 2.7 2019-04-02 13:52:07 +02:00
svlandeg eca9cc5417 fixing Issue #3521 by adding all hyphen variants for each stopword 2019-04-02 13:24:59 +02:00
svlandeg e7062cf699 failing test for Issue #3521 2019-04-02 13:15:35 +02:00
svlandeg 1424b12b09 failing test for Issue #3449 2019-04-02 13:06:37 +02:00
Ines Montani c23e234d65 Auto-format 2019-04-01 12:11:27 +02:00
Ines Montani 68900066e0
Merge pull request #3459 from svlandeg/feature/el-framework
Basic framework and APIs for entity linker
2019-03-29 14:02:22 +01:00
Hiromu Hota 914b9ff3d2 Tags are joined with a comma and padded with asterisks (#3491)
<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Fix a bug in the test of JapaneseTokenizer.
This PR may require @polm's review.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

Bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-28 16:17:31 +01:00
Samuel Kane 06a1846379 fix(util): fix decaying function output (#3495)
* fix(util): fix decaying function output

* fix(util): better test and adhere to code standards

* fix(util): correct variable name, pytestify test, update website text
2019-03-28 13:24:47 +01:00
Duygu Altinok 5a7bc6b39d Fix/irreg adverbs extension (#3499)
* extended list of irreg adverbs

* added test to exceptions

* fixed typo
2019-03-28 13:23:33 +01:00
Sofie a4a6bfa4e1
Merge branch 'master' into feature/el-framework 2019-03-26 11:00:02 +01:00
svlandeg 8814b9010d entity as one field instead of both ID and name 2019-03-25 18:10:41 +01:00
Ines Montani 06bf130890 💫 Add better and serializable sentencizer (#3471)
* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs
2019-03-23 15:45:02 +01:00
Matthew Honnibal d9a07a7f6e
💫 Fix class mismap on parser deserializing (closes #3433) (#3470)
v2.1 introduced a regression when deserializing the parser after
parser.add_label() had been called. The code around the class mapping is
pretty confusing currently, as it was written to accommodate backwards
model compatibility. It needs to be revised when the models are next
retrained.

Closes #3433
2019-03-23 13:46:25 +01:00
Matthew Honnibal 444a3abfe5 Add xfail test for #3433. Improve test for add label. 2019-03-23 12:36:00 +01:00
Ines Montani 6b6e9b638e Fix test for #3468 2019-03-23 11:24:29 +01:00
Ines Montani fbec72b4c3 Slightly modify test for #3468
Check for Token.is_sent_start first (which is serialized/deserialized correctly)
2019-03-23 11:22:44 +01:00
Ines Montani 02d9378d8c Add xfailing test for #3468 2019-03-23 11:19:11 +01:00
svlandeg 9de9900510 adding future import unicode literals to .py files 2019-03-22 16:18:04 +01:00
svlandeg 9751312aff specify unicode strings for python 2.7 2019-03-22 14:15:18 +01:00
svlandeg ec3e860b44 Merge remote-tracking branch 'upstream/master' into feature/el-framework 2019-03-22 13:47:08 +01:00
svlandeg 12d4caf341 Merge remote-tracking branch 'upstream/master' into feature/el-framework 2019-03-22 13:44:36 +01:00
Matthew Honnibal e65b5bb9a0 Fix tokenizer on Python2.7 (#3460)
spaCy v2.1 switched to the built-in re module, where v2.0 had been using
the third-party regex library. When the tokenizer was deserialized on
Python2.7, the `re.compile()` function was called with expressions that
featured escaped unicode codepoints that were not in Python2.7's unicode
database.

Problems occurred when we had a range between two of these unknown
codepoints, like this:

```
    '[\\uAA77-\\uAA79]'
```

On Python2.7, the unknown codepoints are not unescaped correctly,
resulting in arbitrary out-of-range characters being matched by the
expression.

This problem does not occur if we instead have a range between two
unicode literals, rather than the escape sequences. To fix the bug, we
therefore add a new compat function that unescapes unicode sequences
using the `ast.literal_eval()` function. Care is taken to ensure we
do not also escape non-unicode sequences.

Closes #3356.

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-22 13:42:47 +01:00
Ines Montani 188ccd5750 Fix xfail marker 2019-03-22 12:54:14 +01:00
svlandeg 5b1cd49222 error msg and unit tests for setting kb_id on span 2019-03-22 12:05:35 +01:00
svlandeg a48241e9a2 use nlp's vocab for stringstore 2019-03-22 11:36:45 +01:00
svlandeg c71123dd0c ensure no candidates are returned for unknown aliases 2019-03-22 11:36:45 +01:00
svlandeg 98ae77a682 unit test on number of candidates generated 2019-03-22 11:36:45 +01:00
svlandeg a9074e0886 check the length of entities and probabilities vector + unit test 2019-03-22 11:36:45 +01:00
svlandeg d133ffaff9 correct size, not counting dummy elements in the vector 2019-03-22 11:36:45 +01:00
svlandeg 33f8a0fe2e check and unit test in case prior probs exceed 1 2019-03-22 11:36:45 +01:00
svlandeg 20a7b7b1c0 raising error when adding alias for unknown entity + unit test 2019-03-22 11:36:45 +01:00