Commit Graph

5937 Commits

Author SHA1 Message Date
Matthew Honnibal daa8c3787a Add eval_beam_widths argument to spacy train 2019-03-16 15:02:39 +01:00
Ines Montani 2eecd756fa Update package name 2019-03-16 14:43:53 +01:00
Ines Montani f55a52a2dd Set version to v2.1.0.dev0 2019-03-16 13:47:03 +01:00
Ryan Ford 00842d7f1b Merging conversion scripts for conll formats (#3405)
* merging conllu/conll and conllubio scripts

* tabs to spaces

* removing conllubio2json from converters/__init__.py

* Move not-really-CLI tests to misc

* Add converter test using no-ud data

* Fix test I broke

* removing include_biluo parameter

* fixing read_conllx

* remove include_biluo from convert.py
2019-03-15 18:14:46 +01:00
Ines Montani bec8db91e6 Add actual deprecation warning for n_threads (resolves #3410) 2019-03-15 16:38:44 +01:00
Ines Montani cb5dbfa63a Tidy up references to n_threads and fix default 2019-03-15 16:24:26 +01:00
Ines Montani 852e1f105c Tidy up docstrings 2019-03-15 16:23:17 +01:00
Matthew Honnibal b13b2aeb54 Use hash_state in beam 2019-03-15 15:22:58 +01:00
Matthew Honnibal 693c8934e8 Normalize over all actions in parser, not just valid ones 2019-03-15 15:22:16 +01:00
Matthew Honnibal b94b2b1168 Export hash_state from beam_utils 2019-03-15 15:20:28 +01:00
Matthew Honnibal ad56641324 Fix Language.evaluate 2019-03-15 15:20:09 +01:00
Matthew Honnibal f762c36e61 Evaluate accuracy at multiple beam widths 2019-03-15 15:19:49 +01:00
Matthew Honnibal 0703f5986b Remove hack from beam 2019-03-15 00:48:39 +01:00
Sofie c45ed32c74 label in span not writable anymore (#3408)
* label in span not writable anymore

* more explicit unit test and error message for readonly label

* bit more explanation (view)

* error msg tailored to specific case

* fix None case
2019-03-15 00:46:45 +01:00
Ines Montani 8ac197d443 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-12 15:22:11 +01:00
Matthew Honnibal 6aab2d8533 Set version to v2.1.0a13 2019-03-12 15:14:06 +01:00
Ines Montani 8ee6514ab8 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-12 15:11:39 +01:00
Ines Montani 479b5cff43 Auto-format [ci skip] 2019-03-12 13:35:34 +01:00
Matthew Honnibal 1179de0860 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-12 13:33:22 +01:00
Matthew Honnibal 8a4121cbc2 Fix bug introduced by component_cfg 2019-03-12 13:32:56 +01:00
Ines Montani 2912ddc9a6 Don't set extension attribute in Japanese (closes #3398) 2019-03-12 13:30:33 +01:00
Matthew Honnibal 062934aa12 Set version to v2.1.0a12 2019-03-11 22:26:19 +01:00
Ines Montani 886e5966c0 Update test_displacy.py 2019-03-11 19:03:52 +01:00
Ines Montani 4bd2688eac
💫 Fix displaCy support for RTL languages (#3393)
Closes #2091.

## Description

With the new `vocab.writing_system` property introduced in #3390 (exposed via the language defaults), I was able to finally fix this (I think!). Based on the `Doc`, dispaCy now detects whether it's a RTL or LTR language and adjusts the visualization accordingly. Wherever possible, I've also added `direction` and `lang` attributes.

Entity visualization now looks like this:

<img width="318" alt="Screenshot 2019-03-11 at 16 06 51" src="https://user-images.githubusercontent.com/13643239/54136866-d97afd80-441c-11e9-8c27-3d46994cc833.png">

And dependencies like this (ignore the most likely incorrect tags and dependencies):

<img width="621" alt="Screenshot 2019-03-11 at 16 51 59" src="https://user-images.githubusercontent.com/13643239/54137771-8b66f980-441e-11e9-8460-0682b95eef2a.png">

### Types of change
enhancement, bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-11 18:52:50 +01:00
Ines Montani cdd418b93e Auto-format [ci skip] 2019-03-11 17:10:50 +01:00
Matthew Honnibal b0b990e405 Fix token.conjuncts (closes #795) (#3392)
* Implement conjuncts method

* Add span.conjuncts property

* Un-xfail token.conjuncts tests

* Update docs for token.conjuncts and span.conjuncts

* Fix merge error in token.conjuncts
2019-03-11 17:05:45 +01:00
Matthew Honnibal e2b9b523ce Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-11 15:59:28 +01:00
Ines Montani 47e9c274ef Tidy up property code style (#3391)
Use decorator if properties only have a getter and existing syntax if there's getter and setter
2019-03-11 15:59:09 +01:00
Matthew Honnibal db79a704bf Add xfail tests for token.conjuncts 2019-03-11 15:46:52 +01:00
Ines Montani c3df4d1108 Move displaCy tests to own file 2019-03-11 15:28:34 +01:00
Ines Montani c5a407e95a Fix code style 2019-03-11 15:28:22 +01:00
Matthew Honnibal 39a4741e26 Add support for vocab.writing_system property (#3390)
* Add xfail test for vocab.writing_system

* Add vocab.writing_system property

* Set Language.Defaults.writing_system

* Set default writing system

* Remove xfail on test_vocab_writing_system
2019-03-11 15:23:20 +01:00
Matthew Honnibal 05ef0a5abb Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-11 14:33:15 +01:00
Ines Montani ee4f312e89 Add writing_system to ArabicDefaults (experimental) 2019-03-11 14:22:23 +01:00
Ines Montani ebcf2bb1c3 Add Doc.lang and Doc.lang_ 2019-03-11 14:21:40 +01:00
Ines Montani ef80cfde6f Fix pickling of Japanese (closes #3191) 2019-03-11 13:34:23 +01:00
Ines Montani c399162a82 Tidy up 2019-03-11 13:34:14 +01:00
Ines Montani 7c05ca01e8 💫 Support mutable default values for extension attributes (#3389)
* Support mutable default values in extensions

* Update documentation
2019-03-11 12:50:44 +01:00
Matthew Honnibal 4e8a07c7d3 Set version to v2.1.0a11 2019-03-11 10:45:06 +01:00
Matthew Honnibal 80b94313b6 💫 Fix interaction of lemmatizer and tokenizer exceptions (#3388)
Closes #2203. Closes #3268.

Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object.

This PR applies two fixes:

1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite.
2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`).

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-11 01:31:21 +01:00
Matthew Honnibal 04ca710da7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-11 01:07:34 +01:00
Matthew Honnibal 5d25ee52fb Fix English tag map 2019-03-11 01:06:02 +01:00
Ines Montani 8f45ff3dc2 Adjust formatting [ci skip] 2019-03-11 00:47:41 +01:00
Matthew Honnibal 7503e1e505 Improve English tag map. Re #593, #3311 2019-03-10 23:50:00 +01:00
Matthew Honnibal 98acf5ffe4 💫 Allow passing of config parameters to specific pipeline components (#3386)
* Add component_cfg kwarg to begin_training

* Document component_cfg arg to begin_training

* Update docs and auto-format

* Support component_cfg across Language

* Format

* Update docs and docstrings [ci skip]

* Fix begin_training
2019-03-10 23:36:47 +01:00
Ines Montani c998cde7e2 Auto-format [ci skip] 2019-03-10 19:22:59 +01:00
Ines Montani 7ba3a5d95c 💫 Make serialization methods consistent (#3385)
* Make serialization methods consistent

exclude keyword argument instead of random named keyword arguments and deprecation handling

* Update docs and add section on serialization fields
2019-03-10 19:16:45 +01:00
Ines Montani 67e38690d4 Un-xfail passing tests and tidy up 2019-03-10 18:42:16 +01:00
Matthew Honnibal 27dd820753
Fix vocab deserialization when loading already present lexemes (#3383)
* Fix vocab deserialization bug. Closes #2153

* Un-xfail test for #2153
2019-03-10 17:21:19 +01:00
Matthew Honnibal d6eaa71afc Handle scalar values in doc.from_array() 2019-03-10 16:54:03 +01:00
Matthew Honnibal 61e5ce02a4 Add xfailing test for #2153 2019-03-10 16:36:29 +01:00
Matthew Honnibal 7461e5e055 Fix batch bug in issue #3344 2019-03-10 16:01:34 +01:00
Matthew Honnibal 8a6272f842 Un-xfail test 2019-03-10 15:51:15 +01:00
Matthew Honnibal 4e80fc41ad Make doc.from_array() consistent with doc.to_array(). Closes #3382 2019-03-10 15:50:48 +01:00
Ines Montani 0426689db8 💫 Improve Doc.to_json and add Doc.is_nered (#3381)
* Use default return instead of else

* Add Doc.is_nered to indicate if entities have been set

* Add properties in Doc.to_json if they were set, not if they're available

This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.
2019-03-10 15:24:34 +01:00
Ines Montani 7984543953 Add xfailing test for to_array/from_array string attrs 2019-03-10 15:08:15 +01:00
Ines Montani 6bbf4ea309 Simplify tests and avoid tokenizing 2019-03-10 15:05:56 +01:00
Matthew Honnibal a5b1f6dcec Fix NER when preset entities cross sentence boundaries (#3379)
💫 Fix NER when preset entities cross sentence boundaries
2019-03-10 14:53:03 +01:00
Ines Montani 3fe5811fa7 Only link model after download if shortcut link (#3378) 2019-03-10 13:02:24 +01:00
Matthew Honnibal 231bc7bb7b Add xfailing test for #3345 2019-03-10 13:00:15 +01:00
Matthew Honnibal bdc77848f5 Add helper method to apply a transition in parser/NER 2019-03-10 13:00:00 +01:00
Matthew Honnibal ce1fe8a510 Add comment 2019-03-09 17:51:17 +00:00
Matthew Honnibal 28c26e212d Fix textcat model for GPU 2019-03-09 17:50:08 +00:00
Ines Montani 610fb306bd Revert hyphens 2019-03-09 12:51:53 +01:00
Ines Montani bbabb6aaae Escape more hyphens 2019-03-09 12:41:05 +01:00
Ines Montani b8db219850 Auto-format 2019-03-09 12:40:58 +01:00
Ines Montani a145bfe627 Try escaping hyphens again 2019-03-09 03:06:50 +01:00
Ines Montani b9c71fc0f0 Fix flags 2019-03-09 02:46:04 +01:00
Ines Montani ae09b6a6cf Try fixing unicode inconsistencies on Python 2 2019-03-09 02:37:50 +01:00
Ines Montani d957d7a697 Auto-format 2019-03-09 02:37:41 +01:00
Ines Montani 65402c3d02 Revert "Experiment with escaping hyphens"
This reverts commit 9b42e2d5dd.
2019-03-09 02:13:00 +01:00
Ines Montani 9b42e2d5dd Experiment with escaping hyphens 2019-03-09 02:05:26 +01:00
Ines Montani 76764fcf59 💫 Improve converters and training data file formats (#3374)
* Populate converter argument info automatically

* Add conversion option for msgpack

* Update docs

* Allow reading training data from JSONL
2019-03-08 23:15:23 +01:00
Ines Montani 296446a1c8
Tidy up and improve docs and docstrings (#3370)
<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00
Ines Montani daaeeb7a2b Merge branch 'master' into develop 2019-03-07 22:07:31 +01:00
Adrien Ball 88909a9adb Fix egg fragments in direct download (#3369)
## Description
The egg fragment in the URL must be of the form `#egg=package_name==version` instead of `#egg=package_name-version`.
One of the consequences of specifying wrong egg fragments is that `pip` does not recognize the package and its version properly, and thus it re-downloads the package systematically.

I'm not sure how this should be tested properly. 
Here is what I had before the fix when running the same direct download twice:
```
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 1.6MB/s
  Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments.
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 919kB/s
  Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments.
Requirement already satisfied (use --upgrade to upgrade): en-core-web-sm from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 in ./venv3/lib/python3.6/site-packages
```

And after the fix:
```
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |████████████████████████████████| 37.4MB 1.1MB/s
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0
$ python -m spacy download en_core_web_sm-2.0.0 --direct
Looking in indexes: https://pypi.python.org/simple/
Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in ./venv3/lib/python3.6/site-packages (2.0.0)
```

### Types of change
This is an enhancement as it avoids unnecessary downloads of (potentially big) spacy models, when they have already been downloaded.

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-07 21:07:19 +01:00
Ines Montani 96b91a8898 Fix noqa [ci skip] 2019-03-07 12:25:00 +01:00
Ines Montani 9d6ca18a10 Tidy up and only use self.vector once 2019-03-07 01:06:12 +01:00
Ines Montani a8f1efd2f5 Merge branch 'master' into develop 2019-03-07 00:56:31 +01:00
Daniel King 5f40229397 Don't use numpy directly for similarity (#3362)
* Don't use numpy directly for similarity

* Contributor agreement
2019-03-06 22:58:38 +00:00
Ines Montani 6bd34e9d54 Expose Japanese stop words (closes #3346) 2019-03-06 14:21:15 +01:00
Ines Montani 85deb96278 Fix whitespace 2019-03-06 14:20:34 +01:00
Ines Montani 23f6ebf0f3 Add missing " (closes #3343) 2019-02-27 16:37:03 +01:00
Ines Montani 533b580c19 Add test for stray print statements in languages (see #3342) 2019-02-27 16:04:30 +01:00
Ines Montani 48a2046d1c Remove stray print statement (closes #3342) 2019-02-27 15:35:04 +01:00
Ines Montani 07d7c0a1af Fix whitespace 2019-02-27 15:34:21 +01:00
Ines Montani 9b62639d19 Auto-format [ci skip] 2019-02-27 14:24:55 +01:00
Matthew Honnibal 656edcb984 Set version to v2.1.0a10 2019-02-27 12:26:13 +01:00
Matthew Honnibal f1d77eb140
💫 Improve handling of missing NER tags (closes #2603) (#3341)
* Improve handling of missing NER tags

GoldParse can accept missing NER tags, if entities is provided
in BILUO format (rather than as spans). Missing tags can be provided
as None values.

Fix bug that occurred when first tag was a None value. Closes #2603.

* Document specification of missing NER tags.
2019-02-27 12:06:32 +01:00
Ines Montani e359bdd0e3 Auto-format 2019-02-27 11:56:45 +01:00
Matthew Honnibal 4a3371acd5
Make doc[0].is_sent_start == True (closes #2869) (#3340)
* Make doc[0] have sent_start True. Closes #2869

* Document that doc[0].is_sent_start defaults True.
2019-02-27 11:17:17 +01:00
Matthew Honnibal 2d3ce89b78 Improve matcher tests re issue #3328 2019-02-27 10:25:56 +01:00
Matthew Honnibal 8d6954e0e7 Fix matcher bug #3328 2019-02-27 10:25:39 +01:00
Ines Montani aadf586789 Add xfailing test for #3331 2019-02-25 22:33:30 +01:00
Matthew Honnibal 3cdd3eb518 Set version to v2.1.0a9 2019-02-25 21:55:19 +01:00
Matthew Honnibal b449be0f04 Add comment re issue #3170 2019-02-25 21:24:03 +01:00
Matthew Honnibal 9ccd6a3062 Fix head-outside-sentence bug. Fixes #3170 2019-02-25 21:21:44 +01:00
Matthew Honnibal f2fae1f186 Add batch size argument to Language.evaluate(). Closes #3263 2019-02-25 19:30:33 +01:00
Ines Montani f135d663f7 Update conftest.py 2019-02-25 15:55:29 +01:00
Ines Montani 76ce8b2662 Merge branch 'master' into develop 2019-02-25 15:54:55 +01:00
Julia Makogon f1c3108d52 Fixing pymorphy2 dependency issue (#3329) (closes #3327)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement

* pymorphy2 initialization split for ru and uk (#3327)

* stop-words fixed

* Unit-tests updated
2019-02-25 15:48:17 +01:00
Ines Montani 1a735e0f1f Add regression test for #3328 2019-02-25 10:12:58 +01:00
Ines Montani dfbed07d3b Remove unused temp errors 2019-02-24 22:26:08 +01:00
Ines Montani 62b558ab72 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325)
* Fix formatting and whitespace

* Add support for lexical attributes (closes #2390)

* Document lexical attribute setting during retokenization

* Assign variable oputside of nested loop
2019-02-24 21:13:51 +01:00
Ines Montani a48deb4081 Merge regression tests 2019-02-24 21:03:39 +01:00
Ines Montani 8f6c193a4d Delete _test_issue1622.py 2019-02-24 20:33:31 +01:00
Ines Montani c8e967c78d Try include previously segfaulting test 2019-02-24 20:32:46 +01:00
Ines Montani 328b589deb Merge regression tests 2019-02-24 20:31:38 +01:00
Ines Montani 3bc53905cc Remove print statements from test 2019-02-24 20:31:15 +01:00
Ines Montani 1ae0df3da9 Un-x-fail passing test 2019-02-24 20:24:15 +01:00
Ines Montani 399a5803d0 Tidy up tests [ci skip] 2019-02-24 19:02:16 +01:00
Ines Montani 2011563c51 Update docstrings [ci skip] 2019-02-24 18:39:59 +01:00
Ines Montani df19e2bff6
💫 Allow setting of custom attributes during retokenization (closes #3314) (#3324)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.

```python
Token.set_extension('is_musician', default=False)

doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
    retokenizer.merge(doc[2:4], attrs=attrs)

assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```

### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-24 18:38:47 +01:00
Ines Montani 1ea1bc98e7 Document regex utilities [ci skip] 2019-02-24 18:34:10 +01:00
Matthew Honnibal 1f7c56cd93 Fix parser.add_label() 2019-02-24 16:53:22 +01:00
Matthew Honnibal 893aa40d73 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-24 16:43:01 +01:00
Matthew Honnibal 5882d82915 Set version to v2.1.0a9.dev2 2019-02-24 16:42:06 +01:00
Matthew Honnibal 0367f864fe Fix handling of added labels. Resolves #3189 2019-02-24 16:41:41 +01:00
Matthew Honnibal d74dbde828 Fix order of actions when labels added to parser
When labels were added to the parser or NER, we weren't loading back the
classes in the correct order. Re issue #3189
2019-02-24 16:36:29 +01:00
Ines Montani 6de81ae310 Fix formatting of errors 2019-02-24 15:11:28 +01:00
Ines Montani d8f69d592f Tidy up retokenizer tests 2019-02-24 14:14:11 +01:00
Ines Montani 723e27cb8c Tidy up tests 2019-02-24 14:11:23 +01:00
Ines Montani 2982f82934 Auto-format 2019-02-24 14:09:15 +01:00
Matthew Honnibal 909a9d9932 Set version to v2.1.0a9.dev1 2019-02-23 13:10:42 +01:00
Matthew Honnibal 6b0008afc6 Clean up TextCategorizer slightly 2019-02-23 12:28:06 +01:00
Matthew Honnibal d13b9373bf Improve initialization for mutually textcat 2019-02-23 12:27:45 +01:00
Matthew Honnibal e9dd5943b9 Support exclusive_classes setting for textcat models 2019-02-23 11:57:16 +01:00
Matthew Honnibal ce1e4eace2 Default to former TextCategorizer model
* Keep TextCategorizer default model same as v2.0
* Add option 'architecture' that allows "simple_cnn" to switch to
simpler model.
* Add option exclusive_classes, defaulting to False. If set to True,
the model treats classes as mutually exclusive, i.e. only one class can
be true per instance.
2019-02-23 11:55:16 +01:00
Matthew Honnibal 829c9091a4 Set version to v2.1.0a9.dev0 2019-02-21 17:13:34 +01:00
Matthew Honnibal d396a69c7b More fixes for issue #3112 2019-02-21 17:12:23 +01:00
Ines Montani 80bdcb99c5 Fix escaping of HTML in displacy ENT (closes #2728) 2019-02-21 14:30:39 +01:00
Matthew Honnibal 7d529ebdfb Set version to v2.1.0a8 2019-02-21 12:09:34 +01:00
Matthew Honnibal f75be6e7be Set version to v2.1.0a8.dev1 2019-02-21 11:57:06 +01:00
Matthew Honnibal c5f947f194 Fix regex deprecation warnings 2019-02-21 11:56:47 +01:00
Matthew Honnibal 7f02464494 Set version to v2.1.0a8.dev0 2019-02-21 11:42:23 +01:00
Matthew Honnibal f31dbec528 More fixes for #3112 2019-02-21 11:10:10 +01:00
Matthew Honnibal 80195bc2d1
Fix issue #3288 (#3308) 2019-02-21 09:48:53 +01:00
Matthew Honnibal a137e8b418 Fix Pipe.to_bytes() when model uninitialized
Closes #3289
2019-02-21 09:42:02 +01:00
Matthew Honnibal 6574e4f2d3 Fix issue #3112 part 1 2019-02-21 09:27:38 +01:00
Matthew Honnibal b21481eeca Load token_match regex with .match, not .search 2019-02-21 09:09:03 +01:00
Sofie 9a478b6db8 Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293)
* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* remove duplicate

* remove xfail for Issue #2179 fixed by Matt

* adjust documentation and remove reference to regex lib
2019-02-20 22:10:13 +01:00
Matthew Honnibal 0d1ca15b13 💫 Fix bugs in matcher extensions. Closes #1971 (#3301)
* Fix matching on extension attrs and predicates

* Fix detection of match_id when using extension attributes. The match
ID is stored as the last entry in the pattern. We were checking for this
with nr_attr == 0, which didn't account for extension attributes.

* Fix handling of predicates. The wrong count was being passed through,
so even patterns that didn't have a predicate were being checked.

* Fix regex pattern

* Fix matcher set value test
2019-02-20 21:30:39 +01:00
Ines Montani 3b667787a9 Add xfailing test for #3289 2019-02-18 16:45:04 +01:00
Ines Montani 91f260f2c4 Add another test for #1971 2019-02-18 13:36:20 +01:00
Ines Montani f30aac324c Update test_issue1971.py 2019-02-18 13:36:15 +01:00
Ines Montani 8fa26ca97e Fix tensor shape in test for #3288 2019-02-18 11:01:54 +01:00
Ines Montani c32290557f Add xfailing test for #3288 2019-02-18 10:59:31 +01:00
Ines Montani 3fdcdec6a0 Merge branch 'master' into develop 2019-02-18 10:03:32 +01:00
Roshni Biswas e09f1347fa updates for Bengali language (#3286)
* Update morph_rules.py

* contributor agreement for roshni-b

* created example sentences
2019-02-18 10:02:28 +01:00
Ines Montani 043e8186f3 Merge branch 'master' into develop 2019-02-17 17:51:17 +01:00
Marc Puig 51268e9f21 Typo error fixed (#3284) 2019-02-17 17:51:02 +01:00
Ines Montani 3af0b2dd1c Add xfailing test for #1971 [ci skip] 2019-02-17 13:04:47 +01:00
Ines Montani 19a002bfd3 Merge branch 'master' into develop 2019-02-17 12:22:54 +01:00
Ines Montani 1e252b129c Auto-format 2019-02-17 12:22:07 +01:00
Roshni Biswas e26d923726 Update morph_rules.py (#3283) 2019-02-17 12:21:47 +01:00
Matthew Honnibal 7d4a52a4d0 Set version to v2.1.0a7 2019-02-16 17:48:34 +01:00
Matthew Honnibal 07617b6b7f Set version to v2.1.0a7.dev12 2019-02-16 17:30:29 +01:00
Matthew Honnibal 1dc314bada Set version to v2.1.0a7.dev11 2019-02-16 17:02:49 +01:00
Matthew Honnibal 2ef227c313 Set version to v2.1.0a7.dev1 2019-02-16 16:22:46 +01:00
Matthew Honnibal 22923b9cb1 Set version to v2.1.0a7.dev9 2019-02-16 15:47:19 +01:00
Matthew Honnibal e0c91a4c8d Set version to 2.1.0a7 2019-02-16 14:43:38 +01:00
Matthew Honnibal 92b6bd2977
Refinements to retokenize.split() function (#3282)
* Change retokenize.split() API for heads

* Pass lists as values for attrs in split

* Fix test_doc_split filename

* Add error for mismatched tokens after split

* Raise error if new tokens don't match text

* Fix doc test

* Fix error

* Move deps under attrs

* Fix split tests

* Fix retokenize.split
2019-02-15 17:32:31 +01:00
Matthew Honnibal 2dbc61bc26 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-15 14:03:54 +01:00
Ines Montani 1aa57690dc Add xfailing test for orth mismatch in retokenizer.split 2019-02-15 13:55:04 +01:00
Ines Montani 819768483f Add xfailing test for out-of-bounds heads 2019-02-15 13:09:07 +01:00
Ines Montani d8051e89ca Tidy up tests 2019-02-15 12:56:51 +01:00
Matthew Honnibal 58aac58631 Set version to v2.1.0a7.dev8 2019-02-15 12:39:26 +01:00
Matthew Honnibal 5f1abe2cc7 Set version to v2.1.0a7.dev7 2019-02-15 10:30:53 +01:00
Matthew Honnibal a66e8e0c8a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-15 10:30:22 +01:00
Ines Montani c31a9dabd5 💫 Add en/em dash to prefixes and suffixes (#3281)
* Auto-format

* Add en/em dash to prefixes and suffixes
2019-02-15 10:29:59 +01:00
Ines Montani 5651a0d052 💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280)
* Add deprecation warning to Doc.merge and Span.merge

* Replace {Doc,Span}.merge with Doc.retokenize
2019-02-15 10:29:44 +01:00
Matthew Honnibal dcf79c5ef3 Set version to v2.1.0a7.dev6 2019-02-14 20:12:02 +01:00
Matthew Honnibal 0371ac23e7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-14 20:09:10 +01:00
Ines Montani f146121092 💫 Make handling of [Pipe].labels consistent (#3273)
* Make handling of [Pipe].labels consistent

* Un-xfail passing test

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: ines <ines@ines.io>

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: ines <ines@ines.io>

* Update spacy/tests/pipeline/test_pipe_methods.py

Co-Authored-By: ines <ines@ines.io>

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: ines <ines@ines.io>

* Move error message to spacy.errors

* Fix textcat labels and test

* Make EntityRuler.labels return tuple as well
2019-02-15 06:03:19 +11:00
Ines Montani 3d577b77c6 Auto-formatting 2019-02-14 19:56:38 +01:00
Ines Montani 2569339a98 Formatting and whitespace [ci skip] 2019-02-14 18:05:07 +01:00
Matthew Honnibal aebf71bc72 Set version to v2.1.0a7.dev5 2019-02-14 15:51:42 +01:00
Matthew Honnibal 6ccd67c682 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-14 15:51:12 +01:00
Ines Montani e104e47c21 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-14 15:35:34 +01:00
Ines Montani 0cd01a8c5e Merge branch 'master' into develop 2019-02-14 15:35:20 +01:00
Ines Montani 2e31921d0a 💫 Add base Language classes for more languages (#3276)
* Add base classes for more languages

* Add test for language class initialization

Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded
2019-02-15 01:31:19 +11:00
Grivaz 39815513e2 Add split one token into several (resolves #2838) (#3253)
* Add split one token into several (resolves #2838)

* Improve error message for token splitting

* Make retokenizer.split() tests use a Token object

Change retokenizer.split() to use a Token object, instead of an index.

* Pass Token into retokenize.split()

Tweak retokenize.split() API so that we pass the `Token` object, not the index.

* Fix token.idx in retokenize.split()

* Test that token.idx is correct after split

* Fix token.idx for split tokens

* Fix retokenize.split()

* Fix retokenize.split

* Fix retokenize.split() test
2019-02-15 01:27:13 +11:00
Ines Montani 743ecf728c Tidy up conftest 2019-02-14 13:27:13 +01:00
Ines Montani 106d95b01a Fix typo 2019-02-14 12:26:56 +01:00
Ines Montani 11d6b874db
Update stop_words.py 2019-02-14 12:25:19 +01:00
Ines Montani 60c2a3bb65 Also raise original error message in util.get_lang_class
Otherwise, the true error that happens within a Language subclass is swallowed, because if it's imported lazily like that, it'll always be an ImportError
2019-02-13 16:52:25 +01:00
Ines Montani 4d2438f985 Tidy up and auto-format 2019-02-13 15:29:08 +01:00
Ines Montani fbf9f1edf1 Also raise error in Span.__reduce__ 2019-02-13 13:22:05 +01:00
Matthew Honnibal 1831e1423d Set version to v2.1.0a7.dev4 2019-02-13 23:08:40 +11:00
Matthew Honnibal 63dc4234a3 Set version to v2.1.0a7.dev3 2019-02-13 22:53:10 +11:00
Matthew Honnibal b7ea39564f Set version to v2.1.0a7.dev2 2019-02-13 22:52:43 +11:00
Ines Montani 2d0c3c73f4
Raise better error if token is pickled (resolves #2833) (#3267) 2019-02-13 11:27:04 +01:00
Ines Montani 2f45bd94c0 Auto-formatting 2019-02-12 18:30:11 +01:00
Ines Montani 0184a95340 Merge branch 'master' into develop 2019-02-12 18:29:24 +01:00
Akhilesh a78db10941 add kannada support (#3264)
* add kannada support

* add few more stop words

* add support for Kannada Language
2019-02-12 18:28:39 +01:00
Ines Montani b589b945db
Fix PhraseMatcher pickling and length (resolves #3248) (#3252) 2019-02-12 18:27:54 +01:00
Ines Montani 483dddc9bc 💫 Add token match pattern validation via JSON schemas (#3244)
* Add custom MatchPatternError

* Improve validators and add validation option to Matcher

* Adjust formatting

* Never validate in Matcher within PhraseMatcher

If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale).
2019-02-13 01:47:26 +11:00
Ines Montani ad2a514cdf Show warning if phrase pattern Doc was overprocessed (#3255)
In most cases, the PhraseMatcher will match on the verbatim token text or as of v2.1, sometimes the lowercase text. This means that we only need a tokenized Doc, without any other attributes.

If phrase patterns are created by processing large terminology lists with the full `nlp` object, this easily can make things a lot slower, because all components will be applied, even if we don't actually need the attributes they set (like part-of-speech tags, dependency labels).

The warning message also includes a suggestion to use nlp.make_doc or nlp.tokenizer.pipe for even faster processing. For now, the validation has to be enabled explicitly by setting validate=True.
2019-02-13 01:45:31 +11:00
Matthew Honnibal 6ec834dc72 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-02-13 01:14:44 +11:00
Matthew Honnibal 43fa039d96 xfail regression test for model labels 2019-02-13 01:14:26 +11:00