Commit Graph

1190 Commits

Author SHA1 Message Date
Ines Montani 4bd2688eac
💫 Fix displaCy support for RTL languages (#3393)
Closes #2091.

## Description

With the new `vocab.writing_system` property introduced in #3390 (exposed via the language defaults), I was able to finally fix this (I think!). Based on the `Doc`, dispaCy now detects whether it's a RTL or LTR language and adjusts the visualization accordingly. Wherever possible, I've also added `direction` and `lang` attributes.

Entity visualization now looks like this:

<img width="318" alt="Screenshot 2019-03-11 at 16 06 51" src="https://user-images.githubusercontent.com/13643239/54136866-d97afd80-441c-11e9-8c27-3d46994cc833.png">

And dependencies like this (ignore the most likely incorrect tags and dependencies):

<img width="621" alt="Screenshot 2019-03-11 at 16 51 59" src="https://user-images.githubusercontent.com/13643239/54137771-8b66f980-441e-11e9-8460-0682b95eef2a.png">

### Types of change
enhancement, bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-11 18:52:50 +01:00
Matthew Honnibal b0b990e405 Fix token.conjuncts (closes #795) (#3392)
* Implement conjuncts method

* Add span.conjuncts property

* Un-xfail token.conjuncts tests

* Update docs for token.conjuncts and span.conjuncts

* Fix merge error in token.conjuncts
2019-03-11 17:05:45 +01:00
Matthew Honnibal e2b9b523ce Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2019-03-11 15:59:28 +01:00
Matthew Honnibal db79a704bf Add xfail tests for token.conjuncts 2019-03-11 15:46:52 +01:00
Ines Montani c3df4d1108 Move displaCy tests to own file 2019-03-11 15:28:34 +01:00
Ines Montani c5a407e95a Fix code style 2019-03-11 15:28:22 +01:00
Matthew Honnibal 39a4741e26 Add support for vocab.writing_system property (#3390)
* Add xfail test for vocab.writing_system

* Add vocab.writing_system property

* Set Language.Defaults.writing_system

* Set default writing system

* Remove xfail on test_vocab_writing_system
2019-03-11 15:23:20 +01:00
Ines Montani ebcf2bb1c3 Add Doc.lang and Doc.lang_ 2019-03-11 14:21:40 +01:00
Ines Montani c399162a82 Tidy up 2019-03-11 13:34:14 +01:00
Ines Montani 7c05ca01e8 💫 Support mutable default values for extension attributes (#3389)
* Support mutable default values in extensions

* Update documentation
2019-03-11 12:50:44 +01:00
Matthew Honnibal 80b94313b6 💫 Fix interaction of lemmatizer and tokenizer exceptions (#3388)
Closes #2203. Closes #3268.

Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object.

This PR applies two fixes:

1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite.
2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`).

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-11 01:31:21 +01:00
Ines Montani 8f45ff3dc2 Adjust formatting [ci skip] 2019-03-11 00:47:41 +01:00
Ines Montani 7ba3a5d95c 💫 Make serialization methods consistent (#3385)
* Make serialization methods consistent

exclude keyword argument instead of random named keyword arguments and deprecation handling

* Update docs and add section on serialization fields
2019-03-10 19:16:45 +01:00
Ines Montani 67e38690d4 Un-xfail passing tests and tidy up 2019-03-10 18:42:16 +01:00
Matthew Honnibal 27dd820753
Fix vocab deserialization when loading already present lexemes (#3383)
* Fix vocab deserialization bug. Closes #2153

* Un-xfail test for #2153
2019-03-10 17:21:19 +01:00
Matthew Honnibal 61e5ce02a4 Add xfailing test for #2153 2019-03-10 16:36:29 +01:00
Matthew Honnibal 8a6272f842 Un-xfail test 2019-03-10 15:51:15 +01:00
Ines Montani 0426689db8 💫 Improve Doc.to_json and add Doc.is_nered (#3381)
* Use default return instead of else

* Add Doc.is_nered to indicate if entities have been set

* Add properties in Doc.to_json if they were set, not if they're available

This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.
2019-03-10 15:24:34 +01:00
Ines Montani 7984543953 Add xfailing test for to_array/from_array string attrs 2019-03-10 15:08:15 +01:00
Ines Montani 6bbf4ea309 Simplify tests and avoid tokenizing 2019-03-10 15:05:56 +01:00
Matthew Honnibal a5b1f6dcec Fix NER when preset entities cross sentence boundaries (#3379)
💫 Fix NER when preset entities cross sentence boundaries
2019-03-10 14:53:03 +01:00
Matthew Honnibal 231bc7bb7b Add xfailing test for #3345 2019-03-10 13:00:15 +01:00
Ines Montani 96b91a8898 Fix noqa [ci skip] 2019-03-07 12:25:00 +01:00
Ines Montani 533b580c19 Add test for stray print statements in languages (see #3342) 2019-02-27 16:04:30 +01:00
Ines Montani 9b62639d19 Auto-format [ci skip] 2019-02-27 14:24:55 +01:00
Matthew Honnibal f1d77eb140
💫 Improve handling of missing NER tags (closes #2603) (#3341)
* Improve handling of missing NER tags

GoldParse can accept missing NER tags, if entities is provided
in BILUO format (rather than as spans). Missing tags can be provided
as None values.

Fix bug that occurred when first tag was a None value. Closes #2603.

* Document specification of missing NER tags.
2019-02-27 12:06:32 +01:00
Ines Montani e359bdd0e3 Auto-format 2019-02-27 11:56:45 +01:00
Matthew Honnibal 4a3371acd5
Make doc[0].is_sent_start == True (closes #2869) (#3340)
* Make doc[0] have sent_start True. Closes #2869

* Document that doc[0].is_sent_start defaults True.
2019-02-27 11:17:17 +01:00
Matthew Honnibal 2d3ce89b78 Improve matcher tests re issue #3328 2019-02-27 10:25:56 +01:00
Matthew Honnibal 8d6954e0e7 Fix matcher bug #3328 2019-02-27 10:25:39 +01:00
Ines Montani aadf586789 Add xfailing test for #3331 2019-02-25 22:33:30 +01:00
Ines Montani f135d663f7 Update conftest.py 2019-02-25 15:55:29 +01:00
Ines Montani 76ce8b2662 Merge branch 'master' into develop 2019-02-25 15:54:55 +01:00
Julia Makogon f1c3108d52 Fixing pymorphy2 dependency issue (#3329) (closes #3327)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement

* pymorphy2 initialization split for ru and uk (#3327)

* stop-words fixed

* Unit-tests updated
2019-02-25 15:48:17 +01:00
Ines Montani 1a735e0f1f Add regression test for #3328 2019-02-25 10:12:58 +01:00
Ines Montani 62b558ab72 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325)
* Fix formatting and whitespace

* Add support for lexical attributes (closes #2390)

* Document lexical attribute setting during retokenization

* Assign variable oputside of nested loop
2019-02-24 21:13:51 +01:00
Ines Montani a48deb4081 Merge regression tests 2019-02-24 21:03:39 +01:00
Ines Montani 8f6c193a4d Delete _test_issue1622.py 2019-02-24 20:33:31 +01:00
Ines Montani c8e967c78d Try include previously segfaulting test 2019-02-24 20:32:46 +01:00
Ines Montani 328b589deb Merge regression tests 2019-02-24 20:31:38 +01:00
Ines Montani 3bc53905cc Remove print statements from test 2019-02-24 20:31:15 +01:00
Ines Montani 1ae0df3da9 Un-x-fail passing test 2019-02-24 20:24:15 +01:00
Ines Montani 399a5803d0 Tidy up tests [ci skip] 2019-02-24 19:02:16 +01:00
Ines Montani df19e2bff6
💫 Allow setting of custom attributes during retokenization (closes #3314) (#3324)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.

```python
Token.set_extension('is_musician', default=False)

doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
    retokenizer.merge(doc[2:4], attrs=attrs)

assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```

### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-24 18:38:47 +01:00
Ines Montani d8f69d592f Tidy up retokenizer tests 2019-02-24 14:14:11 +01:00
Ines Montani 723e27cb8c Tidy up tests 2019-02-24 14:11:23 +01:00
Ines Montani 80bdcb99c5 Fix escaping of HTML in displacy ENT (closes #2728) 2019-02-21 14:30:39 +01:00
Matthew Honnibal c5f947f194 Fix regex deprecation warnings 2019-02-21 11:56:47 +01:00
Matthew Honnibal 80195bc2d1
Fix issue #3288 (#3308) 2019-02-21 09:48:53 +01:00
Matthew Honnibal a137e8b418 Fix Pipe.to_bytes() when model uninitialized
Closes #3289
2019-02-21 09:42:02 +01:00