Commit Graph

9509 Commits

Author SHA1 Message Date
Matthew Honnibal 63b7accd74
💫 Make span.as_doc() return a copy, not a view. Closes #1537 (#3107)
Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent.

This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes #1537

* Update test for span.as_doc()

* Make span.as_doc() return a copy. Closes #1537

* Document change to Span.as_doc()
2018-12-30 15:17:46 +01:00
Matthew Honnibal 72e4d3782a
Resize doc.tensor when merging spans. Closes #1963 (#3106)
The doc.retokenize() context manager wasn't resizing doc.tensor, leading to a mismatch between the number of tokens in the doc and the number of rows in the tensor. We fix this by deleting rows from the tensor. Merged spans are represented by the vector of their last token.

* Add test for resizing doc.tensor when merging

* Add test for resizing doc.tensor when merging. Closes #1963

* Update get_lca_matrix test for develop

* Fix retokenize if tensor unset
2018-12-30 15:17:17 +01:00
Matthew Honnibal 3d64eb4a74 Update get_lca_matrix test for develop 2018-12-30 14:28:07 +01:00
Matthew Honnibal ac9e3a4a8b Add test for #1773 2018-12-30 13:16:05 +01:00
Matthew Honnibal ee4d06fb1b Prevent exceptions from setting POS but not TAG. Closes #1773 2018-12-30 13:16:05 +01:00
Kirill Bulygin b665a32b95 Enabling `tests/lang/ru/test_lemmatizer.py`, fixing a `unicode` issue (#3084)
<!--- Provide a general summary of your changes in the title. -->

## Description

See #3079. Here I'm merging into `develop` instead of `master`.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

Bug fix.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-12-30 12:10:26 +01:00
Álvaro Abella Bascarán 9bc4cc1352 Fix issue 2396 (#3089)
* Test on #2396: bug in Doc.get_lca_matrix()

* reimplementation of Doc.get_lca_matrix(), (closes #2396)

* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()

* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()

* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix

* use memory view instead of np.ndarray in _get_lca_matrix (faster)

* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview

* cleaner conditional, add comment
2018-12-29 18:05:52 +01:00
Álvaro Abella Bascarán 6fe276f85d Fix issue 2396 (#3089)
* Test on #2396: bug in Doc.get_lca_matrix()

* reimplementation of Doc.get_lca_matrix(), (closes #2396)

* reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix()

* tests Span.get_lca_matrix() as well as Doc.get_lca_matrix()

* implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix

* use memory view instead of np.ndarray in _get_lca_matrix (faster)

* fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview

* cleaner conditional, add comment
2018-12-29 18:02:26 +01:00
Matthew Honnibal 76e3e695af Allow single string attributes in doc.to_array()
Previously inputs like doc.to_array('ORTH') didn't work.

Closes #3064
2018-12-29 16:24:40 +01:00
Matthew Honnibal 174e85439b
Fix behaviour of Matcher's ? quantifier for v2.1 (#3105)
* Add failing test for matcher bug #3009

* Deduplicate matches from Matcher

* Update matcher ? quantifier test

* Fix bug with ? quantifier in Matcher

The ? quantifier indicates a token may occur zero or one times. If the
token pattern fit, the matcher would fail to consider valid matches
where the token pattern did not fit. Consider a simple regex like:

.?b

If we have the string 'b', the .? part will fit --- but then the 'b' in
the pattern will not fit, leaving us with no match. The same bug left us
with too few matches in some cases. For instance, consider:

.?.?

If we have a string of length two, like 'ab', we actually have three
possible matches here: [a, b, ab]. We were only recovering 'ab'. This
should now be fixed. Note that the fix also uncovered another bug, where
we weren't deduplicating the matches. There are actually two ways we
might match 'a' and two ways we might match 'b': as the second token of the pattern,
or as the first token of the pattern. This ambiguity is spurious, so we
need to deduplicate.

Closes #2464 and #3009

* Fix Python2
2018-12-29 16:18:09 +01:00
Matthew Honnibal e808bdd076 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-29 13:54:15 +01:00
Sofie b7916fffcf Fixing few typos in the documentation (#3103)
* few typos / small grammatical errors corrected in documentation

* one more typo

* one last typo
2018-12-28 15:52:26 +01:00
Jari Bakken ba8a840f84 spacy.cli.evaluate: fix TypeError (#3101) 2018-12-28 11:14:28 +01:00
Jari Bakken 0546135fba Set vectors.name when updating meta.json during training (#3100)
* Set vectors.name when updating meta.json during training

* add vectors name to meta in `spacy package`
2018-12-27 19:55:40 +01:00
Jari Bakken cc95167b6d cli.convert: fix typo in converter arguments (#3099) 2018-12-27 18:08:41 +01:00
Jari Bakken e172f2478e Add three missing tags from the `nb` tag map (#3085)
* Contributors agreement for jarib

* Add tags from the UD/NORNE dataset that is missing in the nb tag map. Relates to #3082.
2018-12-27 14:48:40 +01:00
Will Price 4a6af0852a Improve random prefix generation in displaCy arcs (#3096)
* Improve random prefix generation in displaCy arcs

* Add @willprice contributor agreement
2018-12-27 14:46:02 +01:00
Özcan Kasal b573ebca77 trilyon forgotten (#3083)
* trilyon forgotten

* contributor added
2018-12-27 14:44:23 +01:00
Matthew Honnibal 978d8be8f9 Set version to v2.1.0a5 2018-12-21 00:26:39 +01:00
Matthew Honnibal d3f03b1668 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-21 00:25:57 +01:00
Ines Montani bb9ad37e05 Improve entry points and allow custom language classes via entry points (#3080)
* Remove check for overwritten factory

This needs to be handled differently – on first initialization, a new factory will be added and any subsequent initializations will trigger this warning, even if it's a new entry point that doesn't overwrite a built-in.

* Add helper to only load specific entry point

Useful for loading languages via entry points, so that they can be lazy-loaded. Otherwise, all entry point languages would have to be loaded upfront.

* Check entry points for custom languages
2018-12-20 23:58:43 +01:00
Matthew Honnibal f6ac00fab3 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-20 18:45:44 +01:00
Matthew Honnibal d8d27f9129 Set version to v2.1.0a5.dev0 2018-12-20 18:45:34 +01:00
Ines Montani 2dc6c52ccc Update displayed Binder version (see #3077) [ci skip] 2018-12-20 17:36:19 +01:00
Ines Montani ca244f5f84
Small fixes to displaCy (#3076)
## Description
- [x] fix auto-detection of Jupyter notebooks (even if `jupyter=True` isn't set)
- [x] add `displacy.set_render_wrapper` method to define a custom function called around the HTML markup generated in all calls to `displacy.render` (can be used to allow custom integrations, callbacks and page formatting)
- [x] add option to customise host for web server
- [x] show warning if `displacy.serve` is called from within Jupyter notebooks
- [x] move error message to `spacy.errors.Errors`.

### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-12-20 17:32:04 +01:00
Matthew Honnibal aeb59f6791 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-20 16:15:01 +01:00
Matthew Honnibal f57bea8ab6
💫 Prevent parser from predicting unseen classes (#3075)
The output weights often return negative scores for classes, especially
via the bias terms. This means that when we add a new class, we can't
rely on just zeroing the weights, or we'll end up with positive
predictions for those labels.

To solve this, we use nan values as the initial weights for new labels.
This prevents them from ever coming out on top. During backprop, we
replace the nan values with the minimum assigned score, so that we're
still able to learn these classes.
2018-12-20 16:12:22 +01:00
Matthew Honnibal 9ec9f89b99 💫 Raise better error when using uninitialized pipeline component (#3074)
After creating a component, the `.model` attribute is left with the value `True`, to indicate it should be created later during `from_disk()`, `from_bytes()` or `begin_training()`. This had led to confusing errors if you try to use the component without initializing the model.

To fix this, we add a method `require_model()` to the `Pipe` base class. The `require_model()` method needs to be called at the start of the `.predict()` and `.update()` methods of the components. It raises a `ValueError` if the model is not initialized. An error message has been added to `spacy.errors`.
2018-12-20 15:54:53 +01:00
Matthew Honnibal 1788bf1af7 Unbreak progress bar 2018-12-20 13:57:00 +01:00
Muhammad Irfan 2e84ec1513 Fixed ISO code for Urdu. (#3073) 2018-12-20 12:28:53 +01:00
Matthew Honnibal c315e08e6e Fix formatting of meta.json after spacy package 2018-12-19 14:36:08 +01:00
Matthew Honnibal b7ce85a6f3 Fix packaging of json schemas 2018-12-19 13:54:02 +01:00
Matthew Honnibal 35ff889852 Fix OSX wheel building 2018-12-19 13:14:57 +01:00
Matthew Honnibal e24f94ce39 Fix handling of preset entities. closes #2779 2018-12-19 02:13:31 +01:00
Matthew Honnibal faa8656582 Port parser fix for large label sets from master 2018-12-19 02:11:26 +01:00
Matthew Honnibal 99a84e4d0e Make ParserModel.resize_output idempotent 2018-12-19 02:10:36 +01:00
Matthew Honnibal 9fc8ce0c4d Add schemas to MANIFEST 2018-12-19 01:18:50 +01:00
Matthew Honnibal a2b75036e9 Try to make sure json schemas are packaged 2018-12-19 01:08:51 +01:00
Matthew Honnibal 0f83b98afa Remove unused code from spacy pretrain 2018-12-18 19:19:26 +01:00
Ken 5f0c5fbfa4 issue #3012: add test (#3021)
* issue #3012: add test

* add contributor aggreement

* Make test work without models and fix typos

ten.pos_ instead of ten.orth_ and comparison against "10" instead of integer 10
2018-12-18 15:02:49 +01:00
Ines Montani 77a47b2b20 Auto-format 2018-12-18 15:02:11 +01:00
Kirill Bulygin 2fb004832f Fix the first `nlp` call for `ja` (closes #2901) (#3065)
* Fix the first `nlp` call for `ja` (closes #2901)

* Add unicode declaration, formatting and use relative import
2018-12-18 15:01:06 +01:00
Kirill Bulygin 10189d9092 Fix the first `nlp` call for `ja` (closes #2901) (#3065)
* Fix the first `nlp` call for `ja` (closes #2901)

* Add unicode declaration, formatting and use relative import
2018-12-18 14:53:50 +01:00
Ines Montani ae880ef912 Tidy up merge conflict leftovers 2018-12-18 13:58:30 +01:00
Ines Montani 61d09c481b Merge branch 'master' into develop 2018-12-18 13:48:10 +01:00
Brixjohn 52f3c95004 Added alpha support for Tagalog language (#3062)
I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages.

I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language.

While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases.

* Added alpha support for Tagalog language

* Edited contributor template

* Included SCA; Reverted templates

* Fixed SCA template

* Fixed changes in SCA template
2018-12-18 13:08:38 +01:00
Matthew Honnibal 92f4b9c8ea set max batch size to 1000 2018-12-17 23:15:39 +00:00
Matthew Honnibal 3c4a2edf4a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-12-17 23:08:40 +00:00
Matthew Honnibal 95fc0176d1 Pass tagger options in begin_training 2018-12-17 23:08:31 +00:00
Matthew Honnibal 7c504b6ddb Try to implement more losses for pretraining
* Try to implement cosine loss
This one seems to be correct? Still unsure, but it performs okay

* Try to implement the von Mises-Fisher loss
This one's definitely not right yet.
2018-12-17 14:48:27 +00:00