Commit Graph

9878 Commits

Author SHA1 Message Date
Ines Montani 1e5b917d75 Fix formatting [ci skip] 2019-03-23 16:45:50 +01:00
Matthew Honnibal 6c783f8045 Bug fixes and options for TextCategorizer (#3472)
* Fix code for bag-of-words feature extraction

The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).

* Support 'bow' architecture for TextCategorizer

This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.

* Fix size limits in train_textcat example

* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
Ines Montani 06bf130890 💫 Add better and serializable sentencizer (#3471)
* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs
2019-03-23 15:45:02 +01:00
Matthew Honnibal d9a07a7f6e
💫 Fix class mismap on parser deserializing (closes #3433) (#3470)
v2.1 introduced a regression when deserializing the parser after
parser.add_label() had been called. The code around the class mapping is
pretty confusing currently, as it was written to accommodate backwards
model compatibility. It needs to be revised when the models are next
retrained.

Closes #3433
2019-03-23 13:46:25 +01:00
Matthew Honnibal 444a3abfe5 Add xfail test for #3433. Improve test for add label. 2019-03-23 12:36:00 +01:00
Ines Montani 6b6e9b638e Fix test for #3468 2019-03-23 11:24:29 +01:00
Ines Montani fbec72b4c3 Slightly modify test for #3468
Check for Token.is_sent_start first (which is serialized/deserialized correctly)
2019-03-23 11:22:44 +01:00
Ines Montani 02d9378d8c Add xfailing test for #3468 2019-03-23 11:19:11 +01:00
Ines Montani ed91592726 Merge branch 'master' into spacy.io 2019-03-22 19:02:26 +01:00
Ines Montani dcd6e06c47 Improve landing example [ci skip] 2019-03-22 19:02:15 +01:00
Ines Montani c2bb39dcb4 Merge branch 'master' into spacy.io 2019-03-22 18:50:16 +01:00
Ines Montani a841324034 Update landing example [ci skip] 2019-03-22 18:50:00 +01:00
Ines Montani a9ad735241 Merge branch 'master' into spacy.io 2019-03-22 18:36:28 +01:00
Ines Montani b532386a60 Fix typo [ci skip] 2019-03-22 18:36:17 +01:00
Ines Montani 7b5496027b Merge branch 'master' into spacy.io 2019-03-22 18:21:16 +01:00
Ines Montani d8533f0149 Update Binder [ci skip] 2019-03-22 18:16:46 +01:00
Matthew Honnibal 4c5f265884
Fix train loop for train_textcat example 2019-03-22 16:10:11 +01:00
Ines Montani 680eafab94 Merge branch 'master' into spacy.io 2019-03-22 15:17:51 +01:00
Christos Aridas 9cee3f702a Add missing space in landing page (#3462) [ci skip] 2019-03-22 15:17:35 +01:00
Ines Montani 5073ce63fd Merge branch 'spacy.io' [ci skip] 2019-03-22 15:17:11 +01:00
Ines Montani c9bd0e5a96 Set version to 2.1.2 2019-03-22 13:44:47 +01:00
Matthew Honnibal e65b5bb9a0 Fix tokenizer on Python2.7 (#3460)
spaCy v2.1 switched to the built-in re module, where v2.0 had been using
the third-party regex library. When the tokenizer was deserialized on
Python2.7, the `re.compile()` function was called with expressions that
featured escaped unicode codepoints that were not in Python2.7's unicode
database.

Problems occurred when we had a range between two of these unknown
codepoints, like this:

```
    '[\\uAA77-\\uAA79]'
```

On Python2.7, the unknown codepoints are not unescaped correctly,
resulting in arbitrary out-of-range characters being matched by the
expression.

This problem does not occur if we instead have a range between two
unicode literals, rather than the escape sequences. To fix the bug, we
therefore add a new compat function that unescapes unicode sequences
using the `ast.literal_eval()` function. Care is taken to ensure we
do not also escape non-unicode sequences.

Closes #3356.

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-22 13:42:47 +01:00
Ines Montani c81923ee30 Update wasabi pin 2019-03-22 13:31:58 +01:00
Ines Montani 188ccd5750 Fix xfail marker 2019-03-22 12:54:14 +01:00
Ines Montani 7dd5e2f564 Update v2-1.md 2019-03-22 12:43:23 +01:00
Matthew Honnibal d811c97da1 Fix test that caused pytest to choke on Python3 2019-03-22 10:28:51 +01:00
Matthew Honnibal a2ad9832e5 Add failing test for #3356 2019-03-22 02:42:37 +01:00
Matthew Honnibal 7ec64a36fd
Merge pull request #3455 from explosion/bugfix/fix-en-tag-map
💫 Bring English tag_map in line with UD Treebank
2019-03-21 21:19:30 +01:00
Matthew Honnibal c66bd61e88 Fix lemmas 2019-03-21 14:22:12 +01:00
Matthew Honnibal 04395ffa49 Bring English tag_map in line with UD Treebank
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
2019-03-21 13:53:44 +01:00
Ines Montani 375fbf3586 Update v2-1.md 2019-03-21 12:29:08 +01:00
Ines Montani 9394ca1f29 Update index.md 2019-03-21 10:24:55 +01:00
Ines Montani 0c82a5ddb2 Merge branch 'master' of https://github.com/explosion/spaCy 2019-03-21 10:23:56 +01:00
Ines Montani 0712efc6b3 Update version requirements [ci skip] 2019-03-21 10:23:54 +01:00
Matthew Honnibal 4e3ed2ea88 Add -t2v argument to train_textcat script 2019-03-20 23:05:42 +01:00
Ines Montani 764359c952 Merge branch 'master' into spacy.io 2019-03-20 17:24:28 +01:00
Ines Montani dac8f8ff99 Update Span.__init__ docs (see #3445) [ci skip] 2019-03-20 17:24:17 +01:00
Matthew Honnibal c7f26abe5f
Merge pull request #3434 from Bharat123rox/narrow-unicode
Raise Error for a narrow unicode build of Python
2019-03-20 12:19:52 +01:00
Matthew Honnibal 1c8ff59185
Merge pull request #3441 from explosion/fix/cli-ud-scripts
💫 Move UD scripts to bin
2019-03-20 12:19:15 +01:00
Matthew Honnibal 72889a16d5 Fix similarity calculation if vectors are on GPU (#3440) 2019-03-20 12:09:59 +01:00
Matthew Honnibal 1612990e88 Implement cosine loss for spacy pretrain. Make default 2019-03-20 11:06:58 +00:00
Ines Montani ae5b4d0e84 Fix formatting (hopefully also restarts build properly) 2019-03-20 09:55:45 +01:00
Ines Montani 6abc1ddb26 Update __main__.py 2019-03-20 09:43:26 +01:00
Bharat123Rox f2547f02d6 Made changes suggested by @ines 2019-03-20 07:43:19 +05:30
Ines Montani 7400c7f8a7 Move UD scripts to bin 2019-03-20 01:19:34 +01:00
Ines Montani 685fff40cf Revert "Add --always-link flag to cli.download (see #3435)"
This reverts commit 583a566843.
2019-03-20 01:03:40 +01:00
Matthew Honnibal 6cfbb2d34e Merge branch 'master' of https://github.com/explosion/spaCy 2019-03-20 00:59:54 +01:00
Matthew Honnibal 5a53e9358a Set version to 2.1.1 2019-03-20 00:59:45 +01:00
Matthew Honnibal 02d7b41893 Fix GPU installation. Closes #3437 2019-03-20 00:59:27 +01:00
Ines Montani 583a566843 Add --always-link flag to cli.download (see #3435) 2019-03-19 22:03:27 +01:00