Commit Graph

9855 Commits

Author SHA1 Message Date
Ines Montani c9bd0e5a96 Set version to 2.1.2 2019-03-22 13:44:47 +01:00
Matthew Honnibal e65b5bb9a0 Fix tokenizer on Python2.7 (#3460)
spaCy v2.1 switched to the built-in re module, where v2.0 had been using
the third-party regex library. When the tokenizer was deserialized on
Python2.7, the `re.compile()` function was called with expressions that
featured escaped unicode codepoints that were not in Python2.7's unicode
database.

Problems occurred when we had a range between two of these unknown
codepoints, like this:

```
    '[\\uAA77-\\uAA79]'
```

On Python2.7, the unknown codepoints are not unescaped correctly,
resulting in arbitrary out-of-range characters being matched by the
expression.

This problem does not occur if we instead have a range between two
unicode literals, rather than the escape sequences. To fix the bug, we
therefore add a new compat function that unescapes unicode sequences
using the `ast.literal_eval()` function. Care is taken to ensure we
do not also escape non-unicode sequences.

Closes #3356.

- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-22 13:42:47 +01:00
Ines Montani c81923ee30 Update wasabi pin 2019-03-22 13:31:58 +01:00
Ines Montani 188ccd5750 Fix xfail marker 2019-03-22 12:54:14 +01:00
Matthew Honnibal d811c97da1 Fix test that caused pytest to choke on Python3 2019-03-22 10:28:51 +01:00
Matthew Honnibal a2ad9832e5 Add failing test for #3356 2019-03-22 02:42:37 +01:00
Matthew Honnibal 7ec64a36fd
Merge pull request #3455 from explosion/bugfix/fix-en-tag-map
💫 Bring English tag_map in line with UD Treebank
2019-03-21 21:19:30 +01:00
Matthew Honnibal c66bd61e88 Fix lemmas 2019-03-21 14:22:12 +01:00
Matthew Honnibal 04395ffa49 Bring English tag_map in line with UD Treebank
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
2019-03-21 13:53:44 +01:00
Ines Montani 0c82a5ddb2 Merge branch 'master' of https://github.com/explosion/spaCy 2019-03-21 10:23:56 +01:00
Ines Montani 0712efc6b3 Update version requirements [ci skip] 2019-03-21 10:23:54 +01:00
Matthew Honnibal 4e3ed2ea88 Add -t2v argument to train_textcat script 2019-03-20 23:05:42 +01:00
Ines Montani 764359c952 Merge branch 'master' into spacy.io 2019-03-20 17:24:28 +01:00
Ines Montani dac8f8ff99 Update Span.__init__ docs (see #3445) [ci skip] 2019-03-20 17:24:17 +01:00
Matthew Honnibal c7f26abe5f
Merge pull request #3434 from Bharat123rox/narrow-unicode
Raise Error for a narrow unicode build of Python
2019-03-20 12:19:52 +01:00
Matthew Honnibal 1c8ff59185
Merge pull request #3441 from explosion/fix/cli-ud-scripts
💫 Move UD scripts to bin
2019-03-20 12:19:15 +01:00
Matthew Honnibal 72889a16d5 Fix similarity calculation if vectors are on GPU (#3440) 2019-03-20 12:09:59 +01:00
Matthew Honnibal 1612990e88 Implement cosine loss for spacy pretrain. Make default 2019-03-20 11:06:58 +00:00
Ines Montani ae5b4d0e84 Fix formatting (hopefully also restarts build properly) 2019-03-20 09:55:45 +01:00
Ines Montani 6abc1ddb26 Update __main__.py 2019-03-20 09:43:26 +01:00
Bharat123Rox f2547f02d6 Made changes suggested by @ines 2019-03-20 07:43:19 +05:30
Ines Montani 7400c7f8a7 Move UD scripts to bin 2019-03-20 01:19:34 +01:00
Ines Montani 685fff40cf Revert "Add --always-link flag to cli.download (see #3435)"
This reverts commit 583a566843.
2019-03-20 01:03:40 +01:00
Matthew Honnibal 6cfbb2d34e Merge branch 'master' of https://github.com/explosion/spaCy 2019-03-20 00:59:54 +01:00
Matthew Honnibal 5a53e9358a Set version to 2.1.1 2019-03-20 00:59:45 +01:00
Matthew Honnibal 02d7b41893 Fix GPU installation. Closes #3437 2019-03-20 00:59:27 +01:00
Ines Montani 583a566843 Add --always-link flag to cli.download (see #3435) 2019-03-19 22:03:27 +01:00
Bharat123Rox b5f077dcf4 Sign the Contributor Agreement and update details 2019-03-19 23:07:54 +05:30
Bharat123Rox 6db1ddd9c7 Raise ValueError for narrow unicode build 2019-03-19 23:02:58 +05:30
Ines Montani 1aff3ad770 Update netlify.toml 2019-03-19 14:49:35 +01:00
Ines Montani f7b5ff7907 Move netlify.toml to root 2019-03-19 14:40:14 +01:00
Ines Montani c6ee030721 Fix docsearch 2019-03-19 14:38:49 +01:00
Ines Montani 0155083e01 Update netlify.toml 2019-03-19 14:07:00 +01:00
Mehdi Hamoumi 9211f30ee3 Tiny correction in french lookup dictionary (#3427) 2019-03-19 13:00:19 +01:00
Ines Montani d4eed4a84f Add note on unicode build to troubleshooting guide (see #3421) [ci skip] 2019-03-19 10:27:02 +01:00
Ines Montani 42d4b818e4 Redirect Netlify URL 2019-03-19 10:17:56 +01:00
Ines Montani 1ee97bc282 Add page title fallback, just in case 2019-03-18 18:58:55 +01:00
Ines Montani 728ae7651b Fix universe page titles if no separate title is set 2019-03-18 18:58:46 +01:00
Ines Montani a20d3772fd FIx responsive landing 2019-03-18 16:24:52 +01:00
Ines Montani 08284f3a11
💫 v2.1.0 launch updates (only merge on launch!) (#3414)
* Update README.md

* Use production docsearch [ci skip]

* Add option to exclude pages from search
2019-03-18 16:07:26 +01:00
Ines Montani f0c1efcb00 Set version to 2.1.0 2019-03-17 22:42:58 +01:00
Matthew Honnibal 47e110375d Fix jsonl to json conversion (#3419)
* Fix spacy.gold.docs_to_json function

* Fix jsonl2json converter
2019-03-17 22:12:54 +01:00
Matthew Honnibal 0a4b074184 Improve beam search defaults 2019-03-17 21:47:45 +01:00
Ines Montani 226db621d0 Strip out .dev versions in spacy validate [ci skip] 2019-03-17 12:16:53 +01:00
Ines Montani a611b32fbf Update model docs [ci skip] 2019-03-17 11:48:18 +01:00
Matthew Honnibal c6be9964ec Set version to v2.1.0.dev1 2019-03-16 21:47:41 +01:00
Matthew Honnibal 61617c64d5 Revert changes to optimizer default hyper-params (WIP) (#3415)
While developing v2.1, I ran a bunch of hyper-parameter search
experiments to find settings that performed well for spaCy's NER and
parser. I ended up changing the default Adam settings from beta1=0.9,
beta2=0.999, eps=1e-8 to beta1=0.8, beta2=0.8, eps=1e-5. This was giving
a small improvement in accuracy (like, 0.4%).

Months later, I run the models with Prodigy, which uses beam-search
decoding even when the model has been trained with a greedy objective.
The new models performed terribly...So, wtf? After a couple of days
debugging, I figured out that the new optimizer settings was causing the
model to converge to solutions where the top-scoring class often had
a score of like, -80. The variance on the weights had gone up
enormously. I guess I needed to update the L2 regularisation as well?

Anyway. Let's just revert the change --- if the optimizer is finding
such extreme solutions, that seems bad, and not nearly worth the small
improvement in accuracy.

Currently training a slate of models, to verify the accuracy change is minimal.
Once the training is complete, we can merge this.

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-16 21:39:02 +01:00
Matthew Honnibal 62afa64a8d Expose batch size and length caps on CLI for pretrain (#3417)
Add and document CLI options for batch size, max doc length, min doc length for `spacy pretrain`.

Also improve CLI output.

Closes #3216 

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-16 21:38:45 +01:00
Matthew Honnibal 58d562d9b0
Merge pull request #3416 from explosion/feature/improve-beam
Improve beam search support
2019-03-16 18:42:18 +01:00
Ines Montani 2c5dd4d602 Update Vectors.find docs [ci skip] 2019-03-16 17:10:57 +01:00