Commit Graph

8522 Commits

Author SHA1 Message Date
Otto Sulin 4ec3f19e2b fixed stop words -> to-do lex_attrs.py 2018-03-23 22:18:17 +02:00
Justin DuJardin c7ff8ee66c Add contributor agreement 2018-03-23 13:11:56 -07:00
Justin DuJardin eef9430f07 Add example for visualizing word vectors with TensorBoard Projector
Use:

```bash
python vectors_tensorboard.py en_core_web_lg ./output_folder spaCy_large
```
2018-03-23 12:49:01 -07:00
Matthew Honnibal 85717f570c Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-23 20:30:42 +01:00
Matthew Honnibal 8902754f0b Fix vector loading for ud_train 2018-03-23 20:30:00 +01:00
Ines Montani 782ec6f4f2
Merge pull request #2131 from calumcalder/fix-displacy-docs-typo
Fix typo in documentation for displacy Visualizer
2018-03-23 13:03:00 +01:00
Xiaoquan Kong a71b99d7ff bugfix for global-variable-change-in-runtime related issue (#2135)
* Bugfix: setting pollution from spacy/cli/ud_train.py to whole package

* Add contributor agreement of howl-anderson
2018-03-23 11:36:38 +01:00
Calum Calder d000b4323a
Add contributor agreement 2018-03-22 19:29:22 +00:00
Calum Calder c6a0c1cc38
Fix typo in documentation for displacy Visualizer
The word_spacing variable affects the vertical spacing between the words and arcs, not the horizontal spacing.
2018-03-22 19:23:32 +00:00
Ines Montani c94139e436
Merge pull request #2126 from iann0036/patch-1
Add contributor doc
2018-03-22 09:04:17 +01:00
Ines Montani 40c444eaae
Merge pull request #2127 from SebastinSanty/docs-patch
Docs patch
2018-03-22 09:03:50 +01:00
Sebastin Santy 793d29904f
Update _similarity.jade 2018-03-22 03:51:38 +05:30
Ian Mckay c33d6ca360
Add contributor doc 2018-03-22 09:04:58 +11:00
Sebastin Santy 720d2231f6
Update doc.jade 2018-03-22 03:13:23 +05:30
Matthew Honnibal 044397e269 Support .gz and .tar.gz files in spacy init-model 2018-03-21 14:33:23 +01:00
Ines Montani 15eb54fecc
Merge pull request #2123 from iann0036/master
drop should be a float
2018-03-21 13:43:06 +01:00
Ian Mckay 4fbd9897f4
drop should be a float 2018-03-21 23:16:56 +11:00
Matthew Honnibal 49fbe2dfee Use thinc.openblas in spacy.syntax.nn_parser 2018-03-20 02:22:09 +01:00
Ines Montani d24190ee9b
Merge pull request #2116 from DuyguA/minor-enhancements
Minor enhancements
2018-03-19 16:32:37 +01:00
DuyguA f708d7443b added contractions to stopwords #2020 2018-03-19 14:06:39 +01:00
DuyguA ad598c66db added forgotten C for spaCy 2018-03-19 12:47:34 +01:00
Matthew Honnibal bede11b67c
Improve label management in parser and NER (#2108)
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.

Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.

We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.

To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.

Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.

To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.

Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.

This is a squash merge, as I made a lot of very small commits. Individual commit messages below.

* Simplify label management for TransitionSystem and its subclasses

* Fix serialization for new label handling format in parser

* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir

* Set actions in transition system

* Require thinc 6.11.1.dev4

* Fix error in parser init

* Add unicode declaration

* Fix unicode declaration

* Update textcat test

* Try to get model training on less memory

* Print json loc for now

* Try rapidjson to reduce memory use

* Remove rapidjson requirement

* Try rapidjson for reduced mem usage

* Handle None heads when projectivising

* Stream json docs

* Fix train script

* Handle projectivity in GoldParse

* Fix projectivity handling

* Add minibatch_by_words util from ud_train

* Minibatch by number of words in spacy.cli.train

* Move minibatch_by_words util to spacy.util

* Fix label handling

* More hacking at label management in parser

* Fix encoding in msgpack serialization in GoldParse

* Adjust batch sizes in parser training

* Fix minibatch_by_words

* Add merge_subtokens function to pipeline.pyx

* Register merge_subtokens factory

* Restore use of msgpack tmp directory

* Use minibatch-by-words in train

* Handle retokenization in scorer

* Change back-off approach for missing labels. Use 'dep' label

* Update NER for new label management

* Set NER tags for over-segmented words

* Fix label alignment in gold

* Fix label back-off for infrequent labels

* Fix int type in labels dict key

* Fix int type in labels dict key

* Update feature definition for 8 feature set

* Update ud-train script for new label stuff

* Fix json streamer

* Print the line number if conll eval fails

* Update children and sentence boundaries after deprojectivisation

* Export set_children_from_heads from doc.pxd

* Render parses during UD training

* Remove print statement

* Require thinc 6.11.1.dev6. Try adding wheel as install_requires

* Set different dev version, to flush pip cache

* Update thinc version

* Update GoldCorpus docs

* Remove print statements

* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Matthew Honnibal 13c060b90c Merge branch 'master' of https://github.com/explosion/spaCy 2018-03-19 02:04:45 +01:00
Matthew Honnibal ff42b726c1 Fix unicode declaration on test 2018-03-19 02:04:24 +01:00
Ines Montani affc43ef61
Merge pull request #2102 from doug-descombaz/patch-1 (resolves #2103)
Fix typo ditectory -> directory
2018-03-19 02:02:29 +01:00
Matthew Honnibal 318c23d318 Increment thinc 2018-03-16 13:12:53 +01:00
Matthew Honnibal 7dc76c6ff6 Add test for textcat 2018-03-16 12:39:45 +01:00
Matthew Honnibal 3cdee79a0c Add depth argument for text classifier 2018-03-16 12:37:31 +01:00
Matthew Honnibal 13067095a1 Disable broken add-after-train in textcat 2018-03-16 12:33:33 +01:00
Matthew Honnibal 565ef8c4d8 Improve argument passing in textcat 2018-03-16 12:30:51 +01:00
Matthew Honnibal eb2a3c5971 Remove unused function 2018-03-16 12:30:33 +01:00
Matthew Honnibal 307d6bf6d3 Fix parser for Thinc 6.11 2018-03-16 10:59:31 +01:00
Matthew Honnibal 9a389c4490 Fix parser for Thinc 6.11 2018-03-16 10:38:13 +01:00
Matthew Honnibal 3cdfe1ee4d
Merge pull request #2104 from explosion/feature/single-thread
Update parser for Thinc 6.11.0
2018-03-16 04:28:56 +01:00
Matthew Honnibal 39c50225e8 Update thinc 2018-03-16 03:57:47 +01:00
Matthew Honnibal 7be561c8be Fix thinc requirement 2018-03-16 03:34:12 +01:00
Matthew Honnibal 53df6d867b Require new thinc 2018-03-16 03:20:01 +01:00
Matthew Honnibal 791631f433 Require thinc 6.11.0 2018-03-16 02:51:54 +01:00
Matthew Honnibal 648532d647 Don't assume blas methods are present 2018-03-16 02:48:20 +01:00
Matthew Honnibal e85dd038fe Merge remote-tracking branch 'origin/master' into feature/single-thread 2018-03-16 02:41:11 +01:00
doug 9bd899c0e9 Fixed typoe for #2103 2018-03-15 10:22:54 -07:00
Doug DesCombaz 6b1e4997e9
Fix typo ditectory -> directory 2018-03-15 10:08:50 -07:00
Matthew Honnibal e3be3d65b3 Version as v2.0.10.dev0 2018-03-15 17:31:22 +01:00
ines f3f8bfc367 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 17:16:54 +01:00
Ines Montani 0d17377e8b
Merge pull request #2095 from DuyguA/quick-typo-fix (resolves #2063)
Quick typo fix
2018-03-15 00:29:56 +01:00
ines d854f69fe3 Add built-in factories for merge_entities and merge_noun_chunks
Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).
2018-03-15 00:18:51 +01:00
ines 9ad5df41fe Fix whitespace 2018-03-15 00:11:18 +01:00
Matthew Honnibal d7ce6527fb Use increasing batch sizes in ud-train 2018-03-14 20:15:28 +01:00
alldefector f4e5904fc2 Fix Spanish noun_chunks failure caused by typo 2018-03-14 17:03:17 +01:00
Thomas Opsomer fbf48b3f9f lemma property to return hash instead of unicode 2018-03-14 17:03:00 +01:00