Commit Graph

1029 Commits

Author SHA1 Message Date
Calum Calder c6a0c1cc38
Fix typo in documentation for displacy Visualizer
The word_spacing variable affects the vertical spacing between the words and arcs, not the horizontal spacing.
2018-03-22 19:23:32 +00:00
Sebastin Santy 793d29904f
Update _similarity.jade 2018-03-22 03:51:38 +05:30
Sebastin Santy 720d2231f6
Update doc.jade 2018-03-22 03:13:23 +05:30
Ian Mckay 4fbd9897f4
drop should be a float 2018-03-21 23:16:56 +11:00
DuyguA ad598c66db added forgotten C for spaCy 2018-03-19 12:47:34 +01:00
Matthew Honnibal bede11b67c
Improve label management in parser and NER (#2108)
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.

Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.

We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.

To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.

Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.

To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.

Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.

This is a squash merge, as I made a lot of very small commits. Individual commit messages below.

* Simplify label management for TransitionSystem and its subclasses

* Fix serialization for new label handling format in parser

* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir

* Set actions in transition system

* Require thinc 6.11.1.dev4

* Fix error in parser init

* Add unicode declaration

* Fix unicode declaration

* Update textcat test

* Try to get model training on less memory

* Print json loc for now

* Try rapidjson to reduce memory use

* Remove rapidjson requirement

* Try rapidjson for reduced mem usage

* Handle None heads when projectivising

* Stream json docs

* Fix train script

* Handle projectivity in GoldParse

* Fix projectivity handling

* Add minibatch_by_words util from ud_train

* Minibatch by number of words in spacy.cli.train

* Move minibatch_by_words util to spacy.util

* Fix label handling

* More hacking at label management in parser

* Fix encoding in msgpack serialization in GoldParse

* Adjust batch sizes in parser training

* Fix minibatch_by_words

* Add merge_subtokens function to pipeline.pyx

* Register merge_subtokens factory

* Restore use of msgpack tmp directory

* Use minibatch-by-words in train

* Handle retokenization in scorer

* Change back-off approach for missing labels. Use 'dep' label

* Update NER for new label management

* Set NER tags for over-segmented words

* Fix label alignment in gold

* Fix label back-off for infrequent labels

* Fix int type in labels dict key

* Fix int type in labels dict key

* Update feature definition for 8 feature set

* Update ud-train script for new label stuff

* Fix json streamer

* Print the line number if conll eval fails

* Update children and sentence boundaries after deprojectivisation

* Export set_children_from_heads from doc.pxd

* Render parses during UD training

* Remove print statement

* Require thinc 6.11.1.dev6. Try adding wheel as install_requires

* Set different dev version, to flush pip cache

* Update thinc version

* Update GoldCorpus docs

* Remove print statements

* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Doug DesCombaz 6b1e4997e9
Fix typo ditectory -> directory 2018-03-15 10:08:50 -07:00
ines 7e80550f13 Remove alpha preview image [ci skip] 2018-03-09 13:32:08 +01:00
ines ad36b3d677 Add more model licenses to website [ci skip] 2018-03-09 13:31:23 +01:00
M. Willis Monroe b03948aaa5
Broken github link to NLTK 2018-03-06 16:22:46 -08:00
ines 29106ec740 Add "new" tag to is_currency [ci skip] 2018-02-18 14:16:26 +01:00
ines ca2fcad5a3 Add v2.1 tag to new arguments [ci skip] 2018-02-18 14:15:18 +01:00
ines 64f97adef1 Document new Matcher.pipe keyword args [ci skip]
See 1cf774bdc1
2018-02-18 14:13:58 +01:00
ines 61052df31f Document is_currency 2018-02-18 13:30:03 +01:00
Matthew Honnibal f9f46e5a07 Revert matcher fixes from GregDubbin 2018-02-18 10:59:28 +01:00
ines 612c79a4f5 Update first matcher example and match_id (resolves #1989) 2018-02-17 11:57:38 +01:00
ines ca56fb53d1 Add user survey to navigation [ci skip] 2018-02-15 12:14:30 +01:00
ines cab5b775e7 Document ENT_TYPE matcher attribute [ci skip] 2018-02-15 12:14:19 +01:00
Pradeep Kumar Tippa 416cd021ce Added TAG from spacy symbols which used below 2018-02-09 19:16:59 +05:30
Pradeep Kumar Tippa 01cc9cd9c0 assert statement syntax fix in doc 2018-02-09 19:16:25 +05:30
Pradeep Kumar Tippa a78062e466 Merge remote-tracking branch 'upstream/master' into web-doc-patches 2018-02-09 19:13:19 +05:30
ines ab33e274f5 Add more details on symlink error & Windows solution (resolves #1941) [ci skip] 2018-02-09 10:43:33 +01:00
ines 8eaa934382 Merge branch 'master' of https://github.com/explosion/spaCy 2018-02-09 10:23:36 +01:00
ines e9f67be04d Fix regex flag matcher example (resolves #1950) 2018-02-09 10:23:33 +01:00
ines fc4ae04c55 Document LENGTH attribute in matcher 2018-02-09 10:23:03 +01:00
Pradeep Kumar Tippa 8a7467b26e Merge remote-tracking branch 'upstream/master' into web-doc-patches 2018-02-09 13:54:26 +05:30
Orion Montoya 24af6375db
update link to Honnibal and Johnson 2015
aclweb.org is throwing a gateway timeout on the link as `https`+`aclweb.org`, but is fine with `https`+`www.aclweb.org` (also with `http`+`aclweb.org`, but let's keep it in `https`, shall we?
2018-02-08 10:49:09 -08:00
Pradeep Kumar Tippa 03113d6779 Fixing navigating parse tree doc under dependency parse 2018-02-08 19:34:15 +05:30
ines a3b965b29d Remove UPPER from Matcher attributes docs (resolves #1949) 2018-02-08 11:29:27 +01:00
ines 696ae87b47 Fix whitespace 2018-02-08 11:28:54 +01:00
ines 26bc75134d Fix typo 2018-02-08 11:28:44 +01:00
Pradeep Kumar Tippa da9d687e75
Fixing typo from taining to training 2018-02-07 16:49:25 +05:30
Pradeep Kumar Tippa ed7d268e93
Fixing vocab doc
Replacing "like" with "love", coffee suffix should be "fee" but not "ffe"
2018-02-07 14:55:12 +05:30
ines f377c483e4 Add note on manual entity order in displaCy [ci skip] 2018-02-07 01:08:42 +01:00
ines 58eb178667 Update Doc.char_span docs [ci skip] 2018-02-07 01:08:30 +01:00
sayf eddine hammemi 86e7727855 Fix typo in the word build. 2018-02-04 20:48:45 +01:00
ines 901bc0e85f Add Persian to list of languages [ci skip] 2018-02-01 04:47:34 +01:00
Hassan Shamim a0b912c528 fix broken link to test suite models 2018-01-30 15:01:01 -08:00
greg daefed0a34 Correct documentation of '+' and '*' ops 2018-01-22 15:55:44 -05:00
ines 67ba73351d Fix typo and use better serialization example (resolves #1851) [ci skip] 2018-01-16 18:42:03 +01:00
ines 7943a8e90c Add spacy-lookup by @mpuig [ci skip] 2018-01-16 00:28:46 +01:00
ines 5684206154 Add LanguageCrunch by @artpar [ci skip] 2018-01-15 16:14:26 +01:00
Mateusz Tatusko dda0e58c11
Update _pos-tags.jade
really small changes to English tags description, but might help some people while working on projects
1) -PRB- should be -RRB- instead 
2) space gets tagged as _SP, and not SP
2018-01-15 12:01:51 +09:00
ines 0536e91564 Add note on Tagger.tag_names vs. Tagger.labels (see #1666) [ci skip] 2018-01-14 14:37:19 +01:00
ines bbee48080d Clarify hyperparameters and alias usage in spacy train (resolves #1838) [ci skip] 2018-01-14 14:32:50 +01:00
ines 4daba3abda Add regex section to rule-based matching docs (see #1567, #1833) [ci skip] 2018-01-14 14:22:13 +01:00
Ines Montani 36f426fe0a
Merge pull request #1808 from fucking-signup/master
Fix issue #1769
2018-01-12 21:12:02 +00:00
ines cfac5b955f Fix aligment issues with newsletter signup form 2018-01-12 22:06:44 +01:00
ines 65babd9e2e Fix typo, formatting and operator descriptions (resolves #1820) 2018-01-12 22:06:27 +01:00
Matthew Honnibal a2a06dce24
Merge pull request #1792 from explosion/feature-improve-model-download
💫 Improve model downloading and linking
2018-01-11 20:02:08 +01:00