Commit Graph

5231 Commits

Author SHA1 Message Date
Matthew Honnibal 48de4ed49f Require thinc 6.6, and compile the nn_parser module 2017-05-14 01:20:28 +02:00
Matthew Honnibal 4b9d69f428 Merge branch 'v2' into develop
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module

Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
Matthew Honnibal 5cac951a16 Move new parser to nn_parser.pyx, and restore old parser, to make tests pass. 2017-05-14 00:55:01 +02:00
Matthew Honnibal f8c02b4341 Remove cupy imports from parser, so it can work on CPU 2017-05-14 00:37:53 +02:00
Matthew Honnibal 613ba79e2e Fiddle with sizings for parser 2017-05-13 17:20:23 -05:00
Matthew Honnibal e6d71e1778 Small fixes to parser 2017-05-13 17:19:04 -05:00
Matthew Honnibal 188c0f6949 Clean up unused import 2017-05-13 17:18:27 -05:00
Matthew Honnibal f85c8464f7 Draft support of regression loss in parser 2017-05-13 17:17:27 -05:00
ines 1465c6c221 Add API docs for util functions 2017-05-13 21:23:12 +02:00
ines 144161c58c Update links to dev resources 2017-05-13 21:23:02 +02:00
ines 1694c24e52 Add docstrings, error messages and fix consistency 2017-05-13 21:22:49 +02:00
ines ee7dcf65c9 Fix expand_exc to make sure it returns combined dict 2017-05-13 21:22:25 +02:00
ines 824d09bb74 Move resolve_load_name to deprecated 2017-05-13 21:21:47 +02:00
ines 0095d5322b Update adding languages docs 2017-05-13 18:54:10 +02:00
ines a4a37a783e Remove import from non-existing module 2017-05-13 16:00:09 +02:00
ines 1d94c0e98a Update table of contents 2017-05-13 15:42:51 +02:00
ines a48e21755e Add section on testing language tokenizers 2017-05-13 15:39:27 +02:00
ines 5858857a78 Update languages list in conftest 2017-05-13 15:37:54 +02:00
ines 326e677882 Fix syntax highlighting colour of keyword 2017-05-13 15:37:43 +02:00
ines 9f004394aa Use thicker & round dotted lines in graphic 2017-05-13 15:37:28 +02:00
ines 2f54fefb5d Update adding languages docs 2017-05-13 14:54:58 +02:00
ines 9d85cda8e4 Fix models error message and use about.__docs_models__ (see #1051) 2017-05-13 13:05:47 +02:00
ines 6b942763f0 Tidy up imports 2017-05-13 13:04:40 +02:00
ines 3665acc0de Update adding languages docs 2017-05-13 12:39:36 +02:00
ines 8c2a0c026d Fix parse_tree test 2017-05-13 12:32:45 +02:00
ines 6129016e15 Replace deepcopy 2017-05-13 12:32:37 +02:00
ines df68bf45ce Set defaults for light and flat kwargs 2017-05-13 12:32:23 +02:00
ines b9dea345e5 Remove old import 2017-05-13 12:32:11 +02:00
ines 293ee359c5 Fix formatting 2017-05-13 12:32:06 +02:00
ines 2e4db1beb9 Fix formatting 2017-05-13 12:02:39 +02:00
ines 3454f2aca8 Update showcase 2017-05-13 03:32:03 +02:00
ines 4eefb288e3 Port over PR #1055 2017-05-13 03:25:32 +02:00
Matthew Honnibal ee1d35bdb0 Fix merge conflict 2017-05-13 03:20:19 +02:00
Matthew Honnibal b2540d2379 Merge Kengz's tree_print patch 2017-05-13 03:18:49 +02:00
ines 67726d1837 Update data model docs 2017-05-13 03:10:56 +02:00
ines 915b50c736 Update adding languages docs 2017-05-13 03:10:50 +02:00
ines 7f331eafcd Add SVG object 2017-05-13 03:10:41 +02:00
ines d5c83a5810 Fix image mixin to allow figure with no args 2017-05-13 03:10:35 +02:00
ines a74376dca9 Add flow chart graphics 2017-05-13 03:10:21 +02:00
Matthew Honnibal 827b5af697 Update draft of parser neural network model
Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU.

Outline of the model:

We first predict context-sensitive vectors for each word in the input:

(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4

This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features.
To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this
by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a
representation that's one affine transform from this informative lexical information. This is obviously good for the
parser (which backprops to the convolutions too).

The parser model makes a state vector by concatenating the vector representations for its context tokens. Current
results suggest few context tokens works well. Maybe this is a bug.

The current context tokens:

* S0, S1, S2: Top three words on the stack
* B0, B1: First two words of the buffer
* S0L1, S0L2: Leftmost and second leftmost children of S0
* S0R1, S0R2: Rightmost and second rightmost children of S0
* S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0

This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately,
there's a way to structure the computation to save some expense (and make it more GPU friendly).

The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks
with a non-monotonic transition). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications
for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden
weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN
-- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model
is so big.)

This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity.
The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved
to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier.
We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle
in CUDA to train.

Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to
be 0 cost. This is defined as:

(exp(score) / Z) - (exp(score) / gZ)

Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly,
but so far this isn't working well.

Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit
greatly from the pre-computation trick.
2017-05-12 16:09:15 -05:00
ines 19879cb693 Update alpha support docs 2017-05-12 15:57:49 +02:00
ines 1774cf5152 Fix light versions of colors 2017-05-12 15:57:42 +02:00
ines 63d79947c8 Update title in navigation 2017-05-12 15:40:43 +02:00
ines 531ee1373b Rename "Language models" to "Languages" in API 2017-05-12 15:38:56 +02:00
ines c4d2c3cac7 Update adding languages docs 2017-05-12 15:38:17 +02:00
ines c4857bc7db Remove unused argument 2017-05-12 15:37:54 +02:00
ines c13b3fa052 Add LEX_ATTRS 2017-05-12 15:37:45 +02:00
ines bca2ea9c72 Update Portuguese lexical attributes 2017-05-12 15:37:39 +02:00
ines 2f870123bf Fix formatting 2017-05-12 15:37:20 +02:00
ines ca65993d59 Add basic Polish Language class 2017-05-12 09:25:37 +02:00