Commit Graph

726 Commits

Author SHA1 Message Date
Matthew Honnibal 20193371f5 Don't share CNN, to reduce complexities 2017-09-21 14:59:48 +02:00
Matthew Honnibal 24e85c2048 Pass values for CNN maxout pieces option 2017-09-20 19:16:12 -05:00
Matthew Honnibal 2489dcaccf Fix serialization of parser 2017-09-19 23:42:12 +02:00
Matthew Honnibal 2b0efc77ae Fix wiring of pre-trained vectors in parser loading 2017-09-17 05:47:34 -05:00
Matthew Honnibal 31c2e91c35 Fix wiring of pre-trained vectors in parser loading 2017-09-17 05:46:55 -05:00
Matthew Honnibal c003c561c3 Revert NER action loading change, for model compatibility 2017-09-17 05:46:03 -05:00
Matthew Honnibal 43210abacc Resolve fine-tuning conflict 2017-09-17 05:30:04 -05:00
Matthew Honnibal 5ff2491f24 Pass option for pre-trained vectors in parser 2017-09-16 12:47:21 -05:00
Matthew Honnibal 8665a77f48 Fix feature error in NER 2017-09-16 12:46:57 -05:00
Matthew Honnibal f730d07e4e Fix prange error for Windows 2017-09-16 00:25:33 +02:00
Matthew Honnibal 8b481e0465 Remove redundant brackets 2017-09-15 10:38:08 +02:00
Matthew Honnibal 8c503487af Fix lookup of missing NER actions 2017-09-14 16:59:45 +02:00
Matthew Honnibal 664c5af745 Revert padding in parser 2017-09-14 16:59:25 +02:00
Matthew Honnibal c6395b057a Improve parser feature extraction, for missing values 2017-09-14 16:18:02 +02:00
Matthew Honnibal daf869ab3b Fix add_action for NER, so labelled 'O' actions aren't added 2017-09-14 16:16:41 +02:00
Matthew Honnibal dd9cab0faf Fix type-check for int/long 2017-09-06 19:03:05 +02:00
Matthew Honnibal dcbf866970 Merge parser changes 2017-09-06 18:41:05 +02:00
Matthew Honnibal 24ff6b0ad9 Fix parsing and tok2vec models 2017-09-06 05:50:58 -05:00
Matthew Honnibal 33fa91feb7 Restore correctness of parser model 2017-09-04 21:19:30 +02:00
Matthew Honnibal 9d65d67985 Preserve model compatibility in parser, for now 2017-09-04 16:46:22 +02:00
Matthew Honnibal 789e1a3980 Use 13 parser features, not 8 2017-08-31 14:13:00 -05:00
Matthew Honnibal 4ceebde523 Fix gradient bug in parser 2017-08-30 17:32:56 -05:00
Matthew Honnibal 44589fb38c Fix Break oracle 2017-08-25 19:50:55 -05:00
Matthew Honnibal 20dd66ddc2 Constrain sentence boundaries to IS_PUNCT and IS_SPACE tokens 2017-08-25 19:35:47 +02:00
Matthew Honnibal 682346dd66 Restore optimized hidden_depth=0 for parser 2017-08-21 19:18:04 -05:00
Matthew Honnibal 62878e50db Fix misalignment caued by filtering inputs at wrong point in parser 2017-08-20 15:59:28 -05:00
Matthew Honnibal 84b7ed49e4 Ensure updates aren't made if no gold available 2017-08-20 14:41:38 +02:00
Matthew Honnibal ab28f911b4 Fix parser learning rates 2017-08-19 09:02:57 -05:00
Matthew Honnibal c307a0ffb8 Restore patches from nn-beam-parser to spacy/syntax 2017-08-18 22:38:59 +02:00
Matthew Honnibal 5f81d700ff Restore patches from nn-beam-parser to spacy/syntax 2017-08-18 22:23:03 +02:00
Matthew Honnibal d456d2efe1 Fix conflicts in nn_parser 2017-08-18 20:55:58 +02:00
Matthew Honnibal 1cec1efca7 Fix merge conflicts in nn_parser from beam stuff 2017-08-18 20:50:49 +02:00
Matthew Honnibal 426f84937f Resolve conflicts when merging new beam parsing stuff 2017-08-18 13:38:32 -05:00
Matthew Honnibal f75420ae79 Unhack beam parsing, moving it under options instead of global flags 2017-08-18 13:31:15 -05:00
Matthew Honnibal 0209a06b4e Update beam parser 2017-08-16 18:25:49 -05:00
Matthew Honnibal a6d8d7c82e Add is_gold_parse method to transition system 2017-08-16 18:24:09 -05:00
Matthew Honnibal 3533bb61cb Add option of 8 feature parse state 2017-08-16 18:23:27 -05:00
Matthew Honnibal 210f6d5175 Fix efficiency error in batch parse 2017-08-15 03:19:03 -05:00
Matthew Honnibal 23537a011d Tweaks to beam parser 2017-08-15 03:15:28 -05:00
Matthew Honnibal 500e92553d Fix memory error when copying scores in beam 2017-08-15 03:15:04 -05:00
Matthew Honnibal a8e4064dd8 Fix tensor gradient in parser 2017-08-15 03:14:36 -05:00
Matthew Honnibal e420e0366c Remove use of hash function in beam parser 2017-08-15 03:13:57 -05:00
Matthew Honnibal 52c180ecf5 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit ea8de11ad5, reversing
changes made to 08e443e083.
2017-08-14 13:00:23 +02:00
Matthew Honnibal 0ae045256d Fix beam training 2017-08-13 18:02:05 -05:00
Matthew Honnibal 6a42cc16ff Fix beam parser, improve efficiency of non-beam 2017-08-13 12:37:26 +02:00
Matthew Honnibal 12de263813 Bug fixes to beam parsing. Learns small sample 2017-08-13 09:33:39 +02:00
Matthew Honnibal 17874fe491 Disable beam parsing 2017-08-12 19:35:40 -05:00
Matthew Honnibal 3e30712b62 Improve defaults 2017-08-12 19:24:17 -05:00
Matthew Honnibal 28e930aae0 Fixes for beam parsing. Not working 2017-08-12 19:22:52 -05:00
Matthew Honnibal c96d769836 Fix beam parse. Not sure if working 2017-08-12 18:21:54 -05:00
Matthew Honnibal 4638f4b869 Fix beam update 2017-08-12 17:15:16 -05:00
Matthew Honnibal d4308d2363 Initialize State offset to 0 2017-08-12 17:14:39 -05:00
Matthew Honnibal b353e4d843 Work on parser beam training 2017-08-12 14:47:45 -05:00
Matthew Honnibal cd5ecedf6a Try drop_layer in parser 2017-08-12 08:56:33 -05:00
Matthew Honnibal 1a59db1c86 Fix dropout and learn rate in parser 2017-08-12 05:44:39 -05:00
Matthew Honnibal d01dc3704a Adjust parser model 2017-08-09 20:06:33 -05:00
Matthew Honnibal f37528ef58 Pass embed size for parser fine-tune. Use SELU 2017-08-09 17:52:53 -05:00
Matthew Honnibal bbace204be Gate parser fine-tuning behind feature flag 2017-08-09 16:40:42 -05:00
Matthew Honnibal dbdd8afc4b Fix parser fine-tune training 2017-08-08 15:46:07 -05:00
Matthew Honnibal 88bf1cf87c Update parser for fine tuning 2017-08-08 15:34:17 -05:00
Matthew Honnibal 42bd26f6f3 Give parser its own tok2vec weights 2017-08-06 18:33:46 +02:00
Matthew Honnibal 78498a072d Return Transition for missing actions in lookup_action 2017-08-06 14:16:36 +02:00
Matthew Honnibal bfffdeabb2 Fix parser batch-size bug introduced during cleanup 2017-08-06 14:10:48 +02:00
Matthew Honnibal 7f876a7a82 Clean up some unused code in parser 2017-08-06 00:00:21 +02:00
Matthew Honnibal 8fce187de4 Fix ArcEager for missing values 2017-08-01 22:10:05 +02:00
Matthew Honnibal 27abc56e98 Add method to get beam entities 2017-07-29 21:59:02 +02:00
Matthew Honnibal c86445bdfd Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-07-22 01:14:28 +02:00
Matthew Honnibal 3da1063b36 Add beam decoding to parser, to allow NER uncertainties 2017-07-20 15:02:55 +02:00
Matthew Honnibal 0ca5832427 Improve negative example handling in NER oracle 2017-07-20 00:18:49 +02:00
Tpt 57e8254f63 Adds function to extract french noun chunks 2017-06-12 15:20:49 +02:00
Matthew Honnibal 6d0356e6cc Whitespace 2017-06-04 14:55:24 -05:00
ines 6669583f4e Use OrderedDict 2017-06-02 21:07:56 +02:00
ines 2f1025a94c Port over Spanish changes from #1096 2017-06-02 19:09:58 +02:00
ines fdd0923be4 Translate model=True in exclude to lower_model and upper_model 2017-06-02 18:37:07 +02:00
Matthew Honnibal 4c97371051 Fixes for thinc 6.7 2017-06-01 04:22:16 -05:00
Matthew Honnibal ae8010b526 Move weight serialization to Thinc 2017-06-01 02:56:12 -05:00
Matthew Honnibal 097ab9c6e4 Fix transition system to/from disk 2017-05-31 13:44:00 +02:00
Matthew Honnibal 33e5ec737f Fix to/from disk methods 2017-05-31 13:43:10 +02:00
Matthew Honnibal 53a3824334 Fix mistake in ner feature 2017-05-31 03:01:02 +02:00
Matthew Honnibal cc911feab2 Fix bug in NER state 2017-05-30 22:12:19 +02:00
Matthew Honnibal be4a640f0c Fix arc eager label costs for uint64 2017-05-30 20:37:58 +02:00
Matthew Honnibal aa4c33914b Work on serialization 2017-05-29 08:40:45 -05:00
Matthew Honnibal 59f355d525 Fixes for serialization 2017-05-29 13:38:20 +02:00
Matthew Honnibal ff26aa6c37 Work on to/from bytes/disk serialization methods 2017-05-29 11:45:45 +02:00
Matthew Honnibal 6b019b0540 Update to/from bytes methods 2017-05-29 10:14:20 +02:00
Matthew Honnibal 9239f06ed3 Fix german noun chunks iterator 2017-05-28 20:13:03 +02:00
Matthew Honnibal fd9b6722a9 Fix noun chunks iterator for new stringstore 2017-05-28 20:12:10 +02:00
Matthew Honnibal 7996d21717 Fixes for new StringStore 2017-05-28 11:09:27 -05:00
Matthew Honnibal 8a24c60c1e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-28 08:12:05 -05:00
Matthew Honnibal bc97bc292c Fix __call__ method 2017-05-28 08:11:58 -05:00
Matthew Honnibal 84e66ca6d4 WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
Matthew Honnibal 39293ab2ee Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-28 11:46:57 +02:00
Matthew Honnibal dd052572d4 Update arc eager for SBD changes 2017-05-28 11:46:51 +02:00
Matthew Honnibal c1263a844b Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-27 18:32:57 -05:00
Matthew Honnibal 9e711c3476 Divide d_loss by batch size 2017-05-27 18:32:46 -05:00
Matthew Honnibal a1d4c97fb7 Improve correctness of minibatching 2017-05-27 17:59:00 -05:00
Matthew Honnibal 49235017bf Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-27 16:34:28 -05:00
Matthew Honnibal 7ebd26b8aa Use ordered dict to specify transitions 2017-05-27 15:52:20 -05:00
Matthew Honnibal 3eea5383a1 Add move_names property to parser 2017-05-27 15:51:55 -05:00
Matthew Honnibal 99316fa631 Use ordered dict to specify actions 2017-05-27 15:50:21 -05:00
Matthew Honnibal 655ca58c16 Clarifying change to StateC.clone 2017-05-27 15:49:37 -05:00
Matthew Honnibal 3d22fcaf0b Return None from parser if there are no annotations 2017-05-26 14:02:59 -05:00
Matthew Honnibal 3d5a536eaa Improve efficiency of parser batching 2017-05-26 11:31:23 -05:00
Matthew Honnibal 2cb7cc2db7 Remove commented code from parser 2017-05-25 14:55:09 -05:00
Matthew Honnibal c245ff6b27 Rebatch parser inputs, with mid-sentence states 2017-05-25 11:18:59 -05:00
Matthew Honnibal 679efe79c8 Make parser update less hacky 2017-05-25 06:49:00 -05:00
Matthew Honnibal e1cb5be0c7 Adjust dropout, depth and multi-task in parser 2017-05-24 20:11:41 -05:00
Matthew Honnibal 620df0414f Fix dropout in parser 2017-05-23 15:20:45 -05:00
Matthew Honnibal 8026c183d0 Add hacky logic to accelerate depth=0 case in parser 2017-05-23 11:06:49 -05:00
Matthew Honnibal a8b6d11c5b Support optional maxout layer 2017-05-23 05:58:07 -05:00
Matthew Honnibal c55b8fa7c5 Fix bugs in parse_batch 2017-05-23 05:57:52 -05:00
Matthew Honnibal 964707d795 Restore support for deeper networks in parser 2017-05-23 05:31:13 -05:00
Matthew Honnibal 6b918cc58e Support making updates periodically during training 2017-05-23 04:23:29 -05:00
Matthew Honnibal 3f725ff7b3 Roll back changes to parser update 2017-05-23 04:23:05 -05:00
Matthew Honnibal 3959d778ac Revert "Revert "WIP on improving parser efficiency""
This reverts commit 532afef4a8.
2017-05-23 03:06:53 -05:00
Matthew Honnibal 532afef4a8 Revert "WIP on improving parser efficiency"
This reverts commit bdaac7ab44.
2017-05-23 03:05:25 -05:00
Matthew Honnibal bdaac7ab44 WIP on improving parser efficiency 2017-05-23 02:59:31 -05:00
Matthew Honnibal 8a9e318deb Put the parsing loop in a nogil prange block 2017-05-22 17:58:12 -05:00
Matthew Honnibal e2136232f9 Exclude states with no matching gold annotations from parsing 2017-05-22 10:30:12 -05:00
Matthew Honnibal f00f821496 Fix pseudoprojectivity->nonproj 2017-05-22 06:14:42 -05:00
Matthew Honnibal 5d59e74cf6 PseudoProjectivity->nonproj 2017-05-22 05:49:53 -05:00
Matthew Honnibal b45b4aa392 PseudoProjectivity --> nonproj 2017-05-22 05:17:44 -05:00
Matthew Honnibal aae97f00e9 Fix nonproj import 2017-05-22 05:15:06 -05:00
Matthew Honnibal 2a5eb9f61e Make nonproj methods top-level functions, instead of class methods 2017-05-22 04:51:08 -05:00
Matthew Honnibal 33e2222839 Remove unused code in deprojectivize 2017-05-22 04:51:08 -05:00
Matthew Honnibal 025d9bbc37 Fix handling of non-projective deps 2017-05-22 04:51:08 -05:00
Matthew Honnibal 1b5fa68996 Do pseudo-projective pre-processing for parser 2017-05-22 04:51:08 -05:00
Matthew Honnibal 1d5d9838a2 Fix action collection for parser 2017-05-22 04:51:08 -05:00
Matthew Honnibal 3b7c108246 Pass tokvecs through as a list, instead of concatenated. Also fix padding 2017-05-20 13:23:32 -05:00
Matthew Honnibal d52b65aec2 Revert "Move to contiguous buffer for token_ids and d_vectors"
This reverts commit 3ff8c35a79.
2017-05-20 11:26:23 -05:00
Matthew Honnibal b272890a8c Try to move parser to simpler PrecomputedAffine class. Currently broken -- maybe the previous change 2017-05-20 06:40:10 -05:00
Matthew Honnibal 3ff8c35a79 Move to contiguous buffer for token_ids and d_vectors 2017-05-20 04:17:30 -05:00
Matthew Honnibal 8b04b0af9f Remove freqs from transition_system 2017-05-20 02:20:48 -05:00
Matthew Honnibal a1ba20e2b1 Fix over-run on parse_batch 2017-05-19 18:57:30 -05:00
Matthew Honnibal e84de028b5 Remove 'rebatch' op, and remove min-batch cap 2017-05-19 18:16:36 -05:00
Matthew Honnibal c12ab47a56 Remove state argument in pipeline. Other changes 2017-05-19 13:26:36 -05:00
Matthew Honnibal c2c825127a Fix use_params and pipe methods 2017-05-18 08:30:59 -05:00
Matthew Honnibal fc8d3a112c Add util.env_opt support: Can set hyper params through environment variables. 2017-05-18 04:36:53 -05:00
Matthew Honnibal d2626fdb45 Fix name error in nn parser 2017-05-18 04:31:01 -05:00
Matthew Honnibal 793430aa7a Get spaCy train command working with neural network
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Matthew Honnibal 8cf097ca88 Redesign training to integrate NN components
* Obsolete .parser, .entity etc names in favour of .pipeline
* Components no longer create models on initialization
* Models created by loading method (from_disk(), from_bytes() etc), or
    .begin_training()
* Add .predict(), .set_annotations() methods in components
* Pass state through pipeline, to allow components to share information
    more flexibly.
2017-05-16 16:17:30 +02:00
Matthew Honnibal 5211645af3 Get data flowing through pipeline. Needs redesign 2017-05-16 11:21:59 +02:00
Matthew Honnibal a9edb3aa1d Improve integration of NN parser, to support unified training API 2017-05-15 21:53:27 +02:00
Matthew Honnibal 4b9d69f428 Merge branch 'v2' into develop
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module

Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
Matthew Honnibal 5cac951a16 Move new parser to nn_parser.pyx, and restore old parser, to make tests pass. 2017-05-14 00:55:01 +02:00
Matthew Honnibal f8c02b4341 Remove cupy imports from parser, so it can work on CPU 2017-05-14 00:37:53 +02:00
Matthew Honnibal e6d71e1778 Small fixes to parser 2017-05-13 17:19:04 -05:00
Matthew Honnibal 188c0f6949 Clean up unused import 2017-05-13 17:18:27 -05:00
Matthew Honnibal f85c8464f7 Draft support of regression loss in parser 2017-05-13 17:17:27 -05:00
Matthew Honnibal 827b5af697 Update draft of parser neural network model
Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU.

Outline of the model:

We first predict context-sensitive vectors for each word in the input:

(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4

This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features.
To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this
by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a
representation that's one affine transform from this informative lexical information. This is obviously good for the
parser (which backprops to the convolutions too).

The parser model makes a state vector by concatenating the vector representations for its context tokens. Current
results suggest few context tokens works well. Maybe this is a bug.

The current context tokens:

* S0, S1, S2: Top three words on the stack
* B0, B1: First two words of the buffer
* S0L1, S0L2: Leftmost and second leftmost children of S0
* S0R1, S0R2: Rightmost and second rightmost children of S0
* S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0

This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately,
there's a way to structure the computation to save some expense (and make it more GPU friendly).

The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks
with a non-monotonic transition). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications
for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden
weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN
-- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model
is so big.)

This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity.
The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved
to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier.
We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle
in CUDA to train.

Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to
be 0 cost. This is defined as:

(exp(score) / Z) - (exp(score) / gZ)

Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly,
but so far this isn't working well.

Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit
greatly from the pre-computation trick.
2017-05-12 16:09:15 -05:00