Commit Graph

192 Commits

Author SHA1 Message Date
Matthew Honnibal 1cb2f15d65 Clean up unused predict_confidences function 2017-08-16 18:22:26 -05:00
Matthew Honnibal 52c180ecf5 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit ea8de11ad5, reversing
changes made to 08e443e083.
2017-08-14 13:00:23 +02:00
Matthew Honnibal 3e30712b62 Improve defaults 2017-08-12 19:24:17 -05:00
Matthew Honnibal 680043ebca Improve efficiency of tagger.set_annotations for GPU 2017-08-12 08:54:21 -05:00
Matthew Honnibal 3cb8f06881 Fix NeuralLabeller 2017-08-06 14:15:14 +02:00
Matthew Honnibal e9ab800e15 Fix tagging model 2017-08-06 01:50:08 +02:00
Matthew Honnibal 468c138ab3 WIP: Add fine-tuning logic to tagger model, re #1182 2017-08-06 01:13:23 +02:00
Matthew Honnibal 6780132821 Fix tagger loading 2017-07-25 19:41:11 +02:00
Matthew Honnibal c4a81a47a4 Fix deserialization 2017-07-23 14:11:07 +02:00
Matthew Honnibal 4fe77bced2 Add cfg attr to pipeline components 2017-07-23 00:52:47 +02:00
Matthew Honnibal a88a7deffe Five save/load of textcat config 2017-07-23 00:33:43 +02:00
Matthew Honnibal b55714d5d1 Make gold_tuples arg optional in begin_training 2017-07-22 20:04:43 +02:00
Matthew Honnibal b3a749610e Fix name of TextCategorizer 2017-07-22 01:14:07 +02:00
Matthew Honnibal a231b56d40 Add text-classification hook to pipeline 2017-07-20 00:18:15 +02:00
Matthew Honnibal d59fa32df1 Add experimental SimilarityHook omponent 2017-06-05 15:40:03 +02:00
Matthew Honnibal b3b5521625 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-04 20:17:18 -05:00
Matthew Honnibal 7b2ede783d Add SP tag to tag map if missing 2017-06-04 20:16:30 -05:00
Matthew Honnibal 516798e9fc Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-05 01:35:21 +02:00
Matthew Honnibal 193bf913c0 Set is_tagged=True after tagging 2017-06-05 01:35:07 +02:00
Matthew Honnibal b78cc318c3 Fix loading of morphology exceptions 2017-06-04 16:34:32 -05:00
Matthew Honnibal 3680c51b8f Avoid clobbering preset POS tags 2017-06-04 15:52:42 -05:00
ines 1b593bbd6d Fix encoding on tagger serialization 2017-06-02 17:29:21 +02:00
Matthew Honnibal 5f4d328e2c Fix serialization of tag_map in NeuralTagger 2017-06-02 10:18:37 -05:00
Matthew Honnibal 307d615c5f Fix serialization for tagger when tag_map has changed 2017-06-01 12:18:36 -05:00
ines 7a2380f617 Rename "nn_tagger" to "tagger" 2017-06-01 17:37:53 +02:00
Matthew Honnibal 5eae3b9a1e Fix to/from disk in tagger 2017-06-01 04:55:49 -05:00
Matthew Honnibal 53d00a0371 Move weight serialization to Thinc 2017-06-01 03:04:36 -05:00
Matthew Honnibal ae8010b526 Move weight serialization to Thinc 2017-06-01 02:56:12 -05:00
Matthew Honnibal 33e5ec737f Fix to/from disk methods 2017-05-31 13:43:10 +02:00
Matthew Honnibal 293d1b425b Serialize in consistent order 2017-05-29 17:53:06 -05:00
Matthew Honnibal 6522ea6c8b More serialization fixes. Still broken 2017-05-29 13:23:47 -05:00
Matthew Honnibal aa4c33914b Work on serialization 2017-05-29 08:40:45 -05:00
Matthew Honnibal ff26aa6c37 Work on to/from bytes/disk serialization methods 2017-05-29 11:45:45 +02:00
Matthew Honnibal 6b019b0540 Update to/from bytes methods 2017-05-29 10:14:20 +02:00
Matthew Honnibal 6dad4117ad Work on serialization for models 2017-05-29 01:37:57 +02:00
Matthew Honnibal 8a24c60c1e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-28 08:12:05 -05:00
Matthew Honnibal bc97bc292c Fix __call__ method 2017-05-28 08:11:58 -05:00
Matthew Honnibal c1263a844b Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-27 18:32:57 -05:00
Matthew Honnibal 9e711c3476 Divide d_loss by batch size 2017-05-27 18:32:46 -05:00
Matthew Honnibal 34bbad8e0e Add __reduce__ methods on parser subclasses. Fixes pickling. 2017-05-27 15:46:06 -05:00
Matthew Honnibal 467bbeadb8 Add hidden layers for tagger 2017-05-24 20:09:51 -05:00
Matthew Honnibal 5b67bcbee0 Increase default embed size to 7500 2017-05-23 15:20:16 -05:00
Matthew Honnibal 3959d778ac Revert "Revert "WIP on improving parser efficiency""
This reverts commit 532afef4a8.
2017-05-23 03:06:53 -05:00
Matthew Honnibal 532afef4a8 Revert "WIP on improving parser efficiency"
This reverts commit bdaac7ab44.
2017-05-23 03:05:25 -05:00
Matthew Honnibal bdaac7ab44 WIP on improving parser efficiency 2017-05-23 02:59:31 -05:00
Matthew Honnibal a7ee63c0ac Fix labeller loss for unseen labels 2017-05-22 10:41:20 -05:00
Matthew Honnibal 83ffd16474 Fix offset calculation for other negative values 2017-05-22 08:00:53 -05:00
Matthew Honnibal b45b4aa392 PseudoProjectivity --> nonproj 2017-05-22 05:17:44 -05:00
Matthew Honnibal 8d1e64be69 Add experimental NeuralLabeller 2017-05-22 04:51:08 -05:00
Matthew Honnibal 9b1b0742fd Fix prediction for tok2vec 2017-05-22 04:51:08 -05:00
Matthew Honnibal 5db89053aa Merge docstrings 2017-05-21 13:46:23 -05:00
Matthew Honnibal 180e5afede Fix tokvecs flattening in pipeline 2017-05-21 09:05:34 -05:00
ines 99b631617d Reformat docstrings 2017-05-21 13:32:15 +02:00
ines d82ae9a585 Change "function" to "callable" in docs 2017-05-21 13:17:40 +02:00
Matthew Honnibal 3b7c108246 Pass tokvecs through as a list, instead of concatenated. Also fix padding 2017-05-20 13:23:32 -05:00
Matthew Honnibal d52b65aec2 Revert "Move to contiguous buffer for token_ids and d_vectors"
This reverts commit 3ff8c35a79.
2017-05-20 11:26:23 -05:00
Matthew Honnibal 3ff8c35a79 Move to contiguous buffer for token_ids and d_vectors 2017-05-20 04:17:30 -05:00
Matthew Honnibal c12ab47a56 Remove state argument in pipeline. Other changes 2017-05-19 13:26:36 -05:00
ines 0fc05e54e4 Document TokenVectorEncoder 2017-05-19 00:00:02 +02:00
Matthew Honnibal c2c825127a Fix use_params and pipe methods 2017-05-18 08:30:59 -05:00
Matthew Honnibal b460533827 Bug fixes to pipeline 2017-05-18 04:29:51 -05:00
Matthew Honnibal 692bd2a186 Bug fix to tagger: wasnt backproping to token vectors 2017-05-17 13:13:14 +02:00
Matthew Honnibal 793430aa7a Get spaCy train command working with neural network
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Matthew Honnibal 8cf097ca88 Redesign training to integrate NN components
* Obsolete .parser, .entity etc names in favour of .pipeline
* Components no longer create models on initialization
* Models created by loading method (from_disk(), from_bytes() etc), or
    .begin_training()
* Add .predict(), .set_annotations() methods in components
* Pass state through pipeline, to allow components to share information
    more flexibly.
2017-05-16 16:17:30 +02:00
Matthew Honnibal 5211645af3 Get data flowing through pipeline. Needs redesign 2017-05-16 11:21:59 +02:00
Matthew Honnibal a9edb3aa1d Improve integration of NN parser, to support unified training API 2017-05-15 21:53:27 +02:00
Matthew Honnibal 4b9d69f428 Merge branch 'v2' into develop
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module

Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
Matthew Honnibal 5cac951a16 Move new parser to nn_parser.pyx, and restore old parser, to make tests pass. 2017-05-14 00:55:01 +02:00
Matthew Honnibal 613ba79e2e Fiddle with sizings for parser 2017-05-13 17:20:23 -05:00
Matthew Honnibal 827b5af697 Update draft of parser neural network model
Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU.

Outline of the model:

We first predict context-sensitive vectors for each word in the input:

(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4

This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features.
To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this
by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a
representation that's one affine transform from this informative lexical information. This is obviously good for the
parser (which backprops to the convolutions too).

The parser model makes a state vector by concatenating the vector representations for its context tokens. Current
results suggest few context tokens works well. Maybe this is a bug.

The current context tokens:

* S0, S1, S2: Top three words on the stack
* B0, B1: First two words of the buffer
* S0L1, S0L2: Leftmost and second leftmost children of S0
* S0R1, S0R2: Rightmost and second rightmost children of S0
* S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0

This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately,
there's a way to structure the computation to save some expense (and make it more GPU friendly).

The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks
with a non-monotonic transition). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications
for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden
weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN
-- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model
is so big.)

This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity.
The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved
to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier.
We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle
in CUDA to train.

Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to
be 0 cost. This is defined as:

(exp(score) / Z) - (exp(score) / gZ)

Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly,
but so far this isn't working well.

Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit
greatly from the pre-computation trick.
2017-05-12 16:09:15 -05:00
Matthew Honnibal b16ae75824 Remove serializer hacks from pipeline classes 2017-05-09 18:16:40 +02:00
Matthew Honnibal bef89ef23d Mergery 2017-05-08 08:29:36 -05:00
Matthew Honnibal 94e86ae00a Predict tags with encoder 2017-05-08 07:53:45 -05:00
Matthew Honnibal a66a4a4d0f Replace einsums 2017-05-08 14:46:50 +02:00
Matthew Honnibal 6782eedf9b Tmp GPU code 2017-05-07 11:04:24 -05:00
Matthew Honnibal f99f5b75dc working residual net 2017-05-07 03:57:26 +02:00
Matthew Honnibal 7e04260d38 Data running through, likely errors in model 2017-05-06 14:22:20 +02:00
ines d24589aa72 Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
ines 561f2a3eb4 Use consistent formatting for docstrings 2017-04-15 11:59:21 +02:00
Matthew Honnibal 354458484c WIP on add_label bug during NER training
Currently when a new label is introduced to NER during training,
it causes the labels to be read in in an unexpected order. This
invalidates the model.
2017-04-14 23:52:17 +02:00
Matthew Honnibal 2f63806ddb Update config when adding label. Re #910 2017-03-25 22:35:44 +01:00
Raphaël Bournhonesque f332bf05be Remove unused import statements 2017-03-21 21:08:54 +01:00
Matthew Honnibal 7769bc31e3 Add beam-search classes 2017-03-15 09:27:41 -05:00
Matthew Honnibal fa23278ee3 Add classes for beam parser and beam NER 2017-03-11 12:45:37 -06:00
Matthew Honnibal f77a5bb60a Switch back to greedy parser 2017-03-11 11:11:30 -06:00
Matthew Honnibal dcce9ca3f3 Use beam parser 2017-03-11 07:00:20 -06:00
Matthew Honnibal b86f8af0c1 Fix doc strings 2016-11-01 12:25:36 +01:00
Matthew Honnibal 3e688e6d4b Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness. 2016-10-23 17:45:44 +02:00
Matthew Honnibal f787cd29fe Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor. 2016-10-16 21:34:57 +02:00
Matthew Honnibal 4bb73b1a93 Fix parser labels in pipeline 2016-10-16 17:03:22 +02:00
Matthew Honnibal a079677984 Fix omission of O action when creating blank entity recognizer 2016-10-16 11:43:25 +02:00
Matthew Honnibal 509b30834f Add a pipeline module, to collect and wrap processes for annotation 2016-10-16 01:47:12 +02:00