Commit Graph

3020 Commits

Author SHA1 Message Date
Matthew Honnibal e6d71e1778 Small fixes to parser 2017-05-13 17:19:04 -05:00
Matthew Honnibal 188c0f6949 Clean up unused import 2017-05-13 17:18:27 -05:00
Matthew Honnibal f85c8464f7 Draft support of regression loss in parser 2017-05-13 17:17:27 -05:00
ines 1694c24e52 Add docstrings, error messages and fix consistency 2017-05-13 21:22:49 +02:00
ines ee7dcf65c9 Fix expand_exc to make sure it returns combined dict 2017-05-13 21:22:25 +02:00
ines 824d09bb74 Move resolve_load_name to deprecated 2017-05-13 21:21:47 +02:00
ines a4a37a783e Remove import from non-existing module 2017-05-13 16:00:09 +02:00
ines 5858857a78 Update languages list in conftest 2017-05-13 15:37:54 +02:00
ines 9d85cda8e4 Fix models error message and use about.__docs_models__ (see #1051) 2017-05-13 13:05:47 +02:00
ines 6b942763f0 Tidy up imports 2017-05-13 13:04:40 +02:00
ines 8c2a0c026d Fix parse_tree test 2017-05-13 12:32:45 +02:00
ines 6129016e15 Replace deepcopy 2017-05-13 12:32:37 +02:00
ines df68bf45ce Set defaults for light and flat kwargs 2017-05-13 12:32:23 +02:00
ines b9dea345e5 Remove old import 2017-05-13 12:32:11 +02:00
ines 293ee359c5 Fix formatting 2017-05-13 12:32:06 +02:00
ines 4eefb288e3 Port over PR #1055 2017-05-13 03:25:32 +02:00
Matthew Honnibal ee1d35bdb0 Fix merge conflict 2017-05-13 03:20:19 +02:00
Matthew Honnibal b2540d2379 Merge Kengz's tree_print patch 2017-05-13 03:18:49 +02:00
Matthew Honnibal 827b5af697 Update draft of parser neural network model
Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU.

Outline of the model:

We first predict context-sensitive vectors for each word in the input:

(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4

This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features.
To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this
by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a
representation that's one affine transform from this informative lexical information. This is obviously good for the
parser (which backprops to the convolutions too).

The parser model makes a state vector by concatenating the vector representations for its context tokens. Current
results suggest few context tokens works well. Maybe this is a bug.

The current context tokens:

* S0, S1, S2: Top three words on the stack
* B0, B1: First two words of the buffer
* S0L1, S0L2: Leftmost and second leftmost children of S0
* S0R1, S0R2: Rightmost and second rightmost children of S0
* S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0

This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately,
there's a way to structure the computation to save some expense (and make it more GPU friendly).

The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks
with a non-monotonic transition). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications
for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden
weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN
-- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model
is so big.)

This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity.
The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved
to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier.
We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle
in CUDA to train.

Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to
be 0 cost. This is defined as:

(exp(score) / Z) - (exp(score) / gZ)

Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly,
but so far this isn't working well.

Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit
greatly from the pre-computation trick.
2017-05-12 16:09:15 -05:00
ines c4857bc7db Remove unused argument 2017-05-12 15:37:54 +02:00
ines c13b3fa052 Add LEX_ATTRS 2017-05-12 15:37:45 +02:00
ines bca2ea9c72 Update Portuguese lexical attributes 2017-05-12 15:37:39 +02:00
ines 2f870123bf Fix formatting 2017-05-12 15:37:20 +02:00
ines ca65993d59 Add basic Polish Language class 2017-05-12 09:25:37 +02:00
ines 48177c4f92 Add missing tokenizer exceptions 2017-05-12 09:25:24 +02:00
ines bb8be3d194 Add Danish language data 2017-05-10 21:15:12 +02:00
Matthew Honnibal 4efb391994 Fix serializer 2017-05-09 18:45:18 +02:00
Matthew Honnibal b16ae75824 Remove serializer hacks from pipeline classes 2017-05-09 18:16:40 +02:00
Matthew Honnibal 7253b4e649 Remove old serialization tests 2017-05-09 18:12:58 +02:00
Matthew Honnibal f9327343ce Start updating serializer test 2017-05-09 18:12:03 +02:00
Matthew Honnibal 1166b0c491 Implement Doc.to_bytes and Doc.from_bytes methods 2017-05-09 18:11:34 +02:00
Matthew Honnibal 9e167b7bb6 Strip serializer from code 2017-05-09 17:28:50 +02:00
Matthew Honnibal b53f7dfdc3 Remove spacy.serialize 2017-05-09 17:22:06 +02:00
Matthew Honnibal 62ecdea9f2 Add binder class for document serialization 2017-05-09 17:21:00 +02:00
ines a0b00624bb Make sure like_email returns bool 2017-05-09 11:37:29 +02:00
ines ea60932e1b Fix formatting 2017-05-09 11:08:14 +02:00
ines 2c3bdd09b1 Add English test for like_num 2017-05-09 11:06:34 +02:00
ines 22375eafb0 Fix and merge attrs and lex_attrs tests 2017-05-09 11:06:25 +02:00
ines 02d0ac5cab Remove redundant function and fix formatting 2017-05-09 11:06:04 +02:00
ines b5ca50607e Reorganise entity rules 2017-05-09 01:37:10 +02:00
ines 564939391a Remove spacy.orth 2017-05-09 01:21:47 +02:00
ines 12c3d5fbba Fix formatting 2017-05-09 01:15:28 +02:00
ines 2829a024ef Re-add basic like_num check to global lex_attrs 2017-05-09 01:15:23 +02:00
ines 88adeee548 Add English lex_attrs overrides 2017-05-09 01:09:52 +02:00
ines 8f3fbbb147 Fix typos 2017-05-09 01:09:37 +02:00
ines ea5fa46475 Import LEX_ATTRS from lang.lex_attrs 2017-05-09 00:58:10 +02:00
ines 2216e5f326 Reorganise lex_attrs and add dict 2017-05-09 00:57:54 +02:00
ines e666f14d20 Add global lex_attrs 2017-05-09 00:41:53 +02:00
ines 41972c43fe Use consistent regex imports 2017-05-09 00:34:31 +02:00
ines 7b83977020 Remove unused munge package 2017-05-09 00:16:16 +02:00
ines c714841cc8 Move language-specific tests to tests/lang 2017-05-09 00:02:37 +02:00
ines bd57b611cc Update conftest to lazy load languages 2017-05-09 00:02:21 +02:00
ines 9f0fd5963f Reorganise Hungarian punctuation rules 2017-05-09 00:01:59 +02:00
ines fc0d793360 Reorganise Bengali punctuation rules 2017-05-09 00:01:52 +02:00
ines e895d1afd7 Reorganise French punctuation rules 2017-05-09 00:00:54 +02:00
ines 014bda0ae3 Reorganise global punctuation rules 2017-05-09 00:00:46 +02:00
ines a91278cb32 Rename _URL_PATTERN to URL_PATTERN 2017-05-09 00:00:00 +02:00
ines 604f299cf6 Add char classes to global language data 2017-05-08 23:59:33 +02:00
ines f6f5d78cb9 Fix formatting 2017-05-08 23:59:17 +02:00
ines 6eb6306843 Fix language data imports 2017-05-08 23:58:31 +02:00
ines 3c0f85de8e Remove imports in /lang/__init__.py 2017-05-08 23:58:07 +02:00
ines 86d9c29f30 Reorder util functions 2017-05-08 23:51:15 +02:00
ines 9a0d2fdef1 Add load_lang_class() util function 2017-05-08 23:50:45 +02:00
ines 614aa09582 Tidy up Bengali tokenizer exceptions 2017-05-08 22:29:49 +02:00
ines 73b577cb01 Fix relative imports 2017-05-08 22:29:04 +02:00
ines ae99990f63 Fix formatting 2017-05-08 22:23:48 +02:00
ines f46ffe3e89 Move language data to /lang module 2017-05-08 20:00:40 +02:00
ines 41a322c733 Fix LEMMA in exceptions and morph rules 2017-05-08 19:57:36 +02:00
ines 2edc0aee12 Update warning message 2017-05-08 19:53:36 +02:00
ines 6025cdb992 Fix string interpolation in times 2017-05-08 16:38:16 +02:00
ines b9ba58ba5c Add function to resolve load name
Warn if old 'path' keyword argument is used.
2017-05-08 16:33:37 +02:00
ines e6f1a5d0a1 Add unicode declaration 2017-05-08 16:22:17 +02:00
ines be5541bd16 Fix import and tokenizer exceptions 2017-05-08 16:20:14 +02:00
ines 2324788970 Remove bad tests 2017-05-08 16:15:27 +02:00
ines b88c4193e7 Add missing symbol 2017-05-08 16:15:20 +02:00
ines 9a5b2bdd4c Don't set morph rules without tag map 2017-05-08 16:15:12 +02:00
ines 4930f0fa8f Explicitly import TOKEN_MATCH 2017-05-08 16:11:54 +02:00
ines 50b7ec03ca Fix typo 2017-05-08 16:11:45 +02:00
ines 3ca611fe48 Fix wildcard imports 2017-05-08 15:56:29 +02:00
ines c2469b8135 Remove __all__ export 2017-05-08 15:56:22 +02:00
ines 14a9c3ee7a Fix wildcard import 2017-05-08 15:56:13 +02:00
ines deed623864 Remove comment 2017-05-08 15:56:05 +02:00
ines e7f95c37ee Merge base tokenizer exceptions 2017-05-08 15:55:52 +02:00
ines 24606d364c Remove redundant language_data.py files in languages
Originally intended to collect all components of a language, but just
made things messy. Now each component is in charge of exporting itself
properly.
2017-05-08 15:55:29 +02:00
ines a627d3e3b0 Reorganise Chinese language data 2017-05-08 15:54:36 +02:00
ines 7b86ee093a Reorganise Swedish language data 2017-05-08 15:54:29 +02:00
ines 50510fa947 Reorganise Portuguese language data 2017-05-08 15:52:01 +02:00
ines 279895ea83 Reorganise Dutch language data 2017-05-08 15:51:39 +02:00
ines 04ef5025bd Reorganise Norwegian language data 2017-05-08 15:51:22 +02:00
ines 5edbc725d8 Reorganise Japanese language data 2017-05-08 15:50:46 +02:00
ines 51a389d3bb Reorganise Italian language data 2017-05-08 15:50:17 +02:00
ines 1bbfa14436 Reorganise Hungarian language data 2017-05-08 15:49:56 +02:00
ines a77c9fc60d Reorganise Hebrew language data 2017-05-08 15:49:28 +02:00
ines 7f05e977fa Reorganise French language data 2017-05-08 15:49:05 +02:00
ines 0207ffdd52 Reorganise Finnish language data 2017-05-08 15:48:31 +02:00
ines 8e483ec950 Reorganise Spanish language data 2017-05-08 15:48:04 +02:00
ines c7c21b980f Reorganise English language data 2017-05-08 15:47:25 +02:00
ines 1bf9d5ec8b Reorganise German language data 2017-05-08 15:44:26 +02:00
ines 7b3a983f96 Reorganise Bengali language data 2017-05-08 15:43:50 +02:00
ines 607ba458e7 Fix whitespace 2017-05-08 15:42:31 +02:00