Commit Graph

3135 Commits

Author SHA1 Message Date
Matthew Honnibal 5211645af3 Get data flowing through pipeline. Needs redesign 2017-05-16 11:21:59 +02:00
Matthew Honnibal 1d7c18e58a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-15 21:53:47 +02:00
Matthew Honnibal a9edb3aa1d Improve integration of NN parser, to support unified training API 2017-05-15 21:53:27 +02:00
ines 98354be150 Only get user_data if it exists on doc 2017-05-15 13:39:47 +02:00
ines c33bdeb564 Use uppercase for entity types 2017-05-15 01:24:57 +02:00
ines 4aaa607b8d Add xmlns:xlink so SVGs are rendered properly as individual files 2017-05-14 19:54:13 +02:00
ines 9dd13cd76a Update docstrings 2017-05-14 19:30:47 +02:00
ines a04550605a Add Jupyter notebook support (see #1058) 2017-05-14 18:39:01 +02:00
ines c31792aaec Add displaCy visualisers (see #1058) 2017-05-14 17:50:23 +02:00
ines b462076d80 Merge load_lang_class and get_lang_class 2017-05-14 01:31:10 +02:00
ines 36bebe7164 Update docstrings 2017-05-14 01:30:29 +02:00
Matthew Honnibal 4b9d69f428 Merge branch 'v2' into develop
* Move v2 parser into nn_parser.pyx
* New TokenVectorEncoder class in pipeline.pyx
* New spacy/_ml.py module

Currently the two parsers live side-by-side, until we figure out how to
organize them.
2017-05-14 01:10:23 +02:00
Matthew Honnibal 5cac951a16 Move new parser to nn_parser.pyx, and restore old parser, to make tests pass. 2017-05-14 00:55:01 +02:00
Matthew Honnibal f8c02b4341 Remove cupy imports from parser, so it can work on CPU 2017-05-14 00:37:53 +02:00
Matthew Honnibal 613ba79e2e Fiddle with sizings for parser 2017-05-13 17:20:23 -05:00
Matthew Honnibal e6d71e1778 Small fixes to parser 2017-05-13 17:19:04 -05:00
Matthew Honnibal 188c0f6949 Clean up unused import 2017-05-13 17:18:27 -05:00
Matthew Honnibal f85c8464f7 Draft support of regression loss in parser 2017-05-13 17:17:27 -05:00
ines 1694c24e52 Add docstrings, error messages and fix consistency 2017-05-13 21:22:49 +02:00
ines ee7dcf65c9 Fix expand_exc to make sure it returns combined dict 2017-05-13 21:22:25 +02:00
ines 824d09bb74 Move resolve_load_name to deprecated 2017-05-13 21:21:47 +02:00
ines a4a37a783e Remove import from non-existing module 2017-05-13 16:00:09 +02:00
ines 5858857a78 Update languages list in conftest 2017-05-13 15:37:54 +02:00
ines 9d85cda8e4 Fix models error message and use about.__docs_models__ (see #1051) 2017-05-13 13:05:47 +02:00
ines 6b942763f0 Tidy up imports 2017-05-13 13:04:40 +02:00
ines 8c2a0c026d Fix parse_tree test 2017-05-13 12:32:45 +02:00
ines 6129016e15 Replace deepcopy 2017-05-13 12:32:37 +02:00
ines df68bf45ce Set defaults for light and flat kwargs 2017-05-13 12:32:23 +02:00
ines b9dea345e5 Remove old import 2017-05-13 12:32:11 +02:00
ines 293ee359c5 Fix formatting 2017-05-13 12:32:06 +02:00
ines 4eefb288e3 Port over PR #1055 2017-05-13 03:25:32 +02:00
Matthew Honnibal ee1d35bdb0 Fix merge conflict 2017-05-13 03:20:19 +02:00
Matthew Honnibal b2540d2379 Merge Kengz's tree_print patch 2017-05-13 03:18:49 +02:00
Matthew Honnibal 827b5af697 Update draft of parser neural network model
Model is good, but code is messy. Currently requires Chainer, which may cause the build to fail on machines without a GPU.

Outline of the model:

We first predict context-sensitive vectors for each word in the input:

(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4

This convolutional layer is shared between the tagger and the parser. This prevents the parser from needing tag features.
To boost the representation, we make a "super tag" with POS, morphology and dependency label. The tagger predicts this
by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a
representation that's one affine transform from this informative lexical information. This is obviously good for the
parser (which backprops to the convolutions too).

The parser model makes a state vector by concatenating the vector representations for its context tokens. Current
results suggest few context tokens works well. Maybe this is a bug.

The current context tokens:

* S0, S1, S2: Top three words on the stack
* B0, B1: First two words of the buffer
* S0L1, S0L2: Leftmost and second leftmost children of S0
* S0R1, S0R2: Rightmost and second rightmost children of S0
* S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0

This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately,
there's a way to structure the computation to save some expense (and make it more GPU friendly).

The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks
with a non-monotonic transition). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications
for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden
weights for each positional feature wrt the words in the batch. (Note that our token vectors come from the CNN
-- so we can't play this trick over the vocabulary. That's how Stanford's NN parser works --- and why its model
is so big.)

This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity.
The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved
to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier.
We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle
in CUDA to train.

Currently the parser's loss function is multilabel log loss, as the dynamic oracle allows multiple states to
be 0 cost. This is defined as:

(exp(score) / Z) - (exp(score) / gZ)

Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly,
but so far this isn't working well.

Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit
greatly from the pre-computation trick.
2017-05-12 16:09:15 -05:00
ines c4857bc7db Remove unused argument 2017-05-12 15:37:54 +02:00
ines c13b3fa052 Add LEX_ATTRS 2017-05-12 15:37:45 +02:00
ines bca2ea9c72 Update Portuguese lexical attributes 2017-05-12 15:37:39 +02:00
ines 2f870123bf Fix formatting 2017-05-12 15:37:20 +02:00
ines ca65993d59 Add basic Polish Language class 2017-05-12 09:25:37 +02:00
ines 48177c4f92 Add missing tokenizer exceptions 2017-05-12 09:25:24 +02:00
ines bb8be3d194 Add Danish language data 2017-05-10 21:15:12 +02:00
Matthew Honnibal 4efb391994 Fix serializer 2017-05-09 18:45:18 +02:00
Matthew Honnibal b16ae75824 Remove serializer hacks from pipeline classes 2017-05-09 18:16:40 +02:00
Matthew Honnibal 7253b4e649 Remove old serialization tests 2017-05-09 18:12:58 +02:00
Matthew Honnibal f9327343ce Start updating serializer test 2017-05-09 18:12:03 +02:00
Matthew Honnibal 1166b0c491 Implement Doc.to_bytes and Doc.from_bytes methods 2017-05-09 18:11:34 +02:00
Matthew Honnibal 9e167b7bb6 Strip serializer from code 2017-05-09 17:28:50 +02:00
Matthew Honnibal b53f7dfdc3 Remove spacy.serialize 2017-05-09 17:22:06 +02:00
Matthew Honnibal 62ecdea9f2 Add binder class for document serialization 2017-05-09 17:21:00 +02:00
ines a0b00624bb Make sure like_email returns bool 2017-05-09 11:37:29 +02:00
ines ea60932e1b Fix formatting 2017-05-09 11:08:14 +02:00
ines 2c3bdd09b1 Add English test for like_num 2017-05-09 11:06:34 +02:00
ines 22375eafb0 Fix and merge attrs and lex_attrs tests 2017-05-09 11:06:25 +02:00
ines 02d0ac5cab Remove redundant function and fix formatting 2017-05-09 11:06:04 +02:00
ines b5ca50607e Reorganise entity rules 2017-05-09 01:37:10 +02:00
ines 564939391a Remove spacy.orth 2017-05-09 01:21:47 +02:00
ines 12c3d5fbba Fix formatting 2017-05-09 01:15:28 +02:00
ines 2829a024ef Re-add basic like_num check to global lex_attrs 2017-05-09 01:15:23 +02:00
ines 88adeee548 Add English lex_attrs overrides 2017-05-09 01:09:52 +02:00
ines 8f3fbbb147 Fix typos 2017-05-09 01:09:37 +02:00
ines ea5fa46475 Import LEX_ATTRS from lang.lex_attrs 2017-05-09 00:58:10 +02:00
ines 2216e5f326 Reorganise lex_attrs and add dict 2017-05-09 00:57:54 +02:00
ines e666f14d20 Add global lex_attrs 2017-05-09 00:41:53 +02:00
ines 41972c43fe Use consistent regex imports 2017-05-09 00:34:31 +02:00
ines 7b83977020 Remove unused munge package 2017-05-09 00:16:16 +02:00
ines c714841cc8 Move language-specific tests to tests/lang 2017-05-09 00:02:37 +02:00
ines bd57b611cc Update conftest to lazy load languages 2017-05-09 00:02:21 +02:00
ines 9f0fd5963f Reorganise Hungarian punctuation rules 2017-05-09 00:01:59 +02:00
ines fc0d793360 Reorganise Bengali punctuation rules 2017-05-09 00:01:52 +02:00
ines e895d1afd7 Reorganise French punctuation rules 2017-05-09 00:00:54 +02:00
ines 014bda0ae3 Reorganise global punctuation rules 2017-05-09 00:00:46 +02:00
ines a91278cb32 Rename _URL_PATTERN to URL_PATTERN 2017-05-09 00:00:00 +02:00
ines 604f299cf6 Add char classes to global language data 2017-05-08 23:59:33 +02:00
ines f6f5d78cb9 Fix formatting 2017-05-08 23:59:17 +02:00
ines 6eb6306843 Fix language data imports 2017-05-08 23:58:31 +02:00
ines 3c0f85de8e Remove imports in /lang/__init__.py 2017-05-08 23:58:07 +02:00
ines 86d9c29f30 Reorder util functions 2017-05-08 23:51:15 +02:00
ines 9a0d2fdef1 Add load_lang_class() util function 2017-05-08 23:50:45 +02:00
ines 614aa09582 Tidy up Bengali tokenizer exceptions 2017-05-08 22:29:49 +02:00
ines 73b577cb01 Fix relative imports 2017-05-08 22:29:04 +02:00
ines ae99990f63 Fix formatting 2017-05-08 22:23:48 +02:00
ines f46ffe3e89 Move language data to /lang module 2017-05-08 20:00:40 +02:00
ines 41a322c733 Fix LEMMA in exceptions and morph rules 2017-05-08 19:57:36 +02:00
ines 2edc0aee12 Update warning message 2017-05-08 19:53:36 +02:00
ines 6025cdb992 Fix string interpolation in times 2017-05-08 16:38:16 +02:00
ines b9ba58ba5c Add function to resolve load name
Warn if old 'path' keyword argument is used.
2017-05-08 16:33:37 +02:00
ines e6f1a5d0a1 Add unicode declaration 2017-05-08 16:22:17 +02:00
ines be5541bd16 Fix import and tokenizer exceptions 2017-05-08 16:20:14 +02:00
ines 2324788970 Remove bad tests 2017-05-08 16:15:27 +02:00
ines b88c4193e7 Add missing symbol 2017-05-08 16:15:20 +02:00
ines 9a5b2bdd4c Don't set morph rules without tag map 2017-05-08 16:15:12 +02:00
ines 4930f0fa8f Explicitly import TOKEN_MATCH 2017-05-08 16:11:54 +02:00
ines 50b7ec03ca Fix typo 2017-05-08 16:11:45 +02:00
ines 3ca611fe48 Fix wildcard imports 2017-05-08 15:56:29 +02:00
ines c2469b8135 Remove __all__ export 2017-05-08 15:56:22 +02:00
ines 14a9c3ee7a Fix wildcard import 2017-05-08 15:56:13 +02:00
ines deed623864 Remove comment 2017-05-08 15:56:05 +02:00
ines e7f95c37ee Merge base tokenizer exceptions 2017-05-08 15:55:52 +02:00
ines 24606d364c Remove redundant language_data.py files in languages
Originally intended to collect all components of a language, but just
made things messy. Now each component is in charge of exporting itself
properly.
2017-05-08 15:55:29 +02:00
ines a627d3e3b0 Reorganise Chinese language data 2017-05-08 15:54:36 +02:00
ines 7b86ee093a Reorganise Swedish language data 2017-05-08 15:54:29 +02:00
ines 50510fa947 Reorganise Portuguese language data 2017-05-08 15:52:01 +02:00
ines 279895ea83 Reorganise Dutch language data 2017-05-08 15:51:39 +02:00
ines 04ef5025bd Reorganise Norwegian language data 2017-05-08 15:51:22 +02:00
ines 5edbc725d8 Reorganise Japanese language data 2017-05-08 15:50:46 +02:00
ines 51a389d3bb Reorganise Italian language data 2017-05-08 15:50:17 +02:00
ines 1bbfa14436 Reorganise Hungarian language data 2017-05-08 15:49:56 +02:00
ines a77c9fc60d Reorganise Hebrew language data 2017-05-08 15:49:28 +02:00
ines 7f05e977fa Reorganise French language data 2017-05-08 15:49:05 +02:00
ines 0207ffdd52 Reorganise Finnish language data 2017-05-08 15:48:31 +02:00
ines 8e483ec950 Reorganise Spanish language data 2017-05-08 15:48:04 +02:00
ines c7c21b980f Reorganise English language data 2017-05-08 15:47:25 +02:00
ines 1bf9d5ec8b Reorganise German language data 2017-05-08 15:44:26 +02:00
ines 7b3a983f96 Reorganise Bengali language data 2017-05-08 15:43:50 +02:00
ines 607ba458e7 Fix whitespace 2017-05-08 15:42:31 +02:00
ines 60db497525 Add update_exc and expand_exc to util
Doesn't require separate language data util anymore
2017-05-08 15:42:12 +02:00
Matthew Honnibal b44f7e259c Clean up unused parser code 2017-05-08 15:42:04 +02:00
ines 6e5bd4f228 Remove unused functions from deprecated 2017-05-08 15:40:16 +02:00
Matthew Honnibal 17efb1c001 Change width 2017-05-08 08:40:13 -05:00
ines f68e420bc0 Add PRON_LEMMA and DET_LEMMA to deprecated
Will be replaced with proper values across the language data later.
2017-05-08 15:35:30 +02:00
ines bd6a7cf4f6 Simplify deprecated model downloading
Only relevant for spaCy < v1.7.0.
2017-05-08 15:32:10 +02:00
ines 95edd9e896 Let parse_package_meta take full path 2017-05-08 15:30:48 +02:00
ines 326746eb15 Add util function to resolve arg to model path
1. check if in data dir or shortcut link
2. check if installed as a pip package
3. check if string is path to model
4. check if Path or Path-like object
2017-05-08 15:29:47 +02:00
Matthew Honnibal bef89ef23d Mergery 2017-05-08 08:29:36 -05:00
ines a7801e7342 Update spacy.load()
path argument is now deprecated and name can either take a model name
or path. Implement lazy loading by importing module and read Language
class name off __all__.
2017-05-08 15:27:25 +02:00
Matthew Honnibal 50ddc9fc45 Fix infinite loop bug 2017-05-08 07:54:26 -05:00
Matthew Honnibal 94e86ae00a Predict tags with encoder 2017-05-08 07:53:45 -05:00
Matthew Honnibal 56073a11ef Don't use tags when calculating token vectors 2017-05-08 07:52:24 -05:00
Matthew Honnibal a66a4a4d0f Replace einsums 2017-05-08 14:46:50 +02:00
Matthew Honnibal 8d2eab74da Use PretrainableMaxouts 2017-05-08 14:24:55 +02:00
Matthew Honnibal 807cb2e370 Add PretrainableMaxouts 2017-05-08 14:24:43 +02:00
Matthew Honnibal 2e2268a442 Precomputable hidden now working 2017-05-08 11:36:37 +02:00
ines 94697e9afc Fix typo 2017-05-08 02:00:37 +02:00
ines 0ee2a22b67 Merge branch 'pr/1024' into develop 2017-05-08 01:12:44 +02:00
ines c4492d260a Fix kwargs 2017-05-08 01:05:24 +02:00
Matthew Honnibal 10682d35ab Get pre-computed version working 2017-05-08 00:38:35 +02:00
ines b5a726c5cd Tidy up deprecated.py 2017-05-07 23:29:22 +02:00
ines 59c3b9d4dd Tidy up CLI and fix print functions 2017-05-07 23:25:29 +02:00
ines 311704674d Add path2str compat function 2017-05-07 23:24:56 +02:00
ines e34069db9f Move is_package and get_model_package_path to util 2017-05-07 23:24:51 +02:00
ines 957ba676b4 Add model files base path to about.py 2017-05-07 23:22:35 +02:00
ines 8d8dd9ceb2 Don't set default value for model 2017-05-07 23:22:21 +02:00
Matthew Honnibal 35458987e8 Checkpoint -- nearly finished reimpl 2017-05-07 23:05:01 +02:00
Matthew Honnibal 4441866f55 Checkpoint -- nearly finished reimpl 2017-05-07 22:47:06 +02:00
Matthew Honnibal 6782eedf9b Tmp GPU code 2017-05-07 11:04:24 -05:00
Matthew Honnibal e420e5a809 Tmp 2017-05-07 07:31:09 -05:00
Matthew Honnibal 12039e80ca Switch to single matmul for state layer 2017-05-07 14:26:34 +02:00
Matthew Honnibal 700979fb3c CPU/GPU compat 2017-05-07 04:01:11 +02:00
Matthew Honnibal f99f5b75dc working residual net 2017-05-07 03:57:26 +02:00
Matthew Honnibal bdf2dba9fb WIP on refactor, with hidde pre-computing 2017-05-07 02:02:43 +02:00
Matthew Honnibal b439e04f8d Learning smoothly 2017-05-06 20:38:12 +02:00
Matthew Honnibal 08bee76790 Learns things 2017-05-06 18:24:38 +02:00
Matthew Honnibal 04ae1c01f1 Learns things 2017-05-06 18:21:02 +02:00
Matthew Honnibal bcf4cd0a5f Learns things 2017-05-06 17:37:36 +02:00
Matthew Honnibal 8e48b58cd6 Gradients look correct 2017-05-06 16:47:15 +02:00
Matthew Honnibal 7e04260d38 Data running through, likely errors in model 2017-05-06 14:22:20 +02:00
Matthew Honnibal fa7c1990b6 Restore tok2vec function 2017-05-05 20:12:03 +02:00
Matthew Honnibal efe9630e1c Bug fixes 2017-05-05 20:09:50 +02:00
Matthew Honnibal ef4fa594aa Draft of NN parser, to be tested 2017-05-05 19:20:39 +02:00
Matthew Honnibal 7d1df50aec Draft up Parser model 2017-05-04 13:31:40 +02:00
Matthew Honnibal ccaf26206b Pseudocode for parser 2017-05-04 12:17:59 +02:00
ines b1f22c5a10 Fix formatting 2017-05-03 20:11:02 +02:00
ines a04b5be1b2 Add glossary for annotation scheme (closes #1034)
Can be imported as explain from spacy.glossary, or called as
spacy.explain(term)
2017-05-03 17:02:17 +02:00
Gregory Howard 929f2792a7 Rennaming cls in module. cls is now a class 2017-05-03 15:41:07 +02:00
Gregory Howard 0e8c41ea4f Adding method lemmatizer for every class 2017-05-03 12:14:42 +02:00
Gregory Howard 32ca07989e adding export japanese 2017-05-03 11:07:29 +02:00
Grégory Howard f9d7144224 Merge branch 'master' into master 2017-05-03 11:04:51 +02:00
Gregory Howard f2ab7d77b4 Lazy imports language 2017-05-03 11:01:42 +02:00
Ines Montani 3ea23a3f4d Fix formatting 2017-05-03 09:44:38 +02:00
Ines Montani d730eb0c0d Raise custom ImportError if importing janome fails 2017-05-03 09:43:29 +02:00
Ines Montani 949ad6594b Add newline 2017-05-03 09:38:43 +02:00
Ines Montani d12ca587ea Add newline 2017-05-03 09:38:29 +02:00
Ines Montani 8676cd0135 Add newline 2017-05-03 09:38:07 +02:00
Yasuaki Uechi c8f83aeb87 Add basic japanese support 2017-05-03 13:56:21 +09:00
Gregory Howard c0afcd22bb Merge remote-tracking branch 'remotes/upstream/master' 2017-04-27 14:42:54 +02:00
Matthew Honnibal 31ec9e1371 Merge branch 'master' of https://github.com/explosion/spaCy 2017-04-27 13:21:39 +02:00
Matthew Honnibal 2da16adcc2 Add dropout optin for parser and NER
Dropout can now be specified in the `Parser.update()` method via
the `drop` keyword argument, e.g.

    nlp.entity.update(doc, gold, drop=0.4)

This will randomly drop 40% of features, and multiply the value of the
others by 1. / 0.4. This may be useful for generalising from small data
sets.

This commit also patches the examples/training/train_new_entity_type.py
example, to use dropout and fix the output (previously it did not output
the learned entity).
2017-04-27 13:18:39 +02:00
Gregory Howard 92f368f83b Removing extra spaces 2017-04-27 12:02:14 +02:00
Gregory Howard 13b6957c8e Adding unitest for tokenization in french (with title) 2017-04-27 11:53:44 +02:00
Gregory Howard 8ff4682255 correcting tokenizer exception.
Adding tests for lemmatization
2017-04-27 11:52:14 +02:00
Ines Montani 7da9cefd25 Merge pull request #1022 from luvogels/master
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
Ines Montani c9e592ae6c Add newline 2017-04-27 11:15:41 +02:00
Ines Montani 5942adccc2 Add newline 2017-04-27 11:15:19 +02:00
Ines Montani 4cd9269aef Add newline 2017-04-27 11:15:04 +02:00
Ines Montani ccf13ecc21 Add newline 2017-04-27 11:14:42 +02:00
Ines Montani 03d2b0cc05 Add newline 2017-04-27 11:14:26 +02:00
Gregory Howard 44cb486849 Adding unitest for tokenization in french (with title) 2017-04-27 10:59:38 +02:00
Gregory Howard ad8129cb45 Improvement of rules now title insentive and have same declaration format 2017-04-27 10:23:56 +02:00
luvogels d12a0b6431 Hooked up tokenizer tests 2017-04-26 23:21:41 +02:00
Matthew Honnibal f0e1606d27 Increment version 2017-04-26 20:25:41 +02:00
luvogels b331929a7e Merge branch 'master' of https://github.com/luvogels/spaCy 2017-04-26 19:15:48 +02:00
luvogels 8de59ce3b9 Added tokenizer tests 2017-04-26 19:10:18 +02:00
Matthew Honnibal 4d98511db7 Make Span hashable. Closes #1019 2017-04-26 19:01:05 +02:00
Matthew Honnibal 24c4c51f13 Try to make test999 less flakey 2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang 460094bf09 Update __init__.py 2017-04-26 18:27:55 +02:00
ines 527d51ac9a Fetch shortcuts from GitHub and improve error handling 2017-04-26 18:00:28 +02:00
Gregory Howard ed5f094451 Adding insensitive lemmatisation test 2017-04-25 18:07:02 +02:00
ghoward 26e31afc18 renamming tests 2017-04-25 17:46:01 +02:00
ghoward c085c2d391 Adding some unitests 2017-04-25 17:44:16 +02:00
ghoward 55c6910f90 Look_up table for languages in spacy.
Need to find an another name for lemmatizerlookup. I was not inspired.
Trying to uses new files in fr language.
2017-04-24 16:39:00 +02:00