Matthew Honnibal
fd65cf6cbb
Finish refactoring data loading
2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c
Mostly finished loading refactoring. Design is in place, but doesn't work yet.
2016-09-24 15:42:01 +02:00
Matthew Honnibal
60fdf4d5f1
Remove commented out debuggng code
2016-09-24 01:17:18 +02:00
Matthew Honnibal
070af4af9d
Revert "* Working neural net, but features hacky. Switching to extractor."
...
This reverts commit 7c2f1a673b
.
2016-09-21 12:26:14 +02:00
Matthew Honnibal
7c2f1a673b
* Working neural net, but features hacky. Switching to extractor.
2016-05-26 19:06:10 +02:00
Wolfgang Seeker
7b246c13cb
reformulate noun chunk tests for English
2016-05-03 14:24:35 +02:00
Matthew Honnibal
508fd1f6dc
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.
2016-05-02 14:25:10 +02:00
Wolfgang Seeker
d99a9cbce9
different handling of space tokens
...
space tokens are now always attached to the previous non-space token
there are two exceptions:
leading space tokens are attached to the first following non-space token
in input that consists exclusively of space tokens, the last space token
is the head of all others.
2016-04-13 15:28:28 +02:00
Wolfgang Seeker
80bea62842
bugfix in unit test
2016-04-08 16:46:44 +02:00
Wolfgang Seeker
5e2e8e951a
add baseclass DocIterator for iterators over documents
...
add classes for English and German noun chunks
the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker
690c5acabf
adjust train.py to train both english and german models
2016-03-03 15:21:00 +01:00
Wolfgang Seeker
3448cb40a4
integrated pseudo-projective parsing into parser
...
- nonproj.pyx holds a class PseudoProjectivity which currently holds
all functionality to implement Nivre & Nilsson 2005's pseudo-projective
parsing using the HEAD decoration scheme
- changed lefts/rights in Token to account for possible non-projective
structures
2016-03-01 10:09:08 +01:00
Wolfgang Seeker
56b7210e82
moved nonproj.py to syntax/nonproj.pyx
2016-02-25 15:08:49 +01:00
Matthew Honnibal
1b41f868d2
* Check for errors in parser, and parallelise the left-over batch
2016-02-06 10:06:30 +01:00
Matthew Honnibal
165ca28b80
* Set is_parsed flag in Parser.pipe
2016-02-05 19:51:44 +01:00
Matthew Honnibal
bdd579db0a
* Set is_parsed flag in Parser.pipe
2016-02-05 19:50:11 +01:00
Matthew Honnibal
b04c9aad71
* Fix off-by-one in Parser.pipe
2016-02-05 19:37:50 +01:00
Matthew Honnibal
048dfe35aa
* cimport cython.parallel
2016-02-05 12:20:42 +01:00
Matthew Honnibal
8a13cebdcc
* Update for modified thinc interface
2016-02-05 11:44:39 +01:00
Matthew Honnibal
84b247ef83
* Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading.
2016-02-03 02:10:58 +01:00
Matthew Honnibal
b3802562d6
Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2
2016-02-01 08:59:24 +01:00
Matthew Honnibal
4b08a3fafd
* Fix merge conflict
2016-02-01 08:58:18 +01:00
Matthew Honnibal
5188f6d9d8
* Fix parseC function
2016-02-01 08:48:48 +01:00
Matthew Honnibal
bcf8f7ba40
* Add a parse_batch method to Parser, that releases the GIL around a batch of documents.
2016-02-01 08:34:55 +01:00
Matthew Honnibal
490ba65398
* Use openmp in parser
2016-02-01 03:08:42 +01:00
Matthew Honnibal
28e5ad62bc
* Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents
2016-02-01 03:00:15 +01:00
Matthew Honnibal
a47f00901b
* Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents
2016-02-01 02:58:14 +01:00
Matthew Honnibal
daaad66448
* Now fully proxied
2016-02-01 02:37:08 +01:00
Matthew Honnibal
7a0e3bb9c1
* Continue proxying. Some problem currently
2016-02-01 02:22:21 +01:00
Matthew Honnibal
9410e74c92
* Switch parser to use nogil functions
2016-01-30 20:27:07 +01:00
Matthew Honnibal
10877a7791
* Update for thinc 5.0, including changing cost from int to weight_t, and updating the tagger and parser
2016-01-30 14:31:36 +01:00
Matthew Honnibal
84c5dfbfc3
* Clean up debugging python list
2016-01-19 20:10:32 +01:00
Matthew Honnibal
65c5bc4988
* Add add_label method, to allow users to register new entity types and dependency labels.
2016-01-19 19:11:02 +01:00
Matthew Honnibal
3dc398b727
* Fix merge conflict in requirements.txt
2016-01-16 16:20:49 +01:00
Matthew Honnibal
c025a0c64b
* Check for KeyboardInerrupt in parser.__call__
2016-01-16 16:18:44 +01:00
Matthew Honnibal
aec130af56
Use util.Package class for io
...
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
6f47074214
* Make constructor of ParserModel and TaggerModel the same as AveragedPerceptron, for each pickling.
2015-11-07 18:25:17 +11:00
Matthew Honnibal
888c05a7fa
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 11:02:44 +11:00
Matthew Honnibal
fc2185bfe3
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 10:48:31 +11:00
Matthew Honnibal
954442a807
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 10:30:45 +11:00
Matthew Honnibal
19136b0e7d
* Add better debug message for illegal move
2015-11-07 05:34:37 +11:00
Matthew Honnibal
3c162dcac3
* Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc.
2015-11-07 03:24:30 +11:00
Matthew Honnibal
b9991fbd20
* Update to use thinc 3.0
2015-11-06 00:25:59 +11:00
Matthew Honnibal
68f479e821
* Rename Doc.data to Doc.c
2015-11-04 00:15:14 +11:00
Matthew Honnibal
20fd36a0f7
* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125 : allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.
2015-10-13 13:44:41 +11:00
Matthew Honnibal
86c888667f
* Merge in changes from de branch
2015-09-06 19:49:28 +02:00
Matthew Honnibal
5edac11225
* Wrap self.parse in nogil, and break if an invalid move is predicted. The invalid break is a work-around that papers over likely bugs, but we can't easily break in the nogil block, and otherwise we'll get an infinite loop. Need to set this as an error flag.
2015-09-06 04:15:00 +02:00
Matthew Honnibal
a3d5e6c0dd
* Reform constructor and save/load workflow in parser model
2015-08-26 19:19:01 +02:00
Matthew Honnibal
bf38b3b883
* Hack on l/r reversal bug
2015-08-10 05:58:43 +02:00
Matthew Honnibal
6116413b47
* Fix label prediction in StepwiseState
2015-08-10 05:05:31 +02:00