Commit Graph

466 Commits

Author SHA1 Message Date
Matthew Honnibal af81ac8bb0 Use thinc 6.0 2016-12-29 11:58:42 +01:00
Matthew Honnibal bc0a202c9c Fix unicode problem in nonproj module 2016-11-25 17:29:17 -06:00
Matthew Honnibal 159e8c46e1 Merge old training fixes with newer state 2016-11-25 09:16:36 -06:00
Matthew Honnibal 39341598bb Fix NER label calculation 2016-11-25 09:02:22 -06:00
Matthew Honnibal ca773a1f53 Tweak arc_eager n_gold to deal with negative costs, and improve error message. 2016-11-25 09:01:52 -06:00
Matthew Honnibal 608d8f5421 Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state 2016-11-25 09:00:21 -06:00
Matthew Honnibal b8c4f5ea76 Allow German noun chunks to work on Span
Update the German noun chunks iterator, so that it also works on Span objects.
2016-11-24 23:30:15 +11:00
Pokey Rule 3e3bda142d Add noun_chunks to Span 2016-11-24 10:47:20 +00:00
Matthew Honnibal b86f8af0c1 Fix doc strings 2016-11-01 12:25:36 +01:00
Matthew Honnibal 708ea22208 Infer types in transition_system.pyx 2016-10-27 18:08:13 +02:00
Matthew Honnibal 301f3cc898 Fix Issue #429. Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found. 2016-10-27 18:01:55 +02:00
Matthew Honnibal 03a520ec4f Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state. 2016-10-27 17:58:56 +02:00
Matthew Honnibal a209b10579 Improve error message when oracle fails for non-projective trees, re Issue #571. 2016-10-24 20:31:30 +02:00
Matthew Honnibal 3e688e6d4b Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness. 2016-10-23 17:45:44 +02:00
Matthew Honnibal 59038f7efa Restore support for prior data format -- specifically, the labels field of the config. 2016-10-17 00:53:26 +02:00
Matthew Honnibal 7887ab3b36 Fix default use of feature_templates in parser 2016-10-16 21:41:56 +02:00
Matthew Honnibal f787cd29fe Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor. 2016-10-16 21:34:57 +02:00
Matthew Honnibal 274a4d4272 Fix queue Python property in StateClass 2016-10-16 17:04:41 +02:00
Matthew Honnibal e8c8aa08ce Make action_name optional in StepwiseState 2016-10-16 17:04:16 +02:00
Matthew Honnibal 4fc56d4a31 Rename 'labels' to 'actions' in parser options 2016-10-16 11:42:26 +02:00
Matthew Honnibal 3259a63779 Whitespace 2016-10-16 01:47:28 +02:00
Matthew Honnibal d9ae2d68af Load features by string-name for backwards compatibility. 2016-10-12 20:15:11 +02:00
Matthew Honnibal 3a03c668c3 Fix message in ParserStateError 2016-10-12 14:44:31 +02:00
Matthew Honnibal 6bf505e865 Fix error on ParserStateError 2016-10-12 14:35:55 +02:00
Matthew Honnibal ea23b64cc8 Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 12:24:24 +02:00
Matthew Honnibal 1d70db58aa Revert "Changes to iterators.pyx for new StringStore scheme"
This reverts commit 4f794b215a.
2016-09-30 20:19:53 +02:00
Matthew Honnibal 9e09b39b9f Revert "Changes to transition systems for new StringStore scheme"
This reverts commit 0442e0ab1e.
2016-09-30 20:11:49 +02:00
Matthew Honnibal e3285f6f30 Revert "Fix report of ParserStateError"
This reverts commit 78f19baafa.
2016-09-30 20:11:33 +02:00
Matthew Honnibal 78f19baafa Fix report of ParserStateError 2016-09-30 19:59:22 +02:00
Matthew Honnibal 0442e0ab1e Changes to transition systems for new StringStore scheme 2016-09-30 19:58:51 +02:00
Matthew Honnibal 4f794b215a Changes to iterators.pyx for new StringStore scheme 2016-09-30 19:57:49 +02:00
Matthew Honnibal 4cbf0d3bb6 Handle errors when no valid actions are available, pointing users to the issue tracker. 2016-09-27 19:19:53 +02:00
Matthew Honnibal 430473bd98 Raise errors when no actions are available, re Issue #429 2016-09-27 19:09:37 +02:00
Matthew Honnibal 8e7df3c4ca Expect the parser data, if parser.load() is called. 2016-09-27 14:02:12 +02:00
Matthew Honnibal a44763af0e Fix Issue #469: Incorrectly cased root label in noun chunk iterator 2016-09-27 13:13:01 +02:00
Matthew Honnibal e07b9665f7 Don't expect parser model 2016-09-26 18:09:33 +02:00
Matthew Honnibal ee6fa106da Fix parser features 2016-09-26 17:57:32 +02:00
Matthew Honnibal e607e4b598 Fix parser loading 2016-09-26 17:51:11 +02:00
Matthew Honnibal 2debc4e0a2 Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class. 2016-09-26 11:57:54 +02:00
Matthew Honnibal fd65cf6cbb Finish refactoring data loading 2016-09-24 20:26:17 +02:00
Matthew Honnibal 83e364188c Mostly finished loading refactoring. Design is in place, but doesn't work yet. 2016-09-24 15:42:01 +02:00
Matthew Honnibal 60fdf4d5f1 Remove commented out debuggng code 2016-09-24 01:17:18 +02:00
Matthew Honnibal 070af4af9d Revert "* Working neural net, but features hacky. Switching to extractor."
This reverts commit 7c2f1a673b.
2016-09-21 12:26:14 +02:00
Matthew Honnibal 7c2f1a673b * Working neural net, but features hacky. Switching to extractor. 2016-05-26 19:06:10 +02:00
Matthew Honnibal 13fad36e49 * Cosmetic change to english noun chunks iterator -- use enumerate instead of range loop 2016-05-20 10:11:05 +02:00
Wolfgang Seeker 7b78239436 add fix for German noun chunk iterator (issue #365) 2016-05-06 01:41:26 +02:00
Matthew Honnibal bb94022975 * Fix Issue #365: Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags. 2016-05-06 00:21:05 +02:00
Wolfgang Seeker dbf8f5f3ec fix bug in StateC.set_break() 2016-05-05 15:15:34 +02:00
Wolfgang Seeker 3c44b5dc1a call deprojectivization after parsing 2016-05-05 15:10:36 +02:00
Matthew Honnibal 472f576b82 * Deprojectivize German parses 2016-05-05 15:01:10 +02:00
Wolfgang Seeker e4ea2bea01 fix whitespace 2016-05-04 07:40:38 +02:00
Wolfgang Seeker 5bf2fd1f78 make the code less cryptic 2016-05-03 17:19:05 +02:00
Wolfgang Seeker a06fca9fdf German noun chunk iterator now doesn't return tokens more than once 2016-05-03 16:58:59 +02:00
Wolfgang Seeker 7b246c13cb reformulate noun chunk tests for English 2016-05-03 14:24:35 +02:00
Matthew Honnibal 1f1532142f * Fix cost calculation on non-monotonic oracle 2016-05-03 00:21:08 +02:00
Matthew Honnibal 508fd1f6dc * Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 14:25:10 +02:00
Matthew Honnibal 77609588b6 * Fix assignment of root label to words left as root implicitly, after parsing ends. 2016-04-25 19:41:59 +00:00
Matthew Honnibal 7c2d2deaa7 * Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322 2016-04-25 19:41:59 +00:00
Wolfgang Seeker 12024b0b0a bugfix: introducing multiple roots now updates original head's properties
adjust tests to rely less on statistical model
2016-04-20 16:42:41 +02:00
Wolfgang Seeker b98cc3266d bugfix: iterators now reset properly when called a second time 2016-04-15 17:49:16 +02:00
Wolfgang Seeker 289b10f441 remove some comments 2016-04-14 15:37:51 +02:00
Wolfgang Seeker d99a9cbce9 different handling of space tokens
space tokens are now always attached to the previous non-space token
there are two exceptions:
leading space tokens are attached to the first following non-space token
in input that consists exclusively of space tokens, the last space token
is the head of all others.
2016-04-13 15:28:28 +02:00
Wolfgang Seeker d328e0b4a8 Merge branch 'master' into space_head_bug 2016-04-11 12:11:01 +02:00
Wolfgang Seeker 80bea62842 bugfix in unit test 2016-04-08 16:46:44 +02:00
Wolfgang Seeker 1fe911cdb0 bigfix 2016-04-07 18:19:51 +02:00
Matthew Honnibal 872695759d Merge pull request #306 from wbwseeker/german_noun_chunks
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Wolfgang Seeker 7195b6742d add restrictions to L-arc and R-arc to prevent space heads 2016-03-28 10:40:52 +02:00
Wolfgang Seeker 5e2e8e951a add baseclass DocIterator for iterators over documents
add classes for English and German noun chunks

the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker 46e3f979f1 add function for setting head and label to token
change PseudoProjectivity.deprojectivize to use these functions
2016-03-11 17:31:06 +01:00
Wolfgang Seeker 7adbd7a785 replace Counter with normal dict 2016-03-03 21:36:27 +01:00
Wolfgang Seeker 1ae487a4f6 add backwards compatibility with python 2.6 2016-03-03 21:18:12 +01:00
Wolfgang Seeker 72b8df0684 turned PseudoProjectivity into a normal python class 2016-03-03 19:05:08 +01:00
Wolfgang Seeker 690c5acabf adjust train.py to train both english and german models 2016-03-03 15:21:00 +01:00
Wolfgang Seeker 3448cb40a4 integrated pseudo-projective parsing into parser
- nonproj.pyx holds a class PseudoProjectivity which currently holds
  all functionality to implement Nivre & Nilsson 2005's pseudo-projective
  parsing using the HEAD decoration scheme
- changed lefts/rights in Token to account for possible non-projective
  structures
2016-03-01 10:09:08 +01:00
Wolfgang Seeker 56b7210e82 moved nonproj.py to syntax/nonproj.pyx 2016-02-25 15:08:49 +01:00
Matthew Honnibal 1b83cb9dfa * Fix Issue #251: Incorrect right edge calculation on left-clobber low in the tree 2016-02-07 00:00:42 +01:00
Matthew Honnibal 4412a70dc5 * Initialize StateC._empty_token to 0, to avoid undefined behaviour. 2016-02-06 13:34:38 +01:00
Matthew Honnibal 1b41f868d2 * Check for errors in parser, and parallelise the left-over batch 2016-02-06 10:06:30 +01:00
Matthew Honnibal 165ca28b80 * Set is_parsed flag in Parser.pipe 2016-02-05 19:51:44 +01:00
Matthew Honnibal bdd579db0a * Set is_parsed flag in Parser.pipe 2016-02-05 19:50:11 +01:00
Matthew Honnibal b04c9aad71 * Fix off-by-one in Parser.pipe 2016-02-05 19:37:50 +01:00
Matthew Honnibal 048dfe35aa * cimport cython.parallel 2016-02-05 12:20:42 +01:00
Matthew Honnibal 8a13cebdcc * Update for modified thinc interface 2016-02-05 11:44:39 +01:00
Matthew Honnibal 84b247ef83 * Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading. 2016-02-03 02:10:58 +01:00
Matthew Honnibal e3db39dd21 * Fix compiler warning about signed/unsigned comparison 2016-02-01 09:08:07 +01:00
Matthew Honnibal b3802562d6 Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2 2016-02-01 08:59:24 +01:00
Matthew Honnibal 4b08a3fafd * Fix merge conflict 2016-02-01 08:58:18 +01:00
Matthew Honnibal 5188f6d9d8 * Fix parseC function 2016-02-01 08:48:48 +01:00
Matthew Honnibal bcf8f7ba40 * Add a parse_batch method to Parser, that releases the GIL around a batch of documents. 2016-02-01 08:34:55 +01:00
Matthew Honnibal d5579cd0d8 Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2 2016-02-01 03:08:49 +01:00
Matthew Honnibal 490ba65398 * Use openmp in parser 2016-02-01 03:08:42 +01:00
Matthew Honnibal cb78d91ec5 * Fix ArcEager.set_valid 2016-02-01 03:07:37 +01:00
Matthew Honnibal 28e5ad62bc * Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents 2016-02-01 03:00:15 +01:00
Matthew Honnibal a47f00901b * Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents 2016-02-01 02:58:14 +01:00
Matthew Honnibal daaad66448 * Now fully proxied 2016-02-01 02:37:08 +01:00
Matthew Honnibal 7a0e3bb9c1 * Continue proxying. Some problem currently 2016-02-01 02:22:21 +01:00
Matthew Honnibal 2169bbb7ea * Shadow StateClass with StateC, to start proxying 2016-02-01 01:16:14 +01:00
Matthew Honnibal 2fa228458e * Add _state file, which StateClass will proxy to 2016-02-01 01:09:21 +01:00
Matthew Honnibal 9410e74c92 * Switch parser to use nogil functions 2016-01-30 20:27:07 +01:00
Matthew Honnibal 10877a7791 * Update for thinc 5.0, including changing cost from int to weight_t, and updating the tagger and parser 2016-01-30 14:31:36 +01:00