Commit Graph

146 Commits

Author SHA1 Message Date
Matthew Honnibal 2512ea9eeb Fix memory leak in beam parser 2017-11-14 02:11:40 +01:00
ines b4d226a3f1 Tidy up syntax 2017-10-27 19:45:57 +02:00
Matthew Honnibal f111b228e0 Fix re-parsing of previously parsed text
If a Doc object had been previously parsed, it was possible for
invalid parses to be added. There were two problems:

1) The parse was only being partially erased
2) The RightArc action was able to create a 1-cycle.

This patch fixes both errors, and avoids resetting the parse if one is
present. In theory this might allow a better parse to be predicted by
running the parser twice.

Closes #1253.
2017-10-20 16:27:36 +02:00
Matthew Honnibal 4cc84b0234 Prohibit Break when sent_start < 0 2017-10-09 00:02:45 +02:00
Matthew Honnibal e938bce320 Adjust parsing transition system to allow preset sentence segments. 2017-10-08 23:53:34 +02:00
Matthew Honnibal 44589fb38c Fix Break oracle 2017-08-25 19:50:55 -05:00
Matthew Honnibal 20dd66ddc2 Constrain sentence boundaries to IS_PUNCT and IS_SPACE tokens 2017-08-25 19:35:47 +02:00
Matthew Honnibal c307a0ffb8 Restore patches from nn-beam-parser to spacy/syntax 2017-08-18 22:38:59 +02:00
Matthew Honnibal 5f81d700ff Restore patches from nn-beam-parser to spacy/syntax 2017-08-18 22:23:03 +02:00
Matthew Honnibal 426f84937f Resolve conflicts when merging new beam parsing stuff 2017-08-18 13:38:32 -05:00
Matthew Honnibal a6d8d7c82e Add is_gold_parse method to transition system 2017-08-16 18:24:09 -05:00
Matthew Honnibal 52c180ecf5 Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit ea8de11ad5, reversing
changes made to 08e443e083.
2017-08-14 13:00:23 +02:00
Matthew Honnibal 78498a072d Return Transition for missing actions in lookup_action 2017-08-06 14:16:36 +02:00
Matthew Honnibal 8fce187de4 Fix ArcEager for missing values 2017-08-01 22:10:05 +02:00
Matthew Honnibal 3da1063b36 Add beam decoding to parser, to allow NER uncertainties 2017-07-20 15:02:55 +02:00
Matthew Honnibal be4a640f0c Fix arc eager label costs for uint64 2017-05-30 20:37:58 +02:00
Matthew Honnibal 84e66ca6d4 WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
Matthew Honnibal 39293ab2ee Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-28 11:46:57 +02:00
Matthew Honnibal dd052572d4 Update arc eager for SBD changes 2017-05-28 11:46:51 +02:00
Matthew Honnibal 99316fa631 Use ordered dict to specify actions 2017-05-27 15:50:21 -05:00
Matthew Honnibal 3d5a536eaa Improve efficiency of parser batching 2017-05-26 11:31:23 -05:00
Matthew Honnibal e2136232f9 Exclude states with no matching gold annotations from parsing 2017-05-22 10:30:12 -05:00
Matthew Honnibal aae97f00e9 Fix nonproj import 2017-05-22 05:15:06 -05:00
Matthew Honnibal 1d5d9838a2 Fix action collection for parser 2017-05-22 04:51:08 -05:00
Matthew Honnibal 8b04b0af9f Remove freqs from transition_system 2017-05-20 02:20:48 -05:00
ines 0739ae7b76 Tidy up and fix formatting and imports 2017-04-15 13:05:15 +02:00
Matthew Honnibal 354458484c WIP on add_label bug during NER training
Currently when a new label is introduced to NER during training,
it causes the labels to be read in in an unexpected order. This
invalidates the model.
2017-04-14 23:52:17 +02:00
Matthew Honnibal 47a3ef06a6 Unhack deprojetivization, moving it into pipeline
Previously the deprojectivize() call was attached to the transition
system, and only called for German. Instead it should be a separate
process, called after the parser. This makes it available for any
language. Closes #898.
2017-03-31 12:31:50 +02:00
Matthew Honnibal 4ef68c413f Approximate cost in Break transition, to speed things up a bit. 2017-03-15 16:40:27 -05:00
Matthew Honnibal 931feb3360 Allow beam parsing for NER 2017-03-11 11:12:01 -06:00
Matthew Honnibal d11f1a4ddf Record negative costs in non-monotonic arc eager oracle 2017-03-10 11:22:04 -06:00
Matthew Honnibal ca773a1f53 Tweak arc_eager n_gold to deal with negative costs, and improve error message. 2016-11-25 09:01:52 -06:00
Matthew Honnibal a209b10579 Improve error message when oracle fails for non-projective trees, re Issue #571. 2016-10-24 20:31:30 +02:00
Matthew Honnibal 59038f7efa Restore support for prior data format -- specifically, the labels field of the config. 2016-10-17 00:53:26 +02:00
Matthew Honnibal f787cd29fe Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor. 2016-10-16 21:34:57 +02:00
Matthew Honnibal ea23b64cc8 Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 12:24:24 +02:00
Wolfgang Seeker 3c44b5dc1a call deprojectivization after parsing 2016-05-05 15:10:36 +02:00
Matthew Honnibal 472f576b82 * Deprojectivize German parses 2016-05-05 15:01:10 +02:00
Matthew Honnibal 1f1532142f * Fix cost calculation on non-monotonic oracle 2016-05-03 00:21:08 +02:00
Matthew Honnibal 508fd1f6dc * Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 14:25:10 +02:00
Matthew Honnibal 77609588b6 * Fix assignment of root label to words left as root implicitly, after parsing ends. 2016-04-25 19:41:59 +00:00
Matthew Honnibal 7c2d2deaa7 * Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322 2016-04-25 19:41:59 +00:00
Wolfgang Seeker 12024b0b0a bugfix: introducing multiple roots now updates original head's properties
adjust tests to rely less on statistical model
2016-04-20 16:42:41 +02:00
Wolfgang Seeker 289b10f441 remove some comments 2016-04-14 15:37:51 +02:00
Wolfgang Seeker d99a9cbce9 different handling of space tokens
space tokens are now always attached to the previous non-space token
there are two exceptions:
leading space tokens are attached to the first following non-space token
in input that consists exclusively of space tokens, the last space token
is the head of all others.
2016-04-13 15:28:28 +02:00
Wolfgang Seeker 7195b6742d add restrictions to L-arc and R-arc to prevent space heads 2016-03-28 10:40:52 +02:00
Matthew Honnibal bcf8f7ba40 * Add a parse_batch method to Parser, that releases the GIL around a batch of documents. 2016-02-01 08:34:55 +01:00
Matthew Honnibal cb78d91ec5 * Fix ArcEager.set_valid 2016-02-01 03:07:37 +01:00
Matthew Honnibal a47f00901b * Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents 2016-02-01 02:58:14 +01:00
Matthew Honnibal daaad66448 * Now fully proxied 2016-02-01 02:37:08 +01:00