Commit Graph

55 Commits

Author SHA1 Message Date
adrianeboyd d359da9687 Replace Entity/MatchStruct with SpanC (#4459)
* Replace MatchStruct with Entity

Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.

* Replace Entity with more general SpanC
2019-10-18 11:01:47 +02:00
Matthew Honnibal bcd08f20af Merge changes from master 2019-08-21 14:18:52 +02:00
svlandeg dae8a21282 rename entity frequency 2019-07-19 17:40:28 +02:00
svlandeg dbc53b9870 rename to KBEntryC 2019-06-26 15:55:26 +02:00
svlandeg 5c723c32c3 entity vectors in the KB + serialization of them 2019-06-05 18:29:18 +02:00
svlandeg 6e3223f234 bulk loading in proper order of entity indices 2019-04-24 11:26:38 +02:00
svlandeg 735fc2a735 annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
svlandeg d849eb2455 adding kb_id as field to token, el as nlp pipeline component 2019-03-22 11:34:46 +01:00
Matthew Honnibal 9a2d1cc6e0 Add length attribute to MorphAnalysisC 2019-03-08 00:08:57 +01:00
Matthew Honnibal fed0371db7 Remove enums from morphology 2019-03-07 17:14:57 +01:00
Matthew Honnibal b9ade7d4e0 Add MorphAnalysisC struct 2019-03-07 14:03:07 +01:00
Matthew Honnibal 3993f41cc4 Update morphology branch from develop 2019-03-07 00:14:43 +01:00
Matthew Honnibal 8aa7882762
Make NORM a token attribute (#3029)
See #3028. The solution in this patch is pretty debateable.

What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break.

The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm?

Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.
2018-12-08 10:49:10 +01:00
Matthew Honnibal 3bba8e9245 Update structs 2018-09-24 23:58:08 +02:00
Matthew Honnibal 18063803de Make TokenC.sent_tart an int, to allow ternary value 2017-10-08 19:58:54 +02:00
Matthew Honnibal 84e66ca6d4 WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
Matthew Honnibal f51e6a6c16 Adjust lexeme sizing for attr_t being 64 bit 2017-05-28 12:51:09 +02:00
Matthew Honnibal 3ea98e2043 Remove vector member from lexeme 2017-05-28 11:46:24 +02:00
Matthew Honnibal 793430aa7a Get spaCy train command working with neural network
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Matthew Honnibal 58e83fe34b Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 14:54:55 +02:00
Wolfgang Seeker 03fb498dbe introduce lang field for LexemeC to hold language id
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Matthew Honnibal 9ec7b9c454 * Clean up unused Constituent struct. 2015-11-03 23:48:21 +11:00
Matthew Honnibal 1e99fcd413 * Rename .repvec to .vector in C API 2015-11-03 23:47:59 +11:00
Matthew Honnibal 7ac6cacc26 * Remove const qualifier on LexemeC.repvec 2015-09-15 14:42:51 +10:00
Matthew Honnibal c2307fa9ee * More work on language-generic parsing 2015-08-28 02:02:33 +02:00
Matthew Honnibal 1d7f2d3abc * Hack on morphology structs 2015-08-26 19:18:36 +02:00
Matthew Honnibal 815bda201d * Remove UniStr struct 2015-07-22 13:39:17 +02:00
Matthew Honnibal 128b6d9714 * Move Utf8Str struct to strings module, as that's the only place it's relevant 2015-07-20 12:06:41 +02:00
Matthew Honnibal 4dddc8a69b * Fix type declarations for attr_t. Remove unused id_t. 2015-07-18 22:39:57 +02:00
Matthew Honnibal 95e57c2780 * Remove unnecessary key and id properties from Utf8String. 2015-07-17 01:40:18 +02:00
Matthew Honnibal aa82caf8f5 * Add TokenC.spacy attr 2015-07-13 19:48:07 +02:00
Matthew Honnibal 1d3a592edf * Remove the senses attr from LexemeC, to keep data compatibility 2015-07-08 19:24:44 +02:00
Matthew Honnibal e23d1582a2 * Add supersense data to Lexeme objects. Add simple has_sense method to check the flag. 2015-07-01 18:50:37 +02:00
Matthew Honnibal a7bf7b0626 * Rename sent_start to sent_end, to reflect its new usage in the Break transition 2015-06-23 05:39:43 +02:00
Matthew Honnibal 8ee7c541f1 * Update Constituent definition 2015-05-20 16:03:26 +02:00
Matthew Honnibal 03a6626545 * Tmp commit 2015-05-12 20:27:56 +02:00
Matthew Honnibal d2ac8d8007 * Add ctnt field to State, in preparation for constituency parsing 2015-05-12 20:27:56 +02:00
Matthew Honnibal d634038eb6 * Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token 2015-05-12 20:26:41 +02:00
Jordan Suchow 3a8d9b37a6 Remove trailing whitespace 2015-04-19 13:01:38 -07:00
Matthew Honnibal 8057a95f20 * NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring. 2015-03-26 16:44:44 +01:00
Matthew Honnibal b3eda03c9c * Tmp 2015-03-26 16:44:44 +01:00
Matthew Honnibal 135756ac3d * Tmp commit of NER refactoring 2015-03-26 16:44:42 +01:00
Matthew Honnibal b139aa92ba * Start setting out how NER will be implemented in the data model 2015-03-26 16:44:41 +01:00
Matthew Honnibal 75f9b7d6bf * Add L2 norm field to LexemeC struct 2015-02-07 08:43:17 -05:00
Matthew Honnibal 08ca5c8970 * Add sent_end flag to TokenC struct 2015-01-31 13:44:16 +11:00
Matthew Honnibal 12b034e3ef * Move POS tag definitions to parts_of_speech.pxd 2015-01-25 16:31:07 +11:00
Matthew Honnibal fda94271af * Rename NORM1 and NORM2 attrs to lower and norm 2015-01-24 06:17:03 +11:00
Matthew Honnibal 5ed8b2b98f * Rename sic to orth 2015-01-23 02:08:25 +11:00
Matthew Honnibal 45264e356b * Rename vec to repvec 2015-01-22 02:04:24 +11:00
Matthew Honnibal 6c7e44140b * Work on word vectors, and other stuff 2015-01-17 16:21:17 +11:00