Commit Graph

255 Commits

Author SHA1 Message Date
ines 30ce2a6793 Exclude "shed" and "Shed" from tokenizer exceptions (see #847) 2017-02-18 14:10:44 +01:00
Ines Montani 209c37bbcf Exclude "shell" and "Shell" from English tokenizer exceptions (resolves #775) 2017-01-25 13:15:02 +01:00
Ines Montani 50878ef598 Exclude "were" and "Were" from tokenizer exceptions and add regression test (resolves #744) 2017-01-16 13:10:38 +01:00
Matthew Honnibal 4e48862fa8 Remove print statement 2017-01-12 11:25:39 +01:00
Matthew Honnibal fba67fa342 Fix Issue #736: Times were being tokenized with incorrect string values. 2017-01-12 11:21:01 +01:00
Ines Montani 0dec90e9f7 Use global abbreviation data languages and remove duplicates 2017-01-08 20:36:00 +01:00
Ines Montani cab39c59c5 Add missing contractions to English tokenizer exceptions
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
2017-01-05 19:59:06 +01:00
Ines Montani a23504fe07 Move abbreviations below other exceptions 2017-01-05 19:58:07 +01:00
Ines Montani 7d2cf934b9 Generate he/she/it correctly with 's instead of 've 2017-01-05 19:57:00 +01:00
Ines Montani bc911322b3 Move ") to emoticons (see Tweebo challenge test) 2017-01-05 18:05:38 +01:00
Ines Montani 1d237664af Add lowercase lemma to tokenizer exceptions 2017-01-03 23:02:21 +01:00
Ines Montani 84a87951eb Fix typos 2017-01-03 18:27:43 +01:00
Ines Montani 35b39f53c3 Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:26:09 +01:00
Ines Montani 461cbb99d8 Revert "Reorganise English tokenizer exceptions (as discussed in #718)"
This reverts commit b19cfcc144.
2017-01-03 18:21:29 +01:00
Ines Montani b19cfcc144 Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:17:57 +01:00
Ines Montani 78e63dc7d0 Update tokenizer exceptions for English 2016-12-21 18:06:34 +01:00
JM 70ff0639b5 Fixed missing vec_path declaration that was failing if 'add_vectors' was set
Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.
2016-12-20 18:21:05 +01:00
Matthew Honnibal 13a0b31279 Another tweak to GloVe path hackery. 2016-12-18 23:12:49 +01:00
Matthew Honnibal 2c6228565e Fix vector loading re glove hack 2016-12-18 23:06:44 +01:00
Matthew Honnibal 618b50a064 Fix issue #684: GloVe vectors not loaded in spacy.en.English. 2016-12-18 22:46:31 +01:00
Matthew Honnibal 2ef9d53117 Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-18 22:29:31 +01:00
Matthew Honnibal 7a98ee5e5a Merge language data change 2016-12-18 17:03:52 +01:00
Ines Montani b99d683a93 Fix formatting 2016-12-18 16:58:28 +01:00
Ines Montani b11d8cd3db Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data 2016-12-18 16:57:12 +01:00
Ines Montani 2b2ea8ca11 Reorganise language data 2016-12-18 16:54:19 +01:00
Matthew Honnibal 44f4f008bd Wire up lemmatizer rules for English 2016-12-18 15:50:09 +01:00
Ines Montani 1bff59a8db Update English language data 2016-12-18 15:36:53 +01:00
Ines Montani 2eb163c5dd Add lemma rules 2016-12-18 15:36:53 +01:00
Ines Montani 29ad8143d8 Add morph rules 2016-12-18 15:36:53 +01:00
Ines Montani 704c7442e0 Break language data components into their own files 2016-12-18 15:36:53 +01:00
Ines Montani 28326649f3 Fix typo 2016-12-18 13:30:03 +01:00
Matthew Honnibal 28d63ec58e Restore missing '' character in tokenizer exceptions. 2016-12-18 05:34:51 +01:00
Ines Montani a9421652c9 Remove duplicates in tag map 2016-12-17 22:44:31 +01:00
Ines Montani 577adad945 Fix formatting 2016-12-17 14:00:52 +01:00
Ines Montani bb94e784dc Fix typo 2016-12-17 13:59:30 +01:00
Ines Montani a22322187f Add missing lemmas to tokenizer exceptions (fixes #674) 2016-12-17 12:42:41 +01:00
Ines Montani 5445074cbd Expand tokenizer exceptions with unicode apostrophe (fixes #685) 2016-12-17 12:34:08 +01:00
Ines Montani e0a7b5c612 Fix formatting 2016-12-17 12:33:09 +01:00
Ines Montani 08162dce67 Move shared functions and constants to global language data 2016-12-17 12:32:48 +01:00
Ines Montani 6a60a61086 Move update_exc to global language data utils 2016-12-17 12:29:02 +01:00
Ines Montani 487ce1e20a Add encoding declaration 2016-12-17 12:25:44 +01:00
Ines Montani d8d50a0334 Add tokenizer exception for "gonna" (fixes #691) 2016-12-17 11:59:28 +01:00
Ines Montani c69b77d8aa Revert "Add exception for "gonna""
This reverts commit 280c03f67b.
2016-12-17 11:56:44 +01:00
Ines Montani 280c03f67b Add exception for "gonna" 2016-12-17 11:54:59 +01:00
Ines Montani c0c5f31950 Remove unused data and download script 2016-12-08 20:39:49 +01:00
Ines Montani 0c39654786 Remove unused import 2016-12-08 19:46:53 +01:00
Ines Montani e47ee94761 Split punctuation into its own file 2016-12-08 19:46:43 +01:00
Ines Montani 311b30ab35 Reorganize exceptions for English and German 2016-12-08 13:58:32 +01:00
Ines Montani 877f09218b Add more custom rules for abbreviations 2016-12-08 12:47:01 +01:00
Ines Montani ec44bee321 Fix capitalization on morphological features 2016-12-08 12:00:54 +01:00