spaCy

Commit Graph

Author	SHA1	Message	Date
ines	30ce2a6793	Exclude "shed" and "Shed" from tokenizer exceptions (see #847 )	2017-02-18 14:10:44 +01:00
Ines Montani	209c37bbcf	Exclude "shell" and "Shell" from English tokenizer exceptions (resolves #775 )	2017-01-25 13:15:02 +01:00
Ines Montani	50878ef598	Exclude "were" and "Were" from tokenizer exceptions and add regression test (resolves #744 )	2017-01-16 13:10:38 +01:00
Matthew Honnibal	4e48862fa8	Remove print statement	2017-01-12 11:25:39 +01:00
Matthew Honnibal	fba67fa342	Fix Issue #736 : Times were being tokenized with incorrect string values.	2017-01-12 11:21:01 +01:00
Ines Montani	0dec90e9f7	Use global abbreviation data languages and remove duplicates	2017-01-08 20:36:00 +01:00
Ines Montani	cab39c59c5	Add missing contractions to English tokenizer exceptions Inspired by https://github.com/kootenpv/contractions/blob/master/contractions/__init __.py	2017-01-05 19:59:06 +01:00
Ines Montani	a23504fe07	Move abbreviations below other exceptions	2017-01-05 19:58:07 +01:00
Ines Montani	7d2cf934b9	Generate he/she/it correctly with 's instead of 've	2017-01-05 19:57:00 +01:00
Ines Montani	bc911322b3	Move ") to emoticons (see Tweebo challenge test)	2017-01-05 18:05:38 +01:00
Ines Montani	1d237664af	Add lowercase lemma to tokenizer exceptions	2017-01-03 23:02:21 +01:00
Ines Montani	84a87951eb	Fix typos	2017-01-03 18:27:43 +01:00
Ines Montani	35b39f53c3	Reorganise English tokenizer exceptions (as discussed in #718 ) Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.	2017-01-03 18:26:09 +01:00
Ines Montani	461cbb99d8	Revert "Reorganise English tokenizer exceptions (as discussed in #718 )" This reverts commit `b19cfcc144`.	2017-01-03 18:21:29 +01:00
Ines Montani	b19cfcc144	Reorganise English tokenizer exceptions (as discussed in #718 ) Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.	2017-01-03 18:17:57 +01:00
Ines Montani	78e63dc7d0	Update tokenizer exceptions for English	2016-12-21 18:06:34 +01:00
JM	70ff0639b5	Fixed missing vec_path declaration that was failing if 'add_vectors' was set Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.	2016-12-20 18:21:05 +01:00
Matthew Honnibal	13a0b31279	Another tweak to GloVe path hackery.	2016-12-18 23:12:49 +01:00
Matthew Honnibal	2c6228565e	Fix vector loading re glove hack	2016-12-18 23:06:44 +01:00
Matthew Honnibal	618b50a064	Fix issue #684 : GloVe vectors not loaded in spacy.en.English.	2016-12-18 22:46:31 +01:00
Matthew Honnibal	2ef9d53117	Untested fix for issue #684 : GloVe vectors hack should be inserted in English, not in spacy.load.	2016-12-18 22:29:31 +01:00
Matthew Honnibal	7a98ee5e5a	Merge language data change	2016-12-18 17:03:52 +01:00
Ines Montani	b99d683a93	Fix formatting	2016-12-18 16:58:28 +01:00
Ines Montani	b11d8cd3db	Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data	2016-12-18 16:57:12 +01:00
Ines Montani	2b2ea8ca11	Reorganise language data	2016-12-18 16:54:19 +01:00
Matthew Honnibal	44f4f008bd	Wire up lemmatizer rules for English	2016-12-18 15:50:09 +01:00
Ines Montani	1bff59a8db	Update English language data	2016-12-18 15:36:53 +01:00
Ines Montani	2eb163c5dd	Add lemma rules	2016-12-18 15:36:53 +01:00
Ines Montani	29ad8143d8	Add morph rules	2016-12-18 15:36:53 +01:00
Ines Montani	704c7442e0	Break language data components into their own files	2016-12-18 15:36:53 +01:00
Ines Montani	28326649f3	Fix typo	2016-12-18 13:30:03 +01:00
Matthew Honnibal	28d63ec58e	Restore missing '' character in tokenizer exceptions.	2016-12-18 05:34:51 +01:00
Ines Montani	a9421652c9	Remove duplicates in tag map	2016-12-17 22:44:31 +01:00
Ines Montani	577adad945	Fix formatting	2016-12-17 14:00:52 +01:00
Ines Montani	bb94e784dc	Fix typo	2016-12-17 13:59:30 +01:00
Ines Montani	a22322187f	Add missing lemmas to tokenizer exceptions (fixes #674 )	2016-12-17 12:42:41 +01:00
Ines Montani	5445074cbd	Expand tokenizer exceptions with unicode apostrophe (fixes #685 )	2016-12-17 12:34:08 +01:00
Ines Montani	e0a7b5c612	Fix formatting	2016-12-17 12:33:09 +01:00
Ines Montani	08162dce67	Move shared functions and constants to global language data	2016-12-17 12:32:48 +01:00
Ines Montani	6a60a61086	Move update_exc to global language data utils	2016-12-17 12:29:02 +01:00
Ines Montani	487ce1e20a	Add encoding declaration	2016-12-17 12:25:44 +01:00
Ines Montani	d8d50a0334	Add tokenizer exception for "gonna" (fixes #691 )	2016-12-17 11:59:28 +01:00
Ines Montani	c69b77d8aa	Revert "Add exception for "gonna"" This reverts commit `280c03f67b`.	2016-12-17 11:56:44 +01:00
Ines Montani	280c03f67b	Add exception for "gonna"	2016-12-17 11:54:59 +01:00
Ines Montani	c0c5f31950	Remove unused data and download script	2016-12-08 20:39:49 +01:00
Ines Montani	0c39654786	Remove unused import	2016-12-08 19:46:53 +01:00
Ines Montani	e47ee94761	Split punctuation into its own file	2016-12-08 19:46:43 +01:00
Ines Montani	311b30ab35	Reorganize exceptions for English and German	2016-12-08 13:58:32 +01:00
Ines Montani	877f09218b	Add more custom rules for abbreviations	2016-12-08 12:47:01 +01:00
Ines Montani	ec44bee321	Fix capitalization on morphological features	2016-12-08 12:00:54 +01:00

1 2 3 4 5 ...

255 Commits