Commit Graph

4011 Commits

Author SHA1 Message Date
Ines Montani de5aa92bc2 Handle deprecated tokenizer prefix data 2017-01-08 20:33:28 +01:00
Ines Montani abb09782f9 Move sun.txt to original location and fix path to not break parser tests 2017-01-08 20:32:54 +01:00
Ines Montani 57919566b8 Add Jupyter notebooks repo to resources list 2017-01-05 20:50:08 +01:00
Ines Montani cab39c59c5 Add missing contractions to English tokenizer exceptions
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
2017-01-05 19:59:06 +01:00
Ines Montani a23504fe07 Move abbreviations below other exceptions 2017-01-05 19:58:07 +01:00
Ines Montani 7d2cf934b9 Generate he/she/it correctly with 's instead of 've 2017-01-05 19:57:00 +01:00
Ines Montani 8328925e1f Add newlines to long German text 2017-01-05 18:13:30 +01:00
Ines Montani 55b46d7cf6 Add tokenizer tests for German 2017-01-05 18:11:25 +01:00
Ines Montani 5bb4081f52 Remove redundant test_tokenizer.py for English 2017-01-05 18:11:11 +01:00
Ines Montani 8216ba599b Add tests for longer and mixed English texts 2017-01-05 18:11:04 +01:00
Ines Montani 65f937d5c6 Move basic contraction tests to test_contractions.py 2017-01-05 18:09:53 +01:00
Ines Montani bbe7cab3a1 Move non-English-specific tests back to general tokenizer tests 2017-01-05 18:09:29 +01:00
Ines Montani 038002d616 Reformat HU tokenizer tests and adapt to general style
Improve readability of test cases and add conftest.py with fixture
2017-01-05 18:06:44 +01:00
Ines Montani bc911322b3 Move ") to emoticons (see Tweebo challenge test) 2017-01-05 18:05:38 +01:00
Ines Montani 637f785036 Add general sanity tests for all tokenizers 2017-01-05 16:25:38 +01:00
Ines Montani c5f2dc15de Move English tokenizer tests to directory /en 2017-01-05 16:25:04 +01:00
Ines Montani 8b45363b4d Modernize and merge general tokenizer tests 2017-01-05 13:17:05 +01:00
Ines Montani 02cfda48c9 Modernize and merge tokenizer tests for string loading 2017-01-05 13:16:55 +01:00
Ines Montani a11f684822 Modernize and merge tokenizer tests for whitespace 2017-01-05 13:16:33 +01:00
Ines Montani 8b284fc6f1 Modernize and merge tokenizer tests for text from file 2017-01-05 13:15:52 +01:00
Ines Montani 2c2e878653 Modernize and merge tokenizer tests for punctuation 2017-01-05 13:14:16 +01:00
Ines Montani 8a74129cdf Modernize and merge tokenizer tests for prefixes/suffixes/infixes 2017-01-05 13:13:12 +01:00
Ines Montani 0e65dca9a5 Modernize and merge tokenizer tests for exception and emoticons 2017-01-05 13:11:31 +01:00
Ines Montani 34c47bb20d Fix formatting 2017-01-05 13:10:51 +01:00
Ines Montani 2e72683baa Add missing docstrings 2017-01-05 13:10:21 +01:00
Ines Montani da10a049a6 Add unicode declarations 2017-01-05 13:09:48 +01:00
Ines Montani 58adae8774 Remove unused file 2017-01-05 13:09:22 +01:00
Ines Montani c6e5a5349d Move regression test for #360 into own file 2017-01-04 00:49:31 +01:00
Ines Montani 8279993a6f Modernize and merge tokenizer tests for punctuation 2017-01-04 00:49:20 +01:00
Ines Montani 550630df73 Update tokenizer tests for contractions 2017-01-04 00:48:42 +01:00
Ines Montani 109f202e8f Update conftest fixture 2017-01-04 00:48:21 +01:00
Ines Montani ee6b49b293 Modernize tokenizer tests for emoticons 2017-01-04 00:47:59 +01:00
Ines Montani f09b5a5dfd Modernize tokenizer tests for infixes 2017-01-04 00:47:42 +01:00
Ines Montani 59059fed27 Move regression test for #351 to own file 2017-01-04 00:47:11 +01:00
Ines Montani 667051375d Modernize tokenizer tests for whitespace 2017-01-04 00:46:35 +01:00
Ines Montani aafc894285 Modernize tokenizer tests for contractions
Use @pytest.mark.parametrize.
2017-01-03 23:02:21 +01:00
Ines Montani 1d237664af Add lowercase lemma to tokenizer exceptions 2017-01-03 23:02:21 +01:00
Ines Montani dd7cd44ba5 Update README.rst 2017-01-03 21:27:25 +01:00
Ines Montani d677db6277 Change "Multi-language support" to amber for spaCy 2017-01-03 21:24:35 +01:00
Ines Montani 6f51609b5e Use yellow color for neutral pro/con icon 2017-01-03 21:24:14 +01:00
Ines Montani 84a87951eb Fix typos 2017-01-03 18:27:43 +01:00
Ines Montani 35b39f53c3 Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:26:09 +01:00
Ines Montani fb9d3bb022 Revert "Merge remote-tracking branch 'origin/master'"
This reverts commit d3b181cdf1, reversing
changes made to b19cfcc144.
2017-01-03 18:21:36 +01:00
Ines Montani 461cbb99d8 Revert "Reorganise English tokenizer exceptions (as discussed in #718)"
This reverts commit b19cfcc144.
2017-01-03 18:21:29 +01:00
Ines Montani d3b181cdf1 Merge remote-tracking branch 'origin/master'
# Conflicts:
#	spacy/en/tokenizer_exceptions.py
2017-01-03 18:20:01 +01:00
Ines Montani b19cfcc144 Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:17:57 +01:00
Ines Montani 4fc4d3d0e3 Update PULL_REQUEST_TEMPLATE.md 2017-01-03 15:41:16 +01:00
Ines Montani 1bd53bbf89 Fix typos (resolves #718) 2017-01-03 11:26:21 +01:00
Matthew Honnibal 9b48bd161b Merge pull request #700 from oroszgy/tokenization_w_exception_patterns
Tokenization with exception patterns
2017-01-03 09:56:37 +11:00
Ines Montani 1b82756cc7 Tidy up and fix formatting and consistency 2017-01-02 00:29:24 +01:00