Commit Graph

2122 Commits

Author SHA1 Message Date
Ines Montani da10a049a6 Add unicode declarations 2017-01-05 13:09:48 +01:00
Ines Montani 58adae8774 Remove unused file 2017-01-05 13:09:22 +01:00
Ines Montani c6e5a5349d Move regression test for #360 into own file 2017-01-04 00:49:31 +01:00
Ines Montani 8279993a6f Modernize and merge tokenizer tests for punctuation 2017-01-04 00:49:20 +01:00
Ines Montani 550630df73 Update tokenizer tests for contractions 2017-01-04 00:48:42 +01:00
Ines Montani 109f202e8f Update conftest fixture 2017-01-04 00:48:21 +01:00
Ines Montani ee6b49b293 Modernize tokenizer tests for emoticons 2017-01-04 00:47:59 +01:00
Ines Montani f09b5a5dfd Modernize tokenizer tests for infixes 2017-01-04 00:47:42 +01:00
Ines Montani 59059fed27 Move regression test for #351 to own file 2017-01-04 00:47:11 +01:00
Ines Montani 667051375d Modernize tokenizer tests for whitespace 2017-01-04 00:46:35 +01:00
Ines Montani aafc894285 Modernize tokenizer tests for contractions
Use @pytest.mark.parametrize.
2017-01-03 23:02:21 +01:00
Ines Montani 1d237664af Add lowercase lemma to tokenizer exceptions 2017-01-03 23:02:21 +01:00
Ines Montani 84a87951eb Fix typos 2017-01-03 18:27:43 +01:00
Ines Montani 35b39f53c3 Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:26:09 +01:00
Ines Montani fb9d3bb022 Revert "Merge remote-tracking branch 'origin/master'"
This reverts commit d3b181cdf1, reversing
changes made to b19cfcc144.
2017-01-03 18:21:36 +01:00
Ines Montani 461cbb99d8 Revert "Reorganise English tokenizer exceptions (as discussed in #718)"
This reverts commit b19cfcc144.
2017-01-03 18:21:29 +01:00
Ines Montani d3b181cdf1 Merge remote-tracking branch 'origin/master'
# Conflicts:
#	spacy/en/tokenizer_exceptions.py
2017-01-03 18:20:01 +01:00
Ines Montani b19cfcc144 Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:17:57 +01:00
Ines Montani 1bd53bbf89 Fix typos (resolves #718) 2017-01-03 11:26:21 +01:00
Matthew Honnibal fde53be3b4 Move whole token mach inside _split_affixes. 2016-12-30 17:11:50 -06:00
Matthew Honnibal 3ba7c167a8 Fix URL tests 2016-12-30 17:10:08 -06:00
Matthew Honnibal 9936a1b9b5 Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns 2016-12-30 14:53:40 -06:00
Matthew Honnibal 3e8d9c772e Test interaction of token_match and punctuation
Check that the new token_match function applies after punctuation is split off.
2016-12-31 00:52:17 +11:00
Matthew Honnibal 623d94e14f Whitespace 2016-12-31 00:30:28 +11:00
Petter Hohle f112e7754e Add PART to tag map
16 of the 17 PoS tags in the UD tag set is added; PART is missing.
2016-12-28 18:39:01 +01:00
Matthew Honnibal f62db78dc3 Increment version 2016-12-27 21:11:22 +01:00
Matthew Honnibal cade536d1e Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-12-27 21:04:10 +01:00
Matthew Honnibal ce4539dafd Allow the vocabulary to grow to 10,000, to prevent cold-start problem. 2016-12-27 21:03:45 +01:00
Ines Montani ad3669cef5 Merge pull request #703 from magnusburton/master
Added Swedish abbreviations
2016-12-27 01:01:49 +01:00
Ines Montani 78f754dd9a Merge pull request #705 from oroszgy/hu_tokenizer
Initial support for Hungarian
2016-12-27 00:48:13 +01:00
Ines Montani 8785706039 Reformat stop words for better readability 2016-12-24 00:58:40 +01:00
Gyorgy Orosz 45e045a87b Unicode/UTF8 compatibility for Python2 2016-12-24 00:21:00 +01:00
Gyorgy Orosz 72b61b6d03 Typo fix. 2016-12-24 00:10:29 +01:00
Gyorgy Orosz 3a9be4d485 Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers. 2016-12-23 23:49:34 +01:00
Ines Montani 1436b9f15a Fix formatting and consistency 2016-12-23 21:36:01 +01:00
Ines Montani 1d64527727 Update Spanish tokenizer
Remove reflexive pronouns as they're part of an open class, fix
mistakes and add exceptions
2016-12-23 21:36:01 +01:00
Ines Montani 7f411fd01c Remove exceptions containing whitespace / no special chars 2016-12-23 14:30:06 +01:00
Magnus Burton fdf4776262 Added Swedish abbreviations 2016-12-22 22:45:18 +01:00
Gyorgy Orosz d9c59c4751 Maintaining backward compatibility. 2016-12-21 23:30:49 +01:00
Gyorgy Orosz 1748549aeb Added exception pattern mechanism to the tokenizer. 2016-12-21 23:16:19 +01:00
Gyorgy Orosz 35aa54765d Hungarian module is exposed in spacy. 2016-12-21 20:45:36 +01:00
Gyorgy Orosz ab2f6ea46c Removed data files from tests.. 2016-12-21 20:22:09 +01:00
Ines Montani 3c87c71d43 Add tokenizer exceptions for a.m. and p.m. in Spanish 2016-12-21 18:19:10 +01:00
Ines Montani 78e63dc7d0 Update tokenizer exceptions for English 2016-12-21 18:06:34 +01:00
Ines Montani 702d1eed93 Update tokenizer exceptions for German 2016-12-21 18:06:27 +01:00
Ines Montani d60380418e Update tokenizer exceptions for Spanish 2016-12-21 18:06:17 +01:00
Ines Montani 920fa0fed2 Add DET_LEMMA constant 2016-12-21 18:05:41 +01:00
Ines Montani 8978806ea6 Allow Vocab to load without serializer_freqs 2016-12-21 18:05:23 +01:00
Ines Montani be8ed811f6 Remove trailing whitespace 2016-12-21 18:04:41 +01:00
Ines Montani 926e19184a Merge pull request #695 from magnusburton/master
Added Swedish morph rules
2016-12-21 01:06:00 +01:00