Commit Graph

287 Commits

Author SHA1 Message Date
ines f2ea6d4713 Add Dutch example sentences (see #1107) 2017-12-01 23:36:05 +01:00
Canbey Bilgili abe098b255 Adds Turkish Lemmatization 2017-12-01 17:04:32 +03:00
Søren Lind Kristiansen d86b537a38 Enable morph rules for Danish 2017-11-30 15:58:02 +01:00
Søren Lind Kristiansen 13a988adc3 Remove 'Number[psor]' 2017-11-30 15:55:04 +01:00
Søren Lind Kristiansen dd6fde18a9 Add more Danish morph rules and clean up existing ones 2017-11-30 11:17:19 +01:00
Vadim Mazaev 4ba7ddf651 Bugfixies 2017-11-30 12:29:38 +03:00
Jim O'Regan c3e6cee17a use inan in polimorf tagset conversion 2017-11-29 23:15:47 +00:00
Jim O'Regan b32575e78c imports 2017-11-29 23:03:41 +00:00
Jim O'Regan 3696ce6a7b add UD mapping 2017-11-29 22:59:19 +00:00
Matthew Honnibal f9ed9ea529
Merge pull request #1624 from GreenRiverRUS/russian
Add support for Russian
2017-11-29 23:10:01 +01:00
Jim O'Regan 076a6fc60a symbols 2017-11-29 20:11:20 +00:00
Jim O'Regan 834ba3c69a (semi generated) Polimorf mapping 2017-11-29 20:08:24 +00:00
Jim O'Regan ba6a23fd11 BOM in Italian lemmatiser 2017-11-29 17:40:07 +00:00
Ines Montani 9052643e2c
Merge pull request #1653 from sorenlind/da_example_typo
Fix typo
2017-11-27 14:47:42 +00:00
Søren Lind Kristiansen 5fe58b885b Fix typo 2017-11-27 15:36:18 +01:00
Ines Montani d52b1ab245
Add unicode_literals (hopefully fixes test failure on Python 2) 2017-11-27 15:16:54 +01:00
Søren Lind Kristiansen 0ffd27b0f6 Add several Danish alternative spellings 2017-11-27 13:35:41 +01:00
Vadim Mazaev cacd859dcd Added tag map, fixed tests fails, added more exceptions 2017-11-26 20:54:48 +03:00
Søren Lind Kristiansen ef03e9ea53 Remove unused import. 2017-11-25 13:04:02 +01:00
Søren Lind Kristiansen 6aa241bcec Add day of month tokenizer exceptions for Danish. 2017-11-24 15:03:24 +01:00
Søren Lind Kristiansen 0c276ed020 Add weekday abbreviations and remove abiguous month abbreviations for Danish. 2017-11-24 14:43:29 +01:00
Søren Lind Kristiansen 056547e989 Add multiple tokenizer exceptions for Danish. 2017-11-24 11:51:26 +01:00
Søren Lind Kristiansen ac8116510d Fix tokenization of 'i.' for Danish. 2017-11-24 11:16:53 +01:00
Vadim Mazaev 81314f8659 Fixed tokenizer: added char classes; added first lemmatizer and
tokenizer tests
2017-11-21 22:23:59 +03:00
Vadim Mazaev 52ee1f9bf9 Updated Russian Language, added lemmatizer, norm exceptions and lex
attrs
2017-11-21 11:44:46 +03:00
Vadim Mazaev a0739a06d4 Returned russian support from v1.10 branch 2017-11-17 17:06:15 +03:00
ines c9d72de0fb Add dummy serialization methods for Japanese and missing lang getter (resolves #1557) 2017-11-15 12:44:02 +01:00
Mathias Deschamps c0691b2ab4 Add tokenizer exceptions for ing verbs
Extend list of tokenizing exceptions introduced in 123810b
2017-11-13 17:46:05 +01:00
Mathias Deschamps 288298ead9 Add norm exception for ing verbs
Some ing verbs are sometimes written in or in'. Make the NORM form correct
2017-11-13 17:46:05 +01:00
Abhinav Sharma 59f5740ede
improved upon the list of included stop_words 2017-11-13 17:13:49 +05:30
ines 123810b6de Add "lovin'" to tokenizer exceptions (see #1248) 2017-11-09 17:09:30 +01:00
Ines Montani 42b241ccd0
Update language code in usage example in comment 2017-11-08 11:36:38 +01:00
Abhinav Sharma 84edade82d
Create examples.py
Populated the file with the translations of English example sentences
2017-11-08 13:23:08 +05:30
ines bcf42b8846 Fix typo 2017-11-08 01:06:37 +01:00
ines acb9bdb852 Fix PRON_LEMMA imports 2017-11-06 17:41:53 +01:00
ines baa231745c Fix Dutch tag map 2017-11-05 21:41:50 +01:00
ines 507ecb67af Fix Spanish tag map 2017-11-05 19:23:34 +01:00
ines 975e1042ff Fix Italian tag map 2017-11-05 18:34:09 +01:00
ines 6b2d6e4937 Fix Portuguese tag map 2017-11-05 18:31:00 +01:00
ines fa2687fded Fix Dutch tag map 2017-11-05 17:57:59 +01:00
ines fb8990d916 Fix Spanish tag map 2017-11-05 17:48:46 +01:00
ines 9d13288f73 Fix French tag map 2017-11-05 17:47:59 +01:00
ines 54579805c5 Fix French tag map 2017-11-05 17:44:05 +01:00
Matthew Honnibal 0d4bd6414e Fix Italian tag map 2017-11-05 14:11:03 +01:00
ines ef597622a6 Add Portuguese tag map 2017-11-05 13:58:34 +01:00
ines 793c62dfda Add Dutch tag map 2017-11-05 13:48:07 +01:00
ines f7485a09c8 Fix Italian tag map 2017-11-05 13:12:58 +01:00
ines 3cef901834 Add tag map for French and Italian 2017-11-04 23:32:51 +01:00
ines 6c15aafebd Fix formatting 2017-11-04 23:07:02 +01:00
ines 9baab241b4 Add skeleton language data for Turkish 2017-11-02 16:32:24 +01:00
ines c6fea3e5f6 Add Romanian and Croatian skeletons (experimental)
Add language data templates to make it easier for others to contribute to the language support
2017-11-01 23:04:28 +01:00
ines 18c859500b Add missing imports 2017-11-01 23:02:51 +01:00
ines 819e30a26e Tidy up tokenizer exceptions 2017-11-01 23:02:45 +01:00
ines 9659391944 Update deprecated methods and add warnings 2017-11-01 16:49:42 +01:00
Ines Montani d11659463b
Merge pull request #1152 from jimregan/develop-irish
[WIP] attempt a port from #1147
2017-11-01 00:23:43 +01:00
ines 7e424a1804 Don't copy exception dicts if not necessary and tidy up 2017-10-31 21:05:29 +01:00
Ines Montani 06c25a8882
Remove comma that caused list to wrap in tuple!
Also removed extra dict wrappings for performance (we used to have them in there, but they should only really exist if copying the dict is absolutely necessary)
2017-10-31 20:13:16 +01:00
Ines Montani 147448b65b
Add missing symbols 2017-10-31 19:34:45 +01:00
Ines Montani 9b0de9fb43
Fix import of symbols (now nested one level lower) 2017-10-31 19:17:58 +01:00
Jim O'Regan 41dd29e48e merge 2017-10-31 14:07:45 +00:00
Ines Montani 090bd00369 Merge pull request #1464 from mayukh18/develop_bengali_pronouns
added the bengali pronouns for v2.0
2017-10-25 21:55:25 +02:00
mayukh18 1bc07758fa added few bengali pronouns 2017-10-25 22:24:40 +05:30
Ines Montani d3bf488e16 Merge pull request #1171 from mollerhoj/support-danish
Improve basic support for Danish
2017-10-24 20:29:57 +02:00
Matthew Honnibal 66766c1454 Restore SP tag to English tag_map, until models migrate 2017-10-24 17:05:00 +02:00
ines c55db0a4a1 Add example sentences for Japanese and Chinese (see #1107) 2017-10-24 13:02:24 +02:00
ines 66f8f9d4a0 Fix Japanese tokenizer
JapaneseTokenizer now returns a Doc, not individual words
2017-10-24 13:02:19 +02:00
Ines Montani facf77e541 Merge branch 'develop' into support-danish 2017-10-24 11:53:19 +02:00
Matthew Honnibal 49895fbef6 Rename 'SP' special tag to '_SP'
Renaming the tag with an underscore lets us add it to the tag map
without worrying that we'll change the sequence of tags, which throws
off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag,
the "VERB" tag is pushed to a different class ID, and the model is all
messed up.
2017-10-20 14:01:12 +02:00
Ines Montani f0d577e460 Merge pull request #1425 from explosion/feature/hindi-tokenizer
💫 Basic Hindi tokenization support
2017-10-18 13:34:52 +02:00
Matthew Honnibal 839de87ca9 Make lambda func a named function, for pickling 2017-10-17 18:21:20 +02:00
Matthew Honnibal 9ce7d6af87 Make lex attr functions top-level functions, to promote pickling 2017-10-17 18:19:18 +02:00
Ines Montani aab299c8ae Merge pull request #1429 from vishnunekkanti/develop
fix syntax error in zh
2017-10-17 14:45:02 +02:00
ines 485c4f6df5 Add Hungarian examples (see #1107) 2017-10-17 02:37:45 +02:00
Vishnu Kumar Nekkanti d3c54cf39a fixed SyntaxError while checking for jieba 2017-10-16 18:51:33 +05:30
ines 266e7180a7 Add Language class, stop words and basic stemmer that sets NORM 2017-10-14 14:59:52 +02:00
ines e85e1d571b Update base punctuation 2017-10-14 14:59:23 +02:00
ines 9d6c8eaa49 Update base norm exceptions with more unicode characters
e.g. unicode variations of punctuation used in Chinese
2017-10-14 14:58:52 +02:00
ines 38c756fd85 Port over changes from #1287 2017-10-14 13:16:21 +02:00
ines 612224c10d Port over changes from #1157 2017-10-14 13:11:39 +02:00
ines a4d974d97b Port over URL pattern changes from #1411 2017-10-14 12:58:07 +02:00
ines 09aed58140 Port over changes from #1333 and add comments 2017-10-14 12:52:59 +02:00
ines 8ce6f96180 Don't make copies of language data components 2017-10-11 15:34:55 +02:00
ines 417d45f5d0 Add lemmatizer data as variable on language data
Don't create lookup lemmatizer within Language class and just pass in
the data so it can be set on Token creation
2017-10-11 02:24:58 +02:00
ines 0c2343d73a Tidy up language data 2017-10-11 02:22:49 +02:00
Matthew Honnibal 8143618497 Set prefix length back to 1 2017-10-10 19:32:54 +02:00
Matthew Honnibal dce8afb9cf Set prefix length to 3 2017-10-09 21:55:55 -05:00
Ines Montani 959c46eabe Merge pull request #1365 from wannaphongcom/develop
Add Thai language for spaCy v2
2017-09-26 23:43:05 +02:00
Wannaphong Phatthiyaphaibun 3d5046c499 fix import in th 2017-09-26 22:41:20 +07:00
Wannaphong Phatthiyaphaibun a63f790b8c fix thai tag_map 2017-09-26 22:28:57 +07:00
Wannaphong Phatthiyaphaibun 2ea27d07f4 fix tokenizer_exceptions in thai 2017-09-26 22:14:47 +07:00
Wannaphong Phatthiyaphaibun a2bf4cc7bf fix newline in file 2017-09-26 21:49:43 +07:00
ines bb5c631402 Implement like_num getter for French (via #1161) 2017-09-26 16:47:45 +02:00
ines 15479b3bae Add comment to like_num re: future work 2017-09-26 16:43:28 +02:00
ines adda08fe14 Implement like_num getter for Dutch (via #1177) 2017-09-26 16:39:15 +02:00
ines 5ee10379db Port over changes from #1340 2017-09-26 16:38:08 +02:00
Wannaphong Phatthiyaphaibun 5cba67146c add thai in spacy2 2017-09-26 21:36:27 +07:00
ines 10d291f129 Port over change from #1351 2017-09-26 16:11:41 +02:00
ines ece30c28a8 Don't split hyphenated words in German
This way, the tokenizer matches the tokenization in German treebanks
2017-09-16 20:40:15 +02:00
Ines Montani bd3da3d6fb Port over change from #1323 and tidy up 2017-09-14 19:23:13 +02:00
Jim O'Regan 9dfd301962 rearrange 2017-09-11 10:14:18 +01:00