Commit Graph

22 Commits

Author SHA1 Message Date
ines 66c1f194f9 Use consistent unicode declarations 2017-03-12 13:07:28 +01:00
Ines Montani 55c9c62abc Use relative import 2017-01-23 21:27:49 +01:00
Gyorgy Orosz b4df202bfa Better error handling 2017-01-14 22:24:58 +01:00
Gyorgy Orosz b03a46792c Better error handling 2017-01-14 22:09:29 +01:00
Gyorgy Orosz a45f22913f Added further abbreviations present in the Szeged corpus 2017-01-14 22:08:55 +01:00
Gyorgy Orosz 9505c6a72b Passing all old tests. 2017-01-14 20:39:21 +01:00
Gyorgy Orosz 63037e79af Fixed hyphen handling in the Hungarian tokenizer. 2017-01-14 16:30:11 +01:00
Gyorgy Orosz f77c0284d6 Maintaining compatibility with other spacy tokenizers. 2017-01-14 16:19:15 +01:00
Gyorgy Orosz be7a7aeb1a Reversed accidental changes. 2017-01-14 15:59:36 +01:00
Gyorgy Orosz 1be5da1ac6 Fixed Hungarian tokenizer for numbers 2017-01-14 15:51:59 +01:00
Ines Montani 53362b6b93 Reorganise Hungarian prefixes/suffixes/infixes
Use global prefixes and suffixes for non-language-specific rules,
import list of alpha unicode characters and adjust regexes.
2017-01-08 20:40:33 +01:00
Ines Montani 0dec90e9f7 Use global abbreviation data languages and remove duplicates 2017-01-08 20:36:00 +01:00
Ines Montani 8785706039 Reformat stop words for better readability 2016-12-24 00:58:40 +01:00
Gyorgy Orosz 45e045a87b Unicode/UTF8 compatibility for Python2 2016-12-24 00:21:00 +01:00
Gyorgy Orosz 3d5306acb9 Added further testcases. 2016-12-20 23:49:35 +01:00
Gyorgy Orosz 23956e72ff Improved partial support for tokenzing Hungarian numbers 2016-12-20 23:36:59 +01:00
Gyorgy Orosz 6add156075 Refactored language data structure 2016-12-20 22:28:20 +01:00
Gyorgy Orosz c035928156 Partial Hungarian number tokenization is added. 2016-12-20 20:46:20 +01:00
Gyorgy Orosz 0cf2144d24 Adding partial hyphen and quote handling support. 2016-12-11 00:14:36 +01:00
Gyorgy Orosz 2051726fd3 Passing Hungatian abbrev tests. 2016-12-10 23:37:58 +01:00
Gyorgy Orosz 90d22db023 Added Hungarian resource files. 2016-12-08 12:06:36 +01:00
Gyorgy Orosz 5b00039955 First steps towards the Hungarian tokenizer code. 2016-12-07 23:07:43 +01:00