Add note on languages with non-latin characters (see #996)

This commit is contained in:
ines 2017-04-23 15:58:38 +02:00
parent 3a9710f356
commit 2bfec1a4f8
1 changed files with 11 additions and 0 deletions

View File

@ -98,6 +98,17 @@ p
| so that Python functions can be used to help you generalise and combine
| the data as you require.
+infobox("For languages with non-latin characters")
| In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
| needs to know the language's character set. If the language you're adding
| uses non-latin characters, you might need to add the required character
| classes to the global
| #[+src(gh("spacy", "spacy/language_data/punctuation.py")) punctuation.py].
| spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
| to keep this simple and readable. If the language requires very specific
| punctuation rules, you should consider overwriting the default regular
| expressions with your own in the language's #[code Defaults].
+h(3, "stop-words") Stop words
p