mirror of https://github.com/explosion/spaCy.git
Update adding languages docs
This commit is contained in:
parent
9d85cda8e4
commit
2f54fefb5d
|
@ -206,6 +206,14 @@ p
|
||||||
being below beside besides between beyond both bottom but by
|
being below beside besides between beyond both bottom but by
|
||||||
""").split())
|
""").split())
|
||||||
|
|
||||||
|
+infobox("Important note")
|
||||||
|
| When adding stop words from an online source, always #[strong include the link]
|
||||||
|
| in a comment. Make sure to #[strong proofread] and double-check the words
|
||||||
|
| carefully. A lot of the lists available online have been passed around
|
||||||
|
| for years and often contain mistakes, like unicode errors or random words
|
||||||
|
| that have once been added for a specific use case, but don't actually
|
||||||
|
| qualify.
|
||||||
|
|
||||||
+h(3, "tokenizer-exceptions") Tokenizer exceptions
|
+h(3, "tokenizer-exceptions") Tokenizer exceptions
|
||||||
|
|
||||||
p
|
p
|
||||||
|
@ -263,6 +271,15 @@ p
|
||||||
# only declare this at the bottom
|
# only declare this at the bottom
|
||||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||||
|
|
||||||
|
+aside("Generating tokenizer exceptions")
|
||||||
|
| Keep in mind that generating exceptions only makes sense if there's a
|
||||||
|
| clearly defined and #[strong finite number] of them, like common
|
||||||
|
| contractions in English. This is not always the case – in Spanish for
|
||||||
|
| instance, infinitive or imperative reflexive verbs and pronouns are one
|
||||||
|
| token (e.g. "vestirme"). In cases like this, spaCy shouldn't be
|
||||||
|
| generating exceptions for #[em all verbs]. Instead, this will be handled
|
||||||
|
| at a later stage during lemmatization.
|
||||||
|
|
||||||
p
|
p
|
||||||
| When adding the tokenizer exceptions to the #[code Defaults], you can use
|
| When adding the tokenizer exceptions to the #[code Defaults], you can use
|
||||||
| the #[code update_exc()] helper function to merge them with the global
|
| the #[code update_exc()] helper function to merge them with the global
|
||||||
|
@ -380,6 +397,8 @@ p
|
||||||
|
|
||||||
+h(3, "morph-rules") Morph rules
|
+h(3, "morph-rules") Morph rules
|
||||||
|
|
||||||
|
+h(2, "testing") Testing the new language tokenizer
|
||||||
|
|
||||||
+h(2, "vocabulary") Building the vocabulary
|
+h(2, "vocabulary") Building the vocabulary
|
||||||
|
|
||||||
p
|
p
|
||||||
|
|
Loading…
Reference in New Issue