spaCy/website/usage/_adding-languages/_training.jade

94 lines
4.1 KiB
Plaintext
Raw Normal View History

2017-10-03 12:26:20 +00:00
//- 💫 DOCS > USAGE > ADDING LANGUAGES > TRAINING
p
| spaCy expects that common words will be cached in a
| #[+api("vocab") #[code Vocab]] instance. The vocabulary caches lexical
| features, and makes it easy to use information from unlabelled text
| samples in your models. Specifically, you'll usually want to collect
| word frequencies, and train word vectors. To generate the word frequencies
| from a large, raw corpus, you can use the
| #[+src(gh("spacy-dev-resources", "training/word_freqs.py")) #[code word_freqs.py]]
| script from the spaCy developer resources.
+github("spacy-dev-resources", "training/word_freqs.py")
p
| Note that your corpus should not be preprocessed (i.e. you need
| punctuation for example). The word frequencies should be generated as a
| tab-separated file with three columns:
+list("numbers")
+item The number of times the word occurred in your language sample.
+item The number of distinct documents the word occurred in.
+item The word itself.
+code("es_word_freqs.txt", "text").
6361109 111 Aunque
23598543 111 aunque
10097056 111 claro
193454 111 aro
7711123 111 viene
12812323 111 mal
23414636 111 momento
2014580 111 felicidad
233865 111 repleto
15527 111 eto
235565 111 deliciosos
17259079 111 buena
71155 111 Anímate
37705 111 anímate
33155 111 cuéntanos
2389171 111 cuál
961576 111 típico
+aside("Brown Clusters")
| Additionally, you can use distributional similarity features provided by the
| #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
| You should train a model with between 500 and 1000 clusters. A minimum
| frequency threshold of 10 usually works well.
p
| You should make sure you use the spaCy tokenizer for your
| language to segment the text for your word frequencies. This will ensure
| that the frequencies refer to the same segmentation standards you'll be
| using at run-time. For instance, spaCy's English tokenizer segments
| "can't" into two tokens. If we segmented the text by whitespace to
| produce the frequency counts, we'll have incorrect frequency counts for
| the tokens "ca" and "n't".
+h(4, "word-vectors") Training the word vectors
p
| #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
| algorithms let you train useful word similarity models from unlabelled
| text. This is a key part of using
| #[+a("/usage/deep-learning") deep learning] for NLP with limited
| labelled data. The vectors are also useful by themselves they power
| the #[code .similarity()] methods in spaCy. For best results, you should
| pre-process the text with spaCy before training the Word2vec model. This
| ensures your tokenization will match. You can use our
| #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
| which pre-processes the text with your language-specific tokenizer and
| trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
| The #[code vectors.bin] file should consist of one word and vector per line.
+github("spacy-dev-resources", "training/word_vectors.py")
+h(3, "train-tagger-parser") Training the tagger and parser
p
| You can now train the model using a corpus for your language annotated
| with #[+a("http://universaldependencies.org/") Universal Dependencies].
| If your corpus uses the
| #[+a("http://universaldependencies.org/docs/format.html") CoNLL-U] format,
| i.e. files with the extension #[code .conllu], you can use the
| #[+api("cli#convert") #[code convert]] command to convert it to spaCy's
| #[+a("/api/annotation#json-input") JSON format] for training.
| Once you have your UD corpus transformed into JSON, you can train your
| model use the using spaCy's #[+api("cli#train") #[code train]] command.
+infobox
| For more details and examples of how to
| #[strong train the tagger and dependency parser], see the
| #[+a("/usage/training#tagger-parser") usage guide on training].