mirror of https://github.com/explosion/spaCy.git
94 lines
4.1 KiB
Plaintext
94 lines
4.1 KiB
Plaintext
|
//- 💫 DOCS > USAGE > ADDING LANGUAGES > TRAINING
|
|||
|
|
|||
|
p
|
|||
|
| spaCy expects that common words will be cached in a
|
|||
|
| #[+api("vocab") #[code Vocab]] instance. The vocabulary caches lexical
|
|||
|
| features, and makes it easy to use information from unlabelled text
|
|||
|
| samples in your models. Specifically, you'll usually want to collect
|
|||
|
| word frequencies, and train word vectors. To generate the word frequencies
|
|||
|
| from a large, raw corpus, you can use the
|
|||
|
| #[+src(gh("spacy-dev-resources", "training/word_freqs.py")) #[code word_freqs.py]]
|
|||
|
| script from the spaCy developer resources.
|
|||
|
|
|||
|
+github("spacy-dev-resources", "training/word_freqs.py")
|
|||
|
|
|||
|
p
|
|||
|
| Note that your corpus should not be preprocessed (i.e. you need
|
|||
|
| punctuation for example). The word frequencies should be generated as a
|
|||
|
| tab-separated file with three columns:
|
|||
|
|
|||
|
+list("numbers")
|
|||
|
+item The number of times the word occurred in your language sample.
|
|||
|
+item The number of distinct documents the word occurred in.
|
|||
|
+item The word itself.
|
|||
|
|
|||
|
+code("es_word_freqs.txt", "text").
|
|||
|
6361109 111 Aunque
|
|||
|
23598543 111 aunque
|
|||
|
10097056 111 claro
|
|||
|
193454 111 aro
|
|||
|
7711123 111 viene
|
|||
|
12812323 111 mal
|
|||
|
23414636 111 momento
|
|||
|
2014580 111 felicidad
|
|||
|
233865 111 repleto
|
|||
|
15527 111 eto
|
|||
|
235565 111 deliciosos
|
|||
|
17259079 111 buena
|
|||
|
71155 111 Anímate
|
|||
|
37705 111 anímate
|
|||
|
33155 111 cuéntanos
|
|||
|
2389171 111 cuál
|
|||
|
961576 111 típico
|
|||
|
|
|||
|
+aside("Brown Clusters")
|
|||
|
| Additionally, you can use distributional similarity features provided by the
|
|||
|
| #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
|
|||
|
| You should train a model with between 500 and 1000 clusters. A minimum
|
|||
|
| frequency threshold of 10 usually works well.
|
|||
|
|
|||
|
p
|
|||
|
| You should make sure you use the spaCy tokenizer for your
|
|||
|
| language to segment the text for your word frequencies. This will ensure
|
|||
|
| that the frequencies refer to the same segmentation standards you'll be
|
|||
|
| using at run-time. For instance, spaCy's English tokenizer segments
|
|||
|
| "can't" into two tokens. If we segmented the text by whitespace to
|
|||
|
| produce the frequency counts, we'll have incorrect frequency counts for
|
|||
|
| the tokens "ca" and "n't".
|
|||
|
|
|||
|
+h(4, "word-vectors") Training the word vectors
|
|||
|
|
|||
|
p
|
|||
|
| #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
|
|||
|
| algorithms let you train useful word similarity models from unlabelled
|
|||
|
| text. This is a key part of using
|
|||
|
| #[+a("/usage/deep-learning") deep learning] for NLP with limited
|
|||
|
| labelled data. The vectors are also useful by themselves – they power
|
|||
|
| the #[code .similarity()] methods in spaCy. For best results, you should
|
|||
|
| pre-process the text with spaCy before training the Word2vec model. This
|
|||
|
| ensures your tokenization will match. You can use our
|
|||
|
| #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
|
|||
|
| which pre-processes the text with your language-specific tokenizer and
|
|||
|
| trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
|
|||
|
| The #[code vectors.bin] file should consist of one word and vector per line.
|
|||
|
|
|||
|
+github("spacy-dev-resources", "training/word_vectors.py")
|
|||
|
|
|||
|
+h(3, "train-tagger-parser") Training the tagger and parser
|
|||
|
|
|||
|
p
|
|||
|
| You can now train the model using a corpus for your language annotated
|
|||
|
| with #[+a("http://universaldependencies.org/") Universal Dependencies].
|
|||
|
| If your corpus uses the
|
|||
|
| #[+a("http://universaldependencies.org/docs/format.html") CoNLL-U] format,
|
|||
|
| i.e. files with the extension #[code .conllu], you can use the
|
|||
|
| #[+api("cli#convert") #[code convert]] command to convert it to spaCy's
|
|||
|
| #[+a("/api/annotation#json-input") JSON format] for training.
|
|||
|
| Once you have your UD corpus transformed into JSON, you can train your
|
|||
|
| model use the using spaCy's #[+api("cli#train") #[code train]] command.
|
|||
|
|
|||
|
+infobox
|
|||
|
| For more details and examples of how to
|
|||
|
| #[strong train the tagger and dependency parser], see the
|
|||
|
| #[+a("/usage/training#tagger-parser") usage guide on training].
|