spaCy/website/docs/usage/vectors-similarity.md

9.5 KiB

title menu
Word Vectors and Semantic Similarity
Basics
basics
Custom Vectors
custom
GPU Usage
gpu

Basics

Training word vectors

Dense, real valued vectors representing distributional similarity information are now a cornerstone of practical NLP. The most common way to train these vectors is the Word2vec family of algorithms. If you need to train a word2vec model, we recommend the implementation in the Python library Gensim.

import Vectors101 from 'usage/101/_vectors-similarity.md'

Customizing word vectors

Word vectors let you import knowledge from raw text into your model. The knowledge is represented as a table of numbers, with one row per term in your vocabulary. If two terms are used in similar contexts, the algorithm that learns the vectors should assign them rows that are quite similar, while words that are used in different contexts will have quite different values. This lets you use the row-values assigned to the words as a kind of dictionary, to tell you some things about what the words in your text mean.

Word vectors are particularly useful for terms which aren't well represented in your labelled training data. For instance, if you're doing named entity recognition, there will always be lots of names that you don't have examples of. For instance, imagine your training data happens to contain some examples of the term "Microsoft", but it doesn't contain any examples of the term "Symantec". In your raw text sample, there are plenty of examples of both terms, and they're used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won't see examples of "Symantec" labelled as a company. However, it'll see that "Symantec" has a word vector that usually corresponds to company terms, so it can make the inference.

In order to make best use of the word vectors, you want the word vectors table to cover a very large vocabulary. However, most words are rare, so most of the rows in a large word vectors table will be accessed very rarely, or never at all. You can usually cover more than 95% of the tokens in your corpus with just a few thousand rows in the vector table. However, it's those 5% of rare terms where the word vectors are most useful. The problem is that increasing the size of the vector table produces rapidly diminishing returns in coverage over these rare terms.

Converting word vectors for use in spaCy

Custom word vectors can be trained using a number of open-source libraries, such as Gensim, Fast Text, or Tomas Mikolov's original word2vec implementation. Most word vector libraries output an easy-to-read text-based format, where each line consists of the word followed by its vector. For everyday use, we want to convert the vectors model into a binary format that loads faster and takes up less space on disk. The easiest way to do this is the init-model command-line utility:

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz

This will output a spaCy model in the directory /tmp/la_vectors_wiki_lg, giving you access to some nice Latin vectors 😉 You can then pass the directory path to spacy.load().

nlp_latin = spacy.load("/tmp/la_vectors_wiki_lg")
doc1 = nlp_latin("Caecilius est in horto")
doc2 = nlp_latin("servus est in atrio")
doc1.similarity(doc2)

The model directory will have a /vocab directory with the strings, lexical entries and word vectors from the input vectors model. The init-model command supports a number of archive formats for the word vectors: the vectors can be in plain text (.txt), zipped (.zip), or tarred and zipped (.tgz).

Optimizing vector coverage

To help you strike a good balance between coverage and memory usage, spaCy's Vectors class lets you map multiple keys to the same row of the table. If you're using the spacy init-model command to create a vocabulary, pruning the vectors will be taken care of automatically if you set the --prune-vectors flag. You can also do it manually in the following steps:

  1. Start with a word vectors model that covers a huge vocabulary. For instance, the en_vectors_web_lg model provides 300-dimensional GloVe vectors for over 1 million terms of English.
  2. If your vocabulary has values set for the Lexeme.prob attribute, the lexemes will be sorted by descending probability to determine which vectors to prune. Otherwise, lexemes will be sorted by their order in the Vocab.
  3. Call Vocab.prune_vectors with the number of vectors you want to keep.
nlp = spacy.load('en_vectors_web_lg')
n_vectors = 105000  # number of vectors to keep
removed_words = nlp.vocab.prune_vectors(n_vectors)

assert len(nlp.vocab.vectors) <= n_vectors  # unique vectors have been pruned
assert nlp.vocab.vectors.n_keys > n_vectors  # but not the total entries

Vocab.prune_vectors reduces the current vector table to a given number of unique entries, and returns a dictionary containing the removed words, mapped to (string, score) tuples, where string is the entry the removed word was mapped to, and score the similarity score between the two words.

### Removed words
{
    "Shore": ("coast", 0.732257),
    "Precautionary": ("caution", 0.490973),
    "hopelessness": ("sadness", 0.742366),
    "Continous": ("continuous", 0.732549),
    "Disemboweled": ("corpse", 0.499432),
    "biostatistician": ("scientist", 0.339724),
    "somewheres": ("somewheres", 0.402736),
    "observing": ("observe", 0.823096),
    "Leaving": ("leaving", 1.0),
}

In the example above, the vector for "Shore" was removed and remapped to the vector of "coast", which is deemed about 73% similar. "Leaving" was remapped to the vector of "leaving", which is identical.

If you're using the init-model command, you can set the --prune-vectors option to easily reduce the size of the vectors as you add them to a spaCy model:

$ python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000

This will create a spaCy model with vectors for the first 10,000 words in the vectors model. All other words in the vectors model are mapped to the closest vector among those retained.

Adding vectors

spaCy's new Vectors class greatly improves the way word vectors are stored, accessed and used. The data is stored in two structures:

  • An array, which can be either on CPU or GPU.
  • A dictionary mapping string-hashes to rows in the table.

Keep in mind that the Vectors class itself has no StringStore, so you have to store the hash-to-string mapping separately. If you need to manage the strings, you should use the Vectors via the Vocab class, e.g. vocab.vectors. To add vectors to the vocabulary, you can use the Vocab.set_vector method.

### Adding vectors
from spacy.vocab import Vocab

vector_data = {"dog": numpy.random.uniform(-1, 1, (300,)),
               "cat": numpy.random.uniform(-1, 1, (300,)),
               "orange": numpy.random.uniform(-1, 1, (300,))}
vocab = Vocab()
for word, vector in vector_data.items():
    vocab.set_vector(word, vector)

Using custom similarity methods

By default, Token.vector returns the vector for its underlying Lexeme, while Doc.vector and Span.vector return an average of the vectors of their tokens. You can customize these behaviors by modifying the doc.user_hooks, doc.user_span_hooks and doc.user_token_hooks dictionaries.

For more details on adding hooks and overwriting the built-in Doc, Span and Token methods, see the usage guide on user hooks.

Storing vectors on a GPU

If you're using a GPU, it's much more efficient to keep the word vectors on the device. You can do that by setting the Vectors.data attribute to a cupy.ndarray object if you're using spaCy or Chainer, or a torch.Tensor object if you're using PyTorch. The data object just needs to support __iter__ and __getitem__, so if you're using another library such as TensorFlow, you could also create a wrapper for your vectors data.

### spaCy, Thinc or Chainer
import cupy.cuda
from spacy.vectors import Vectors

vector_table = numpy.zeros((3, 300), dtype="f")
vectors = Vectors(["dog", "cat", "orange"], vector_table)
with cupy.cuda.Device(0):
    vectors.data = cupy.asarray(vectors.data)
### PyTorch
import torch
from spacy.vectors import Vectors

vector_table = numpy.zeros((3, 300), dtype="f")
vectors = Vectors(["dog", "cat", "orange"], vector_table)
vectors.data = torch.Tensor(vectors.data).cuda(0)