mirror of https://github.com/explosion/spaCy.git
225 lines
9.6 KiB
Markdown
225 lines
9.6 KiB
Markdown
---
|
|
title: Vectors and Embeddings
|
|
menu:
|
|
- ["What's a Word Vector?", 'whats-a-vector']
|
|
- ['Word Vectors', 'vectors']
|
|
- ['Other Embeddings', 'embeddings']
|
|
next: /usage/transformers
|
|
---
|
|
|
|
An old idea in linguistics is that you can "know a word by the company it
|
|
keeps": that is, word meanings can be understood relationally, based on their
|
|
patterns of usage. This idea inspired a branch of NLP research known as
|
|
"distributional semantics" that has aimed to compute databases of lexical
|
|
knowledge automatically. The [Word2vec](https://en.wikipedia.org/wiki/Word2vec)
|
|
family of algorithms are a key milestone in this line of research. For
|
|
simplicity, we will refer to a distributional word representation as a "word
|
|
vector", and algorithms that computes word vectors (such as
|
|
[GloVe](https://nlp.stanford.edu/projects/glove/),
|
|
[FastText](https://fasttext.cc), etc.) as "Word2vec algorithms".
|
|
|
|
Word vector tables are included in some of the spaCy [model packages](/models)
|
|
we distribute, and you can easily create your own model packages with word
|
|
vectors you train or download yourself. In some cases you can also add word
|
|
vectors to an existing pipeline, although each pipeline can only have a single
|
|
word vectors table, and a model package that already has word vectors is
|
|
unlikely to work correctly if you replace the vectors with new ones.
|
|
|
|
## What's a word vector? {#whats-a-vector}
|
|
|
|
For spaCy's purposes, a "word vector" is a 1-dimensional slice from a
|
|
2-dimensional **vectors table**, with a deterministic mapping from word types to
|
|
rows in the table.
|
|
|
|
```python
|
|
def what_is_a_word_vector(
|
|
word_id: int,
|
|
key2row: Dict[int, int],
|
|
vectors_table: Floats2d,
|
|
*,
|
|
default_row: int=0
|
|
) -> Floats1d:
|
|
return vectors_table[key2row.get(word_id, default_row)]
|
|
```
|
|
|
|
Word2vec algorithms try to produce vectors tables that let you estimate useful
|
|
relationships between words using simple linear algebra operations. For
|
|
instance, you can often find close synonyms of a word by finding the vectors
|
|
closest to it by cosine distance, and then finding the words that are mapped to
|
|
those neighboring vectors. Word vectors can also be useful as features in
|
|
statistical models.
|
|
|
|
### Word vectors vs. contextual language models {#vectors-vs-language-models}
|
|
|
|
The key difference between word vectors and contextual language models such as
|
|
ElMo, BERT and GPT-2 is that word vectors model **lexical types**, rather than
|
|
_tokens_. If you have a list of terms with no context around them, a model like
|
|
BERT can't really help you. BERT is designed to understand language **in
|
|
context**, which isn't what you have. A word vectors table will be a much better
|
|
fit for your task. However, if you do have words in context — whole sentences or
|
|
paragraphs of running text — word vectors will only provide a very rough
|
|
approximation of what the text is about.
|
|
|
|
Word vectors are also very computationally efficient, as they map a word to a
|
|
vector with a single indexing operation. Word vectors are therefore useful as a
|
|
way to **improve the accuracy** of neural network models, especially models that
|
|
are small or have received little or no pretraining. In spaCy, word vector
|
|
tables are only used as **static features**. spaCy does not backpropagate
|
|
gradients to the pretrained word vectors table. The static vectors table is
|
|
usually used in combination with a smaller table of learned task-specific
|
|
embeddings.
|
|
|
|
## Using word vectors directly {#vectors}
|
|
|
|
spaCy stores word vector information in the
|
|
[`Vocab.vectors`](/api/vocab#attributes) attribute, so you can access the whole
|
|
vectors table from most spaCy objects. You can also access the vector for a
|
|
[`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) or
|
|
[`Lexeme`](/api/lexeme) instance via the `vector` attribute. If your `Doc` or
|
|
`Span` has multiple tokens, the average of the word vectors will be returned,
|
|
excluding any "out of vocabulary" entries that have no vector available. If none
|
|
of the words have a vector, a zeroed vector will be returned.
|
|
|
|
The `vector` attribute is a **read-only** numpy or cupy array (depending on
|
|
whether you've configured spaCy to use GPU memory), with dtype `float32`. The
|
|
array is read-only so that spaCy can avoid unnecessary copy operations where
|
|
possible. You can modify the vectors via the `Vocab` or `Vectors` table.
|
|
|
|
### Converting word vectors for use in spaCy
|
|
|
|
Custom word vectors can be trained using a number of open-source libraries, such
|
|
as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc),
|
|
or Tomas Mikolov's original
|
|
[Word2vec implementation](https://code.google.com/archive/p/word2vec/). Most
|
|
word vector libraries output an easy-to-read text-based format, where each line
|
|
consists of the word followed by its vector. For everyday use, we want to
|
|
convert the vectors model into a binary format that loads faster and takes up
|
|
less space on disk. The easiest way to do this is the
|
|
[`init-model`](/api/cli#init-model) command-line utility:
|
|
|
|
```bash
|
|
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
|
|
python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
|
|
```
|
|
|
|
This will output a spaCy model in the directory `/tmp/la_vectors_wiki_lg`,
|
|
giving you access to some nice Latin vectors 😉 You can then pass the directory
|
|
path to [`spacy.load()`](/api/top-level#spacy.load).
|
|
|
|
```python
|
|
nlp_latin = spacy.load("/tmp/la_vectors_wiki_lg")
|
|
doc1 = nlp_latin("Caecilius est in horto")
|
|
doc2 = nlp_latin("servus est in atrio")
|
|
doc1.similarity(doc2)
|
|
```
|
|
|
|
The model directory will have a `/vocab` directory with the strings, lexical
|
|
entries and word vectors from the input vectors model. The
|
|
[`init-model`](/api/cli#init-model) command supports a number of archive formats
|
|
for the word vectors: the vectors can be in plain text (`.txt`), zipped
|
|
(`.zip`), or tarred and zipped (`.tgz`).
|
|
|
|
### Optimizing vector coverage {#custom-vectors-coverage new="2"}
|
|
|
|
To help you strike a good balance between coverage and memory usage, spaCy's
|
|
[`Vectors`](/api/vectors) class lets you map **multiple keys** to the **same
|
|
row** of the table. If you're using the
|
|
[`spacy init-model`](/api/cli#init-model) command to create a vocabulary,
|
|
pruning the vectors will be taken care of automatically if you set the
|
|
`--prune-vectors` flag. You can also do it manually in the following steps:
|
|
|
|
1. Start with a **word vectors model** that covers a huge vocabulary. For
|
|
instance, the [`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg)
|
|
model provides 300-dimensional GloVe vectors for over 1 million terms of
|
|
English.
|
|
2. If your vocabulary has values set for the `Lexeme.prob` attribute, the
|
|
lexemes will be sorted by descending probability to determine which vectors
|
|
to prune. Otherwise, lexemes will be sorted by their order in the `Vocab`.
|
|
3. Call [`Vocab.prune_vectors`](/api/vocab#prune_vectors) with the number of
|
|
vectors you want to keep.
|
|
|
|
```python
|
|
nlp = spacy.load('en_vectors_web_lg')
|
|
n_vectors = 105000 # number of vectors to keep
|
|
removed_words = nlp.vocab.prune_vectors(n_vectors)
|
|
|
|
assert len(nlp.vocab.vectors) <= n_vectors # unique vectors have been pruned
|
|
assert nlp.vocab.vectors.n_keys > n_vectors # but not the total entries
|
|
```
|
|
|
|
[`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector
|
|
table to a given number of unique entries, and returns a dictionary containing
|
|
the removed words, mapped to `(string, score)` tuples, where `string` is the
|
|
entry the removed word was mapped to, and `score` the similarity score between
|
|
the two words.
|
|
|
|
```python
|
|
### Removed words
|
|
{
|
|
"Shore": ("coast", 0.732257),
|
|
"Precautionary": ("caution", 0.490973),
|
|
"hopelessness": ("sadness", 0.742366),
|
|
"Continous": ("continuous", 0.732549),
|
|
"Disemboweled": ("corpse", 0.499432),
|
|
"biostatistician": ("scientist", 0.339724),
|
|
"somewheres": ("somewheres", 0.402736),
|
|
"observing": ("observe", 0.823096),
|
|
"Leaving": ("leaving", 1.0),
|
|
}
|
|
```
|
|
|
|
In the example above, the vector for "Shore" was removed and remapped to the
|
|
vector of "coast", which is deemed about 73% similar. "Leaving" was remapped to
|
|
the vector of "leaving", which is identical. If you're using the
|
|
[`init-model`](/api/cli#init-model) command, you can set the `--prune-vectors`
|
|
option to easily reduce the size of the vectors as you add them to a spaCy
|
|
model:
|
|
|
|
```bash
|
|
$ python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000
|
|
```
|
|
|
|
This will create a spaCy model with vectors for the first 10,000 words in the
|
|
vectors model. All other words in the vectors model are mapped to the closest
|
|
vector among those retained.
|
|
|
|
### Adding vectors {#adding-vectors}
|
|
|
|
```python
|
|
### Adding vectors
|
|
from spacy.vocab import Vocab
|
|
|
|
vector_data = {"dog": numpy.random.uniform(-1, 1, (300,)),
|
|
"cat": numpy.random.uniform(-1, 1, (300,)),
|
|
"orange": numpy.random.uniform(-1, 1, (300,))}
|
|
vocab = Vocab()
|
|
for word, vector in vector_data.items():
|
|
vocab.set_vector(word, vector)
|
|
```
|
|
|
|
### Using custom similarity methods {#custom-similarity}
|
|
|
|
By default, [`Token.vector`](/api/token#vector) returns the vector for its
|
|
underlying [`Lexeme`](/api/lexeme), while [`Doc.vector`](/api/doc#vector) and
|
|
[`Span.vector`](/api/span#vector) return an average of the vectors of their
|
|
tokens. You can customize these behaviors by modifying the `doc.user_hooks`,
|
|
`doc.user_span_hooks` and `doc.user_token_hooks` dictionaries.
|
|
|
|
<Infobox title="Custom user hooks" emoji="📖">
|
|
|
|
For more details on **adding hooks** and **overwriting** the built-in `Doc`,
|
|
`Span` and `Token` methods, see the usage guide on
|
|
[user hooks](/usage/processing-pipelines#custom-components-user-hooks).
|
|
|
|
</Infobox>
|
|
|
|
<!-- TODO:
|
|
|
|
### Storing vectors on a GPU {#gpu}
|
|
|
|
-->
|
|
|
|
## Other embeddings {#embeddings}
|
|
|
|
<!-- TODO: something about other embeddings -->
|