diff --git a/website/docs/usage/vectors-embeddings.md b/website/docs/usage/vectors-embeddings.md index 823b30c20..184436d12 100644 --- a/website/docs/usage/vectors-embeddings.md +++ b/website/docs/usage/vectors-embeddings.md @@ -2,28 +2,17 @@ title: Vectors and Embeddings menu: - ["What's a Word Vector?", 'whats-a-vector'] - - ['Word Vectors', 'vectors'] - - ['Other Embeddings', 'embeddings'] + - ['Using Word Vectors', 'usage'] + - ['Converting and Importing', 'converting'] next: /usage/transformers --- -An old idea in linguistics is that you can "know a word by the company it -keeps": that is, word meanings can be understood relationally, based on their -patterns of usage. This idea inspired a branch of NLP research known as -"distributional semantics" that has aimed to compute databases of lexical -knowledge automatically. The [Word2vec](https://en.wikipedia.org/wiki/Word2vec) -family of algorithms are a key milestone in this line of research. For -simplicity, we will refer to a distributional word representation as a "word -vector", and algorithms that computes word vectors (such as -[GloVe](https://nlp.stanford.edu/projects/glove/), -[FastText](https://fasttext.cc), etc.) as "Word2vec algorithms". - +Word vector tables (or "embeddings") let you find similar terms, and can improve +the accuracy of some of your components. You can even use word vectors as a +quick-and-dirty text-classification solution when you don't have any training data. Word vector tables are included in some of the spaCy [model packages](/models) we distribute, and you can easily create your own model packages with word -vectors you train or download yourself. In some cases you can also add word -vectors to an existing pipeline, although each pipeline can only have a single -word vectors table, and a model package that already has word vectors is -unlikely to work correctly if you replace the vectors with new ones. +vectors you train or download yourself. ## What's a word vector? {#whats-a-vector} @@ -42,6 +31,17 @@ def what_is_a_word_vector( return vectors_table[key2row.get(word_id, default_row)] ``` +An old idea in linguistics is that you can "know a word by the company it +keeps": that is, word meanings can be understood relationally, based on their +patterns of usage. This idea inspired a branch of NLP research known as +"distributional semantics" that has aimed to compute databases of lexical +knowledge automatically. The [Word2vec](https://en.wikipedia.org/wiki/Word2vec) +family of algorithms are a key milestone in this line of research. For +simplicity, we will refer to a distributional word representation as a "word +vector", and algorithms that computes word vectors (such as +[GloVe](https://nlp.stanford.edu/projects/glove/), +[FastText](https://fasttext.cc), etc.) as "Word2vec algorithms". + Word2vec algorithms try to produce vectors tables that let you estimate useful relationships between words using simple linear algebra operations. For instance, you can often find close synonyms of a word by finding the vectors @@ -51,14 +51,15 @@ statistical models. ### Word vectors vs. contextual language models {#vectors-vs-language-models} -The key difference between word vectors and contextual language models such as -ElMo, BERT and GPT-2 is that word vectors model **lexical types**, rather than -_tokens_. If you have a list of terms with no context around them, a model like -BERT can't really help you. BERT is designed to understand language **in -context**, which isn't what you have. A word vectors table will be a much better -fit for your task. However, if you do have words in context — whole sentences or -paragraphs of running text — word vectors will only provide a very rough -approximation of what the text is about. +The key difference between word vectors and contextual language models such +as [transformers](/usage/transformers) +is that word vectors model **lexical types**, rather than +_tokens_. If you have a list of terms with no context around them, +a transformer model like BERT can't really help you. BERT is designed to understand +language **in context**, which isn't what you have. A word vectors table will be +a much better fit for your task. However, if you do have words in context — whole +sentences or paragraphs of running text — word vectors will only provide a very +rough approximation of what the text is about. Word vectors are also very computationally efficient, as they map a word to a vector with a single indexing operation. Word vectors are therefore useful as a @@ -69,7 +70,7 @@ gradients to the pretrained word vectors table. The static vectors table is usually used in combination with a smaller table of learned task-specific embeddings. -## Using word vectors directly {#vectors} +## Using word vectors {#usage} spaCy stores word vector information in the [`Vocab.vectors`](/api/vocab#attributes) attribute, so you can access the whole @@ -85,7 +86,141 @@ whether you've configured spaCy to use GPU memory), with dtype `float32`. The array is read-only so that spaCy can avoid unnecessary copy operations where possible. You can modify the vectors via the `Vocab` or `Vectors` table. -### Converting word vectors for use in spaCy +### Word vectors and similarity + +A common use-case of word vectors is to answer _similarity questions_. You can +ask how similar a `token`, `span`, `doc` or `lexeme` is to another object using +the `.similarity()` method. You can even check the similarity of mismatched +types, asking how similar a whole document is to a particular word, how similar +a span is to a document, etc. By default, the `.similarity()` method will use +return the cosine of the `.vector` attribute of the two objects being compared. +You can customize this behavior by setting one or more +[user hooks](/usage/processing-pipelines#custom-components-user-hooks) for the +types you want to customize. + +Word vector similarity is a practical technique for many situations, especially +since it's easy to use and relatively efficient to compute. However, it's +important to maintain realistic expectations about what information it can +provide. Words can be related to each over in many ways, so a single +"similarity" score will always be a mix of different signals. The word vectors +model is also not trained for your specific use-case, so you have no way of +telling it which results are more or less useful for your purpose. These +problems are even more accute when you go from measuring the similarity of +single words to the similarity of spans or documents. The vector averaging +process is insensitive to the order of the words, so `doc1.similarity(doc2)` +will mostly be based on the overlap in lexical items between the two documents +objects. Two documents expressing the same meaning with dissimilar wording will +return a lower similarity score than two documents that happen to contain the +same words while expressing different meanings. + +### Using word vectors in your models + +Many neural network models are able to use word vector tables as additional +features, which sometimes results in significant improvements in accuracy. +spaCy's built-in embedding layer, `spacy.MultiHashEmbed.v1`, can be configured +to use word vector tables using the `also_use_static_vectors` flag. This +setting is also available on the `spacy.MultiHashEmbedCNN.v1` layer, which +builds the default token-to-vector encoding architecture. + +``` +[tagger.model.tok2vec.embed] +@architectures = "spacy.MultiHashEmbed.v1" +width = 128 +rows = 7000 +also_embed_subwords = true +also_use_static_vectors = true +``` + + +The configuration system will look up the string `spacy.MultiHashEmbed.v1` +in the `architectures` registry, and call the returned object with the +rest of the arguments from the block. This will result in a call to the +`spacy.ml.models.tok2vec.MultiHashEmbed` function, which will return +a Thinc model object with the type signature `Model[List[Doc], +List[Floats2d]]`. Because the embedding layer takes a list of `Doc` objects as +input, it does not need to store a copy of the vectors table. The vectors will +be retrieved from the `Doc` objects that are passed in, via the +`doc.vocab.vectors` attribute. This part of the process is handled by the +`spacy.ml.staticvectors.StaticVectors` layer. + + +#### Creating a custom embedding layer + +The `MultiHashEmbed` layer is spaCy's recommended strategy for constructing +initial word representations for your neural network models, but you can also +implement your own. You can register any function to a string name, and then +reference that function within your config (see the [training]("/usage/training") +section for more details). To try this out, you can save the following little +example to a new Python file: + +``` +from spacy.ml.staticvectors import StaticVectors +from spacy.util import registry + +print("I was imported!") + +@registry.architectures("my_example.MyEmbedding.v1") +def MyEmbedding(output_width: int) -> Model[List[Doc], List[Floats2d]]: + print("I was called!") + return StaticVectors(nO=output_width) +``` + +If you pass the path to your file to the `spacy train` command using the `-c` +argument, your file will be imported, which means the decorator registering the +function will be run. Your function is now on equal footing with any of spaCy's +built-ins, so you can drop it in instead of any other model with the same input +and output signature. For instance, you could use it in the tagger model as +follows: + +``` +[tagger.model.tok2vec.embed] +@architectures = "my_example.MyEmbedding.v1" +output_width = 128 +``` + +Now that you have a custom function wired into the network, you can start +implementing the logic you're interested in. For example, let's say you want to +try a relatively simple embedding strategy that makes use of static word vectors, +but combines them via summation with a smaller table of learned embeddings. + +```python +from thinc.api import add, chain, remap_ids, Embed +from spacy.ml.staticvectors import StaticVectors + +@registry.architectures("my_example.MyEmbedding.v1") +def MyCustomVectors( + output_width: int, + vector_width: int, + embed_rows: int, + key2row: Dict[int, int] +) -> Model[List[Doc], List[Floats2d]]: + return add( + StaticVectors(nO=output_width), + chain( + FeatureExtractor(["ORTH"]), + remap_ids(key2row), + Embed(nO=output_width, nV=embed_rows) + ) + ) +``` + +#### When should you add word vectors to your model? + +Word vectors are not compatible with most [transformer models](/usage/transformers), +but if you're training another type of NLP network, it's almost always worth +adding word vectors to your model. As well as improving your final accuracy, +word vectors often make experiments more consistent, as the accuracy you +reach will be less sensitive to how the network is randomly initialized. High +variance due to random chance can slow down your progress significantly, as you +need to run many experiments to filter the signal from the noise. + +Word vector features need to be enabled prior to training, and the same word vectors +table will need to be available at runtime as well. You cannot add word vector +features once the model has already been trained, and you usually cannot +replace one word vectors table with another without causing a significant loss +of performance. + +## Converting word vectors for use in spaCy {#converting} Custom word vectors can be trained using a number of open-source libraries, such as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc), @@ -185,6 +320,13 @@ vector among those retained. ### Adding vectors {#adding-vectors} +You can also add word vectors individually, using the method `vocab.set_vector`. +This is often the easiest approach if you have vectors in an arbitrary format, +as you can read in the vectors with your own logic, and just set them with +a simple loop. This method is likely to be slower than approaches that work +with the whole vectors table at once, but it's a great approach for once-off +conversions before you save out your model to disk. + ```python ### Adding vectors from spacy.vocab import Vocab @@ -196,29 +338,3 @@ vocab = Vocab() for word, vector in vector_data.items(): vocab.set_vector(word, vector) ``` - -### Using custom similarity methods {#custom-similarity} - -By default, [`Token.vector`](/api/token#vector) returns the vector for its -underlying [`Lexeme`](/api/lexeme), while [`Doc.vector`](/api/doc#vector) and -[`Span.vector`](/api/span#vector) return an average of the vectors of their -tokens. You can customize these behaviors by modifying the `doc.user_hooks`, -`doc.user_span_hooks` and `doc.user_token_hooks` dictionaries. - - - -For more details on **adding hooks** and **overwriting** the built-in `Doc`, -`Span` and `Token` methods, see the usage guide on -[user hooks](/usage/processing-pipelines#custom-components-user-hooks). - - - - - -## Other embeddings {#embeddings} - -