Upd MultiHashEmbed docs

2020-10-05 20:28:21 +02:00 · 2020-10-05 20:28:21 +02:00 · 919790cb47
parent b392d48e76
commit 919790cb47
1 changed files with 38 additions and 13 deletions
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -136,25 +136,50 @@ argument that connects to the shared `tok2vec` component in the pipeline.
 > [model]
 > @architectures = "spacy.MultiHashEmbed.v1"
 > width = 64
-> rows = 2000
-> also_embed_subwords = false
-> also_use_static_vectors = false
+> attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
+> rows = [2000, 1000, 1000, 1000]
+> include_static_vectors = true
 > ```

 Construct an embedding layer that separately embeds a number of lexical
-attributes using hash embedding, concatenates the results, and passes it through
-a feed-forward subnetwork to build mixed representations. The features used are
-the `NORM`, `PREFIX`, `SUFFIX` and `SHAPE`, and they are extracted with a
-[FeatureExtractor](/api/architectures#FeatureExtractor) layer. Vectors from pretrained static
-vectors can also be incorporated into the concatenated representation.
+attributes using hash embedding, concatenates the results, and passes it
+through a feed-forward subnetwork to build a mixed representations.
+
+The features used can be configured with the 'attrs' argument. The suggested
+attributes are NORM, PREFIX, SUFFIX and SHAPE. This lets the model take into
+account some subword information, without contruction a fully character-based
+representation. If pretrained vectors are available, they can be included in
+the representation as well, with the vectors table will be kept static
+(i.e. it's not updated).
+
+The `width` parameter specifices the output width of the layer and the widths
+of all embedding tables. If static vectors are included, a learned linear
+layer is used to map the vectors to the specified width before concatenating
+it with the other embedding outputs. A single Maxout layer is then used to
+reduce the concatenated vectors to the final width.
+    
+The `rows` parameter controls the number of rows used by the `HashEmbed`
+tables. The HashEmbed layer needs surprisingly few rows, due to its use of
+the hashing trick. Generally between 2000 and 10000 rows is sufficient,
+even for very large vocabularies. A number of rows must be specified for each
+table, so the `rows` list must be of the same length as the `attrs` parameter.
+
+    attrs (list of attr IDs): The token attributes to embed. A separate
+        embedding table will be constructed for each attribute.
+    rows (List[int]): The number of rows in the embedding tables. Must have the
+        same length as attrs.
+    include_static_vectors (bool): Whether to also use static word vectors.
+        Requires a vectors table to be loaded in the Doc objects' vocab.
+

 | Name                      | Description                                                                                                                                                                                                       |
 | ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `width`                   | The output width. Also used as the width of the embedding tables. Recommended values are between `64` and `300`. ~~int~~                                                                                          |
-| `rows`                    | The number of rows for the embedding tables. Can be low, due to the hashing trick. Embeddings for prefix, suffix and word shape use half as many rows. Recommended values are between `2000` and `10000`. ~~int~~ |
-| `also_embed_subwords`     | Whether to use the `PREFIX`, `SUFFIX` and `SHAPE` features in the embeddings. If not using these, you may need more rows in your hash embeddings, as there will be increased chance of collisions. ~~bool~~       |
-| `also_use_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [Doc](/api/doc) objects' vocab. ~~bool~~                                                                                    |
-| **CREATES**               | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~                                                                                                                                            |
+| `width`                   | The output width. Also used as the width of the embedding tables. Recommended values are between `64` and `300`. ~~int~~ |
+| `attrs`                   | The token attributes to embed. A separate |
+embedding table will be constructed for each attribute. ~~List[Union[int, str]]~~ |
+| `rows`                    | The number of rows for each embedding tables. Can be low, due to the hashing trick. Recommended values are between `1000` and `10000`. ~~List[int]~~ |
+| `include_static_vectors`  | Whether to also use static word vectors. Requires a vectors table to be loaded in the [Doc](/api/doc) objects' vocab. ~~bool~~ |
+| **CREATES**               | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |

 ### spacy.CharacterEmbed.v1 {#CharacterEmbed}