spaCy/website/docs/api/span.md

513 lines
22 KiB
Markdown
Raw Normal View History

---
title: Span
tag: class
source: spacy/tokens/span.pyx
---
A slice from a [`Doc`](/api/doc) object.
## Span.\_\_init\_\_ {#init tag="method"}
Create a Span object from the slice `doc[start : end]`.
> #### Example
>
> ```python
> doc = nlp("Give it back! He pleaded.")
> span = doc[1:4]
> assert [t.text for t in span] == ["it", "back", "!"]
> ```
Documentation for Entity Linking (#4065) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts
2019-09-12 09:38:34 +00:00
| Name | Type | Description |
| ----------- | ---------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
Documentation for Entity Linking (#4065) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts
2019-09-12 09:38:34 +00:00
| `doc` | `Doc` | The parent document. |
| `start` | int | The index of the first token of the span. |
| `end` | int | The index of the first token after the span. |
| `label` | int / unicode | A label to attach to the span, e.g. for named entities. As of v2.1, the label can also be a unicode string. |
| `kb_id` | int / unicode | A knowledge base ID to attach to the span, e.g. for named entities. The ID can be an integer or a unicode string. |
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
| **RETURNS** | `Span` | The newly constructed object. |
## Span.\_\_getitem\_\_ {#getitem tag="method"}
Get a `Token` object.
> #### Example
>
> ```python
> doc = nlp("Give it back! He pleaded.")
> span = doc[1:4]
> assert span[1].text == "back"
> ```
| Name | Type | Description |
| ----------- | ------- | --------------------------------------- |
| `i` | int | The index of the token within the span. |
| **RETURNS** | `Token` | The token at `span[i]`. |
Get a `Span` object.
> #### Example
>
> ```python
> doc = nlp("Give it back! He pleaded.")
> span = doc[1:4]
> assert span[1:3].text == "back!"
> ```
| Name | Type | Description |
| ----------- | ------ | -------------------------------- |
| `start_end` | tuple | The slice of the span to get. |
| **RETURNS** | `Span` | The span at `span[start : end]`. |
## Span.\_\_iter\_\_ {#iter tag="method"}
Iterate over `Token` objects.
> #### Example
>
> ```python
> doc = nlp("Give it back! He pleaded.")
> span = doc[1:4]
> assert [t.text for t in span] == ["it", "back", "!"]
> ```
| Name | Type | Description |
| ---------- | ------- | ----------------- |
| **YIELDS** | `Token` | A `Token` object. |
## Span.\_\_len\_\_ {#len tag="method"}
Get the number of tokens in the span.
> #### Example
>
> ```python
> doc = nlp("Give it back! He pleaded.")
> span = doc[1:4]
> assert len(span) == 3
> ```
| Name | Type | Description |
| ----------- | ---- | --------------------------------- |
| **RETURNS** | int | The number of tokens in the span. |
## Span.set_extension {#set_extension tag="classmethod" new="2"}
Define a custom attribute on the `Span` which becomes available via `Span._`.
For details, see the documentation on
[custom attributes](/usage/processing-pipelines#custom-components-attributes).
> #### Example
>
> ```python
> from spacy.tokens import Span
> city_getter = lambda span: any(city in span.text for city in ("New York", "Paris", "Berlin"))
> Span.set_extension("has_city", getter=city_getter)
> doc = nlp("I like New York in Autumn")
> assert doc[1:4]._.has_city
> ```
| Name | Type | Description |
| --------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `span._.my_attr`. |
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
| `method` | callable | Set a custom method on the object, for example `span._.compare(other_span)`. |
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
| `setter` | callable | Setter function that takes the `Span` and a value, and modifies the object. Is called when the user writes to the `Span._` attribute. |
| `force` | bool | Force overwriting existing attribute. |
## Span.get_extension {#get_extension tag="classmethod" new="2"}
Look up a previously registered extension by name. Returns a 4-tuple
`(default, method, getter, setter)` if the extension is registered. Raises a
`KeyError` otherwise.
> #### Example
>
> ```python
> from spacy.tokens import Span
> Span.set_extension("is_city", default=False)
> extension = Span.get_extension("is_city")
> assert extension == (False, None, None, None)
> ```
| Name | Type | Description |
| ----------- | ------- | ------------------------------------------------------------- |
| `name` | unicode | Name of the extension. |
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
## Span.has_extension {#has_extension tag="classmethod" new="2"}
Check whether an extension has been registered on the `Span` class.
> #### Example
>
> ```python
> from spacy.tokens import Span
> Span.set_extension("is_city", default=False)
> assert Span.has_extension("is_city")
> ```
| Name | Type | Description |
| ----------- | ------- | ------------------------------------------ |
| `name` | unicode | Name of the extension to check. |
| **RETURNS** | bool | Whether the extension has been registered. |
## Span.remove_extension {#remove_extension tag="classmethod" new="2.0.12"}
Remove a previously registered extension.
> #### Example
>
> ```python
> from spacy.tokens import Span
> Span.set_extension("is_city", default=False)
> removed = Span.remove_extension("is_city")
> assert not Span.has_extension("is_city")
> ```
| Name | Type | Description |
| ----------- | ------- | --------------------------------------------------------------------- |
| `name` | unicode | Name of the extension. |
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
## Span.char_span {#char_span tag="method" new="2.2.4"}
Create a `Span` object from the slice `span.text[start:end]`. Returns `None` if
the character indices don't map to a valid span.
> #### Example
>
> ```python
> doc = nlp("I like New York")
> span = doc[1:4].char_span(5, 13, label="GPE")
> assert span.text == "New York"
> ```
| Name | Type | Description |
| ----------- | ---------------------------------------- | --------------------------------------------------------------------- |
| `start` | int | The index of the first character of the span. |
| `end` | int | The index of the last character after the span. |
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
| `kb_id` | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
| **RETURNS** | `Span` | The newly constructed object or `None`. |
## Span.similarity {#similarity tag="method" model="vectors"}
Make a semantic similarity estimate. The default estimate is cosine similarity
using an average of word vectors.
> #### Example
>
> ```python
> doc = nlp("green apples and red oranges")
> green_apples = doc[:2]
> red_oranges = doc[3:]
> apples_oranges = green_apples.similarity(red_oranges)
> oranges_apples = red_oranges.similarity(green_apples)
> assert apples_oranges == oranges_apples
> ```
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------------------------------------------------------- |
| `other` | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. |
| **RETURNS** | float | A scalar similarity score. Higher is more similar. |
## Span.get_lca_matrix {#get_lca_matrix tag="method"}
Calculates the lowest common ancestor matrix for a given `Span`. Returns LCA
matrix containing the integer index of the ancestor, or `-1` if no common
ancestor is found, e.g. if span excludes a necessary ancestor.
> #### Example
>
> ```python
> doc = nlp("I like New York in Autumn")
> span = doc[1:4]
> matrix = span.get_lca_matrix()
> # array([[0, 0, 0], [0, 1, 2], [0, 2, 2]], dtype=int32)
> ```
| Name | Type | Description |
| ----------- | -------------------------------------- | ------------------------------------------------ |
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Span`. |
## Span.to_array {#to_array tag="method" new="2"}
Given a list of `M` attribute IDs, export the tokens to a numpy `ndarray` of
shape `(N, M)`, where `N` is the length of the document. The values will be
32-bit integers.
> #### Example
>
> ```python
> from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
> doc = nlp("I like New York in Autumn.")
> span = doc[2:3]
> # All strings mapped to integers, for easy export to numpy
> np_array = span.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
> ```
| Name | Type | Description |
| ----------- | ----------------------------- | -------------------------------------------------------------------------------------------------------- |
| `attr_ids` | list | A list of attribute ID ints. |
| **RETURNS** | `numpy.ndarray[long, ndim=2]` | A feature matrix, with one row per word, and one column per attribute indicated in the input `attr_ids`. |
## Span.merge {#merge tag="method"}
<Infobox title="Deprecation note" variant="danger">
As of v2.1.0, `Span.merge` still works but is considered deprecated. You should
use the new and less error-prone [`Doc.retokenize`](/api/doc#retokenize)
instead.
</Infobox>
Retokenize the document, such that the span is merged into a single token.
> #### Example
>
> ```python
> doc = nlp("I like New York in Autumn.")
> span = doc[2:4]
> span.merge()
> assert len(doc) == 6
> assert doc[2].text == "New York"
> ```
| Name | Type | Description |
| -------------- | ------- | ------------------------------------------------------------------------------------------------------------------------- |
| `**attributes` | - | Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span. |
| **RETURNS** | `Token` | The newly merged token. |
## Span.ents {#ents tag="property" new="2.0.13" model="ner"}
The named entities in the span. Returns a tuple of named entity `Span` objects,
if the entity recognizer has been applied.
> #### Example
>
> ```python
> doc = nlp("Mr. Best flew to New York on Saturday morning.")
> span = doc[0:6]
> ents = list(span.ents)
> assert ents[0].label == 346
> assert ents[0].label_ == "PERSON"
> assert ents[0].text == "Mr. Best"
> ```
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------- |
| **RETURNS** | tuple | Entities in the span, one `Span` per entity. |
## Span.as_doc {#as_doc tag="method"}
Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
> #### Example
>
> ```python
> doc = nlp("I like New York in Autumn.")
> span = doc[2:4]
> doc2 = span.as_doc()
> assert doc2.text == "New York"
> ```
| Name | Type | Description |
| ---------------- | ----- | ---------------------------------------------------- |
| `copy_user_data` | bool | Whether or not to copy the original doc's user data. |
| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. |
## Span.root {#root tag="property" model="parser"}
The token with the shortest path to the root of the sentence (or the root
itself). If multiple tokens are equally high in the tree, the first token is
taken.
> #### Example
>
> ```python
> doc = nlp("I like New York in Autumn.")
> i, like, new, york, in_, autumn, dot = range(len(doc))
> assert doc[new].head.text == "York"
> assert doc[york].head.text == "like"
> new_york = doc[new:york+1]
> assert new_york.root.text == "York"
> ```
| Name | Type | Description |
| ----------- | ------- | --------------- |
| **RETURNS** | `Token` | The root token. |
## Span.conjuncts {#conjuncts tag="property" model="parser"}
A tuple of tokens coordinated to `span.root`.
> #### Example
>
> ```python
> doc = nlp("I like apples and oranges")
> apples_conjuncts = doc[2:3].conjuncts
> assert [t.text for t in apples_conjuncts] == ["oranges"]
> ```
2019-03-11 16:10:50 +00:00
| Name | Type | Description |
| ----------- | ------- | ----------------------- |
| **RETURNS** | `tuple` | The coordinated tokens. |
## Span.lefts {#lefts tag="property" model="parser"}
Tokens that are to the left of the span, whose heads are within the span.
> #### Example
>
> ```python
> doc = nlp("I like New York in Autumn.")
> lefts = [t.text for t in doc[3:7].lefts]
> assert lefts == ["New"]
> ```
| Name | Type | Description |
| ---------- | ------- | ------------------------------------ |
| **YIELDS** | `Token` | A left-child of a token of the span. |
## Span.rights {#rights tag="property" model="parser"}
Tokens that are to the right of the span, whose heads are within the span.
> #### Example
>
> ```python
> doc = nlp("I like New York in Autumn.")
> rights = [t.text for t in doc[2:4].rights]
> assert rights == ["in"]
> ```
| Name | Type | Description |
| ---------- | ------- | ------------------------------------- |
| **YIELDS** | `Token` | A right-child of a token of the span. |
## Span.n_lefts {#n_lefts tag="property" model="parser"}
The number of tokens that are to the left of the span, whose heads are within
the span.
> #### Example
>
> ```python
> doc = nlp("I like New York in Autumn.")
> assert doc[3:7].n_lefts == 1
> ```
| Name | Type | Description |
| ----------- | ---- | -------------------------------- |
| **RETURNS** | int | The number of left-child tokens. |
## Span.n_rights {#n_rights tag="property" model="parser"}
The number of tokens that are to the right of the span, whose heads are within
the span.
> #### Example
>
> ```python
> doc = nlp("I like New York in Autumn.")
> assert doc[2:4].n_rights == 1
> ```
| Name | Type | Description |
| ----------- | ---- | --------------------------------- |
| **RETURNS** | int | The number of right-child tokens. |
## Span.subtree {#subtree tag="property" model="parser"}
Tokens within the span and tokens which descend from them.
> #### Example
>
> ```python
> doc = nlp("Give it back! He pleaded.")
> subtree = [t.text for t in doc[:3].subtree]
> assert subtree == ["Give", "it", "back", "!"]
> ```
| Name | Type | Description |
| ---------- | ------- | ------------------------------------------------- |
| **YIELDS** | `Token` | A token within the span, or a descendant from it. |
## Span.has_vector {#has_vector tag="property" model="vectors"}
A boolean value indicating whether a word vector is associated with the object.
> #### Example
>
> ```python
> doc = nlp("I like apples")
> assert doc[1:].has_vector
> ```
| Name | Type | Description |
| ----------- | ---- | -------------------------------------------- |
| **RETURNS** | bool | Whether the span has a vector data attached. |
## Span.vector {#vector tag="property" model="vectors"}
A real-valued meaning representation. Defaults to an average of the token
vectors.
> #### Example
>
> ```python
> doc = nlp("I like apples")
> assert doc[1:].vector.dtype == "float32"
> assert doc[1:].vector.shape == (300,)
> ```
| Name | Type | Description |
| ----------- | ---------------------------------------- | --------------------------------------------------- |
| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A 1D numpy array representing the span's semantics. |
## Span.vector_norm {#vector_norm tag="property" model="vectors"}
The L2 norm of the span's vector representation.
> #### Example
>
> ```python
> doc = nlp("I like apples")
> doc[1:].vector_norm # 4.800883928527915
> doc[2:].vector_norm # 6.895897646384268
> assert doc[1:].vector_norm != doc[2:].vector_norm
> ```
| Name | Type | Description |
| ----------- | ----- | ----------------------------------------- |
| **RETURNS** | float | The L2 norm of the vector representation. |
## Attributes {#attributes}
2019-08-01 16:37:09 +00:00
| Name | Type | Description |
| --------------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The parent document. |
| `tensor` <Tag variant="new">2.1.7</Tag> | `ndarray` | The span's slice of the parent `Doc`'s tensor. |
| `sent` | `Span` | The sentence span that this span is a part of. |
| `start` | int | The token offset for the start of the span. |
| `end` | int | The token offset for the end of the span. |
| `start_char` | int | The character offset for the start of the span. |
| `end_char` | int | The character offset for the end of the span. |
| `text` | unicode | A unicode representation of the span text. |
| `text_with_ws` | unicode | The text content of the span with a trailing whitespace character if the last token has one. |
| `orth` | int | ID of the verbatim text content. |
| `orth_` | unicode | Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. |
Documentation for Entity Linking (#4065) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts
2019-09-12 09:38:34 +00:00
| `label` | int | The hash value of the span's label. |
2019-08-01 16:37:09 +00:00
| `label_` | unicode | The span's label. |
| `lemma_` | unicode | The span's lemma. |
Documentation for Entity Linking (#4065) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts
2019-09-12 09:38:34 +00:00
| `kb_id` | int | The hash value of the knowledge base ID referred to by the span. |
| `kb_id_` | unicode | The knowledge base ID referred to by the span. |
2019-08-01 16:37:09 +00:00
| `ent_id` | int | The hash value of the named entity the token is an instance of. |
| `ent_id_` | unicode | The string ID of the named entity the token is an instance of. |
| `sentiment` | float | A scalar value indicating the positivity or negativity of the span. |
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |