spaCy/website/docs/api/doc.md

39 KiB
Raw Blame History

title tag teaser source
Doc class A container for accessing linguistic annotations. spacy/tokens/doc.pyx

A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. The Doc object holds an array of TokenC structs. The Python-level Token and Span objects are views of this array, i.e. they don't own the data themselves.

Doc.__init__

Construct a Doc object. The most common way to get a Doc object is via the nlp object.

Example

# Construction 1
doc = nlp("Some text")

# Construction 2
from spacy.tokens import Doc

words = ["hello", "world", "!"]
spaces = [True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
Name Description
vocab A storage container for lexical types. Vocab
words A list of strings to add to the container. Optional[List[str]]
spaces A list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True. Optional[List[bool]]
keyword-only
user\_data Optional extra data to attach to the Doc. Dict
tags 3 A list of strings, of the same length as words, to assign as token.tag for each word. Defaults to None. Optional[List[str]]
pos 3 A list of strings, of the same length as words, to assign as token.pos for each word. Defaults to None. Optional[List[str]]
morphs 3 A list of strings, of the same length as words, to assign as token.morph for each word. Defaults to None. Optional[List[str]]
lemmas 3 A list of strings, of the same length as words, to assign as token.lemma for each word. Defaults to None. Optional[List[str]]
heads 3 A list of values, of the same length as words, to assign as the head for each word. Head indices are the absolute position of the head in the Doc. Defaults to None. Optional[List[int]]
deps 3 A list of strings, of the same length as words, to assign as token.dep for each word. Defaults to None. Optional[List[str]]
sent_starts 3 A list of values, of the same length as words, to assign as token.is_sent_start. Will be overridden by heads if heads is provided. Defaults to None. Optional[List[Union[bool, None]]
ents 3 A list of strings, of the same length of words, to assign the token-based IOB tag. Defaults to None. Optional[List[str]]

Doc.__getitem__

Get a Token object at position i, where i is an integer. Negative indexing is supported, and follows the usual Python semantics, i.e. doc[-2] is doc[len(doc) - 2].

Example

doc = nlp("Give it back! He pleaded.")
assert doc[0].text == "Give"
assert doc[-1].text == "."
span = doc[1:3]
assert span.text == "it back"
Name Description
i The index of the token. int
RETURNS The token at doc[i]. Token

Get a Span object, starting at position start (token index) and ending at position end (token index). For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as Span objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics.

Name Description
start_end The slice of the document to get. Tuple[int, int]
RETURNS The span at doc[start:end]. Span

Doc.__iter__

Iterate over Token objects, from which the annotations can be easily accessed.

Example

doc = nlp("Give it back")
assert [t.text for t in doc] == ["Give", "it", "back"]

This is the main way of accessing Token objects, which are the main way annotations are accessed from Python. If faster-than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython.

Name Description
YIELDS A Token object. Token

Doc.__len__

Get the number of tokens in the document.

Example

doc = nlp("Give it back! He pleaded.")
assert len(doc) == 7
Name Description
RETURNS The number of tokens in the document. int

Doc.set_extension

Define a custom attribute on the Doc which becomes available via Doc._. For details, see the documentation on custom attributes.

Example

from spacy.tokens import Doc
city_getter = lambda doc: any(city in doc.text for city in ("New York", "Paris", "Berlin"))
Doc.set_extension("has_city", getter=city_getter)
doc = nlp("I like New York")
assert doc._.has_city
Name Description
name Name of the attribute to set by the extension. For example, "my_attr" will be available as doc._.my_attr. str
default Optional default value of the attribute if no getter or method is defined. Optional[Any]
method Set a custom method on the object, for example doc._.compare(other_doc). Optional[CallableDoc, ...], Any
getter Getter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute. Optional[CallableDoc], Any
setter Setter function that takes the Doc and a value, and modifies the object. Is called when the user writes to the Doc._ attribute. Optional[CallableDoc, Any], None
force Force overwriting existing attribute. bool

Doc.get_extension

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

Example

from spacy.tokens import Doc
Doc.set_extension("has_city", default=False)
extension = Doc.get_extension("has_city")
assert extension == (False, None, None, None)
Name Description
name Name of the extension. str
RETURNS A (default, method, getter, setter) tuple of the extension. Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]

Doc.has_extension

Check whether an extension has been registered on the Doc class.

Example

from spacy.tokens import Doc
Doc.set_extension("has_city", default=False)
assert Doc.has_extension("has_city")
Name Description
name Name of the extension to check. str
RETURNS Whether the extension has been registered. bool

Doc.remove_extension

Remove a previously registered extension.

Example

from spacy.tokens import Doc
Doc.set_extension("has_city", default=False)
removed = Doc.remove_extension("has_city")
assert not Doc.has_extension("has_city")
Name Description
name Name of the extension. str
RETURNS A (default, method, getter, setter) tuple of the removed extension. Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]

Doc.char_span

Create a Span object from the slice doc.text[start_idx:end_idx]. Returns None if the character indices don't map to a valid span using the default alignment mode `"strict".

Example

doc = nlp("I like New York")
span = doc.char_span(7, 15, label="GPE")
assert span.text == "New York"
Name Description
start The index of the first character of the span. int
end The index of the last character after the span. ~int~~
label A label to attach to the span, e.g. for named entities. Union[int, str]
kb_id 2.2 An ID from a knowledge base to capture the meaning of a named entity. Union[int, str]
vector A meaning representation of the span. numpy.ndarray[ndim=1, dtype=float32]
alignment_mode How character indices snap to token boundaries. Options: "strict" (no snapping), "contract" (span of all tokens completely within the character span), "expand" (span of all tokens at least partially covered by the character span). Defaults to "strict". str
RETURNS The newly constructed object or None. Optional[Span]

Doc.set_ents

Set the named entities in the document.

Example

from spacy.tokens import Span
doc = nlp("Mr. Best flew to New York on Saturday morning.")
doc.set_ents([Span(doc, 0, 2, "PERSON")])
ents = list(doc.ents)
assert ents[0].label_ == "PERSON"
assert ents[0].text == "Mr. Best"
Name Description
entities Spans with labels to set as entities. List[Span]
keyword-only
blocked Spans to set as "blocked" (never an entity) for spacy's built-in NER component. Other components may ignore this setting. Optional[List[Span]]
missing Spans with missing/unknown entity information. Optional[List[Span]]
outside Spans outside of entities (O in IOB). Optional[List[Span]]
default How to set entity annotation for tokens outside of any provided spans. Options: "blocked", "missing", "outside" and "unmodified" (preserve current state). Defaults to "outside". str

Doc.similarity

Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.

Example

apples = nlp("I like apples")
oranges = nlp("I like oranges")
apples_oranges = apples.similarity(oranges)
oranges_apples = oranges.similarity(apples)
assert apples_oranges == oranges_apples
Name Description
other The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects. Union[Doc, Span, Token, Lexeme]
RETURNS A scalar similarity score. Higher is more similar. float

Doc.count_by

Count the frequencies of a given attribute. Produces a dict of {attr (int): count (ints)} frequencies, keyed by the values of the given attribute ID.

Example

from spacy.attrs import ORTH
doc = nlp("apple apple orange banana")
assert doc.count_by(ORTH) == {7024: 1, 119552: 1, 2087: 2}
doc.to_array([ORTH])
# array([[11880], [11880], [7561], [12800]])
Name Description
attr_id The attribute ID. int
RETURNS A dictionary mapping attributes to integer counts. Dict[int, int]

Doc.get_lca_matrix

Calculates the lowest common ancestor matrix for a given Doc. Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found, e.g. if span excludes a necessary ancestor.

Example

doc = nlp("This is a test")
matrix = doc.get_lca_matrix()
# array([[0, 1, 1, 1], [1, 1, 1, 1], [1, 1, 2, 3], [1, 1, 3, 3]], dtype=int32)
Name Description
RETURNS The lowest common ancestor matrix of the Doc. numpy.ndarray[ndim=2, dtype=int32]

Doc.has_annotation

Check whether the doc contains annotation on a token attribute.

This method replaces the previous boolean attributes like Doc.is_tagged, Doc.is_parsed or Doc.is_sentenced.

doc = nlp("This is a text")
- assert doc.is_parsed
+ assert doc.has_annotation("DEP")
Name Description
attr The attribute string name or int ID. Union[int, str]
keyword-only
require_complete Whether to check that the attribute is set on every token in the doc. Defaults to False. bool
RETURNS Whether specified annotation is present in the doc. bool

Doc.to_array

Export given token attributes to a numpy ndarray. If attr_ids is a sequence of M attributes, the output array will be of shape (N, M), where N is the length of the Doc (in tokens). If attr_ids is a single attribute, the output shape will be (N,). You can specify attributes by integer ID (e.g. spacy.attrs.LEMMA) or string name (e.g. "LEMMA" or "lemma"). The values will be 64-bit integers.

Returns a 2D array with one row per token and one column per attribute (when attr_ids is a list), or as a 1D numpy array, with one item per attribute (when attr_ids is a single value).

Example

from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
doc = nlp(text)
# All strings mapped to integers, for easy export to numpy
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
np_array = doc.to_array("POS")
Name Description
attr_ids A list of attributes (int IDs or string names) or a single attribute (int ID or string name). Union[int, str, List[Union[int, str]]]
RETURNS The exported attributes as a numpy array. Union[numpy.ndarray[ndim=2, dtype=uint64], numpy.ndarray[ndim=1, dtype=uint64]]

Doc.from_array

Load attributes from a numpy array. Write to a Doc object, from an (M, N) array of attributes.

Example

from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
from spacy.tokens import Doc
doc = nlp("Hello world!")
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
doc2 = Doc(doc.vocab, words=[t.text for t in doc])
doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
assert doc[0].pos_ == doc2[0].pos_
Name Description
attrs A list of attribute ID ints. List[int]
array The attribute values to load. numpy.ndarray[ndim=2, dtype=int32]
exclude String names of serialization fields to exclude. Iterable[str]
RETURNS The Doc itself. Doc

Doc.from_docs

Concatenate multiple Doc objects to form a new one. Raises an error if the Doc objects do not all share the same Vocab.

Example

from spacy.tokens import Doc
texts = ["London is the capital of the United Kingdom.",
         "The River Thames flows through London.",
         "The famous Tower Bridge crosses the River Thames."]
docs = list(nlp.pipe(texts))
c_doc = Doc.from_docs(docs)
assert str(c_doc) == " ".join(texts)
assert len(list(c_doc.sents)) == len(docs)
assert [str(ent) for ent in c_doc.ents] == \
       [str(ent) for doc in docs for ent in doc.ents]
Name Description
docs A list of Doc objects. List[Doc]
ensure_whitespace Insert a space between two adjacent docs whenever the first doc does not end in whitespace. bool
attrs Optional list of attribute ID ints or attribute name strings. Optional[List[Union[str, int]]]
RETURNS The new Doc object that is containing the other docs or None, if docs is empty or None. Optional[Doc]

Doc.to_disk

Save the current state to a directory.

Example

doc.to_disk("/path/to/doc")
Name Description
path A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. Union[str, Path]
keyword-only
exclude String names of serialization fields to exclude. Iterable[str]

Doc.from_disk

Loads state from a directory. Modifies the object in place and returns it.

Example

from spacy.tokens import Doc
from spacy.vocab import Vocab
doc = Doc(Vocab()).from_disk("/path/to/doc")
Name Description
path A path to a directory. Paths may be either strings or Path-like objects. Union[str, Path]
keyword-only
exclude String names of serialization fields to exclude. Iterable[str]
RETURNS The modified Doc object. Doc

Doc.to_bytes

Serialize, i.e. export the document contents to a binary string.

Example

doc = nlp("Give it back! He pleaded.")
doc_bytes = doc.to_bytes()
Name Description
keyword-only
exclude String names of serialization fields to exclude. Iterable[str]
RETURNS A losslessly serialized copy of the Doc, including all annotations. bytes

Doc.from_bytes

Deserialize, i.e. import the document contents from a binary string.

Example

from spacy.tokens import Doc
doc = nlp("Give it back! He pleaded.")
doc_bytes = doc.to_bytes()
doc2 = Doc(doc.vocab).from_bytes(doc_bytes)
assert doc.text == doc2.text
Name Description
data The string to load from. bytes
keyword-only
exclude String names of serialization fields to exclude. Iterable[str]
RETURNS The Doc object. Doc

Doc.retokenize

Context manager to handle retokenization of the Doc. Modifications to the Doc's tokenization are stored, and then made all at once when the context manager exits. This is much more efficient, and less error-prone. All views of the Doc (Span and Token) created before the retokenization are invalidated, although they may accidentally continue to work.

Example

doc = nlp("Hello world!")
with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[0:2])
Name Description
RETURNS The retokenizer. Retokenizer

Retokenizer.merge

Mark a span for merging. The attrs will be applied to the resulting token (if they're context-dependent token attributes like LEMMA or DEP) or to the underlying lexeme (if they're context-independent lexical attributes like LOWER or IS_STOP). Writable custom extension attributes can be provided using the "_" key and specifying a dictionary that maps attribute names to values.

Example

doc = nlp("I like David Bowie")
with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": "David Bowie"}
    retokenizer.merge(doc[2:4], attrs=attrs)
Name Description
span The span to merge. Span
attrs Attributes to set on the merged token. Dict[Union[str, int], Any]

Retokenizer.split

Mark a token for splitting, into the specified orths. The heads are required to specify how the new subtokens should be integrated into the dependency tree. The list of per-token heads can either be a token in the original document, e.g. doc[2], or a tuple consisting of the token in the original document and its subtoken index. For example, (doc[3], 1) will attach the subtoken to the second subtoken of doc[3].

This mechanism allows attaching subtokens to other newly created subtokens, without having to keep track of the changing token indices. If the specified head token will be split within the retokenizer block and no subtoken index is specified, it will default to 0. Attributes to set on subtokens can be provided as a list of values. They'll be applied to the resulting token (if they're context-dependent token attributes like LEMMA or DEP) or to the underlying lexeme (if they're context-independent lexical attributes like LOWER or IS_STOP).

Example

doc = nlp("I live in NewYork")
with doc.retokenize() as retokenizer:
    heads = [(doc[3], 1), doc[2]]
    attrs = {"POS": ["PROPN", "PROPN"],
             "DEP": ["pobj", "compound"]}
    retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
Name Description
token The token to split. Token
orths The verbatim text of the split tokens. Needs to match the text of the original token. List[str]
heads List of token or (token, subtoken) tuples specifying the tokens to attach the newly split subtokens to. List[Union[Token, Tuple[Token, int]]]
attrs Attributes to set on all split tokens. Attribute names mapped to list of per-token attribute values. Dict[Union[str, int], List[Any]]

Doc.ents

The named entities in the document. Returns a tuple of named entity Span objects, if the entity recognizer has been applied.

Example

doc = nlp("Mr. Best flew to New York on Saturday morning.")
ents = list(doc.ents)
assert ents[0].label_ == "PERSON"
assert ents[0].text == "Mr. Best"
Name Description
RETURNS Entities in the document, one Span per entity. Tuple[Span, ...]

Doc.noun_chunks

Iterate over the base noun phrases in the document. Yields base noun-phrase Span objects, if the document has been syntactically parsed. A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it so no NP-level coordination, no prepositional phrases, and no relative clauses.

Example

doc = nlp("A phrase with another phrase occurs.")
chunks = list(doc.noun_chunks)
assert chunks[0].text == "A phrase"
assert chunks[1].text == "another phrase"
Name Description
YIELDS Noun chunks in the document. Span

Doc.sents

Iterate over the sentences in the document. Sentence spans have no label. To improve accuracy on informal texts, spaCy calculates sentence boundaries from the syntactic dependency parse. If the parser is disabled, the sents iterator will be unavailable.

Example

doc = nlp("This is a sentence. Here's another...")
sents = list(doc.sents)
assert len(sents) == 2
assert [s.root.text for s in sents] == ["is", "'s"]
Name Description
YIELDS Sentences in the document. Span

Doc.has_vector

A boolean value indicating whether a word vector is associated with the object.

Example

doc = nlp("I like apples")
assert doc.has_vector
Name Description
RETURNS Whether the document has a vector data attached. bool

Doc.vector

A real-valued meaning representation. Defaults to an average of the token vectors.

Example

doc = nlp("I like apples")
assert doc.vector.dtype == "float32"
assert doc.vector.shape == (300,)
Name Description
RETURNS A 1-dimensional array representing the document's vector. numpy.ndarray[ndim=1, dtype=float32]

Doc.vector_norm

The L2 norm of the document's vector representation.

Example

doc1 = nlp("I like apples")
doc2 = nlp("I like oranges")
doc1.vector_norm  # 4.54232424414368
doc2.vector_norm  # 3.304373298575751
assert doc1.vector_norm != doc2.vector_norm
Name Description
RETURNS The L2 norm of the vector representation. float

Attributes

Name Description
text A string representation of the document text. str
text_with_ws An alias of Doc.text, provided for duck-type compatibility with Span and Token. str
mem The document's local memory heap, for all C data it owns. cymem.Pool
vocab The store of lexical types. Vocab
tensor 2 Container for dense vector representations. numpy.ndarray
cats 2 Maps a label to a score for categories applied to the document. The label is a string and the score should be a float. Dict[str, float]
user_data A generic storage area, for user custom data. Dict[str, Any]
lang 2.1 Language of the document's vocabulary. int
lang_ 2.1 Language of the document's vocabulary. str
sentiment The document's positivity/negativity score, if available. float
user_hooks A dictionary that allows customization of the Doc's properties. Dict[str, Callable]
user_token_hooks A dictionary that allows customization of properties of Token children. Dict[str, Callable]
user_span_hooks A dictionary that allows customization of properties of Span children. Dict[str, Callable]
_ User space for adding custom attribute extensions. Underscore

Serialization fields

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

Example

data = doc.to_bytes(exclude=["text", "tensor"])
doc.from_disk("./doc.bin", exclude=["user_data"])
Name Description
text The value of the Doc.text attribute.
sentiment The value of the Doc.sentiment attribute.
tensor The value of the Doc.tensor attribute.
user_data The value of the Doc.user_data dictionary.
user_data_keys The keys of the Doc.user_data dictionary.
user_data_values The values of the Doc.user_data dictionary.