spaCy/example.md at 61dfdd9fbd962d389c7f46b3595fc48f8cf4590f

14 KiB

Raw Blame History

title	teaser	tag	source	new
Example	A training instance	class	spacy/gold/example.pyx	3.0

An Example holds the information for one training instance. It stores two Doc objects: one for holding the gold-standard reference data, and one for holding the predictions of the pipeline. An Alignment object stores the alignment between these two documents, as they can differ in tokenization.

Example.init

Construct an Example object from the predicted document and the reference document. If alignment is None, it will be initialized from the words in both documents.

Example

from spacy.tokens import Doc
from spacy.gold import Example

words = ["hello", "world", "!"]
spaces = [True, False, False]
predicted = Doc(nlp.vocab, words=words, spaces=spaces)
reference = parse_gold_doc(my_data)
example = Example(predicted, reference)

Name	Type	Description
`predicted`	`Doc`	The document containing (partial) predictions. Can not be `None`.
`reference`	`Doc`	The document containing gold-standard annotations. Can not be `None`.
keyword-only
`alignment`	`Alignment`	An object holding the alignment between the tokens of the `predicted` and `reference` documents.

Example.from_dict

Construct an Example object from the predicted document and the reference annotations provided as a dictionary. For more details on the required format, see the training format documentation.

Example

from spacy.tokens import Doc
from spacy.gold import Example

predicted = Doc(vocab, words=["Apply", "some", "sunscreen"])
token_ref = ["Apply", "some", "sun", "screen"]
tags_ref = ["VERB", "DET", "NOUN", "NOUN"]
example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})

Name	Type	Description
`predicted`	`Doc`	The document containing (partial) predictions. Can not be `None`.
`example_dict`	`Dict[str, obj]`	The gold-standard annotations as a dictionary. Can not be `None`.
RETURNS	`Example`	The newly constructed object.

Example.text

The text of the predicted document in this Example.

Example

raw_text = example.text

Name	Type	Description
RETURNS	str	The text of the `predicted` document.

Example.predicted

Example

docs = [eg.predicted for eg in examples]
predictions, _ = model.begin_update(docs)
set_annotations(docs, predictions)

The Doc holding the predictions. Occassionally also refered to as example.x.

Name	Type	Description
RETURNS	`Doc`	The document containing (partial) predictions.

Example.reference

Example

for i, eg in enumerate(examples):
    for j, label in enumerate(all_labels):
        gold_labels[i][j] = eg.reference.cats.get(label, 0.0)

The Doc holding the gold-standard annotations. Occassionally also refered to as example.y.

Name	Type	Description
RETURNS	`Doc`	The document containing gold-standard annotations.

Example.alignment

Example

tokens_x = ["Apply", "some", "sunscreen"]
x = Doc(vocab, words=tokens_x)
tokens_y = ["Apply", "some", "sun", "screen"]
example = Example.from_dict(x, {"words": tokens_y})
alignment = example.alignment
assert list(alignment.y2x.data) == [[0], [1], [2], [2]]

The Alignment object mapping the tokens of the predicted document to those of the reference document.

Name	Type	Description
RETURNS	`Alignment`	The document containing gold-standard annotations.

Example.get_aligned

Example

predicted = Doc(vocab, words=["Apply", "some", "sunscreen"])
token_ref = ["Apply", "some", "sun", "screen"]
tags_ref = ["VERB", "DET", "NOUN", "NOUN"]
example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"]

Get the aligned view of a certain token attribute, denoted by its int ID or string name.

Name	Type	Description	Default
`field`	int or str	Attribute ID or string name
`as_string`	bool	Whether or not to return the list of values as strings.	`False`
RETURNS	`List[int]` or `List[str]`	List of integer values, or string values if `as_string` is `True`.

Example.get_aligned_parse

Example

doc = nlp("He pretty quickly walks away")
example = Example.from_dict(doc, {"heads": [3, 2, 3, 0, 2]})
proj_heads, proj_labels = example.get_aligned_parse(projectivize=True)
assert proj_heads == [3, 2, 3, 0, 3]

Get the aligned view of the dependency parse. If projectivize is set to True, non-projective dependency trees are made projective through the Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).

Name	Type	Description	Default
`projectivize`	bool	Whether or not to projectivize the dependency trees	`True`
RETURNS	`List[int]` or `List[str]`	List of integer values, or string values if `as_string` is `True`.

Example.get_aligned_ner

Example

words = ["Mrs", "Smith", "flew", "to", "New York"]
doc = Doc(en_vocab, words=words)
entities = [(0, 9, "PERSON"), (18, 26, "LOC")]
gold_words = ["Mrs Smith", "flew", "to", "New", "York"]
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
ner_tags = example.get_aligned_ner()
assert ner_tags == ["B-PERSON", "L-PERSON", "O", "O", "U-LOC"]

Get the aligned view of the NER BILUO tags.

Name	Type	Description
RETURNS	`List[str]`	List of BILUO values, denoting whether tokens are part of an NER annotation or not.

Example.get_aligned_spans_y2x

Example

words = ["Mr and Mrs Smith", "flew", "to", "New York"]
doc = Doc(en_vocab, words=words)
entities = [(0, 16, "PERSON")]
tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"]
example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities})
ents_ref = example.reference.ents
assert [(ent.start, ent.end) for ent in ents_ref] == [(0, 4)]
ents_y2x = example.get_aligned_spans_y2x(ents_ref)
assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)]

Get the aligned view of any set of Span objects defined over example.reference. The resulting span indices will align to the tokenization in example.predicted.

Name	Type	Description
`y_spans`	`Iterable[Span]`	`Span` objects aligned to the tokenization of `self.reference`.
RETURNS	`Iterable[Span]`	`Span` objects aligned to the tokenization of `self.predicted`.

Example.get_aligned_spans_x2y

Example

nlp.add_pipe("my_ner")
doc = nlp("Mr and Mrs Smith flew to New York")
tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"]
example = Example.from_dict(doc, {"words": tokens_ref})
ents_pred = example.predicted.ents
# Assume the NER model has found "Mr and Mrs Smith" as a named entity
assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)]
ents_x2y = example.get_aligned_spans_x2y(ents_pred)
assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]

Get the aligned view of any set of Span objects defined over example.predicted. The resulting span indices will align to the tokenization in example.reference. This method is particularly useful to assess the accuracy of predicted entities against the original gold-standard annotation.

Name	Type	Description
`x_spans`	`Iterable[Span]`	`Span` objects aligned to the tokenization of `self.predicted`.
RETURNS	`Iterable[Span]`	`Span` objects aligned to the tokenization of `self.reference`.

Example.to_dict

Return a dictionary representation of the reference annotation contained in this Example.

Example

eg_dict = example.to_dict()

Name	Type	Description
RETURNS	`Dict[str, Any]`	Dictionary representation of the reference annotation.

Example.split_sents

Example

doc = nlp("I went yesterday had lots of fun")
tokens_ref = ["I", "went", "yesterday", "had", "lots", "of", "fun"]
sents_ref = [True, False, False, True, False, False, False]
example = Example.from_dict(doc, {"words": tokens_ref, "sent_starts": sents_ref})
split_examples = example.split_sents()
assert split_examples[0].text == "I went yesterday "
assert split_examples[1].text == "had lots of fun"

Split one Example into multiple Example objects, one for each sentence.

Name	Type	Description
RETURNS	`List[Example]`	List of `Example` objects, one for each original sentence.

Alignment

Calculate alignment tables between two tokenizations.

Alignment attributes {#alignment-attributes"}

Name	Type	Description
`x2y`	`Ragged`	The `Ragged` object holding the alignment from `x` to `y`.
`y2x`	`Ragged`	The `Ragged` object holding the alignment from `y` to `x`.

The current implementation of the alignment algorithm assumes that both tokenizations add up to the same string. For example, you'll be able to align ["I", "'", "m"] and ["I", "'m"], which both add up to "I'm", but not ["I", "'m"] and ["I", "am"].

Example

from spacy.gold import Alignment

bert_tokens = ["obama", "'", "s", "podcast"]
spacy_tokens = ["obama", "'s", "podcast"]
alignment = Alignment.from_strings(bert_tokens, spacy_tokens)
a2b = alignment.x2y
assert list(a2b.dataXd) == [0, 1, 1, 2]

If a2b.dataXd[1] == a2b.dataXd[2] == 1, that means that A[1] ("'") and A[2] ("s") both align to B[1] ("'s").

Alignment.from_strings

Name	Type	Description
`A`	list	String values of candidate tokens to align.
`B`	list	String values of reference tokens to align.
RETURNS	`Alignment`	An `Alignment` object describing the alignment.

14 KiB Raw Blame History

Example.__init__

Example

Example.from_dict

Example

Example.text

Example

Example.predicted

Example

Example.reference

Example

Example.alignment

Example

Example.get_aligned

Example

Example.get_aligned_parse

Example

Example.get_aligned_ner

Example

Example.get_aligned_spans_y2x

Example

Example.get_aligned_spans_x2y

Example

Example.to_dict

Example

Example.split_sents

Example

Alignment

Alignment attributes {#alignment-attributes"}

Example

Alignment.from_strings

14 KiB

Raw Blame History

Example.init