spaCy/website/docs/api/goldparse.md

9.3 KiB

title teaser tag source
GoldParse A collection for training annotations class spacy/gold.pyx

GoldParse.__init__

Create a GoldParse.

Name Type Description
doc Doc The document the annotations refer to.
words iterable A sequence of unicode word strings.
tags iterable A sequence of strings, representing tag annotations.
heads iterable A sequence of integers, representing syntactic head offsets.
deps iterable A sequence of strings, representing the syntactic relation types.
entities iterable A sequence of named entity annotations, either as BILUO tag strings, or as (start_char, end_char, label) tuples, representing the entity positions.
RETURNS GoldParse The newly constructed object.

GoldParse.__len__

Get the number of gold-standard tokens.

Name Type Description
RETURNS int The number of gold-standard tokens.

GoldParse.is_projective

Whether the provided syntactic annotations form a projective dependency tree.

Name Type Description
RETURNS bool Whether annotations form projective tree.

Attributes

Name Type Description
tags list The part-of-speech tag annotations.
heads list The syntactic head annotations.
labels list The syntactic relation-type annotations.
ents list The named entity annotations.
cand_to_gold list The alignment from candidate tokenization to gold tokenization.
gold_to_cand list The alignment from gold tokenization to candidate tokenization.
cats 2 list Entries in the list should be either a label, or a (start, end, label) triple. The tuple form is used for categories applied to spans of the document.

Utilities

gold.biluo_tags_from_offsets

Encode labelled spans into per-token tags, using the BILUO scheme (Begin/In/Last/Unit/Out).

Returns a list of unicode strings, describing the tags. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". The string "-" is used where the entity offsets don't align with the tokenization in the Doc object. The training algorithm will view these as missing values. O denotes a non-entity token. B denotes the beginning of a multi-token entity, I the inside of an entity of three or more tokens, and L the end of an entity of two or more tokens. U denotes a single-token entity.

Example

from spacy.gold import biluo_tags_from_offsets

doc = nlp(u"I like London.")
entities = [(7, 13, "LOC")]
tags = biluo_tags_from_offsets(doc, entities)
assert tags == ["O", "O", "U-LOC", "O"]
Name Type Description
doc Doc The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document.
entities iterable A sequence of (start, end, label) triples. start and end should be character-offset integers denoting the slice into the original string.
RETURNS list Unicode strings, describing the BILUO tags.

gold.offsets_from_biluo_tags

Encode per-token tags following the BILUO scheme into entity offsets.

Example

from spacy.gold import offsets_from_biluo_tags

doc = nlp(u"I like London.")
tags = ["O", "O", "U-LOC", "O"]
entities = offsets_from_biluo_tags(doc, tags)
assert entities == [(7, 13, "LOC")]
Name Type Description
doc Doc The document that the BILUO tags refer to.
entities iterable A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U".
RETURNS list A sequence of (start, end, label) triples. start and end will be character-offset integers denoting the slice into the original string.

gold.spans_from_biluo_tags

Encode per-token tags following the BILUO scheme into Span objects. This can be used to create entity spans from token-based tags, e.g. to overwrite the doc.ents.

Example

from spacy.gold import offsets_from_biluo_tags

doc = nlp(u"I like London.")
tags = ["O", "O", "U-LOC", "O"]
doc.ents = spans_from_biluo_tags(doc, tags)
Name Type Description
doc Doc The document that the BILUO tags refer to.
entities iterable A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U".
RETURNS list A sequence of Span objects with added entity labels.