spaCy/goldparse.md at 23ec07debdd568f09c7c83b10564850f9fa67ad4

15 KiB

Raw Blame History

title	teaser	tag	source
GoldParse	A collection for training annotations	class	spacy/gold.pyx

GoldParse.init

Create a GoldParse. Unlike annotations in entities, label annotations in cats can overlap, i.e. a single word can be covered by multiple labelled spans. The TextCategorizer component expects true examples of a label to have the value 1.0, and negative examples of a label to have the value 0.0. Labels not in the dictionary are treated as missing – the gradient for those labels will be zero.

Name	Type	Description
`doc`	`Doc`	The document the annotations refer to.
`words`	iterable	A sequence of unicode word strings.
`tags`	iterable	A sequence of strings, representing tag annotations.
`heads`	iterable	A sequence of integers, representing syntactic head offsets.
`deps`	iterable	A sequence of strings, representing the syntactic relation types.
`entities`	iterable	A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None.
`cats`	dict	Labels for text classification. Each key in the dictionary may be a string or an int, or a `(start_char, end_char, label)` tuple, indicating that the label is applied to only part of the document (usually a sentence).
RETURNS	`GoldParse`	The newly constructed object.

GoldParse.len

Get the number of gold-standard tokens.

Name	Type	Description
RETURNS	int	The number of gold-standard tokens.

GoldParse.is_projective

Whether the provided syntactic annotations form a projective dependency tree.

Name	Type	Description
RETURNS	bool	Whether annotations form projective tree.

Attributes

Name	Type	Description
`words`	list	The words.
`tags`	list	The part-of-speech tag annotations.
`heads`	list	The syntactic head annotations.
`labels`	list	The syntactic relation-type annotations.
`ner`	list	The named entity annotations as BILUO tags.
`cand_to_gold`	list	The alignment from candidate tokenization to gold tokenization.
`gold_to_cand`	list	The alignment from gold tokenization to candidate tokenization.
`cats` 2	list	Entries in the list should be either a label, or a `(start, end, label)` triple. The tuple form is used for categories applied to spans of the document.

Utilities

gold.docs_to_json

Convert a list of Doc objects into the JSON-serializable format used by the spacy train command.

Example

from spacy.gold import docs_to_json

doc = nlp(u"I like London")
json_data = docs_to_json([doc])

Name	Type	Description
`docs`	iterable / `Doc`	The `Doc` object(s) to convert.
`id`	int	ID to assign to the JSON. Defaults to `0`.
RETURNS	list	The data in spaCy's JSON format.

gold.align

Calculate alignment tables between two tokenizations, using the Levenshtein algorithm. The alignment is case-insensitive.

The current implementation of the alignment algorithm assumes that both tokenizations add up to the same string. For example, you'll be able to align ["I", "'", "m"] and ["I", "'m"], which both add up to "I'm", but not ["I", "'m"] and ["I", "am"].

Example

from spacy.gold import align

bert_tokens = ["obama", "'", "s", "podcast"]
spacy_tokens = ["obama", "'s", "podcast"]
alignment = align(bert_tokens, spacy_tokens)
cost, a2b, b2a, a2b_multi, b2a_multi = alignment

Name	Type	Description
`tokens_a`	list	String values of candidate tokens to align.
`tokens_b`	list	String values of reference tokens to align.
RETURNS	tuple	A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment.

The returned tuple contains the following alignment information:

Example
a2b = array([0, -1, -1, 2])
b2a = array([0, 2, 3])
a2b_multi = {1: 1, 2: 1}
b2a_multi = {}
If a2b[3] == 2, that means that tokens_a[3] aligns to tokens_b[2]. If there's no one-to-one alignment for a token, it has the value -1.

Name	Type	Description
`cost`	int	The number of misaligned tokens.
`a2b`	`numpy.ndarray[ndim=1, dtype='int32']`	One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`.
`b2a`	`numpy.ndarray[ndim=1, dtype='int32']`	One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`.
`a2b_multi`	dict	A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`.
`b2a_multi`	dict	A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`.

gold.biluo_tags_from_offsets

Encode labelled spans into per-token tags, using the BILUO scheme (Begin, In, Last, Unit, Out). Returns a list of unicode strings, describing the tags. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". The string "-" is used where the entity offsets don't align with the tokenization in the Doc object. The training algorithm will view these as missing values. O denotes a non-entity token. B denotes the beginning of a multi-token entity, I the inside of an entity of three or more tokens, and L the end of an entity of two or more tokens. U denotes a single-token entity.

Example

from spacy.gold import biluo_tags_from_offsets

doc = nlp(u"I like London.")
entities = [(7, 13, "LOC")]
tags = biluo_tags_from_offsets(doc, entities)
assert tags == ["O", "O", "U-LOC", "O"]

Name	Type	Description
`doc`	`Doc`	The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document.
`entities`	iterable	A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string.
RETURNS	list	Unicode strings, describing the BILUO tags.

gold.offsets_from_biluo_tags

Encode per-token tags following the BILUO scheme into entity offsets.

Example

from spacy.gold import offsets_from_biluo_tags

doc = nlp(u"I like London.")
tags = ["O", "O", "U-LOC", "O"]
entities = offsets_from_biluo_tags(doc, tags)
assert entities == [(7, 13, "LOC")]

Name	Type	Description
`doc`	`Doc`	The document that the BILUO tags refer to.
`entities`	iterable	A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`.
RETURNS	list	A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string.

gold.spans_from_biluo_tags

Encode per-token tags following the BILUO scheme into Span objects. This can be used to create entity spans from token-based tags, e.g. to overwrite the doc.ents.

Example

from spacy.gold import offsets_from_biluo_tags

doc = nlp(u"I like London.")
tags = ["O", "O", "U-LOC", "O"]
doc.ents = spans_from_biluo_tags(doc, tags)

Name	Type	Description
`doc`	`Doc`	The document that the BILUO tags refer to.
`entities`	iterable	A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`.
RETURNS	list	A sequence of `Span` objects with added entity labels.

15 KiB Raw Blame History Unescape Escape

GoldParse.__init__

GoldParse.__len__

GoldParse.is_projective

Attributes

Utilities

gold.docs_to_json

Example

gold.align

Example

Example

gold.biluo_tags_from_offsets

Example

gold.offsets_from_biluo_tags

Example

gold.spans_from_biluo_tags

Example

15 KiB

Raw Blame History

GoldParse.init

GoldParse.len