spaCy/website/docs/api/scorer.md

14 KiB

title teaser tag source
Scorer Compute evaluation scores class spacy/scorer.py

The Scorer computes evaluation scores. It's typically created by Language.evaluate. In addition, the Scorer provides a number of evaluation methods for evaluating Token and Doc attributes.

Scorer.__init__

Create a new Scorer.

Example

from spacy.scorer import Scorer

# Default scoring pipeline
scorer = Scorer()

# Provided scoring pipeline
nlp = spacy.load("en_core_web_sm")
scorer = Scorer(nlp)
Name Description
nlp The pipeline to use for scoring, where each pipeline component may provide a scoring method. If none is provided, then a default pipeline for the multi-language code xx is constructed containing: senter, tagger, morphologizer, parser, ner, textcat. Language

Scorer.score

Calculate the scores for a list of Example objects using the scoring methods provided by the components in the pipeline.

The returned Dict contains the scores provided by the individual pipeline components. For the scoring methods provided by the Scorer and use by the core pipeline components, the individual score names start with the Token or Doc attribute being scored:

  • token_acc, token_p, token_r, token_f,
  • sents_p, sents_r, sents_f
  • tag_acc, pos_acc, morph_acc, morph_per_feat, lemma_acc
  • dep_uas, dep_las, dep_las_per_type
  • ents_p, ents_r ents_f, ents_per_type
  • textcat_macro_auc, textcat_macro_f

Example

scorer = Scorer()
scores = scorer.score(examples)
Name Description
examples The Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
RETURNS A dictionary of scores. Dict[str, Union[float, Dict[str, float]]]

Scorer.score_tokenization

Scores the tokenization:

  • token_acc: number of correct tokens / number of gold tokens
  • token_p, token_r, token_f: precision, recall and F-score for token character spans

Example

scores = Scorer.score_tokenization(examples)
Name Description
examples The Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
RETURNS Dict

Scorer.score_token_attr

Scores a single token attribute.

Example

scores = Scorer.score_token_attr(examples, "pos")
print(scores["pos_acc"])
Name Description
examples The Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
attr The attribute to score. str
keyword-only
getter Defaults to getattr. If provided, getter(token, attr) should return the value of the attribute for an individual Token. Callable[[Token, str], Any]
RETURNS A dictionary containing the score {attr}_acc. Dict[str, float]

Scorer.score_token_attr_per_feat

Scores a single token attribute per feature for a token attribute in the Universal Dependencies FEATS format.

Example

scores = Scorer.score_token_attr_per_feat(examples, "morph")
print(scores["morph_per_feat"])
Name Description
examples The Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
attr The attribute to score. str
keyword-only
getter Defaults to getattr. If provided, getter(token, attr) should return the value of the attribute for an individual Token. Callable[[Token, str], Any]
RETURNS A dictionary containing the per-feature PRF scores under the key {attr}_per_feat. Dict[str, Dict[str, float]]

Scorer.score_spans

Returns PRF scores for labeled or unlabeled spans.

Example

scores = Scorer.score_spans(examples, "ents")
print(scores["ents_f"])
Name Description
examples The Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
attr The attribute to score. str
keyword-only
getter Defaults to getattr. If provided, getter(doc, attr) should return the Span objects for an individual Doc. CallableDoc, str], Iterable[Span
RETURNS A dictionary containing the PRF scores under the keys {attr}_p, {attr}_r, {attr}_f and the per-type PRF scores under {attr}_per_type. Dict[str, Union[float, Dict[str, float]]]

Scorer.score_deps

Calculate the UAS, LAS, and LAS per type scores for dependency parses.

Example

def dep_getter(token, attr):
    dep = getattr(token, attr)
    dep = token.vocab.strings.as_string(dep).lower()
    return dep

scores = Scorer.score_deps(
    examples,
    "dep",
    getter=dep_getter,
    ignore_labels=("p", "punct")
)
print(scores["dep_uas"], scores["dep_las"])
Name Description
examples The Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
attr The attribute to score. str
keyword-only
getter Defaults to getattr. If provided, getter(token, attr) should return the value of the attribute for an individual Token. Callable[[Token, str], Any]
head_attr The attribute containing the head token. str
head_getter Defaults to getattr. If provided, head_getter(token, attr) should return the head for an individual Token. Callable[[Doc, str], Token]
ignore_labels Labels to ignore while scoring (e.g. "punct"). Iterable[str]
RETURNS A dictionary containing the scores: {attr}_uas, {attr}_las, and {attr}_las_per_type. Dict[str, Union[float, Dict[str, float]]]

Scorer.score_cats

Calculate PRF and ROC AUC scores for a doc-level attribute that is a dict containing scores for each label like Doc.cats. The reported overall score depends on the scorer settings:

  1. all: {attr}_score (one of {attr}_f / {attr}_macro_f / {attr}_macro_auc), {attr}_score_desc (text description of the overall score), {attr}_f_per_type, {attr}_auc_per_type
  2. binary exclusive with positive label: {attr}_p, {attr}_r, {attr}_f
  3. 3+ exclusive classes, macro-averaged F-score: {attr}_macro_f;
  4. multilabel, macro-averaged AUC: {attr}_macro_auc

Example

labels = ["LABEL_A", "LABEL_B", "LABEL_C"]
scores = Scorer.score_cats(
    examples,
    "cats",
    labels=labels
)
print(scores["cats_macro_auc"])
Name Description
examples The Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
attr The attribute to score. str
keyword-only
getter Defaults to getattr. If provided, getter(doc, attr) should return the cats for an individual Doc. CallableDoc, str], Dict[str, float
labels The set of possible labels. Defaults to []. Iterable[str]
multi_label Whether the attribute allows multiple labels. Defaults to True. bool
positive_label The positive label for a binary task with exclusive classes. Defaults to None. Optional[str]
RETURNS A dictionary containing the scores, with inapplicable scores as None. Dict[str, Optional[float]]

Returns PRF for predicted links on the entity level. To disentangle the performance of the NEL from the NER, this method only evaluates NEL links for entities that overlap between the gold reference and the predictions.

Example

scores = Scorer.score_links(
    examples,
    negative_labels=["NIL", ""]
)
print(scores["nel_micro_f"])
Name Description
examples The Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
keyword-only
negative_labels The string values that refer to no annotation (e.g. "NIL"). Iterable[str]
RETURNS A dictionary containing the scores. Dict[str, Optional[float]]