From 14a796e3f9ecfd5a6db969032324d83d40883704 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Tue, 7 Jul 2020 14:46:41 +0200 Subject: [PATCH] add Example API with examples of Example usage --- website/docs/api/example.md | 274 +++++++++++++++++++++++++++++++++++- 1 file changed, 272 insertions(+), 2 deletions(-) diff --git a/website/docs/api/example.md b/website/docs/api/example.md index 9dabaf851..0f1ed618d 100644 --- a/website/docs/api/example.md +++ b/website/docs/api/example.md @@ -1,10 +1,280 @@ --- title: Example -teaser: A training example +teaser: A training instance tag: class source: spacy/gold/example.pyx +new: 3.0 --- - +An `Example` holds the information for one training instance. It stores two +`Doc` objects: one for holding the gold-standard reference data, and one for +holding the predictions of the pipeline. An `Alignment` +object stores the alignment between these two documents, as they can differ in +tokenization. ## Example.\_\_init\_\_ {#init tag="method"} + +Construct an `Example` object from the `predicted` document and the `reference` +document. If `alignment` is `None`, it will be initialized from the words in +both documents. + +> #### Example +> +> ```python +> from spacy.tokens import Doc +> from spacy.gold import Example +> words = ["hello", "world", "!"] +> spaces = [True, False, False] +> predicted = Doc(nlp.vocab, words=words, spaces=spaces) +> reference = parse_gold_doc(my_data) +> example = Example(predicted, reference) +> ``` + +| Name | Type | Description | +| -------------- | ----------- | ------------------------------------------------------------------------------------------------ | +| `predicted` | `Doc` | The document containing (partial) predictions. Can not be `None`. | +| `reference` | `Doc` | The document containing gold-standard annotations. Can not be `None`. | +| _keyword-only_ | | | +| `alignment` | `Alignment` | An object holding the alignment between the tokens of the `predicted` and `reference` documents. | +| **RETURNS** | `Example` | The newly constructed object. | + +## Example.from_dict {#from_dict tag="classmethod"} + +Construct an `Example` object from the `predicted` document and the reference +annotations provided as a dictionary. + + + +> #### Example +> +> ```python +> from spacy.tokens import Doc +> from spacy.gold import Example +> predicted = Doc(vocab, words=["Apply", "some", "sunscreen"]) +> token_ref = ["Apply", "some", "sun", "screen"] +> tags_ref = ["VERB", "DET", "NOUN", "NOUN"] +> example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref}) +> ``` + +| Name | Type | Description | +| -------------- | ---------------- | ----------------------------------------------------------------- | +| `predicted` | `Doc` | The document containing (partial) predictions. Can not be `None`. | +| `example_dict` | `Dict[str, obj]` | The gold-standard annotations as a dictionary. Can not be `None`. | +| **RETURNS** | `Example` | The newly constructed object. | + +## Example.text {#text tag="property"} + +The text of the `predicted` document in this `Example`. + +> #### Example +> +> ```python +> raw_text = example.text +> ``` + +| Name | Type | Description | +| ----------- | ---- | ------------------------------------- | +| **RETURNS** | str | The text of the `predicted` document. | + +## Example.predicted {#predicted tag="property"} + +> #### Example +> +> ```python +> docs = [eg.predicted for eg in examples] +> predictions, _ = model.begin_update(docs) +> set_annotations(docs, predictions) +> ``` + +The `Doc` holding the predictions. Occassionally also refered to as `example.x`. + +| Name | Type | Description | +| ----------- | ----- | ---------------------------------------------- | +| **RETURNS** | `Doc` | The document containing (partial) predictions. | + +## Example.reference {#reference tag="property"} + +> #### Example +> +> ```python +> for i, eg in enumerate(examples): +> for j, label in enumerate(all_labels): +> gold_labels[i][j] = eg.reference.cats.get(label, 0.0) +> ``` + +The `Doc` holding the gold-standard annotations. Occassionally also refered to +as `example.y`. + +| Name | Type | Description | +| ----------- | ----- | -------------------------------------------------- | +| **RETURNS** | `Doc` | The document containing gold-standard annotations. | + +## Example.alignment {#alignment tag="property"} + +> #### Example +> +> ```python +> tokens_x = ["Apply", "some", "sunscreen"] +> x = Doc(vocab, words=tokens_x) +> tokens_y = ["Apply", "some", "sun", "screen"] +> example = Example.from_dict(x, {"words": tokens_y}) +> alignment = example.alignment +> assert list(alignment.y2x.data) == [[0], [1], [2], [2]] +> ``` + +The `Alignment` object mapping the tokens of the `predicted` document to those +of the `reference` document. + +| Name | Type | Description | +| ----------- | ----------- | -------------------------------------------------- | +| **RETURNS** | `Alignment` | The document containing gold-standard annotations. | + +## Example.get_aligned {#get_aligned tag="method"} + +> #### Example +> +> ```python +> predicted = Doc(vocab, words=["Apply", "some", "sunscreen"]) +> token_ref = ["Apply", "some", "sun", "screen"] +> tags_ref = ["VERB", "DET", "NOUN", "NOUN"] +> example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref}) +> assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"] +> ``` + +Get the aligned view of a certain token attribute, denoted by its int ID or string name. + +| Name | Type | Description | Default | +| ----------- | -------------------------- | ------------------------------------------------------------------ | ------- | +| `field` | int or str | Attribute ID or string name | | +| `as_string` | bool | Whether or not to return the list of values as strings. | `False` | +| **RETURNS** | `List[int]` or `List[str]` | List of integer values, or string values if `as_string` is `True`. | | + +## Example.get_aligned_parse {#get_aligned_parse tag="method"} + +> #### Example +> +> ```python +> doc = nlp("He pretty quickly walks away") +> example = Example.from_dict(doc, {"heads": [3, 2, 3, 0, 2]}) +> proj_heads, proj_labels = example.get_aligned_parse(projectivize=True) +> assert proj_heads == [3, 2, 3, 0, 3] +> ``` + +Get the aligned view of the dependency parse. If `projectivize` is set to +`True`, non-projective dependency trees are made projective through the +Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005). + +| Name | Type | Description | Default | +| -------------- | -------------------------- | ------------------------------------------------------------------ | ------- | +| `projectivize` | bool | Whether or not to projectivize the dependency trees | `True` | +| **RETURNS** | `List[int]` or `List[str]` | List of integer values, or string values if `as_string` is `True`. | | + +## Example.get_aligned_ner {#get_aligned_ner tag="method"} + +> #### Example +> +> ```python +> words = ["Mrs", "Smith", "flew", "to", "New York"] +> doc = Doc(en_vocab, words=words) +> entities = [(0, len("Mrs Smith"), "PERSON"), (18, 18 + len("New York"), "LOC")] +> gold_words = ["Mrs Smith", "flew", "to", "New", "York"] +> example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) +> ner_tags = example.get_aligned_ner() +> assert ner_tags == ["B-PERSON", "L-PERSON", "O", "O", "U-LOC"] +> ``` + +Get the aligned view of the NER +[BILUO](/usage/linguistic-features#accessing-ner) tags. + +| Name | Type | Description | +| ----------- | ----------- | ----------------------------------------------------------------------------------- | +| **RETURNS** | `List[str]` | List of BILUO values, denoting whether tokens are part of an NER annotation or not. | + +## Example.get_aligned_spans_y2x {#get_aligned_spans_y2x tag="method"} + +> #### Example +> +> ```python +> words = ["Mr and Mrs Smith", "flew", "to", "New York"] +> doc = Doc(en_vocab, words=words) +> entities = [(0, len("Mr and Mrs Smith"), "PERSON")] +> tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"] +> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) +> ents_ref = example.reference.ents +> assert [(ent.start, ent.end) for ent in ents_ref] == [(0, 4)] +> ents_y2x = example.get_aligned_spans_y2x(ents_ref) +> assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)] +> ``` + +Get the aligned view of any set of [`Span`](/api/span) objects defined over +`example.reference`. The resulting span indices will align to the tokenization +in `example.predicted`. + +| Name | Type | Description | +| ----------- | ---------------- | --------------------------------------------------------------- | +| `y_spans` | `Iterable[Span]` | `Span` objects aligned to the tokenization of `self.reference`. | +| **RETURNS** | `Iterable[Span]` | `Span` objects aligned to the tokenization of `self.predicted`. | + +## Example.get_aligned_spans_x2y {#get_aligned_spans_x2y tag="method"} + +> #### Example +> +> ```python +> ruler = EntityRuler(nlp) +> patterns = [{"label": "PERSON", "pattern": "Mr and Mrs Smith"}] +> ruler.add_patterns(patterns) +> nlp.add_pipe(ruler) +> doc = nlp("Mr and Mrs Smith flew to New York") +> entities = [(0, len("Mr and Mrs Smith"), "PERSON")] +> tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"] +> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) +> ents_pred = example.predicted.ents +> assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)] +> ents_x2y = example.get_aligned_spans_x2y(ents_pred) +> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)] +> ``` + +Get the aligned view of any set of [`Span`](/api/span) objects defined over +`example.predicted`. The resulting span indices will align to the tokenization +in `example.reference`. This method is particularly useful to assess the +accuracy of predicted entities against the original gold-standard annotation. + +| Name | Type | Description | +| ----------- | ---------------- | --------------------------------------------------------------- | +| `x_spans` | `Iterable[Span]` | `Span` objects aligned to the tokenization of `self.predicted`. | +| **RETURNS** | `Iterable[Span]` | `Span` objects aligned to the tokenization of `self.reference`. | + +## Example.to_dict {#to_dict tag="method"} + +Return a dictionary representation of the reference annotation contained in this +`Example`. + +> #### Example +> +> ```python +> eg_dict = example.to_dict() +> ``` + +| Name | Type | Description | +| ----------- | ---------------- | ------------------------------------------------------ | +| **RETURNS** | `Dict[str, obj]` | Dictionary representation of the reference annotation. | + +## Example.split_sents {#split_sents tag="method"} + +> #### Example +> +> ```python +> doc = nlp("I went yesterday had lots of fun") +> tokens_ref = ["I", "went", "yesterday", "had", "lots", "of", "fun"] +> sents_ref = [True, False, False, True, False, False, False] +> example = Example.from_dict(doc, {"words": tokens_ref, "sent_starts": sents_ref}) +> split_examples = example.split_sents() +> assert split_examples[0].text == "I went yesterday " +> assert split_examples[1].text == "had lots of fun" +> ``` + +Split one `Example` into multiple `Example` objects, one for each sentence. + +| Name | Type | Description | +| ----------- | --------------- | ---------------------------------------------------------- | +| **RETURNS** | `List[Example]` | List of `Example` objects, one for each original sentence. |