From 55ed2b4e8254af9331ebd3cc5316315e22d2e96f Mon Sep 17 00:00:00 2001
From: Raphael Mitsch <r.mitsch@outlook.com>
Date: Mon, 4 Dec 2023 15:23:28 +0100
Subject: [PATCH] Add documentation for EL task (#12988)

* Add documentation for EL task.

* Fix EL factory name.

* Add llm_entity_linker_mentio.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Incorporate feedback.

* Format.

* Fix link to KB data.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
---
 website/docs/api/large-language-models.mdx   | 168 ++++++++++++++++++-
 website/docs/usage/large-language-models.mdx |   1 +
 2 files changed, 168 insertions(+), 1 deletion(-)

diff --git a/website/docs/api/large-language-models.mdx b/website/docs/api/large-language-models.mdx
index f8404cb2e..730ef5054 100644
--- a/website/docs/api/large-language-models.mdx
+++ b/website/docs/api/large-language-models.mdx
@@ -20,7 +20,8 @@ An LLM component is implemented through the `LLMWrapper` class. It is accessible
 through a generic `llm`
 [component factory](https://spacy.io/usage/processing-pipelines#custom-components-factories)
 as well as through task-specific component factories: `llm_ner`, `llm_spancat`,
-`llm_rel`, `llm_textcat`, `llm_sentiment` and `llm_summarization`.
+`llm_rel`, `llm_textcat`, `llm_sentiment`, `llm_summarization` and
+`llm_entity_linker`.
 
 ### LLMWrapper.\_\_init\_\_ {id="init",tag="method"}
 
@@ -302,6 +303,171 @@ max_n_words = 20
 path = "summarization_examples.yml"
 ```
 
+### EL (Entity Linking) {id="nel"}
+
+The EL links recognized entities (see [NER](#ner)) to those in a knowledge base
+(KB). The EL task prompts the LLM to select the most likely candidate from the
+KB, whose structure can be arbitrary.
+
+Note that the documents processed by the entity linking task are expected to
+have recognized entities in their `.ents` attribute. This can be achieved by
+either running the [NER task](#ner), using a trained spaCy NER model or setting
+the entities manually prior to running the EL task.
+
+In order to be able to pull data from the KB, an object implementing the
+`CandidateSelector` protocol has to be provided. This requires two functions:
+(1) `__call__()` to fetch candidate entities for entity mentions in the text
+(assumed to be available in `Doc.ents`) and (2) `get_entity_description()` to
+fetch descriptions for any given entity ID. Descriptions can be empty, but
+ideally provide more context for entities stored in the KB.
+
+`spacy-llm` provides a `CandidateSelector` implementation
+(`spacy.CandidateSelector.v1`) that leverages a spaCy knowledge base - as used
+in an `entity_linking` component - to select candidates. This knowledge base can
+be loaded from an existing spaCy pipeline (note that the pipeline's EL component
+doesn't have to be trained) or from a separate .yaml file.
+
+#### spacy.EntityLinker.v1 {id="el-v1"}
+
+Supports zero- and few-shot prompting. Relies on a configurable component
+suggesting viable entities before letting the LLM pick the most likely
+candidate.
+
+> #### Example config for spacy.EntityLinker.v1
+>
+> ```ini
+> [paths]
+> el_nlp = null
+>
+> ...
+>
+> [components.llm.task]
+> @llm_tasks = "spacy.EntityLinker.v1"
+>
+> [initialize]
+> [initialize.components]
+> [initialize.components.llm]
+> [initialize.components.llm.candidate_selector]
+> @llm_misc = "spacy.CandidateSelector.v1"
+>
+> # Load a KB from a KB file. For loading KBs from spaCy pipelines see spacy.KBObjectLoader.v1.
+> [initialize.components.llm.candidate_selector.kb_loader]
+> @llm_misc = "spacy.KBFileLoader.v1"
+> # Path to knowledge base .yaml file.
+> path = ${paths.el_kb}
+> ```
+
+| Argument              | Description                                                                                                                                                                                   |
+| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `template`            | Custom prompt template to send to LLM model. Defaults to [entity_linker.v1.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/entity_linker.v1.jinja). ~~str~~ |
+| `parse_responses`     | Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. ~~Optional[TaskResponseParser[EntityLinkerTask]]~~                                   |
+| `prompt_example_type` | Type to use for fewshot examples. Defaults to `ELExample`. ~~Optional[Type[FewshotExample]]~~                                                                                                 |
+| `examples`            | Optional callable that reads a file containing task examples for few-shot learning. If `None` is passed, zero-shot learning will be used. Defaults to `None`. ~~ExamplesConfigType~~          |
+| `scorer`              | Scorer function. Defaults to the metric used by spaCy to evaluate entity linking performance. ~~Optional[Scorer]~~                                                                            |
+
+##### spacy.CandidateSelector.v1 {id="candidate-selector-v1"}
+
+`spacy.CandidateSelector.v1` is an implementation of the `CandidateSelector`
+protocol required by [`spacy.EntityLinker.v1`](#el-v1). The built-in candidate
+selector method allows loading existing knowledge bases in several ways, e. g.
+loading from a spaCy pipeline with a (not necessarily trained) entity linking
+component, and loading from a file describing the knowlege base as a .yaml file.
+Either way the loaded data will be converted to a spaCy `InMemoryLookupKB`
+instance. The KB's selection capabilities are used to select the most likely
+entity candidates for the specified mentions.
+
+> #### Example config for spacy.CandidateSelector.v1
+>
+> ```ini
+> [initialize]
+> [initialize.components]
+> [initialize.components.llm]
+> [initialize.components.llm.candidate_selector]
+> @llm_misc = "spacy.CandidateSelector.v1"
+>
+> # Load a KB from a KB file. For loading KBs from spaCy pipelines see spacy.KBObjectLoader.v1.
+> [initialize.components.llm.candidate_selector.kb_loader]
+> @llm_misc = "spacy.KBFileLoader.v1"
+> # Path to knowledge base .yaml file.
+> path = ${paths.el_kb}
+> ```
+
+| Argument    | Description                                                       |
+| ----------- | ----------------------------------------------------------------- |
+| `kb_loader` | KB loader object. ~~InMemoryLookupKBLoader~~                      |
+| `top_n`     | Top-n candidates to include in the prompt. Defaults to 5. ~~int~~ |
+
+##### spacy.KBObjectLoader.v1 {id="kb-object-loader-v1"}
+
+Adheres to the `InMemoryLookupKBLoader` interface required by
+[`spacy.CandidateSelector.v1`](#candidate-selector-v1). Loads a knowledge base
+from an existing spaCy pipeline.
+
+> #### Example config for spacy.KBObjectLoader.v1
+>
+> ```ini
+> [initialize.components.llm.candidate_selector.kb_loader]
+> @llm_misc = "spacy.KBObjectLoader.v1"
+> # Path to knowledge base directory in serialized spaCy pipeline.
+> path = ${paths.el_kb}
+> # Path to spaCy pipeline. If this is not specified, spacy-llm tries to determine this automatically (but may fail).
+> nlp_path = ${paths.el_nlp}
+> # Path to file with descriptions for entity.
+> desc_path = ${paths.el_desc}
+> ```
+
+| Argument          | Description                                                                                                                                                                                                                         |
+| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `path`            | Path to KB file. ~~Union[str, Path]~~                                                                                                                                                                                               |
+| `nlp_path`        | Path to serialized NLP pipeline. If None, path will be guessed. ~~Optional[Union[Path, str]]~~                                                                                                                                      |
+| `desc_path`       | Path to file with descriptions for entities. ~~int~~                                                                                                                                                                                |
+| `ent_desc_reader` | Entity description reader. Defaults to an internal method expecting a CSV file without header row, with ";" as delimiters, and with two columns - one for the entitys' IDs, one for their descriptions. ~~Optional[EntDescReader]~~ |
+
+##### spacy.KBFileLoader.v1 {id="kb-file-loader-v1"}
+
+Adheres to the `InMemoryLookupKBLoader` interface required by
+[`spacy.CandidateSelector.v1`](#candidate-selector-v1). Loads a knowledge base
+from a knowledge base file. The KB .yaml file has to stick to the following
+format:
+
+```yaml
+entities:
+  # The key should be whatever ID identifies this entity uniquely in your knowledge base.
+  ID1:
+      name: "..."
+      desc: "..."
+  ID2:
+      ...
+# Data on aliases in your knowledge base - e. g. "Apple" for the entity "Apple Inc.".
+aliases:
+  - alias: "..."
+    # List of all entities that this alias refers to.
+    entities: ["ID1", "ID2", ...]
+    # Optional: prior probabilities that this alias refers to the n-th entity in the "entities" attribute.
+    probabilities: [0.5, 0.2, ...]
+  - alias: "..."
+    entities: [...]
+    probabilities: [...]
+  ...
+```
+
+See
+[here](https://github.com/explosion/spacy-llm/blob/main/usage_examples/el_openai/el_kb_data.yml)
+for a toy example of how such a KB file might look like.
+
+> #### Example config for spacy.KBFileLoader.v1
+>
+> ```ini
+> [initialize.components.llm.candidate_selector.kb_loader]
+> @llm_misc = "spacy.KBFileLoader.v1"
+> # Path to knowledge base file.
+> path = ${paths.el_kb}
+> ```
+
+| Argument | Description                           |
+| -------- | ------------------------------------- |
+| `path`   | Path to KB file. ~~Union[str, Path]~~ |
+
 ### NER {id="ner"}
 
 The NER task identifies non-overlapping entities in text.
diff --git a/website/docs/usage/large-language-models.mdx b/website/docs/usage/large-language-models.mdx
index 94494b4e1..43b22ce07 100644
--- a/website/docs/usage/large-language-models.mdx
+++ b/website/docs/usage/large-language-models.mdx
@@ -357,6 +357,7 @@ evaluate the component.
 
 | Component                                                               | Description                                                                                                       |
 | ----------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
+| [`spacy.EntityLinker.v1`](/api/large-language-models#el-v1)             | The entity linking task prompts the model to link all entities in a given text to entries in a knowledge base.    |
 | [`spacy.Summarization.v1`](/api/large-language-models#summarization-v1) | The summarization task prompts the model for a concise summary of the provided text.                              |
 | [`spacy.NER.v3`](/api/large-language-models#ner-v3)                     | Implements Chain-of-Thought reasoning for NER extraction - obtains higher accuracy than v1 or v2.                 |
 | [`spacy.NER.v2`](/api/large-language-models#ner-v2)                     | Builds on v1 and additionally supports defining the provided labels with explicit descriptions.                   |