mirror of https://github.com/explosion/spaCy.git
REL intro and get_candidates function
This commit is contained in:
parent
df06f7a792
commit
2c4b2ee5e9
|
@ -486,6 +486,60 @@ with Model.define_operators({">>": chain}):
|
|||
|
||||
## Create new trainable components {#components}
|
||||
|
||||
In addition to [swapping out](#swap-architectures) default models in built-in
|
||||
components, you can also implement an entirely new,
|
||||
[trainable pipeline component](usage/processing-pipelines#trainable-components)
|
||||
from scratch. This can be done by creating a new class inheriting from [`Pipe`](/api/pipe),
|
||||
and linking it up to your custom model implementation.
|
||||
|
||||
### Example: Pipeline component for relation extraction {#component-rel}
|
||||
|
||||
This section will run through an example of implementing a novel relation extraction
|
||||
component from scratch. As a first step, we need a method that will generate pairs of
|
||||
entities that we want to classify as being related or not. These candidate pairs are
|
||||
typically formed within one document, which means we'll have a function that takes a
|
||||
`Doc` as input and outputs a `List` of `Span` tuples. In this example, we will focus
|
||||
on binary relation extraction, i.e. the tuple will be of length 2.
|
||||
|
||||
We register this function in the 'misc' register so we can easily refer to it from the config,
|
||||
and allow swapping it out for any candidate
|
||||
generation function. For instance, a very straightforward implementation would be to just
|
||||
take any two entities from the same document:
|
||||
|
||||
```python
|
||||
@registry.misc.register("rel_cand_generator.v1")
|
||||
def create_candidate_indices() -> Callable[[Doc], List[Tuple[Span, Span]]]:
|
||||
def get_candidate_indices(doc: "Doc"):
|
||||
indices = []
|
||||
for ent1 in doc.ents:
|
||||
for ent2 in doc.ents:
|
||||
indices.append((ent1, ent2))
|
||||
return indices
|
||||
return get_candidate_indices
|
||||
```
|
||||
|
||||
But we could also refine this further by excluding relations of an entity with itself,
|
||||
and posing a maximum distance (in number of tokens) between two entities:
|
||||
|
||||
```python
|
||||
### {highlight="1,2,7,8"}
|
||||
@registry.misc.register("rel_cand_generator.v2")
|
||||
def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
|
||||
def get_candidate_indices(doc: "Doc"):
|
||||
indices = []
|
||||
for ent1 in doc.ents:
|
||||
for ent2 in doc.ents:
|
||||
if ent1 != ent2:
|
||||
if max_length and abs(ent2.start - ent1.start) <= max_length:
|
||||
indices.append((ent1, ent2))
|
||||
return indices
|
||||
return get_candidate_indices
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<Infobox title="This section is still under construction" emoji="🚧" variant="warning">
|
||||
</Infobox>
|
||||
|
||||
|
|
|
@ -1035,7 +1035,7 @@ plug fully custom machine learning components into your pipeline. You'll need
|
|||
the following:
|
||||
|
||||
1. **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This
|
||||
can be a model using implemented in
|
||||
can be a model implemented in
|
||||
[Thinc](/usage/layers-architectures#thinc), or a
|
||||
[wrapped model](/usage/layers-architectures#frameworks) implemented in
|
||||
PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a
|
||||
|
|
Loading…
Reference in New Issue