spaCy/website/docs/usage/transformers.md

11 KiB
Raw Blame History

title teaser menu next
Transformers Using transformer models like BERT in spaCy
Installation
install
Runtime Usage
runtime
Training Usage
training
/usage/training

Installation

spaCy v3.0 lets you use almost any statistical model to power your pipeline. You can use models implemented in a variety of frameworks, including TensorFlow, PyTorch and MXNet. To keep things sane, spaCy expects models from these frameworks to be wrapped with a common interface, using our machine learning library Thinc. A transformer model is just a statistical model, so the spacy-transformers package actually has very little work to do: we just have to provide a few functions that do the required plumbing. We also provide a pipeline component, Transformer, that lets you do multi-task learning and lets you save the transformer outputs for later use.

To use transformers with spaCy, you need the spacy-transformers package installed. It takes care of all the setup behind the scenes, and makes sure the transformer pipeline component is available to spaCy.

$ pip install spacy-transformers

Runtime usage

Transformer models can be used as drop-in replacements for other types of neural networks, so your spaCy pipeline can include them in a way that's completely invisible to the user. Users will download, load and use the model in the standard way, like any other spaCy pipeline. Instead of using the transformers as subnetworks directly, you can also use them via the Transformer pipeline component.

The processing pipeline with the transformer component

The Transformer component sets the Doc._.trf_data extension attribute, which lets you access the transformers outputs at runtime.

$ python -m spacy download en_core_trf_lg
### Example
import spacy

nlp = spacy.load("en_core_trf_lg")
for doc in nlp.pipe(["some text", "some other text"]):
    tokvecs = doc._.trf_data.tensors[-1]

You can also customize how the Transformer component sets annotations onto the Doc, by customizing the annotation_setter. This callback will be called with the raw input and output data for the whole batch, along with the batch of Doc objects, allowing you to implement whatever you need. The annotation setter is called with a batch of Doc objects and a FullTransformerBatch containing the transformers data for the batch.

def custom_annotation_setter(docs, trf_data):
    # TODO:
    ...

nlp = spacy.load("en_core_trf_lg")
nlp.get_pipe("transformer").annotation_setter = custom_annotation_setter
doc = nlp("This is a text")
print()  # TODO:

Training usage

The recommended workflow for training is to use spaCy's config system, usually via the spacy train command. The training config defines all component settings and hyperparameters in one place and lets you describe a tree of objects by referring to creation functions, including functions you register yourself.

The easiest way to get started is to clone a transformers-based project template. Swap in your data, edit the settings and hyperparameters and train, evaluate, package and visualize your model.

The [components] section in the config.cfg describes the pipeline components and the settings used to construct them, including their model implementation. Here's a config snippet for the Transformer component, along with matching Python code. In this case, the [components.transformer] block describes the transformer component:

Python equivalent

from spacy_transformers import Transformer, TransformerModel
from spacy_transformers.annotation_setters import null_annotation_setter
from spacy_transformers.span_getters import get_doc_spans

trf = Transformer(
    nlp.vocab,
    TransformerModel(
        "bert-base-cased",
        get_spans=get_doc_spans,
        tokenizer_config={"use_fast": True},
    ),
    annotation_setter=null_annotation_setter,
    max_batch_items=4096,
)
### config.cfg (excerpt)
[components.transformer]
factory = "transformer"
max_batch_items = 4096

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-cased"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "doc_spans.v1"

[components.transformer.annotation_setter]
@annotation_setters = "spacy-transformer.null_annotation_setter.v1"

The [components.transformer.model] block describes the model argument passed to the transformer component. It's a Thinc Model object that will be passed into the component. Here, it references the function spacy-transformers.TransformerModel.v1 registered in the architectures registry. If a key in a block starts with @, it's resolved to a function and all other settings are passed to the function as arguments. In this case, name, tokenizer_config and get_spans.

get_spans is a function that takes a batch of Doc object and returns lists of potentially overlapping Span objects to process by the transformer. Several built-in functions are available for example, to process the whole document or individual sentences. When the config is resolved, the function is created and passed into the model as an argument.

Remember that the config.cfg used for training should contain no missing values and requires all settings to be defined. You don't want any hidden defaults creeping in and changing your results! spaCy will tell you if settings are missing, and you can run spacy debug config with --auto-fill to automatically fill in all defaults.

Customizing the settings

To change any of the settings, you can edit the config.cfg and re-run the training. To change any of the functions, like the span getter, you can replace the name of the referenced function e.g. @span_getters = "sent_spans.v1" to process sentences. You can also register your own functions using the span_getters registry:

config.cfg

[components.transformer.model.get_spans]
@span_getters = "custom_sent_spans"
### code.py
import spacy_transformers

@spacy_transformers.registry.span_getters("custom_sent_spans")
def configure_custom_sent_spans():
    # TODO: write custom example
    def get_sent_spans(docs):
        return [list(doc.sents) for doc in docs]

    return get_sent_spans

To resolve the config during training, spaCy needs to know about your custom function. You can make it available via the --code argument that can point to a Python file:

$ python -m spacy train ./train.spacy ./dev.spacy ./config.cfg --code ./code.py

Customizing the model implementations

The Transformer component expects a Thinc Model object to be passed in as its model argument. You're not limited to the implementation provided by spacy-transformers the only requirement is that your registered function must return an object of type Model[List[Doc], FullTransformerBatch]: that is, a Thinc model that takes a list of Doc objects, and returns a FullTransformerBatch object with the transformer data.

Model type annotations

In the documentation and code base, you may come across type annotations and descriptions of Thinc model types, like Model[List[Doc], List[Floats2d]]. This so-called generic type describes the layer and its input and output type in this case, it takes a list of Doc objects as the input and list of 2-dimensional arrays of floats as the output. You can read more about defining Thinc models here. Also see the type checking for how to enable linting in your editor to see live feedback if your inputs and outputs don't match.

The same idea applies to task models that power the downstream components. Most of spaCy's built-in model creation functions support a tok2vec argument, which should be a Thinc layer of type Model[List[Doc], List[Floats2d]]. This is where we'll plug in our transformer model, using the Tok2VecListener layer, which sneakily delegates to the Transformer pipeline component.

### config.cfg (excerpt) {highlight="12"}
[components.ner]
factory = "ner"

[nlp.pipeline.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 3
hidden_width = 128
maxout_pieces = 3
use_upper = false

[nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecListener.v1"
grad_factor = 1.0

[nlp.pipeline.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

The Tok2VecListener layer expects a pooling layer as the argument pooling, which needs to be of type Model[Ragged, Floats2d]. This layer determines how the vector for each spaCy token will be computed from the zero or more source rows the token is aligned against. Here we use the reduce_mean layer, which averages the wordpiece rows. We could instead use reduce_last, reduce_max, or a custom function you write yourself.

You can have multiple components all listening to the same transformer model, and all passing gradients back to it. By default, all of the gradients will be equally weighted. You can control this with the grad_factor setting, which lets you reweight the gradients from the different listeners. For instance, setting grad_factor = 0 would disable gradients from one of the listeners, while grad_factor = 2.0 would multiply them by 2. This is similar to having a custom learning rate for each component. Instead of a constant, you can also provide a schedule, allowing you to freeze the shared parameters at the start of training.