55 KiB
title | teaser | tag | source |
---|---|---|---|
Language | A text-processing pipeline | class | spacy/language.py |
Usually you'll load this once per process as nlp
and pass the instance around
your application. The Language
class is created when you call
spacy.load()
and contains the shared vocabulary
and language data, optional model data loaded from a
model package or a path, and a
processing pipeline containing components like
the tagger or parser that are called on a document in order. You can also add
your own processing pipeline components that take a Doc
object, modify it and
return it.
Language.__init__
Initialize a Language
object.
Example
# Construction from subclass from spacy.lang.en import English nlp = English() # Construction from scratch from spacy.vocab import Vocab from spacy.language import Language nlp = Language(Vocab())
Name | Type | Description |
---|---|---|
vocab |
Vocab |
A Vocab object. If True , a vocab is created using the default language data settings. |
keyword-only | ||
max_length |
int | Maximum number of characters allowed in a single text. Defaults to 10 ** 6 . |
meta |
dict | Custom meta data for the Language class. Is written to by models to add model meta data. |
create_tokenizer |
Callable |
Optional function that receives the nlp object and returns a tokenizer. |
Language.from_config
Create a Language
object from a loaded config. Will set up the tokenizer and
language data, add pipeline components based on the pipeline and components
define in the config and validate the results. If no config is provided, the
default config of the given language is used. This is also how spaCy loads a
model under the hood based on its config.cfg
.
Example
from thinc.api import Config from spacy.language import Language config = Config().from_disk("./config.cfg") nlp = Language.from_config(config)
Name | Type | Description |
---|---|---|
config |
Dict[str, Any] / Config |
The loaded config. |
keyword-only | ||
disable |
Iterable[str] |
List of pipeline component names to disable. |
auto_fill |
bool | Whether to automatically fill in missing values in the config, based on defaults and function argument annotations. Defaults to True . |
validate |
bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to True . |
RETURNS | Language |
The initialized object. |
Language.component
Register a custom pipeline component under a given name. This allows
initializing the component by name using
Language.add_pipe
and referring to it in
config files. This classmethod and decorator is
intended for simple stateless functions that take a Doc
and return it. For
more complex stateful components that allow settings and need access to the
shared nlp
object, use the Language.factory
decorator. For more details and examples, see the
usage documentation.
Example
from spacy.language import Language # Usage as a decorator @Language.component("my_component") def my_component(doc): # Do something to the doc return doc # Usage as a function Language.component("my_component2", func=my_component)
Name | Type | Description |
---|---|---|
name |
str | The name of the component factory. |
keyword-only | ||
assigns |
Iterable[str] |
Doc or Token attributes assigned by this component, e.g. ["token.ent_id"] . Used for pipeline analysis. |
requires |
Iterable[str] |
Doc or Token attributes required by this component, e.g. ["token.ent_id"] . Used for pipeline analysis. |
retokenizes |
bool | Whether the component changes tokenization. Used for pipeline analysis. |
scores |
Iterable[str] |
All scores set by the components if it's trainable, e.g. ["ents_f", "ents_r", "ents_p"] . |
default_score_weights |
Dict[str, float] |
The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to 1.0 per component and will be combined and normalized for the whole pipeline. |
func |
Optional[Callable] |
Optional function if not used a a decorator. |
Language.factory
Register a custom pipeline component factory under a given name. This allows
initializing the component by name using
Language.add_pipe
and referring to it in
config files. The registered factory function needs to
take at least two named arguments which spaCy fills in automatically: nlp
for the current nlp
object and name
for the component instance name. This
can be useful to distinguish multiple instances of the same component and allows
trainable components to add custom losses using the component instance name. The
default_config
defines the default values of the remaining factory arguments.
It's merged into the nlp.config
. For more details and
examples, see the
usage documentation.
Example
from spacy.language import Language # Usage as a decorator @Language.factory( "my_component", default_config={"some_setting": True}, ) def create_my_component(nlp, name, some_setting): return MyComponent(some_setting) # Usage as function Language.factory( "my_component", default_config={"some_setting": True}, func=create_my_component )
Name | Type | Description |
---|---|---|
name |
str | The name of the component factory. |
keyword-only | ||
default_config |
Dict[str, any] |
The default config, describing the default values of the factory arguments. |
assigns |
Iterable[str] |
Doc or Token attributes assigned by this component, e.g. ["token.ent_id"] . Used for pipeline analysis. |
requires |
Iterable[str] |
Doc or Token attributes required by this component, e.g. ["token.ent_id"] . Used for pipeline analysis. |
retokenizes |
bool | Whether the component changes tokenization. Used for pipeline analysis. |
scores |
Iterable[str] |
All scores set by the components if it's trainable, e.g. ["ents_f", "ents_r", "ents_p"] . |
default_score_weights |
Dict[str, float] |
The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to 1.0 per component and will be combined and normalized for the whole pipeline. |
func |
Optional[Callable] |
Optional function if not used a a decorator. |
Language.__call__
Apply the pipeline to some text. The text can span multiple sentences, and can contain arbitrary whitespace. Alignment into the original string is preserved.
Example
doc = nlp("An example sentence. Another sentence.") assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
Name | Type | Description |
---|---|---|
text |
str | The text to be processed. |
keyword-only | ||
disable |
List[str] |
Names of pipeline components to disable. |
component_cfg |
Dict[str, dict] |
Optional dictionary of keyword arguments for components, keyed by component names. Defaults to None . |
RETURNS | Doc |
A container for accessing the annotations. |
Language.pipe
Process texts as a stream, and yield Doc
objects in order. This is usually
more efficient than processing texts one-by-one.
Example
texts = ["One document.", "...", "Lots of documents"] for doc in nlp.pipe(texts, batch_size=50): assert doc.is_parsed
Name | Type | Description |
---|---|---|
texts |
Iterable[str] |
A sequence of strings. |
keyword-only | ||
as_tuples |
bool | If set to True , inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False . |
batch_size |
int | The number of texts to buffer. |
disable |
List[str] |
Names of pipeline components to disable. |
cleanup |
bool | If True , unneeded strings are freed to control memory use. Experimental. |
component_cfg |
Dict[str, dict] |
Optional dictionary of keyword arguments for components, keyed by component names. Defaults to None . |
n_process 2.2.2 |
int | Number of processors to use, only supported in Python 3. Defaults to 1 . |
YIELDS | Doc |
Documents in the order of the original text. |
Language.begin_training
Initialize the pipe for training, using data examples if available. Returns an
Optimizer
object.
Example
optimizer = nlp.begin_training(get_examples)
Name | Type | Description |
---|---|---|
get_examples |
Callable[[], Iterable[Example]] |
Optional function that returns gold-standard annotations in the form of Example objects. |
keyword-only | ||
sgd |
Optimizer |
An optional optimizer. Will be created via create_optimizer if not set. |
RETURNS | Optimizer |
The optimizer. |
Language.resume_training
Continue training a pretrained model. Create and return an optimizer, and
initialize "rehearsal" for any pipeline component that has a rehearse
method.
Rehearsal is used to prevent models from "forgetting" their initialized
"knowledge". To perform rehearsal, collect samples of text you want the models
to retain performance on, and call nlp.rehearse
with
a batch of Example objects.
Example
optimizer = nlp.resume_training() nlp.rehearse(examples, sgd=optimizer)
Name | Type | Description |
---|---|---|
keyword-only | ||
sgd |
Optimizer |
An optional optimizer. Will be created via create_optimizer if not set. |
RETURNS | Optimizer |
The optimizer. |
Language.update
Update the models in the pipeline.
Example
for raw_text, entity_offsets in train_data: doc = nlp.make_doc(raw_text) example = Example.from_dict(doc, {"entities": entity_offsets}) nlp.update([example], sgd=optimizer)
Name | Type | Description |
---|---|---|
examples |
Iterable[Example] |
A batch of Example objects to learn from. |
keyword-only | ||
drop |
float | The dropout rate. |
sgd |
Optimizer |
The optimizer. |
losses |
Dict[str, float] |
Dictionary to update with the loss, keyed by pipeline component. |
component_cfg |
Dict[str, dict] |
Optional dictionary of keyword arguments for components, keyed by component names. Defaults to None . |
RETURNS | Dict[str, float] |
The updated losses dictionary. |
Language.rehearse
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the current model to make predictions similar to an initial model, to try to address the "catastrophic forgetting" problem. This feature is experimental.
Example
optimizer = nlp.resume_training() losses = nlp.rehearse(examples, sgd=optimizer)
Name | Type | Description |
---|---|---|
examples |
Iterable[Example] |
A batch of Example objects to learn from. |
keyword-only | ||
drop |
float | The dropout rate. |
sgd |
Optimizer |
The optimizer. |
losses |
Dict[str, float] |
Optional record of the loss during training. Updated using the component name as the key. |
RETURNS | Dict[str, float] |
The updated losses dictionary. |
Language.evaluate
Evaluate a model's pipeline components.
Example
scores = nlp.evaluate(examples, verbose=True) print(scores)
Name | Type | Description |
---|---|---|
examples |
Iterable[Example] |
A batch of Example objects to learn from. |
keyword-only | ||
verbose |
bool | Print debugging information. |
batch_size |
int | The batch size to use. |
scorer |
Scorer |
Optional Scorer to use. If not passed in, a new one will be created. |
component_cfg |
Dict[str, dict] |
Optional dictionary of keyword arguments for components, keyed by component names. Defaults to None . |
RETURNS | Dict[str, Union[float, dict]] |
A dictionary of evaluation scores. |
Language.use_params
Replace weights of models in the pipeline with those provided in the params dictionary. Can be used as a context manager, in which case, models go back to their original weights after the block.
Example
with nlp.use_params(optimizer.averages): nlp.to_disk("/tmp/checkpoint")
Name | Type | Description |
---|---|---|
params |
dict | A dictionary of parameters keyed by model ID. |
Language.create_pipe
Create a pipeline component from a factory.
As of v3.0, the Language.add_pipe
method also takes
the string name of the factory, creates the component, adds it to the pipeline
and returns it. The Language.create_pipe
method is now mostly used internally.
To create a component and add it to the pipeline, you should always use
Language.add_pipe
.
Example
parser = nlp.create_pipe("parser")
Name | Type | Description |
---|---|---|
factory_name |
str | Name of the registered component factory. |
name |
str | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. |
keyword-only | ||
config 3 |
Dict[str, Any] |
Optional config parameters to use for this component. Will be merged with the default_config specified by the component factory. |
validate 3 |
bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to True . |
RETURNS | callable | The pipeline component. |
Language.add_pipe
Add a component to the processing pipeline. Expects a name that maps to a
component factory registered using
@Language.component
or
@Language.factory
. Components should be callables
that take a Doc
object, modify it and return it. Only one of before
,
after
, first
or last
can be set. Default behavior is last=True
.
As of v3.0, the Language.add_pipe
method doesn't
take callables anymore and instead expects the name of a component factory
registered using @Language.component
or
@Language.factory
. It now takes care of creating the
component, adds it to the pipeline and returns it.
Example
@Language.component("component") def component_func(doc): # modify Doc and return it return doc nlp.add_pipe("component", before="ner") component = nlp.add_pipe("component", name="custom_name", last=True)
Name | Type | Description |
---|---|---|
factory_name |
str | Name of the registered component factory. |
name |
str | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. |
keyword-only | ||
before |
str / int | Component name or index to insert component directly before. |
after |
str / int | Component name or index to insert component directly after: |
first |
bool | Insert component first / not first in the pipeline. |
last |
bool | Insert component last / not last in the pipeline. |
config 3 |
Dict[str, Any] |
Optional config parameters to use for this component. Will be merged with the default_config specified by the component factory. |
validate 3 |
bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to True . |
RETURNS 3 | callable | The pipeline component. |
Language.has_factory
Check whether a factory name is registered on the Language
class or subclass.
Will check for
language-specific factories
registered on the subclass, as well as general-purpose factories registered on
the Language
base class, available to all subclasses.
Example
from spacy.language import Language from spacy.lang.en import English @English.component("component") def component(doc): return doc assert English.has_factory("component") assert not Language.has_factory("component")
Name | Type | Description |
---|---|---|
name |
str | Name of the pipeline factory to check. |
RETURNS | bool | Whether a factory of that name is registered on the class. |
Language.has_pipe
Check whether a component is present in the pipeline. Equivalent to
name in nlp.pipe_names
.
Example
@Language.component("component") def component(doc): return doc nlp.add_pipe("component", name="my_component") assert "my_component" in nlp.pipe_names assert nlp.has_pipe("my_component")
Name | Type | Description |
---|---|---|
name |
str | Name of the pipeline component to check. |
RETURNS | bool | Whether a component of that name exists in the pipeline. |
Language.get_pipe
Get a pipeline component for a given component name.
Example
parser = nlp.get_pipe("parser") custom_component = nlp.get_pipe("custom_component")
Name | Type | Description |
---|---|---|
name |
str | Name of the pipeline component to get. |
RETURNS | callable | The pipeline component. |
Language.replace_pipe
Replace a component in the pipeline.
Example
nlp.replace_pipe("parser", my_custom_parser)
Name | Type | Description |
---|---|---|
name |
str | Name of the component to replace. |
component |
callable | The pipeline component to insert. |
keyword-only | ||
config 3 |
Dict[str, Any] |
Optional config parameters to use for the new component. Will be merged with the default_config specified by the component factory. |
validate 3 |
bool | Whether to validate the component config and arguments against the types expected by the factory. Defaults to True . |
Language.rename_pipe
Rename a component in the pipeline. Useful to create custom names for
pre-defined and pre-loaded components. To change the default name of a component
added to the pipeline, you can also use the name
argument on
add_pipe
.
Example
nlp.rename_pipe("parser", "spacy_parser")
Name | Type | Description |
---|---|---|
old_name |
str | Name of the component to rename. |
new_name |
str | New name of the component. |
Language.remove_pipe
Remove a component from the pipeline. Returns the removed component name and component function.
Example
name, component = nlp.remove_pipe("parser") assert name == "parser"
Name | Type | Description |
---|---|---|
name |
str | Name of the component to remove. |
RETURNS | tuple | A (name, component) tuple of the removed component. |
Language.select_pipes
Disable one or more pipeline components. If used as a context manager, the
pipeline will be restored to the initial state at the end of the block.
Otherwise, a DisabledPipes
object is returned, that has a .restore()
method
you can use to undo your changes. You can specify either disable
(as a list or
string), or enable
. In the latter case, all components not in the enable
list, will be disabled.
Example
with nlp.select_pipes(disable=["tagger", "parser"]): nlp.begin_training() with nlp.select_pipes(enable="ner"): nlp.begin_training() disabled = nlp.select_pipes(disable=["tagger", "parser"]) nlp.begin_training() disabled.restore()
As of spaCy v3.0, the disable_pipes
method has been renamed to select_pipes
:
- nlp.disable_pipes(["tagger", "parser"])
+ nlp.select_pipes(disable=["tagger", "parser"])
Name | Type | Description |
---|---|---|
keyword-only | ||
disable |
str / list | Name(s) of pipeline components to disable. |
enable |
str / list | Names(s) of pipeline components that will not be disabled. |
RETURNS | DisabledPipes |
The disabled pipes that can be restored by calling the object's .restore() method. |
Language.get_factory_meta
Get the factory meta information for a given pipeline component name. Expects
the name of the component factory. The factory meta is an instance of the
FactoryMeta
dataclass and contains the
information about the component and its default provided by the
@Language.component
or
@Language.factory
decorator.
Example
factory_meta = Language.get_factory_meta("ner") assert factory_meta.factory == "ner" print(factory_meta.default_config)
Name | Type | Description |
---|---|---|
name |
str | The factory name. |
RETURNS | FactoryMeta |
The factory meta. |
Language.get_pipe_meta
Get the factory meta information for a given pipeline component name. Expects
the name of the component instance in the pipeline. The factory meta is an
instance of the FactoryMeta
dataclass and
contains the information about the component and its default provided by the
@Language.component
or
@Language.factory
decorator.
Example
nlp.add_pipe("ner", name="entity_recognizer") factory_meta = nlp.get_pipe_meta("entity_recognizer") assert factory_meta.factory == "ner" print(factory_meta.default_config)
Name | Type | Description |
---|---|---|
name |
str | The pipeline component name. |
RETURNS | FactoryMeta |
The factory meta. |
Language.meta
Custom meta data for the Language class. If a model is loaded, contains meta
data of the model. The Language.meta
is also what's serialized as the
meta.json
when you save an nlp
object to disk.
Example
print(nlp.meta)
Name | Type | Description |
---|---|---|
RETURNS | dict | The meta data. |
Language.config
Export a trainable config.cfg
for the current
nlp
object. Includes the current pipeline, all configs used to create the
currently active pipeline components, as well as the default training config
that can be used with spacy train
. Language.config
returns
a Thinc Config
object, which is a
subclass of the built-in dict
. It supports the additional methods to_disk
(serialize the config to a file) and to_str
(output the config as a string).
Example
nlp.config.to_disk("./config.cfg") print(nlp.config.to_str())
Name | Type | Description |
---|---|---|
RETURNS | Config |
The config. |
Language.to_disk
Save the current state to a directory. If a model is loaded, this will include the model.
Example
nlp.to_disk("/path/to/models")
Name | Type | Description |
---|---|---|
path |
str / Path |
A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path -like objects. |
exclude |
list | Names of pipeline components or serialization fields to exclude. |
Language.from_disk
Loads state from a directory. Modifies the object in place and returns it. If
the saved Language
object contains a model, the model will be loaded. Note
that this method is commonly used via the subclasses like English
or German
to make language-specific functionality like the
lexical attribute getters available to the
loaded object.
Example
from spacy.language import Language nlp = Language().from_disk("/path/to/model") # using language-specific subclass from spacy.lang.en import English nlp = English().from_disk("/path/to/en_model")
Name | Type | Description |
---|---|---|
path |
str / Path |
A path to a directory. Paths may be either strings or Path -like objects. |
exclude |
list | Names of pipeline components or serialization fields to exclude. |
RETURNS | Language |
The modified Language object. |
Language.to_bytes
Serialize the current state to a binary string.
Example
nlp_bytes = nlp.to_bytes()
Name | Type | Description |
---|---|---|
exclude |
list | Names of pipeline components or serialization fields to exclude. |
RETURNS | bytes | The serialized form of the Language object. |
Language.from_bytes
Load state from a binary string. Note that this method is commonly used via the
subclasses like English
or German
to make language-specific functionality
like the lexical attribute getters
available to the loaded object.
Example
from spacy.lang.en import English nlp_bytes = nlp.to_bytes() nlp2 = English() nlp2.from_bytes(nlp_bytes)
Name | Type | Description |
---|---|---|
bytes_data |
bytes | The data to load from. |
exclude |
list | Names of pipeline components or serialization fields to exclude. |
RETURNS | Language |
The Language object. |
Attributes
Name | Type | Description |
---|---|---|
vocab |
Vocab |
A container for the lexical types. |
tokenizer |
Tokenizer |
The tokenizer. |
make_doc |
Callable |
Callable that takes a string and returns a Doc . |
pipeline |
List[str, Callable] |
List of (name, component) tuples describing the current processing pipeline, in order. |
pipe_names 2 |
List[str] |
List of pipeline component names, in order. |
pipe_labels 2.2 |
Dict[str, List[str]] |
List of labels set by the pipeline components, if available, keyed by component name. |
pipe_factories 2.2 |
Dict[str, str] |
Dictionary of pipeline component names, mapped to their factory names. |
factories |
Dict[str, Callable] |
All available factory functions, keyed by name. |
factory_names 3 |
List[str] |
List of all available factory names. |
path 2 |
Path |
Path to the model data directory, if a model is loaded. Otherwise None . |
Class attributes
Name | Type | Description |
---|---|---|
Defaults |
class | Settings, data and factory methods for creating the nlp object and processing pipeline. |
lang |
str | Two-letter language ID, i.e. ISO code. |
default_config |
dict | Base config to use for Language.config. Defaults to default_config.cfg . |
Defaults
The following attributes can be set on the Language.Defaults
class to
customize the default language data:
Example
from spacy.language import language from spacy.lang.tokenizer_exceptions import URL_MATCH from thinc.api import Config DEFAULT_CONFIFG = """ [nlp.tokenizer] @tokenizers = "MyCustomTokenizer.v1" """ class Defaults(Language.Defaults): stop_words = set() tokenizer_exceptions = {} prefixes = tuple() suffixes = tuple() infixes = tuple() token_match = None url_match = URL_MATCH lex_attr_getters = {} syntax_iterators = {} writing_system = {"direction": "ltr", "has_case": True, "has_letters": True} config = Config().from_str(DEFAULT_CONFIG)
Name | Description |
---|---|
stop_words |
List of stop words, used for Token.is_stop .Example: stop_words.py |
tokenizer_exceptions |
Tokenizer exception rules, string mapped to list of token attributes. Example: de/tokenizer_exceptions.py |
prefixes , suffixes , infixes |
Prefix, suffix and infix rules for the default tokenizer. Example: puncutation.py |
token_match |
Optional regex for matching strings that should never be split, overriding the infix rules. Example: fr/tokenizer_exceptions.py |
url_match |
Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match. Example: tokenizer_exceptions.py |
lex_attr_getters |
Custom functions for setting lexical attributes on tokens, e.g. like_num .Example: lex_attrs.py |
syntax_iterators |
Functions that compute views of a Doc object based on its syntax. At the moment, only used for noun chunks.Example: syntax_iterators.py . |
writing_system |
Information about the language's writing system, available via Vocab.writing_system . Defaults to: {"direction": "ltr", "has_case": True, "has_letters": True}. .Example: zh/__init__.py |
config |
Default config added to nlp.config . This can include references to custom tokenizers or lemmatizers.Example: zh/__init__.py |
Serialization fields
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the exclude
argument.
Example
data = nlp.to_bytes(exclude=["tokenizer", "vocab"]) nlp.from_disk("./model-data", exclude=["ner"])
Name | Description |
---|---|
vocab |
The shared Vocab . |
tokenizer |
Tokenization rules and exceptions. |
meta |
The meta data, available as Language.meta . |
... | String names of pipeline components, e.g. "ner" . |
FactoryMeta
The FactoryMeta
contains the information about the component and its default
provided by the @Language.component
or
@Language.factory
decorator. It's created whenever a
component is defined and stored on the Language
class for each component
instance and factory instance.
Name | Type | Description |
---|---|---|
factory |
str | The name of the registered component factory. |
default_config |
Dict[str, Any] |
The default config, describing the default values of the factory arguments. |
assigns |
Iterable[str] |
Doc or Token attributes assigned by this component, e.g. ["token.ent_id"] . Used for pipeline analysis. |
requires |
Iterable[str] |
Doc or Token attributes required by this component, e.g. ["token.ent_id"] . Used for pipeline analysis. |
retokenizes |
bool | Whether the component changes tokenization. Used for pipeline analysis. |
scores |
Iterable[str] |
All scores set by the components if it's trainable, e.g. ["ents_f", "ents_r", "ents_p"] . |
default_score_weights |
Dict[str, float] |
The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to 1.0 per component and will be combined and normalized for the whole pipeline. |