--- title: Language teaser: A text-processing pipeline tag: class source: spacy/language.py --- Usually you'll load this once per process as `nlp` and pass the instance around your application. The `Language` class is created when you call [`spacy.load()`](/api/top-level#spacy.load) and contains the shared vocabulary and [language data](/usage/adding-languages), optional model data loaded from a [model package](/models) or a path, and a [processing pipeline](/usage/processing-pipelines) containing components like the tagger or parser that are called on a document in order. You can also add your own processing pipeline components that take a `Doc` object, modify it and return it. ## Language.\_\_init\_\_ {#init tag="method"} Initialize a `Language` object. > #### Example > > ```python > from spacy.vocab import Vocab > from spacy.language import Language > nlp = Language(Vocab()) > > from spacy.lang.en import English > nlp = English() > ``` | Name | Type | Description | | ----------- | ---------- | ------------------------------------------------------------------------------------------ | | `vocab` | `Vocab` | A `Vocab` object. If `True`, a vocab is created via `Language.Defaults.create_vocab`. | | `make_doc` | callable | A function that takes text and returns a `Doc` object. Usually a `Tokenizer`. | | `meta` | dict | Custom meta data for the `Language` class. Is written to by models to add model meta data. | | **RETURNS** | `Language` | The newly constructed object. | ## Language.\_\_call\_\_ {#call tag="method"} Apply the pipeline to some text. The text can span multiple sentences, and can contain arbitrary whitespace. Alignment into the original string is preserved. > #### Example > > ```python > doc = nlp(u"An example sentence. Another sentence.") > assert (doc[0].text, doc[0].head.tag_) == ("An", "NN") > ``` | Name | Type | Description | | ----------- | ------- | --------------------------------------------------------------------------------- | | `text` | unicode | The text to be processed. | | `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | **RETURNS** | `Doc` | A container for accessing the annotations. | Pipeline components to prevent from being loaded can now be added as a list to `disable`, instead of specifying one keyword argument per component. ```diff - doc = nlp(u"I don't want parsed", parse=False) + doc = nlp(u"I don't want parsed", disable=["parser"]) ``` ## Language.pipe {#pipe tag="method"} Process texts as a stream, and yield `Doc` objects in order. This is usually more efficient than processing texts one-by-one. Early versions of spaCy used simple statistical models that could be efficiently multi-threaded, as we were able to entirely release Python's global interpreter lock. The multi-threading was controlled using the `n_threads` keyword argument to the `.pipe` method. This keyword argument is now deprecated as of v2.1.0. Future versions may introduce a `n_process` argument for parallel inference via multiprocessing. > #### Example > > ```python > texts = [u"One document.", u"...", u"Lots of documents"] > for doc in nlp.pipe(texts, batch_size=50): > assert doc.is_parsed > ``` | Name | Type | Description | | ------------ | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | `texts` | - | A sequence of unicode objects. | | `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. | | `n_threads` | int | The number of worker threads to use. If `-1`, OpenMP will decide how many to use at run time. Default is `2`. | | `batch_size` | int | The number of texts to buffer. | | `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | **YIELDS** | `Doc` | Documents in the order of the original text. | ## Language.update {#update tag="method"} Update the models in the pipeline. > #### Example > > ```python > for raw_text, entity_offsets in train_data: > doc = nlp.make_doc(raw_text) > gold = GoldParse(doc, entities=entity_offsets) > nlp.update([doc], [gold], drop=0.5, sgd=optimizer) > ``` | Name | Type | Description | | ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. | | `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). | | `drop` | float | The dropout rate. | | `sgd` | callable | An optimizer. | | **RETURNS** | dict | Results from the update. | ## Language.begin_training {#begin_training tag="method"} Allocate models, pre-process training data and acquire an optimizer. > #### Example > > ```python > optimizer = nlp.begin_training(gold_tuples) > ``` | Name | Type | Description | | ------------- | -------- | ---------------------------- | | `gold_tuples` | iterable | Gold-standard training data. | | `**cfg` | - | Config parameters. | | **RETURNS** | callable | An optimizer. | ## Language.use_params {#use_params tag="contextmanager, method"} Replace weights of models in the pipeline with those provided in the params dictionary. Can be used as a context manager, in which case, models go back to their original weights after the block. > #### Example > > ```python > with nlp.use_params(optimizer.averages): > nlp.to_disk("/tmp/checkpoint") > ``` | Name | Type | Description | | -------- | ---- | --------------------------------------------- | | `params` | dict | A dictionary of parameters keyed by model ID. | | `**cfg` | - | Config parameters. | ## Language.preprocess_gold {#preprocess_gold tag="method"} Can be called before training to pre-process gold data. By default, it handles nonprojectivity and adds missing tags to the tag map. | Name | Type | Description | | ------------ | -------- | ---------------------------------------- | | `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. | | **YIELDS** | tuple | Tuples of `Doc` and `GoldParse` objects. | ## Language.create_pipe {#create_pipe tag="method" new="2"} Create a pipeline component from a factory. > #### Example > > ```python > parser = nlp.create_pipe("parser") > nlp.add_pipe(parser) > ``` | Name | Type | Description | | ----------- | -------- | ---------------------------------------------------------------------------------- | | `name` | unicode | Factory name to look up in [`Language.factories`](/api/language#class-attributes). | | `config` | dict | Configuration parameters to initialize component. | | **RETURNS** | callable | The pipeline component. | ## Language.add_pipe {#add_pipe tag="method" new="2"} Add a component to the processing pipeline. Valid components are callables that take a `Doc` object, modify it and return it. Only one of `before`, `after`, `first` or `last` can be set. Default behavior is `last=True`. > #### Example > > ```python > def component(doc): > # modify Doc and return it return doc > > nlp.add_pipe(component, before="ner") > nlp.add_pipe(component, name="custom_name", last=True) > ``` | Name | Type | Description | | ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `component` | callable | The pipeline component. | | `name` | unicode | Name of pipeline component. Overwrites existing `component.name` attribute if available. If no `name` is set and the component exposes no name attribute, `component.__name__` is used. An error is raised if the name already exists in the pipeline. | | `before` | unicode | Component name to insert component directly before. | | `after` | unicode | Component name to insert component directly after: | | `first` | bool | Insert component first / not first in the pipeline. | | `last` | bool | Insert component last / not last in the pipeline. | ## Language.has_pipe {#has_pipe tag="method" new="2"} Check whether a component is present in the pipeline. Equivalent to `name in nlp.pipe_names`. > #### Example > > ```python > nlp.add_pipe(lambda doc: doc, name="component") > assert "component" in nlp.pipe_names > assert nlp.has_pipe("component") > ``` | Name | Type | Description | | ----------- | ------- | -------------------------------------------------------- | | `name` | unicode | Name of the pipeline component to check. | | **RETURNS** | bool | Whether a component of that name exists in the pipeline. | ## Language.get_pipe {#get_pipe tag="method" new="2"} Get a pipeline component for a given component name. > #### Example > > ```python > parser = nlp.get_pipe("parser") > custom_component = nlp.get_pipe("custom_component") > ``` | Name | Type | Description | | ----------- | -------- | -------------------------------------- | | `name` | unicode | Name of the pipeline component to get. | | **RETURNS** | callable | The pipeline component. | ## Language.replace_pipe {#replace_pipe tag="method" new="2"} Replace a component in the pipeline. > #### Example > > ```python > nlp.replace_pipe("parser", my_custom_parser) > ``` | Name | Type | Description | | ----------- | -------- | --------------------------------- | | `name` | unicode | Name of the component to replace. | | `component` | callable | The pipeline component to insert. | ## Language.rename_pipe {#rename_pipe tag="method" new="2"} Rename a component in the pipeline. Useful to create custom names for pre-defined and pre-loaded components. To change the default name of a component added to the pipeline, you can also use the `name` argument on [`add_pipe`](/api/language#add_pipe). > #### Example > > ```python > nlp.rename_pipe("parser", "spacy_parser") > ``` | Name | Type | Description | | ---------- | ------- | -------------------------------- | | `old_name` | unicode | Name of the component to rename. | | `new_name` | unicode | New name of the component. | ## Language.remove_pipe {#remove_pipe tag="method" new="2"} Remove a component from the pipeline. Returns the removed component name and component function. > #### Example > > ```python > name, component = nlp.remove_pipe("parser") > assert name == "parser" > ``` | Name | Type | Description | | ----------- | ------- | ----------------------------------------------------- | | `name` | unicode | Name of the component to remove. | | **RETURNS** | tuple | A `(name, component)` tuple of the removed component. | ## Language.disable_pipes {#disable_pipes tag="contextmanager, method" new="2"} Disable one or more pipeline components. If used as a context manager, the pipeline will be restored to the initial state at the end of the block. Otherwise, a `DisabledPipes` object is returned, that has a `.restore()` method you can use to undo your changes. > #### Example > > ```python > with nlp.disable_pipes('tagger', 'parser'): > nlp.begin_training() > > disabled = nlp.disable_pipes('tagger', 'parser') > nlp.begin_training() > disabled.restore() > ``` | Name | Type | Description | | ----------- | --------------- | ------------------------------------------------------------------------------------ | | `*disabled` | unicode | Names of pipeline components to disable. | | **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. | ## Language.to_disk {#to_disk tag="method" new="2"} Save the current state to a directory. If a model is loaded, this will **include the model**. > #### Example > > ```python > nlp.to_disk("/path/to/models") > ``` | Name | Type | Description | | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being saved. | ## Language.from_disk {#from_disk tag="method" new="2"} Loads state from a directory. Modifies the object in place and returns it. If the saved `Language` object contains a model, the model will be loaded. Note that this method is commonly used via the subclasses like `English` or `German` to make language-specific functionality like the [lexical attribute getters](/usage/adding-languages#lex-attrs) available to the loaded object. > #### Example > > ```python > from spacy.language import Language > nlp = Language().from_disk("/path/to/model") > > # using language-specific subclass > from spacy.lang.en import English > nlp = English().from_disk("/path/to/en_model") > ``` | Name | Type | Description | | ----------- | ---------------- | --------------------------------------------------------------------------------- | | `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | **RETURNS** | `Language` | The modified `Language` object. | As of spaCy v2.0, the `save_to_directory` method has been renamed to `to_disk`, to improve consistency across classes. Pipeline components to prevent from being loaded can now be added as a list to `disable`, instead of specifying one keyword argument per component. ```diff - nlp = spacy.load("en", tagger=False, entity=False) + nlp = English().from_disk("/model", disable=["tagger', 'ner"]) ``` ## Language.to_bytes {#to_bytes tag="method"} Serialize the current state to a binary string. > #### Example > > ```python > nlp_bytes = nlp.to_bytes() > ``` | Name | Type | Description | | ----------- | ----- | ------------------------------------------------------------------------------------------------------------------- | | `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being serialized. | | **RETURNS** | bytes | The serialized form of the `Language` object. | ## Language.from_bytes {#from_bytes tag="method"} Load state from a binary string. Note that this method is commonly used via the subclasses like `English` or `German` to make language-specific functionality like the [lexical attribute getters](/usage/adding-languages#lex-attrs) available to the loaded object. > #### Example > > ```python > from spacy.lang.en import English > nlp_bytes = nlp.to_bytes() > nlp2 = English() > nlp2.from_bytes(nlp_bytes) > ``` | Name | Type | Description | | ------------ | ---------- | --------------------------------------------------------------------------------- | | `bytes_data` | bytes | The data to load from. | | `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | **RETURNS** | `Language` | The `Language` object. | Pipeline components to prevent from being loaded can now be added as a list to `disable`, instead of specifying one keyword argument per component. ```diff - nlp = English().from_bytes(bytes, tagger=False, entity=False) + nlp = English().from_bytes(bytes, disable=["tagger", "ner"]) ``` ## Attributes {#attributes} | Name | Type | Description | | --------------------------------------- | ------------------ | ----------------------------------------------------------------------------------------------- | | `vocab` | `Vocab` | A container for the lexical types. | | `tokenizer` | `Tokenizer` | The tokenizer. | | `make_doc` | `lambda text: Doc` | Create a `Doc` object from unicode text. | | `pipeline` | list | List of `(name, component)` tuples describing the current processing pipeline, in order. | | `pipe_names` 2 | list | List of pipeline component names, in order. | | `meta` | dict | Custom meta data for the Language class. If a model is loaded, contains meta data of the model. | | `path` 2 | `Path` | Path to the model data directory, if a model is loaded. Otherwise `None`. | ## Class attributes {#class-attributes} | Name | Type | Description | | -------------------------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------- | | `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. | | `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). | | `factories` 2 | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |