spaCy/website/usage/_processing-pipelines/_extensions.jade

439 lines
21 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > PROCESSING PIPELINES > EXTENSIONS
p
| As of v2.0, spaCy allows you to set any custom attributes and methods
| on the #[code Doc], #[code Span] and #[code Token], which become
| available as #[code Doc._], #[code Span._] and #[code Token._] for
| example, #[code Token._.my_attr]. This lets you store additional
| information relevant to your application, add new features and
| functionality to spaCy, and implement your own models trained with other
| machine learning libraries. It also lets you take advantage of spaCy's
| data structures and the #[code Doc] object as the "single source of
| truth".
+aside("Why ._?")
| Writing to a #[code ._] attribute instead of to the #[code Doc] directly
| keeps a clearer separation and makes it easier to ensure backwards
| compatibility. For example, if you've implemented your own #[code .coref]
| property and spaCy claims it one day, it'll break your code. Similarly,
| just by looking at the code, you'll immediately know what's built-in and
| what's custom for example, #[code doc.sentiment] is spaCy, while
| #[code doc._.sent_score] isn't.
p
| There are three main types of extensions, which can be defined using the
| #[+api("doc#set_extension") #[code Doc.set_extension]],
| #[+api("span#set_extension") #[code Span.set_extension]] and
| #[+api("token#set_extension") #[code Token.set_extension]] methods.
+list("numbers")
+item #[strong Attribute extensions].
| Set a default value for an attribute, which can be overwritten
| manually at any time. Attribute extensions work like "normal"
| variables and are the quickest way to store arbitrary information
| on a #[code Doc], #[code Span] or #[code Token]. Attribute defaults
| behaves just like argument defaults
| #[+a("http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments") in Python functions],
| and should not be used for mutable values like dictionaries or lists.
+code-wrapper
+code.
Doc.set_extension('hello', default=True)
assert doc._.hello
doc._.hello = False
+item #[strong Property extensions].
| Define a getter and an optional setter function. If no setter is
| provided, the extension is immutable. Since the getter and setter
| functions are only called when you #[em retrieve] the attribute,
| you can also access values of previously added attribute extensions.
| For example, a #[code Doc] getter can average over #[code Token]
| attributes. For #[code Span] extensions, you'll almost always want
| to use a property otherwise, you'd have to write to
| #[em every possible] #[code Span] in the #[code Doc] to set up the
| values correctly.
+code-wrapper
+code.
Doc.set_extension('hello', getter=get_hello_value, setter=set_hello_value)
assert doc._.hello
doc._.hello = 'Hi!'
+item #[strong Method extensions].
| Assign a function that becomes available as an object method. Method
| extensions are always immutable. For more details and implementation
| ideas, see
| #[+a("/usage/examples#custom-components-attr-methods") these examples].
+code-wrapper
+code.o-no-block.
Doc.set_extension('hello', method=lambda doc, name: 'Hi {}!'.format(name))
assert doc._.hello('Bob') == 'Hi Bob!'
p
| Before you can access a custom extension, you need to register it using
| the #[code set_extension] method on the object you want
| to add it to, e.g. the #[code Doc]. Keep in mind that extensions are
| always #[strong added globally] and not just on a particular instance.
| If an attribute of the same name
| already exists, or if you're trying to access an attribute that hasn't
| been registered, spaCy will raise an #[code AttributeError].
+code("Example").
from spacy.tokens import Doc, Span, Token
fruits = ['apple', 'pear', 'banana', 'orange', 'strawberry']
is_fruit_getter = lambda token: token.text in fruits
has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])
Token.set_extension('is_fruit', getter=is_fruit_getter)
Doc.set_extension('has_fruit', getter=has_fruit_getter)
Span.set_extension('has_fruit', getter=has_fruit_getter)
+aside-code("Usage example").
doc = nlp(u"I have an apple and a melon")
assert doc[3]._.is_fruit # get Token attributes
assert not doc[0]._.is_fruit
assert doc._.has_fruit # get Doc attributes
assert doc[1:4]._.has_fruit # get Span attributes
p
| Once you've registered your custom attribute, you can also use the
| built-in #[code set], #[code get] and #[code has] methods to modify and
| retrieve the attributes. This is especially useful it you want to pass in
| a string instead of calling #[code doc._.my_attr].
+table(["Method", "Description", "Valid for", "Example"])
+row
+cell #[code ._.set()]
+cell Set a value for an attribute.
+cell Attributes, mutable properties.
+cell #[code.u-break token._.set('my_attr', True)]
+row
+cell #[code ._.get()]
+cell Get the value of an attribute.
+cell Attributes, mutable properties, immutable properties, methods.
+cell #[code.u-break my_attr = span._.get('my_attr')]
+row
+cell #[code ._.has()]
+cell Check if an attribute exists.
+cell Attributes, mutable properties, immutable properties, methods.
+cell #[code.u-break doc._.has('my_attr')]
+infobox("How the ._ is implemented")
| Extension definitions the defaults, methods, getters and setters you
| pass in to #[code set_extension] are stored in class attributes on the
| #[code Underscore] class. If you write to an extension attribute, e.g.
| #[code doc._.hello = True], the data is stored within the
| #[+api("doc#attributes") #[code Doc.user_data]] dictionary. To keep the
| underscore data separate from your other dictionary entries, the string
| #[code "._."] is placed before the name, in a tuple.
+h(4, "component-example1") Example: Custom sentence segmentation logic
p
| Let's say you want to implement custom logic to improve spaCy's sentence
| boundary detection. Currently, sentence segmentation is based on the
| dependency parse, which doesn't always produce ideal results. The custom
| logic should therefore be applied #[strong after] tokenization, but
| #[strong before] the dependency parsing this way, the parser can also
| take advantage of the sentence boundaries.
+code-exec.
import spacy
def custom_sentencizer(doc):
for i, token in enumerate(doc[:-2]):
# define sentence start if period + titlecase token
if token.text == '.' and doc[i+1].is_title:
doc[i+1].sent_start = True
return doc
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(custom_sentencizer, before='parser') # insert before parser
doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
print(sent.text)
+h(4, "component-example2")
| Example: Pipeline component for entity matching and tagging with
| custom attributes
p
| This example shows how to create a spaCy extension that takes a
| terminology list (in this case, single- and multi-word company names),
| matches the occurences in a document, labels them as #[code ORG] entities,
| merges the tokens and sets custom #[code is_tech_org] and
| #[code has_tech_org] attributes. For efficient matching, the example uses
| the #[+api("phrasematcher") #[code PhraseMatcher]] which accepts
| #[code Doc] objects as match patterns and works well for large
| terminology lists. It also ensures your patterns will always match, even
| when you customise spaCy's tokenization rules. When you call #[code nlp]
| on a text, the custom pipeline component is applied to the #[code Doc]
+github("spacy", "examples/pipeline/custom_component_entities.py", 500)
p
| Wrapping this functionality in a
| pipeline component allows you to reuse the module with different
| settings, and have all pre-processing taken care of when you call
| #[code nlp] on your text and receive a #[code Doc] object.
+h(4, "component-example3")
| Example: Pipeline component for GPE entities and country meta data via a
| REST API
p
| This example shows the implementation of a pipeline component
| that fetches country meta data via the
| #[+a("https://restcountries.eu") REST Countries API] sets entity
| annotations for countries, merges entities into one token and
| sets custom attributes on the #[code Doc], #[code Span] and
| #[code Token] for example, the capital, latitude/longitude coordinates
| and even the country flag.
+github("spacy", "examples/pipeline/custom_component_countries_api.py", 500)
p
| In this case, all data can be fetched on initialisation in one request.
| However, if you're working with text that contains incomplete country
| names, spelling mistakes or foreign-language versions, you could also
| implement a #[code like_country]-style getter function that makes a
| request to the search API endpoint and returns the best-matching
| result.
+h(4, "custom-components-usage-ideas") Other usage ideas
+list
+item
| #[strong Adding new features and hooking in models]. For example,
| a sentiment analysis model, or your preferred solution for
| lemmatization or sentiment analysis. spaCy's built-in tagger,
| parser and entity recognizer respect annotations that were already
| set on the #[code Doc] in a previous step of the pipeline.
+item
| #[strong Integrating other libraries and APIs]. For example, your
| pipeline component can write additional information and data
| directly to the #[code Doc] or #[code Token] as custom attributes,
| while making sure no information is lost in the process. This can
| be output generated by other libraries and models, or an external
| service with a REST API.
+item
| #[strong Debugging and logging]. For example, a component which
| stores and/or exports relevant information about the current state
| of the processed document, and insert it at any point of your
| pipeline.
+infobox("Developing third-party extensions")
| The new pipeline management and custom attributes finally make it easy
| to develop your own spaCy extensions and plugins and share them with
| others. Extensions can claim their own #[code ._] namespace and exist as
| standalone packages. If you're developing a tool or library and want to
| make it easy for others to use it with spaCy and add it to their
| pipeline, all you have to do is expose a function that takes a
| #[code Doc], modifies it and returns it. For more details and
| #[strong best practices], see the section on
| #[+a("#extensions") developing spaCy extensions].
+h(3, "custom-components-user-hooks") User hooks
p
| While it's generally recommended to use the #[code Doc._], #[code Span._]
| and #[code Token._] proxies to add your own custom attributes, spaCy
| offers a few exceptions to allow #[strong customising the built-in methods]
| like #[+api("doc#similarity") #[code Doc.similarity]] or
| #[+api("doc#vector") #[code Doc.vector]]. with your own hooks, which can
| rely on statistical models you train yourself. For instance, you can
| provide your own on-the-fly sentence segmentation algorithm or document
| similarity method.
p
| Hooks let you customize some of the behaviours of the #[code Doc],
| #[code Span] or #[code Token] objects by adding a component to the
| pipeline. For instance, to customize the
| #[+api("doc#similarity") #[code Doc.similarity]] method, you can add a
| component that sets a custom function to
| #[code doc.user_hooks['similarity']]. The built-in #[code Doc.similarity]
| method will check the #[code user_hooks] dict, and delegate to your
| function if you've set one. Similar results can be achieved by setting
| functions to #[code Doc.user_span_hooks] and #[code Doc.user_token_hooks].
+aside("Implementation note")
| The hooks live on the #[code Doc] object because the #[code Span] and
| #[code Token] objects are created lazily, and don't own any data. They
| just proxy to their parent #[code Doc]. This turns out to be convenient
| here — we only have to worry about installing hooks in one place.
+table(["Name", "Customises"])
+row
+cell #[code user_hooks]
+cell
+api("doc#vector") #[code Doc.vector]
+api("doc#has_vector") #[code Doc.has_vector]
+api("doc#vector_norm") #[code Doc.vector_norm]
+api("doc#sents") #[code Doc.sents]
+row
+cell #[code user_token_hooks]
+cell
+api("token#similarity") #[code Token.similarity]
+api("token#vector") #[code Token.vector]
+api("token#has_vector") #[code Token.has_vector]
+api("token#vector_norm") #[code Token.vector_norm]
+api("token#conjuncts") #[code Token.conjuncts]
+row
+cell #[code user_span_hooks]
+cell
+api("span#similarity") #[code Span.similarity]
+api("span#vector") #[code Span.vector]
+api("span#has_vector") #[code Span.has_vector]
+api("span#vector_norm") #[code Span.vector_norm]
+api("span#root") #[code Span.root]
+code("Add custom similarity hooks").
class SimilarityModel(object):
def __init__(self, model):
self._model = model
def __call__(self, doc):
doc.user_hooks['similarity'] = self.similarity
doc.user_span_hooks['similarity'] = self.similarity
doc.user_token_hooks['similarity'] = self.similarity
def similarity(self, obj1, obj2):
y = self._model([obj1.vector, obj2.vector])
return float(y[0])
+h(3, "extensions") Developing spaCy extensions
p
| We're very excited about all the new possibilities for community
| extensions and plugins in spaCy v2.0, and we can't wait to see what
| you build with it! To get you started, here are a few tips, tricks and
| best practices. #[+a("/universe/?category=pipeline") See here] for
| examples of other spaCy extensions.
+list
+item
| Make sure to choose a #[strong descriptive and specific name] for
| your pipeline component class, and set it as its #[code name]
| attribute. Avoid names that are too common or likely to clash with
| built-in or a user's other custom components. While it's fine to call
| your package "spacy_my_extension", avoid component names including
| "spacy", since this can easily lead to confusion.
+code-wrapper
+code-new name = 'myapp_lemmatizer'
+code-old name = 'lemmatizer'
+item
| When writing to #[code Doc], #[code Token] or #[code Span] objects,
| #[strong use getter functions] wherever possible, and avoid setting
| values explicitly. Tokens and spans don't own any data themselves,
| so you should provide a function that allows them to compute the
| values instead of writing static properties to individual objects.
+code-wrapper
+code-new.
is_fruit = lambda token: token.text in ('apple', 'orange')
Token.set_extension('is_fruit', getter=is_fruit)
+code-old.
token._.set_extension('is_fruit', default=False)
if token.text in ('apple', 'orange'):
token._.set('is_fruit', True)
+item
| When using #[strong mutable values] like dictionaries or lists as
| the #[code default] argument, keep in mind that they behave just like
| mutable default arguments
| #[+a("http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments") in Python functions].
| This can easily cause unintended results, like the same value being
| set on #[em all] objects instead of only one particular instance.
| In most cases, it's better to use #[strong getters and setters], and
| only set the #[code default] for boolean or string values.
+code-wrapper
+code-new.
Doc.set_extension('fruits', getter=get_fruits, setter=set_fruits)
+code-old.
Doc.set_extension('fruits', default={})
doc._.fruits['apple'] = u'🍎' # all docs now have {'apple': u'🍎'}
+item
| Always add your custom attributes to the #[strong global] #[code Doc]
| #[code Token] or #[code Span] objects, not a particular instance of
| them. Add the attributes #[strong as early as possible], e.g. in
| your extension's #[code __init__] method or in the global scope of
| your module. This means that in the case of namespace collisions,
| the user will see an error immediately, not just when they run their
| pipeline.
+code-wrapper
+code-new.
from spacy.tokens import Doc
def __init__(attr='my_attr'):
Doc.set_extension(attr, getter=self.get_doc_attr)
+code-old.
def __call__(doc):
doc.set_extension('my_attr', getter=self.get_doc_attr)
+item
| If your extension is setting properties on the #[code Doc],
| #[code Token] or #[code Span], include an option to
| #[strong let the user to change those attribute names]. This makes
| it easier to avoid namespace collisions and accommodate users with
| different naming preferences. We recommend adding an #[code attrs]
| argument to the #[code __init__] method of your class so you can
| write the names to class attributes and reuse them across your
| component.
+code-wrapper
+code-new Doc.set_extension(self.doc_attr, default='some value')
+code-old Doc.set_extension('my_doc_attr', default='some value')
+item
| Ideally, extensions should be #[strong standalone packages] with
| spaCy and optionally, other packages specified as a dependency. They
| can freely assign to their own #[code ._] namespace, but should stick
| to that. If your extension's only job is to provide a better
| #[code .similarity] implementation, and your docs state this
| explicitly, there's no problem with writing to the
| #[+a("#custom-components-user-hooks") #[code user_hooks]], and
| overwriting spaCy's built-in method. However, a third-party
| extension should #[strong never silently overwrite built-ins], or
| attributes set by other extensions.
+item
| If you're looking to publish a model that depends on a custom
| pipeline component, you can either #[strong require it] in the model
| package's dependencies, or if the component is specific and
| lightweight choose to #[strong ship it with your model package]
| and add it to the #[code Language] instance returned by the
| model's #[code load()] method. For examples of this, check out the
| implementations of spaCy's
| #[+api("util#load_model_from_init_py") #[code load_model_from_init_py]]
| and #[+api("util#load_model_from_path") #[code load_model_from_path]]
| utility functions.
+code-wrapper
+code-new.
nlp.add_pipe(my_custom_component)
return nlp.from_disk(model_path)
+item
| Once you're ready to share your extension with others, make sure to
| #[strong add docs and installation instructions] (you can
| always link to this page for more info). Make it easy for others to
| install and use your extension, for example by uploading it to
| #[+a("https://pypi.python.org") PyPi]. If you're sharing your code on
| GitHub, don't forget to tag it
| with #[+a("https://github.com/topics/spacy?o=desc&s=stars") #[code spacy]]
| and #[+a("https://github.com/topics/spacy-extension?o=desc&s=stars") #[code spacy-extension]]
| to help people find it. If you post it on Twitter, feel free to tag
| #[+a("https://twitter.com/" + SOCIAL.twitter) @#{SOCIAL.twitter}]
| so we can check it out.