spaCy/website/docs/usage/entity-recognition.jade

240 lines
9.2 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > NAMED ENTITY RECOGNITION
include ../../_includes/_mixins
p
| spaCy features an extremely fast statistical entity recognition system,
| that assigns labels to contiguous spans of tokens. The default model
| identifies a variety of named and numeric entities, including companies,
| locations, organizations and products. You can add arbitrary classes to
| the entity recognition system, and update the model with new examples.
+aside-code("Example").
import spacy
nlp = spacy.load('en')
doc = nlp(u'London is a big city in the United Kingdom.')
for ent in doc.ents:
print(ent.label_, ent.text)
# GPE London
# GPE United Kingdom
p
| The standard way to access entity annotations is the
| #[+api("doc#ents") #[code doc.ents]] property, which produces a sequence
| of #[+api("span") #[code Span]] objects. The entity type is accessible
| either as an integer ID or as a string, using the attributes
| #[code ent.label] and #[code ent.label_]. The #[code Span] object acts
| as a sequence of tokens, so you can iterate over the entity or index into
| it. You can also get the text form of the whole entity, as though it were
| a single token. See the #[+api("span") API reference] for more details.
p
| You can access token entity annotations using the #[code token.ent_iob]
| and #[code token.ent_type] attributes. The #[code token.ent_iob]
| attribute indicates whether an entity starts, continues or ends on the
| tag (In, Begin, Out).
+code("Example").
doc = nlp(u'London is a big city in the United Kingdom.')
print(doc[0].text, doc[0].ent_iob, doc[0].ent_type_)
# (u'London', 2, u'GPE')
print(doc[1].text, doc[1].ent_iob, doc[1].ent_type_)
# (u'is', 3, u'')
+h(2, "setting") Setting entity annotations
p
| To ensure that the sequence of token annotations remains consistent, you
| have to set entity annotations at the document level — you can't write
| directly to the #[code token.ent_iob] or #[code token.ent_type]
| attributes. The easiest way to set entities is to assign to the
| #[code doc.ents] attribute.
+code("Example").
doc = nlp(u'London is a big city in the United Kingdom.')
doc.ents = []
assert doc[0].ent_type_ == ''
doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings['GPE'])]
assert doc[0].ent_type_ == 'GPE'
doc.ents = []
doc.ents = [(u'LondonCity', doc.vocab.strings['GPE'], 0, 1)]
p
| The value you assign should be a sequence, the values of which
| can either be #[code Span] objects, or #[code (ent_id, ent_type, start, end)]
| tuples, where #[code start] and #[code end] are token offsets that
| describe the slice of the document that should be annotated.
p
| You can also assign entity annotations using the #[code doc.from_array()]
| method. To do this, you should include both the #[code ENT_TYPE] and the
| #[code ENT_IOB] attributes in the array you're importing from.
+code("Example").
from spacy.attrs import ENT_IOB, ENT_TYPE
import numpy
doc = nlp.make_doc(u'London is a big city in the United Kingdom.')
assert list(doc.ents) == []
header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)))
attr_array[0, 0] = 2 # B
attr_array[0, 1] = doc.vocab.strings[u'GPE']
doc.from_array(header, attr_array)
assert list(doc.ents)[0].text == u'London'
p
| Finally, you can always write to the underlying struct, if you compile
| a Cython function. This is easy to do, and allows you to write efficient
| native code.
+code("Example").
# cython: infer_types=True
from spacy.tokens.doc cimport Doc
cpdef set_entity(Doc doc, int start, int end, int ent_type):
for i in range(start, end):
doc.c[i].ent_type = ent_type
doc.c[start].ent_iob = 3
for i in range(start+1, end):
doc.c[i].ent_iob = 2
p
| Obviously, if you write directly to the array of #[code TokenC*] structs,
| you'll have responsibility for ensuring that the data is left in a
| consistent state.
+h(2, "displacy") Visualizing named entities
p
| The #[+a(DEMOS_URL + "/displacy-ent/") displaCy #[sup ENT] visualizer]
| lets you explore an entity recognition model's behaviour interactively.
| If you're training a model, it's very useful to run the visualization
| yourself. To help you do that, spaCy v2.0+ comes with a visualization
| module. Simply pass a #[code Doc] or a list of #[code Doc] objects to
| displaCy and run #[+api("displacy#serve") #[code displacy.serve]] to
| run the web server, or #[+api("displacy#render") #[code displacy.render]]
| to generate the raw markup.
p
| For more details and examples, see the
| #[+a("/docs/usage/visualizers") usage workflow on visualizing spaCy].
+code("Named Entity example").
import spacy
from spacy import displacy
text = """But Google is starting from behind. The company made a late push
into hardware, and Apples Siri, available on iPhones, and Amazons Alexa
software, which runs on its Echo and Dot devices, have clear leads in
consumer adoption."""
nlp = spacy.load('custom_ner_model')
doc = nlp(text)
displacy.serve(doc, style='ent')
+codepen("a73f8b68f9af3157855962b283b364e4", 345)
+h(2, "entity-types") Built-in entity types
include ../api/_annotation/_named-entities
+aside("Install")
| The #[+api("load") spacy.load()] function configures a pipeline that
| includes all of the available annotators for the given ID. In the example
| above, the #[code 'en'] ID tells spaCy to load the default English
| pipeline. If you have installed the data with
| #[code python -m spacy.en.download] this will include the entity
| recognition model.
+h(2, "updating") Training and updating
p
| To provide training examples to the entity recogniser, you'll first need
| to create an instance of the #[code GoldParse] class. You can specify
| your annotations in a stand-off format or as token tags.
+code.
import spacy
import random
from spacy.gold import GoldParse
from spacy.language import EntityRecognizer
train_data = [
('Who is Chaka Khan?', [(7, 17, 'PERSON')]),
('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])
]
nlp = spacy.load('en', entity=False, parser=False)
ner = EntityRecognizer(nlp.vocab, entity_types=['PERSON', 'LOC'])
for itn in range(5):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.tagger(doc)
ner.update(doc, gold)
ner.model.end_training()
p
| If a character offset in your entity annotations don't fall on a token
| boundary, the #[code GoldParse] class will treat that annotation as a
| missing value. This allows for more realistic training, because the
| entity recogniser is allowed to learn from examples that may feature
| tokenizer errors.
+aside-code("Example").
doc = Doc(nlp.vocab, [u'rats', u'make', u'good', u'pets'])
gold = GoldParse(doc, [u'U-ANIMAL', u'O', u'O', u'O'])
ner = EntityRecognizer(nlp.vocab, entity_types=['ANIMAL'])
ner.update(doc, gold)
p
| You can also provide token-level entity annotation, using the
| following tagging scheme to describe the entity boundaries:
+table([ "Tag", "Description" ])
+row
+cell #[code #[span.u-color-theme B] EGIN]
+cell The first token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme I] N]
+cell An inner token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme L] AST]
+cell The final token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme U] NIT]
+cell A single-token entity.
+row
+cell #[code #[span.u-color-theme O] UT]
+cell A non-entity token.
+aside("Why BILUO, not IOB?")
| There are several coding schemes for encoding entity annotations as
| token tags. These coding schemes are equally expressive, but not
| necessarily equally learnable.
| #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
| showed that the minimal #[strong Begin], #[strong In], #[strong Out]
| scheme was more difficult to learn than the #[strong BILUO] scheme that
| we use, which explicitly marks boundary tokens.
p
| spaCy translates the character offsets into this scheme, in order to
| decide the cost of each action given the current state of the entity
| recogniser. The costs are then used to calculate the gradient of the
| loss, to train the model. The exact algorithm is a pastiche of
| well-known methods, and is not currently described in any single
| publication. The model is a greedy transition-based parser guided by a
| linear model whose weights are learned using the averaged perceptron
| loss, via the #[+a("http://www.aclweb.org/anthology/C12-1059") dynamic oracle]
| imitation learning strategy. The transition system is equivalent to the
| BILOU tagging scheme.