spaCy/website/docs/_api-span.jade

306 lines
11 KiB
Plaintext
Raw Normal View History

2016-10-03 18:19:13 +00:00
//- ----------------------------------
//- 💫 DOCS > API > SPAN
//- ----------------------------------
+section("span")
+h(2, "span", "https://github.com/" + SOCIAL.github + "/spaCy/blob/master/spacy/tokens/span.pyx")
| #[+tag class] Span
p.
A slice of a #[code Doc] object, consisting of zero or
more tokens. Spans are usually used to represent sentences, named entities,
phrases.
+aside("Implementation").
#[code Span] objects are views — that is, they do not copy the
underlying C data. This makes them cheap to construct, as internally are
simply a reference to the #[code Doc] object, a start position, an end
2016-03-31 14:24:48 +00:00
position, and a label ID.
2016-10-03 18:19:13 +00:00
+code("python", "Overview").
2016-03-31 14:24:48 +00:00
class Span:
doc = Doc
start = int
end = int
label = int
def __init__(self, doc, start, end, label=0, vector=None, vector_norm=None):
return self
def __len__(self):
return int
def __getitem__(self, i):
return Token()
def __iter__(self):
yield Token()
def similarity(self, other):
return float
def merge(self, tag, lemma, ent_type):
return None
2016-10-03 18:19:13 +00:00
2016-03-31 14:24:48 +00:00
@property
def label_(self):
return unicode
@property
def vector(self):
2016-10-03 18:19:13 +00:00
return numpy.ndarray(dtype="float64")
2016-03-31 14:24:48 +00:00
@property
def vector_norm(self):
return float
@property
def text(self):
return unicode
@property
def text_with_ws(self):
return unicode
@property
def orth_(self):
return unicode
@property
def lemma_(self):
return unicode
@property
def root(self):
return Token()
@property
def lefts(self):
yield Token()
@property
def rights(self):
yield Token()
@property
def subtree(self):
yield Token()
2016-10-03 18:19:13 +00:00
+section("span-create")
+h(3, "span-init")
| #[+tag Section] Create a Span
p.
Span instances are usually created via the #[code Doc] object.
2016-03-31 14:24:48 +00:00
2016-10-03 18:19:13 +00:00
+table(["Example", "Description"])
2016-03-31 14:24:48 +00:00
+row
+cell #[code.lang-python span = doc[4 : 7]]
+cell Produce a span with tokens 4, 5 and 6.
+row
+cell #[code.lang-python span = Span(doc, start, end, label=spacy.symbols.PERSON)]
+cell Calling #[code Span.__init__] directly allows you to set a label.
+row
+cell #[code.lang-python for entity in doc.ents]
+cell See #[a(href="/docs#doc-spans-ents") Doc.ents]
+row
+cell #[code.lang-python for sentence in doc.sents]
+cell See #[a(href="/docs#doc-spans-sents") Doc.sents]
+row
+cell #[code.lang-python for noun_phrase in doc.noun_chunks]
+cell See #[a(href="/docs#doc-spans-nounchunks") Doc.noun_chunks]
2016-10-03 18:19:13 +00:00
+code("python", "Definition").
2016-03-31 14:24:48 +00:00
def __init__(self, doc, start, end, label=0, vector=None, vector_norm=None):
return Span()
2016-10-03 18:19:13 +00:00
+table(["Name", "Type", "Description"])
2016-03-31 14:24:48 +00:00
+row
+cell doc
+cell Doc
+cell The parent doc object, to slice from.
+row
+cell start
+cell int
+cell The index of the first token in the slice.
+row
+cell end
+cell int
+cell The index of the first token #[em outside] the slice (since ranges are exclusive in Python).
+row
+cell label
+cell int or unicode
2016-10-03 18:19:13 +00:00
+cell A label for the span. Either a string, or an integer ID, that should refer to a string mapped by the #[code Doc] object"s #[code StringStore].
2016-03-31 14:24:48 +00:00
+row
+cell vector
2016-10-03 18:19:13 +00:00
+cell
2016-03-31 14:24:48 +00:00
+cell
+row
+cell vector_norm
+cell
2016-10-03 18:19:13 +00:00
+cell
+section("span-merge")
+h(3, "span-merge")
| #[+tag method] Span.merge
p.
Merge the span into a single token, modifying the underlying
#[code.lang-python Doc] object in place.
+aside("Caveat").
2016-03-31 14:24:48 +00:00
Magic is done to allow you to call #[code.lang-python merge()]
without invalidating other #[code.lang-python Span] objects.
2016-10-03 18:19:13 +00:00
However, it"s difficult to ensure all indices are recomputed
correctly. Please report any errors encountered on the issue
2016-03-31 14:24:48 +00:00
tracker.
2016-10-03 18:19:13 +00:00
+code("python", "Example").
2016-03-31 14:24:48 +00:00
for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, ent.label_)
for np in doc.noun_chunks:
while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
np = np[1:]
np.merge(np.root.tag_, np.text, np.root.ent_type_)
2016-10-03 18:19:13 +00:00
+code("python", "Definition").
2016-03-31 14:24:48 +00:00
def merge(self, tag, lemma, ent_type):
return None
2016-10-03 18:19:13 +00:00
+table(["Name", "Type", "Description"])
2016-03-31 14:24:48 +00:00
+row
+cell tag
+cell unicode
+cell The fine-grained part-of-speech tag to assign to the new token.
+row
+cell lemma
+cell unicode
+cell The lemma string for the new token.
+row
+cell ent_type
+cell unicode
+cell The named entity type to assign to the new token.
2016-10-03 18:19:13 +00:00
+section("span-similarity")
+h(3, "span-similarity")
| #[+tag method] Span.similarity
2016-03-31 14:24:48 +00:00
2016-10-03 18:19:13 +00:00
p Estimate the semantic similarity between the span and another #[code Span], #[code Doc], #[code Token] or #[code Lexeme].
+aside("Algorithm").
2016-03-31 14:24:48 +00:00
Similarity is estimated
using the cosine metric, between #[code Span.vector] and #[code other.vector].
By default, #[code Span.vector] is computed by averaging the vectors
of its tokens.
2016-10-03 18:19:13 +00:00
+code("python", "Example").
2016-03-31 14:24:48 +00:00
doc = nlp("Apples and oranges are similar. Boots and hippos aren't.")
apples_sent, boots_sent = doc.sents
fruit = doc.vocab[u'fruit']
assert apples_sent.similarity(fruit) > boot_sent.similarity(fruit)
2016-10-03 18:19:13 +00:00
+code("python", "Definition").
2016-03-31 14:24:48 +00:00
def similarity(self, other):
return float
2016-10-03 18:19:13 +00:00
+table(["Name", "Type", "Description"])
2016-03-31 14:24:48 +00:00
+row
+cell other
+cell Token, Span, Doc or Lexeme
+cell The other object to judge similarity with.
2016-10-03 18:19:13 +00:00
+section("span-sequence")
+h(3, "span-sequence")
| #[+tag section] Span as a Sequence
2016-03-31 14:24:48 +00:00
p.
#[code Span] objects act as a sequence of #[code Token] objects. In
this way they mirror the API of the #[code Doc] object.
2016-10-03 18:19:13 +00:00
+table(["Name", "Description"], "params")
2016-03-31 14:24:48 +00:00
+row
+cell #[code.lang-python token = span[i]]
2016-10-03 18:19:13 +00:00
+cell.
Get the #[code Token] object at position #[em i], where
#[code i] is an offset within the #[code Span], not the
document. That is, if you have #[code.lang-python span = doc[4:6]],
then #[code.lang-python span[0].i == 4]
2016-03-31 14:24:48 +00:00
+row
+cell #[code.lang-python for token in span]
+cell.
Iterate over the #[code Token] objects in the span.
+row
+cell __len__
2016-10-03 18:19:13 +00:00
+cell Number of tokens in the span.
2016-03-31 14:24:48 +00:00
+row
+cell text
+cell.
2016-10-03 18:19:13 +00:00
The text content of the span, obtained from
2016-03-31 14:24:48 +00:00
#[code ''.join(token.text_with_ws for token in span)].
+row
+cell start
+cell.
The start offset of the span, i.e. #[code span[0].i].
+row
+cell end
+cell.
The end offset of the span, i.e. #[code span[-1].i + 1].
2016-10-03 18:19:13 +00:00
+section("span-navigating-parse")
+h(3, "span-navigativing-parse")
| #[+tag Section] Span and the Syntactic Parse
2016-03-31 14:24:48 +00:00
p.
Span objects allow similar access to the syntactic parse as individual
tokens.
2016-10-03 18:19:13 +00:00
+table(["Name", "Type", "Description"])
2016-03-31 14:24:48 +00:00
+row
+cell root
+cell #[code.lang-python Token]
2016-10-03 18:19:13 +00:00
+cell.
The word with the shortest path to the root of the sentence is
the root of the span.
2016-03-31 14:24:48 +00:00
+row
+cell lefts
+cell #[code.lang-python yield Token]
+cell Tokens that are to the left of the span, whose head is within it.
+row
+cell rights
+cell #[code.lang-python yield Token]
+cell Tokens that are to the right of the span, whose head is within it.
+row
+cell subtree
+cell #[code.lang-python yield Token]
2016-10-03 18:19:13 +00:00
+cell.
Tokens in the range #[code (start, end+1)], where #[code start]
is the index of the leftmost word descended from a token in the
span, and #[code end] is the index of the rightmost token descended
from a token in the span.
2016-03-31 14:24:48 +00:00
2016-10-03 18:19:13 +00:00
+section("span-strings")
+h(3, "span-strings")
| #[+tag section] Span"s Strings API
2016-03-31 14:24:48 +00:00
p.
You can access the textual content of the span, and different view of
it, with the following properties.
2016-10-03 18:19:13 +00:00
+table(["Name", "Type", "Description"])
2016-03-31 14:24:48 +00:00
+row
+cell text_with_ws
+cell unicode
+cell.
2016-10-03 18:19:13 +00:00
The form of the span as it appears in the string, including
trailing whitespace. This is useful when you need to use linguistic
2016-03-31 14:24:48 +00:00
features to add inline mark-up to the string.
+row
+cell lemma / lemma_
+cell int / unicode
+cell.
Whitespace-concatenated lemmas of each token in the span.
+row
+cell label / label_
+cell int / unicode
+cell.
The span label, used particularly for named entities.