2016-10-03 18:19:13 +00:00
//- ----------------------------------
2016-10-20 22:58:24 +00:00
//- 💫 DOCS > API > LANGUAGE
2016-10-03 18:19:13 +00:00
//- ----------------------------------
2016-03-31 14:24:48 +00:00
2016-10-20 22:58:24 +00:00
+section("language")
+h(2, "language", "https://github.com/" + SOCIAL.github + "/spaCy/blob/master/spacy/language.py")
| #[+tag class] Language
2016-03-31 14:24:48 +00:00
p.
2016-10-20 22:58:24 +00:00
A pipeline that transforms text strings into annotated spaCy Doc objects. Usually you'll load the Language pipeline once and pass the instance around your program.
2016-10-03 18:19:13 +00:00
+code("python", "Overview").
2016-03-31 14:24:48 +00:00
class Language:
2016-10-20 22:58:24 +00:00
Defaults = BaseDefaults
def __init__(self, path=True, **overrides):
self.vocab = Vocab()
self.tokenizer = Tokenizer()
self.tagger = Tagger()
self.parser = DependencyParser()
self.entity = EntityRecognizer()
self.make_doc = lambda text: Doc()
self.pipeline = [self.tagger, self.parser, self.entity]
def __call__(self, text, **toggle):
doc = self.make_doc(text)
for proc in self.pipeline:
if toggle.get(process.name, True):
process(doc)
return doc
def pipe(self, texts_iterator, batch_size=1000, n_threads=2, **toggle):
docs = (self.make_doc(text) for text in texts_iterator)
for process in self.pipeline:
if toggle.get(process.name, True):
docs = process.pipe(docs, batch_size=batch_size, n_threads=n_threads)
for doc in self.docs:
yield doc
def end_training(self, path=None):
2016-03-31 14:24:48 +00:00
return None
2016-10-20 22:58:24 +00:00
class English(Language):
class Defaults(BaseDefaults):
pass
2016-03-31 14:24:48 +00:00
2016-10-20 22:58:24 +00:00
class German(Language):
class Defaults(BaseDefaults):
pass
2016-10-03 18:19:13 +00:00
+section("english-init")
+h(3, "english-init")
2016-10-20 22:58:24 +00:00
| #[+tag method] Language.__init__
2016-03-31 14:24:48 +00:00
p
2016-10-20 22:58:24 +00:00
| Load the pipeline. You can disable components by passing None as a value,
| e.g. pass parser=None, vectors=None to save memory if you're not using
| those components. You can also pass an object as the value.
| Pass a function create_pipeline to use a custom pipeline --- see
| the custom pipeline tutorial.
2016-10-03 18:19:13 +00:00
2016-03-31 14:24:48 +00:00
+aside("Efficiency").
2016-10-03 18:19:13 +00:00
Loading takes 10-20 seconds, and the instance consumes 2 to 3
2016-03-31 14:24:48 +00:00
gigabytes of memory. Intended use is for one instance to be
created for each language per process, but you can create more
2016-10-20 22:58:24 +00:00
if you're doing something unusual. You may wish to make the
2016-10-03 18:19:13 +00:00
instance a global variable or "singleton".
2016-03-31 14:24:48 +00:00
2016-10-03 18:19:13 +00:00
+table(["Example", "Description"])
2016-03-31 14:24:48 +00:00
+row
2016-10-20 22:58:24 +00:00
+cell #[code nlp = English()]
+cell Load everything, from default path.
2016-03-31 14:24:48 +00:00
+row
2016-10-20 22:58:24 +00:00
+cell #[code nlp = English(path='my_data')]
+cell Load everything, from specified path
2016-03-31 14:24:48 +00:00
+row
2016-10-20 22:58:24 +00:00
+cell #[code nlp = English(path=path_obj)]
+cell Load everything, from an object that follows the #[code pathlib.Path] protocol.
2016-10-03 18:19:13 +00:00
2016-03-31 14:24:48 +00:00
+row
2016-10-20 22:58:24 +00:00
+cell #[code nlp = English(parser=False, vectors=False)]
+cell Load everything except the parser and the word vectors.
2016-03-31 14:24:48 +00:00
+row
2016-10-20 22:58:24 +00:00
+cell #[code nlp = English(parser=my_parser)]
+cell Load everything, and use a custom parser.
2016-03-31 14:24:48 +00:00
+row
2016-10-20 22:58:24 +00:00
+cell #[code nlp = English(create_pipeline=my_pipeline)]
+cell Load everything, and use a custom pipeline.
2016-03-31 14:24:48 +00:00
2016-10-20 22:58:24 +00:00
+code("python", "Definition").
def __init__(self, path=True, **overrides):
D = self.Defaults
self.vocab = Vocab(path=path, parent=self, **D.vocab) \
if 'vocab' not in overrides \
else overrides['vocab']
self.tokenizer = Tokenizer(self.vocab, path=path, **D.tokenizer) \
if 'tokenizer' not in overrides \
else overrides['tokenizer']
self.tagger = Tagger(self.vocab, path=path, **D.tagger) \
if 'tagger' not in overrides \
else overrides['tagger']
self.parser = DependencyParser(self.vocab, path=path, **D.parser) \
if 'parser' not in overrides \
else overrides['parser']
self.entity = EntityRecognizer(self.vocab, path=path, **D.entity) \
if 'entity' not in overrides \
else overrides['entity']
self.matcher = Matcher(self.vocab, path=path, **D.matcher) \
if 'matcher' not in overrides \
else overrides['matcher']
if 'make_doc' in overrides:
self.make_doc = overrides['make_doc']
elif 'create_make_doc' in overrides:
self.make_doc = overrides['create_make_doc'](self)
else:
self.make_doc = lambda text: self.tokenizer(text)
if 'pipeline' in overrides:
self.pipeline = overrides['pipeline']
elif 'create_pipeline' in overrides:
self.pipeline = overrides['create_pipeline'](self)
else:
self.pipeline = [self.tagger, self.parser, self.matcher, self.entity]
+section("language-call")
+h(3, "language-call")
| #[+tag method] Language.__call__
2016-03-31 14:24:48 +00:00
p
2016-10-03 18:19:13 +00:00
| The main entry point to spaCy. Takes raw unicode text, and returns
| a #[code Doc] object, which can be iterated to access #[code Token]
2016-03-31 14:24:48 +00:00
| and #[code Span] objects.
2016-10-03 18:19:13 +00:00
2016-03-31 14:24:48 +00:00
+aside("Efficiency").
2016-10-20 22:58:24 +00:00
spaCy's algorithms are all linear-time, so you can supply
2016-03-31 14:24:48 +00:00
documents of arbitrary length, e.g. whole novels.
2016-10-03 18:19:13 +00:00
+table(["Example", "Description"], "code")
2016-03-31 14:24:48 +00:00
+row
2016-10-20 22:58:24 +00:00
+cell #[ doc = nlp(u'Some text.')]
2016-03-31 14:24:48 +00:00
+cell Apply the full pipeline.
+row
2016-10-20 22:58:24 +00:00
+cell #[ doc = nlp(u'Some text.', parse=False)]
2016-03-31 14:24:48 +00:00
+cell Applies tagger and entity, not parser
+row
2016-10-20 22:58:24 +00:00
+cell #[ doc = nlp(u'Some text.', entity=False)]
2016-03-31 14:24:48 +00:00
+cell Applies tagger and parser, not entity.
+row
2016-10-20 22:58:24 +00:00
+cell #[ doc = nlp(u'Some text.', tag=False)]
2016-03-31 14:24:48 +00:00
+cell Does not apply tagger, entity or parser
+row
2016-10-20 22:58:24 +00:00
+cell #[ doc = nlp(u'')]
2016-03-31 14:24:48 +00:00
+cell Zero-length tokens, not an error
+row
2016-10-20 22:58:24 +00:00
+cell #[ doc = nlp(b'Some text')]
2016-03-31 14:24:48 +00:00
+cell Error: need unicode
+row
2016-10-20 22:58:24 +00:00
+cell #[ doc = nlp(b'Some text'.decode('utf8'))]
2016-03-31 14:24:48 +00:00
+cell Decode bytes into unicode first.
2016-10-03 18:19:13 +00:00
+code("python", "Definition").
2016-03-31 14:24:48 +00:00
def __call__(self, text, tag=True, parse=True, entity=True, matcher=True):
return self
2016-10-03 18:19:13 +00:00
+table(["Name", "Type", "Description"])
2016-03-31 14:24:48 +00:00
+row
+cell text
2016-10-03 18:19:13 +00:00
+cell #[+a(link_unicode) unicode]
2016-03-31 14:24:48 +00:00
+cell.
2016-10-03 18:19:13 +00:00
The text to be processed. spaCy expects raw unicode text
– you don"t necessarily need to, say, split it into paragraphs.
However, depending on your documents, you might be better
off applying custom pre-processing. Non-text formatting,
e.g. from HTML mark-up, should be removed before sending
the document to spaCy. If your documents have a consistent
format, you may be able to improve accuracy by pre-processing.
For instance, if the first word of your documents are always
in upper-case, it may be helpful to normalize them before
2016-03-31 14:24:48 +00:00
supplying them to spaCy.
+row
+cell tag
2016-10-03 18:19:13 +00:00
+cell #[+a(link_bool) bool]
2016-03-31 14:24:48 +00:00
+cell.
2016-10-03 18:19:13 +00:00
Whether to apply the part-of-speech tagger. Required for
2016-03-31 14:24:48 +00:00
parsing and entity recognition.
+row
+cell parse
2016-10-03 18:19:13 +00:00
+cell #[+a(link_bool) bool]
2016-03-31 14:24:48 +00:00
+cell.
Whether to apply the syntactic dependency parser.
+row
+cell entity
2016-10-03 18:19:13 +00:00
+cell #[+a(link_bool) bool]
2016-03-31 14:24:48 +00:00
+cell.
Whether to apply the named entity recognizer.
2016-10-03 18:19:13 +00:00
+section("english-pipe")
+h(3, "english-pipe")
| #[+tag method] English.pipe
2016-03-31 14:24:48 +00:00
p
2016-10-03 18:19:13 +00:00
| Parse a sequence of texts into a sequence of #[code Doc] objects.
| Accepts a generator as input, and produces a generator as output.
| Internally, it accumulates a buffer of #[code batch_size]
| texts, works on them with #[code n_threads] workers in parallel,
2016-03-31 14:24:48 +00:00
| and then yields the #[code Doc] objects one by one.
2016-10-03 18:19:13 +00:00
+aside("Efficiency").
spaCy releases the global interpreter lock around the parser and
named entity recognizer, allowing shared-memory parallelism via
OpenMP. However, OpenMP is not supported on OSX — so multiple
2016-03-31 14:24:48 +00:00
threads will only be used on Linux and Windows.
2016-10-03 18:19:13 +00:00
+table(["Example", "Description"], "usage")
2016-03-31 14:24:48 +00:00
+row
2016-10-03 18:19:13 +00:00
+cell #[+a("https://github.com/" + SOCIAL.github + "/spaCy/blob/master/examples/parallel_parse.py") parallel_parse.py]
2016-03-31 14:24:48 +00:00
+cell Parse comments from Reddit in parallel.
2016-10-03 18:19:13 +00:00
+code("python", "Definition").
2016-03-31 14:24:48 +00:00
def pipe(self, texts, n_threads=2, batch_size=1000):
yield Doc()
2016-10-03 18:19:13 +00:00
+table(["Arg", "Type", "Description"])
2016-03-31 14:24:48 +00:00
+row
+cell texts
+cell
+cell.
2016-10-03 18:19:13 +00:00
A sequence of unicode objects. Usually you will want this
to be a generator, so that you don"t need to have all of
2016-03-31 14:24:48 +00:00
your texts in memory.
+row
+cell n_threads
2016-10-03 18:19:13 +00:00
+cell #[+a(link_int) int]
2016-03-31 14:24:48 +00:00
+cell.
2016-10-03 18:19:13 +00:00
The number of worker threads to use. If -1, OpenMP will
2016-03-31 14:24:48 +00:00
decide how many to use at run time. Default is 2.
+row
+cell batch_size
2016-10-03 18:19:13 +00:00
+cell #[+a(link_int) int]
2016-03-31 14:24:48 +00:00
+cell.
2016-10-03 18:19:13 +00:00
The number of texts to buffer. Let"s say you have a
#[code batch_size] of 1,000. The input, #[code texts], is
a generator that yields the texts one-by-one. We want to
operate on them in parallel. So, we accumulate a work queue.
Instead of taking one document from #[code texts] and
operating on it, we buffer #[code batch_size] documents,
work on them in parallel, and then yield them one-by-one.
2016-03-31 14:24:48 +00:00
Higher #[code batch_size] therefore often results in better
parallelism, up to a point.