spaCy/website/usage/_processing-pipelines/_multithreading.jade

//- 💫 DOCS > USAGE > PROCESSING PIPELINES > MULTI-THREADING

p
    |  If you have a sequence of documents to process, you should use the
    |  #[+api("language#pipe") #[code Language.pipe()]] method. The method takes
    |  an iterator of texts, and accumulates an internal buffer,
    |  which it works on in parallel. It then yields the documents in order,
    |  one-by-one. After a long and bitter struggle, the global interpreter
    |  lock was freed around spaCy's main parsing loop in v0.100.3. This means
    |  that #[code .pipe()] will be significantly faster in most
    |  practical situations, because it allows shared memory parallelism.

+code.
    for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):
       pass

p
    |  To make full use of the #[code .pipe()] function, you might want to
    |  brush up on #[strong Python generators]. Here are a few quick hints:

+list
    +item
        |  Generator comprehensions can be written as
        |  #[code (item for item in sequence)].

    +item
        |  The
        |  #[+a("https://docs.python.org/2/library/itertools.html") #[code itertools] built-in library]
        |  and the
        |  #[+a("https://github.com/pytoolz/cytoolz") #[code cytoolz] package]
        |  provide a lot of handy #[strong generator tools].

    +item
        |  Often you'll have an input stream that pairs text with some
        |  important meta data, e.g. a JSON document. To
        |  #[strong pair up the meta data] with the processed #[code Doc]
        |  object, you should use the #[code itertools.tee] function to split
        |  the generator in two, and then #[code izip] the extra stream to the
        |  document stream. Here's
        |  #[+a(gh("spacy") + "/issues/172#issuecomment-183963403") an example].

+h(3, "multi-processing-example") Example: Multi-processing with Joblib

p
    |  This example shows how to use multiple cores to process text using
    |  spaCy and #[+a("https://joblib.readthedocs.io/en/latest/parallel.html") Joblib]. We're
    |  exporting part-of-speech-tagged, true-cased, (very roughly)
    |  sentence-separated text, with each "sentence" on a newline, and
    |  spaces between tokens. Data is loaded from the IMDB movie reviews
    |  dataset and will be loaded automatically via Thinc's built-in dataset
    |  loader.

+github("spacy", "examples/pipeline/multi_processing.py", 500)
Update usage documentation 2017-10-03 12:26:20 +00:00			`//- 💫 DOCS > USAGE > PROCESSING PIPELINES > MULTI-THREADING`

			`p`
			`\| If you have a sequence of documents to process, you should use the`
			`\| #[+api("language#pipe") #[code Language.pipe()]] method. The method takes`
			`\| an iterator of texts, and accumulates an internal buffer,`
			`\| which it works on in parallel. It then yields the documents in order,`
			`\| one-by-one. After a long and bitter struggle, the global interpreter`
			`\| lock was freed around spaCy's main parsing loop in v0.100.3. This means`
			`\| that #[code .pipe()] will be significantly faster in most`
			`\| practical situations, because it allows shared memory parallelism.`

			`+code.`
			`for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):`
			`pass`

			`p`
			`\| To make full use of the #[code .pipe()] function, you might want to`
			`\| brush up on #[strong Python generators]. Here are a few quick hints:`

			`+list`
			`+item`
			`\| Generator comprehensions can be written as`
			`\| #[code (item for item in sequence)].`

			`+item`
			`\| The`
			`\| #[+a("https://docs.python.org/2/library/itertools.html") #[code itertools] built-in library]`
			`\| and the`
			`\| #[+a("https://github.com/pytoolz/cytoolz") #[code cytoolz] package]`
			`\| provide a lot of handy #[strong generator tools].`

			`+item`
			`\| Often you'll have an input stream that pairs text with some`
			`\| important meta data, e.g. a JSON document. To`
			`\| #[strong pair up the meta data] with the processed #[code Doc]`
			`\| object, you should use the #[code itertools.tee] function to split`
			`\| the generator in two, and then #[code izip] the extra stream to the`
			`\| document stream. Here's`
			`\| #[+a(gh("spacy") + "/issues/172#issuecomment-183963403") an example].`
Update multi-processing example and add to docs 2017-10-26 23:58:55 +00:00
			`+h(3, "multi-processing-example") Example: Multi-processing with Joblib`

			`p`
			`\| This example shows how to use multiple cores to process text using`
Joblib site has moved. (#3118) 2019-01-05 12:10:54 +00:00			`\| spaCy and #[+a("https://joblib.readthedocs.io/en/latest/parallel.html") Joblib]. We're`
Update multi-processing example and add to docs 2017-10-26 23:58:55 +00:00			`\| exporting part-of-speech-tagged, true-cased, (very roughly)`
			`\| sentence-separated text, with each "sentence" on a newline, and`
			`\| spaces between tokens. Data is loaded from the IMDB movie reviews`
			`\| dataset and will be loaded automatically via Thinc's built-in dataset`
			`\| loader.`

Adjust GitHub embeds 2017-10-27 10:30:59 +00:00			`+github("spacy", "examples/pipeline/multi_processing.py", 500)`