Update processing-pipelines.md to mention method for doc metadata (#7480)

* Update processing-pipelines.md

Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True)

Link to a new example on the attributes page detailing the following:

> ```
> data = [
>   ("Some text to process", {"meta": "foo"}),
>   ("And more text...", {"meta": "bar"})
> ]
> 
> for doc, context in nlp.pipe(data, as_tuples=True):
>     # Let's assume you have a "meta" extension registered on the Doc
>     doc._.meta = context["meta"]
> ```

from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as

* Updating the attributes section

Update the attributes section with example of how extensions can be used to store metadata.

* Update processing-pipelines.md

* Update processing-pipelines.md

Made as_tuples example executable and relocated to the end of the "Processing Text" section.

* Update processing-pipelines.md

* Update processing-pipelines.md

Removed extra line

* Reformat and rephrase

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
This commit is contained in:
langdonholmes 2021-04-19 02:58:12 -07:00 committed by GitHub
parent 0e7f94b247
commit df541c6b5e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 33 additions and 0 deletions

View File

@ -91,6 +91,37 @@ have to call `list()` on it first:
</Infobox>
You can use the `as_tuples` option to pass additional context along with each
doc when using [`nlp.pipe`](/api/language#pipe). If `as_tuples` is `True`, then
the input should be a sequence of `(text, context)` tuples and the output will
be a sequence of `(doc, context)` tuples. For example, you can pass metadata in
the context and save it in a [custom attribute](#custom-components-attributes):
```python
### {executable="true"}
import spacy
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
text_tuples = [
("This is the first text.", {"text_id": "text1"}),
("This is the second text.", {"text_id": "text2"})
]
nlp = spacy.load("en_core_web_sm")
doc_tuples = nlp.pipe(text_tuples, as_tuples=True)
docs = []
for doc, context in doc_tuples:
doc._.text_id = context["text_id"]
docs.append(doc)
for doc in docs:
print(f"{doc._.text_id}: {doc.text}")
```
### Multiprocessing {#multiprocessing}
spaCy includes built-in support for multiprocessing with
@ -1373,6 +1404,8 @@ There are three main types of extensions, which can be defined using the
[`Span.set_extension`](/api/span#set_extension) and
[`Token.set_extension`](/api/token#set_extension) methods.
## Description
1. **Attribute extensions.** Set a default value for an attribute, which can be
overwritten manually at any time. Attribute extensions work like "normal"
variables and are the quickest way to store arbitrary information on a `Doc`,