From 28726c25a19248b06b59c5ca759410b84b70668c Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Wed, 10 Mar 2021 11:42:02 +0100 Subject: [PATCH 1/2] Update docs for convert CLI and NER examples --- extra/example_data/ner_example_data/README.md | 20 ++++++++++++++++- website/docs/api/cli.md | 22 +++++++++---------- 2 files changed, 30 insertions(+), 12 deletions(-) diff --git a/extra/example_data/ner_example_data/README.md b/extra/example_data/ner_example_data/README.md index af70694f5..3c6a4a86b 100644 --- a/extra/example_data/ner_example_data/README.md +++ b/extra/example_data/ner_example_data/README.md @@ -1,7 +1,25 @@ ## Examples of NER/IOB data that can be converted with `spacy convert` -spacy JSON training files were generated with: +To convert an IOB file to `.spacy` ([`DocBin`](https://spacy.io/api/docbin)) +for spaCy v3: +```bash +python -m spacy convert -c iob -s -n 10 -b en_core_web_sm file.iob . ``` + +See all the `spacy convert` options: https://spacy.io/api/cli#convert + +--- + +The spaCy v2 JSON training files were generated using **spaCy v2** with: + +```bash python -m spacy convert -c iob -s -n 10 -b en file.iob ``` + +To convert an existing JSON training file to `.spacy` for spaCy v3, convert +with **spaCy v3**: + +```bash +python -m spacy convert file.json . +``` diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index e8be0f79c..fd149b285 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -261,24 +261,24 @@ $ python -m spacy convert [input_file] [output_dir] [--converter] [--file-type] | `output_dir` | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(positional)~~ | | `--converter`, `-c` 2 | Name of converter to use (see below). ~~str (option)~~ | | `--file-type`, `-t` 2.1 | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. ~~str (option)~~ | -| `--n-sents`, `-n` | Number of sentences per document. ~~int (option)~~ | -| `--seg-sents`, `-s` 2.2 | Segment sentences (for `--converter ner`). ~~bool (flag)~~ | +| `--n-sents`, `-n` | Number of sentences per document. Supported for: `conll`, `conllu`, `iob`, `ner` ~~int (option)~~ | +| `--seg-sents`, `-s` 2.2 | Segment sentences. Supported for: `conll`, `ner` ~~bool (flag)~~ | | `--base`, `-b` | Trained spaCy pipeline for sentence segmentation to use as base (for `--seg-sents`). ~~Optional[str](option)~~ | -| `--morphology`, `-m` | Enable appending morphology to tags. ~~bool (flag)~~ | -| `--ner-map`, `-nm` | NER tag mapping (as JSON-encoded dict of entity types). ~~Optional[Path](option)~~ | +| `--morphology`, `-m` | Enable appending morphology to tags. Supported for: `conllu` ~~bool (flag)~~ | +| `--ner-map`, `-nm` | NER tag mapping (as JSON-encoded dict of entity types). Supported for: `conllu` ~~Optional[Path](option)~~ | | `--lang`, `-l` 2.1 | Language code (if tokenizer required). ~~Optional[str] \(option)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | **CREATES** | Binary [`DocBin`](/api/docbin) training data that can be used with [`spacy train`](/api/cli#train). | ### Converters {#converters} -| ID | Description | -| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `auto` | Automatically pick converter based on file extension and file content (default). | -| `json` | JSON-formatted training data used in spaCy v2.x. | -| `conll` | Universal Dependencies `.conllu` or `.conll` format. | -| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | -| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `\|`, either `word\|B-ENT`or`word\|POS\|B-ENT`. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | +| ID | Description | +| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `auto` | Automatically pick converter based on file extension and file content (default). | +| `json` | JSON-formatted training data used in spaCy v2.x. | +| `conllu` | Universal Dependencies `.conllu` format. | +| `ner` / `conll` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | +| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `\|`, either `word\|B-ENT`or`word\|POS\|B-ENT`. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | ## debug {#debug new="3"} From 84470d9b9e65bd1843dd250e5d94bb44fd87469e Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Thu, 11 Mar 2021 10:10:58 +0100 Subject: [PATCH 2/2] Incorporate BILUO note from #7407 --- website/docs/api/cli.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index fd149b285..44a8e2fc2 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -272,13 +272,13 @@ $ python -m spacy convert [input_file] [output_dir] [--converter] [--file-type] ### Converters {#converters} -| ID | Description | -| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `auto` | Automatically pick converter based on file extension and file content (default). | -| `json` | JSON-formatted training data used in spaCy v2.x. | -| `conllu` | Universal Dependencies `.conllu` format. | -| `ner` / `conll` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | -| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `\|`, either `word\|B-ENT`or`word\|POS\|B-ENT`. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | +| ID | Description | +| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `auto` | Automatically pick converter based on file extension and file content (default). | +| `json` | JSON-formatted training data used in spaCy v2.x. | +| `conllu` | Universal Dependencies `.conllu` format. | +| `ner` / `conll` | NER with IOB/IOB2/BILUO tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the NER tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | +| `iob` | NER with IOB/IOB2/BILUO tags, one sentence per line with tokens separated by whitespace and annotation separated by `\|`, either `word\|B-ENT`or`word\|POS\|B-ENT`. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | ## debug {#debug new="3"}