mirror of https://github.com/explosion/spaCy.git
Updates/bugfixes for NER/IOB converters (#4186)
* Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1|pos1|ent1 word2|pos2|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters
This commit is contained in:
parent
5feb342f5e
commit
82159b5c19
|
@ -0,0 +1,7 @@
|
|||
## Examples of NER/IOB data that can be converted with `spacy convert`
|
||||
|
||||
spacy JSON training files were generated with:
|
||||
|
||||
```
|
||||
python -m spacy convert -c iob -s -n 10 -b en file.iob
|
||||
```
|
|
@ -0,0 +1,2 @@
|
|||
When|WRB|O Sebastian|NNP|B-PERSON Thrun|NNP|I-PERSON started|VBD|O working|VBG|O on|IN|O self|NN|O -|HYPH|O driving|VBG|O cars|NNS|O at|IN|O Google|NNP|B-ORG in|IN|O 2007|CD|B-DATE ,|,|O few|JJ|O people|NNS|O outside|RB|O of|IN|O the|DT|O company|NN|O took|VBD|O him|PRP|O seriously|RB|O .|.|O
|
||||
“|''|O I|PRP|O can|MD|O tell|VB|O you|PRP|O very|RB|O senior|JJ|O CEOs|NNS|O of|IN|O major|JJ|O American|JJ|B-NORP car|NN|O companies|NNS|O would|MD|O shake|VB|O my|PRP$|O hand|NN|O and|CC|O turn|VB|O away|RB|O because|IN|O I|PRP|O was|VBD|O n’t|RB|O worth|JJ|O talking|VBG|O to|IN|O ,|,|O ”|''|O said|VBD|O Thrun|NNP|B-PERSON ,|,|O in|IN|O an|DT|O interview|NN|O with|IN|O Recode|NNP|B-ORG earlier|RBR|B-DATE this|DT|I-DATE week|NN|I-DATE .|.|O
|
|
@ -0,0 +1,349 @@
|
|||
[
|
||||
{
|
||||
"id":0,
|
||||
"paragraphs":[
|
||||
{
|
||||
"sentences":[
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"orth":"When",
|
||||
"tag":"WRB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Sebastian",
|
||||
"tag":"NNP",
|
||||
"ner":"B-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":"Thrun",
|
||||
"tag":"NNP",
|
||||
"ner":"L-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":"started",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"working",
|
||||
"tag":"VBG",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"on",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"self",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"-",
|
||||
"tag":"HYPH",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"driving",
|
||||
"tag":"VBG",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"cars",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"at",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Google",
|
||||
"tag":"NNP",
|
||||
"ner":"U-ORG"
|
||||
},
|
||||
{
|
||||
"orth":"in",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"2007",
|
||||
"tag":"CD",
|
||||
"ner":"U-DATE"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":",",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"few",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"people",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"outside",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"of",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"the",
|
||||
"tag":"DT",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"company",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"took",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"him",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"seriously",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":".",
|
||||
"tag":".",
|
||||
"ner":"O"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"orth":"\u201c",
|
||||
"tag":"''",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"I",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"can",
|
||||
"tag":"MD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"tell",
|
||||
"tag":"VB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"you",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"very",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"senior",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"CEOs",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"of",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"major",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"American",
|
||||
"tag":"JJ",
|
||||
"ner":"U-NORP"
|
||||
},
|
||||
{
|
||||
"orth":"car",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"companies",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"would",
|
||||
"tag":"MD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"shake",
|
||||
"tag":"VB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"my",
|
||||
"tag":"PRP$",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"hand",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"and",
|
||||
"tag":"CC",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"turn",
|
||||
"tag":"VB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"away",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"because",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"I",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"was",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"n\u2019t",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"worth",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"talking",
|
||||
"tag":"VBG",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"to",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":",",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"\u201d",
|
||||
"tag":"''",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"said",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Thrun",
|
||||
"tag":"NNP",
|
||||
"ner":"U-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":",",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"in",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"an",
|
||||
"tag":"DT",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"interview",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"with",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Recode",
|
||||
"tag":"NNP",
|
||||
"ner":"U-ORG"
|
||||
},
|
||||
{
|
||||
"orth":"earlier",
|
||||
"tag":"RBR",
|
||||
"ner":"B-DATE"
|
||||
},
|
||||
{
|
||||
"orth":"this",
|
||||
"tag":"DT",
|
||||
"ner":"I-DATE"
|
||||
},
|
||||
{
|
||||
"orth":"week",
|
||||
"tag":"NN",
|
||||
"ner":"L-DATE"
|
||||
},
|
||||
{
|
||||
"orth":".",
|
||||
"tag":".",
|
||||
"ner":"O"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
|
@ -0,0 +1,70 @@
|
|||
-DOCSTART- -X- O O
|
||||
|
||||
When WRB _ O
|
||||
Sebastian NNP _ B-PERSON
|
||||
Thrun NNP _ I-PERSON
|
||||
started VBD _ O
|
||||
working VBG _ O
|
||||
on IN _ O
|
||||
self NN _ O
|
||||
- HYPH _ O
|
||||
driving VBG _ O
|
||||
cars NNS _ O
|
||||
at IN _ O
|
||||
Google NNP _ B-ORG
|
||||
in IN _ O
|
||||
2007 CD _ B-DATE
|
||||
, , _ O
|
||||
few JJ _ O
|
||||
people NNS _ O
|
||||
outside RB _ O
|
||||
of IN _ O
|
||||
the DT _ O
|
||||
company NN _ O
|
||||
took VBD _ O
|
||||
him PRP _ O
|
||||
seriously RB _ O
|
||||
. . _ O
|
||||
|
||||
“ '' _ O
|
||||
I PRP _ O
|
||||
can MD _ O
|
||||
tell VB _ O
|
||||
you PRP _ O
|
||||
very RB _ O
|
||||
senior JJ _ O
|
||||
CEOs NNS _ O
|
||||
of IN _ O
|
||||
major JJ _ O
|
||||
American JJ _ B-NORP
|
||||
car NN _ O
|
||||
companies NNS _ O
|
||||
would MD _ O
|
||||
shake VB _ O
|
||||
my PRP$ _ O
|
||||
hand NN _ O
|
||||
and CC _ O
|
||||
turn VB _ O
|
||||
away RB _ O
|
||||
because IN _ O
|
||||
I PRP _ O
|
||||
was VBD _ O
|
||||
n’t RB _ O
|
||||
worth JJ _ O
|
||||
talking VBG _ O
|
||||
to IN _ O
|
||||
, , _ O
|
||||
” '' _ O
|
||||
said VBD _ O
|
||||
Thrun NNP _ B-PERSON
|
||||
, , _ O
|
||||
in IN _ O
|
||||
an DT _ O
|
||||
interview NN _ O
|
||||
with IN _ O
|
||||
Recode NNP _ B-ORG
|
||||
earlier RBR _ B-DATE
|
||||
this DT _ I-DATE
|
||||
week NN _ I-DATE
|
||||
. . _ O
|
||||
|
|
@ -0,0 +1,349 @@
|
|||
[
|
||||
{
|
||||
"id":0,
|
||||
"paragraphs":[
|
||||
{
|
||||
"sentences":[
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"orth":"When",
|
||||
"tag":"WRB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Sebastian",
|
||||
"tag":"NNP",
|
||||
"ner":"B-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":"Thrun",
|
||||
"tag":"NNP",
|
||||
"ner":"L-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":"started",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"working",
|
||||
"tag":"VBG",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"on",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"self",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"-",
|
||||
"tag":"HYPH",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"driving",
|
||||
"tag":"VBG",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"cars",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"at",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Google",
|
||||
"tag":"NNP",
|
||||
"ner":"U-ORG"
|
||||
},
|
||||
{
|
||||
"orth":"in",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"2007",
|
||||
"tag":"CD",
|
||||
"ner":"U-DATE"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":",",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"few",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"people",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"outside",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"of",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"the",
|
||||
"tag":"DT",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"company",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"took",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"him",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"seriously",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":".",
|
||||
"tag":".",
|
||||
"ner":"O"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"orth":"\u201c",
|
||||
"tag":"''",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"I",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"can",
|
||||
"tag":"MD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"tell",
|
||||
"tag":"VB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"you",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"very",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"senior",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"CEOs",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"of",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"major",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"American",
|
||||
"tag":"JJ",
|
||||
"ner":"U-NORP"
|
||||
},
|
||||
{
|
||||
"orth":"car",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"companies",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"would",
|
||||
"tag":"MD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"shake",
|
||||
"tag":"VB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"my",
|
||||
"tag":"PRP$",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"hand",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"and",
|
||||
"tag":"CC",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"turn",
|
||||
"tag":"VB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"away",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"because",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"I",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"was",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"n\u2019t",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"worth",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"talking",
|
||||
"tag":"VBG",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"to",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":",",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"\u201d",
|
||||
"tag":"''",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"said",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Thrun",
|
||||
"tag":"NNP",
|
||||
"ner":"U-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":",",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"in",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"an",
|
||||
"tag":"DT",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"interview",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"with",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Recode",
|
||||
"tag":"NNP",
|
||||
"ner":"U-ORG"
|
||||
},
|
||||
{
|
||||
"orth":"earlier",
|
||||
"tag":"RBR",
|
||||
"ner":"B-DATE"
|
||||
},
|
||||
{
|
||||
"orth":"this",
|
||||
"tag":"DT",
|
||||
"ner":"I-DATE"
|
||||
},
|
||||
{
|
||||
"orth":"week",
|
||||
"tag":"NN",
|
||||
"ner":"L-DATE"
|
||||
},
|
||||
{
|
||||
"orth":".",
|
||||
"tag":".",
|
||||
"ner":"O"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
|
@ -0,0 +1,66 @@
|
|||
When WRB O
|
||||
Sebastian NNP B-PERSON
|
||||
Thrun NNP I-PERSON
|
||||
started VBD O
|
||||
working VBG O
|
||||
on IN O
|
||||
self NN O
|
||||
- HYPH O
|
||||
driving VBG O
|
||||
cars NNS O
|
||||
at IN O
|
||||
Google NNP B-ORG
|
||||
in IN O
|
||||
2007 CD B-DATE
|
||||
, , O
|
||||
few JJ O
|
||||
people NNS O
|
||||
outside RB O
|
||||
of IN O
|
||||
the DT O
|
||||
company NN O
|
||||
took VBD O
|
||||
him PRP O
|
||||
seriously RB O
|
||||
. . O
|
||||
“ '' O
|
||||
I PRP O
|
||||
can MD O
|
||||
tell VB O
|
||||
you PRP O
|
||||
very RB O
|
||||
senior JJ O
|
||||
CEOs NNS O
|
||||
of IN O
|
||||
major JJ O
|
||||
American JJ B-NORP
|
||||
car NN O
|
||||
companies NNS O
|
||||
would MD O
|
||||
shake VB O
|
||||
my PRP$ O
|
||||
hand NN O
|
||||
and CC O
|
||||
turn VB O
|
||||
away RB O
|
||||
because IN O
|
||||
I PRP O
|
||||
was VBD O
|
||||
n’t RB O
|
||||
worth JJ O
|
||||
talking VBG O
|
||||
to IN O
|
||||
, , O
|
||||
” '' O
|
||||
said VBD O
|
||||
Thrun NNP B-PERSON
|
||||
, , O
|
||||
in IN O
|
||||
an DT O
|
||||
interview NN O
|
||||
with IN O
|
||||
Recode NNP B-ORG
|
||||
earlier RBR B-DATE
|
||||
this DT I-DATE
|
||||
week NN I-DATE
|
||||
. . O
|
|
@ -0,0 +1,353 @@
|
|||
[
|
||||
{
|
||||
"id":0,
|
||||
"paragraphs":[
|
||||
{
|
||||
"sentences":[
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"orth":"When",
|
||||
"tag":"WRB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Sebastian",
|
||||
"tag":"NNP",
|
||||
"ner":"B-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":"Thrun",
|
||||
"tag":"NNP",
|
||||
"ner":"L-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":"started",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"working",
|
||||
"tag":"VBG",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"on",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"self",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"-",
|
||||
"tag":"HYPH",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"driving",
|
||||
"tag":"VBG",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"cars",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"at",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Google",
|
||||
"tag":"NNP",
|
||||
"ner":"U-ORG"
|
||||
},
|
||||
{
|
||||
"orth":"in",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"2007",
|
||||
"tag":"CD",
|
||||
"ner":"U-DATE"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":",",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"few",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"people",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"outside",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"of",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"the",
|
||||
"tag":"DT",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"company",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"took",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"him",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"seriously",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":".",
|
||||
"tag":".",
|
||||
"ner":"O"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"orth":"\u201c",
|
||||
"tag":"''",
|
||||
"ner":"O"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"orth":"I",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"can",
|
||||
"tag":"MD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"tell",
|
||||
"tag":"VB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"you",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"very",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"senior",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"CEOs",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"of",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"major",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"American",
|
||||
"tag":"JJ",
|
||||
"ner":"U-NORP"
|
||||
},
|
||||
{
|
||||
"orth":"car",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"companies",
|
||||
"tag":"NNS",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"would",
|
||||
"tag":"MD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"shake",
|
||||
"tag":"VB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"my",
|
||||
"tag":"PRP$",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"hand",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"and",
|
||||
"tag":"CC",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"turn",
|
||||
"tag":"VB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"away",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"because",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"I",
|
||||
"tag":"PRP",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"was",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"n\u2019t",
|
||||
"tag":"RB",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"worth",
|
||||
"tag":"JJ",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"talking",
|
||||
"tag":"VBG",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"to",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":",",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"\u201d",
|
||||
"tag":"''",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"said",
|
||||
"tag":"VBD",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Thrun",
|
||||
"tag":"NNP",
|
||||
"ner":"U-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":",",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"in",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"an",
|
||||
"tag":"DT",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"interview",
|
||||
"tag":"NN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"with",
|
||||
"tag":"IN",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Recode",
|
||||
"tag":"NNP",
|
||||
"ner":"U-ORG"
|
||||
},
|
||||
{
|
||||
"orth":"earlier",
|
||||
"tag":"RBR",
|
||||
"ner":"B-DATE"
|
||||
},
|
||||
{
|
||||
"orth":"this",
|
||||
"tag":"DT",
|
||||
"ner":"I-DATE"
|
||||
},
|
||||
{
|
||||
"orth":"week",
|
||||
"tag":"NN",
|
||||
"ner":"L-DATE"
|
||||
},
|
||||
{
|
||||
"orth":".",
|
||||
"tag":".",
|
||||
"ner":"O"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
|
@ -0,0 +1,66 @@
|
|||
When O
|
||||
Sebastian B-PERSON
|
||||
Thrun I-PERSON
|
||||
started O
|
||||
working O
|
||||
on O
|
||||
self O
|
||||
- O
|
||||
driving O
|
||||
cars O
|
||||
at O
|
||||
Google B-ORG
|
||||
in O
|
||||
2007 B-DATE
|
||||
, O
|
||||
few O
|
||||
people O
|
||||
outside O
|
||||
of O
|
||||
the O
|
||||
company O
|
||||
took O
|
||||
him O
|
||||
seriously O
|
||||
. O
|
||||
“ O
|
||||
I O
|
||||
can O
|
||||
tell O
|
||||
you O
|
||||
very O
|
||||
senior O
|
||||
CEOs O
|
||||
of O
|
||||
major O
|
||||
American B-NORP
|
||||
car O
|
||||
companies O
|
||||
would O
|
||||
shake O
|
||||
my O
|
||||
hand O
|
||||
and O
|
||||
turn O
|
||||
away O
|
||||
because O
|
||||
I O
|
||||
was O
|
||||
n’t O
|
||||
worth O
|
||||
talking O
|
||||
to O
|
||||
, O
|
||||
” O
|
||||
said O
|
||||
Thrun B-PERSON
|
||||
, O
|
||||
in O
|
||||
an O
|
||||
interview O
|
||||
with O
|
||||
Recode B-ORG
|
||||
earlier B-DATE
|
||||
this I-DATE
|
||||
week I-DATE
|
||||
. O
|
|
@ -0,0 +1,353 @@
|
|||
[
|
||||
{
|
||||
"id":0,
|
||||
"paragraphs":[
|
||||
{
|
||||
"sentences":[
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"orth":"When",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Sebastian",
|
||||
"tag":"-",
|
||||
"ner":"B-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":"Thrun",
|
||||
"tag":"-",
|
||||
"ner":"L-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":"started",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"working",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"on",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"self",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"-",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"driving",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"cars",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"at",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Google",
|
||||
"tag":"-",
|
||||
"ner":"U-ORG"
|
||||
},
|
||||
{
|
||||
"orth":"in",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"2007",
|
||||
"tag":"-",
|
||||
"ner":"U-DATE"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"few",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"people",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"outside",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"of",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"the",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"company",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"took",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"him",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"seriously",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":".",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"orth":"\u201c",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"orth":"I",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"can",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"tell",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"you",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"very",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"senior",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"CEOs",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"of",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"major",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"American",
|
||||
"tag":"-",
|
||||
"ner":"U-NORP"
|
||||
},
|
||||
{
|
||||
"orth":"car",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"companies",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"would",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"shake",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"my",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"hand",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"and",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"turn",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"away",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"because",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"I",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"was",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"n\u2019t",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"worth",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"talking",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"to",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"\u201d",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"said",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Thrun",
|
||||
"tag":"-",
|
||||
"ner":"U-PERSON"
|
||||
},
|
||||
{
|
||||
"orth":",",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"in",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"an",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"interview",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"with",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
},
|
||||
{
|
||||
"orth":"Recode",
|
||||
"tag":"-",
|
||||
"ner":"U-ORG"
|
||||
},
|
||||
{
|
||||
"orth":"earlier",
|
||||
"tag":"-",
|
||||
"ner":"B-DATE"
|
||||
},
|
||||
{
|
||||
"orth":"this",
|
||||
"tag":"-",
|
||||
"ner":"I-DATE"
|
||||
},
|
||||
{
|
||||
"orth":"week",
|
||||
"tag":"-",
|
||||
"ner":"L-DATE"
|
||||
},
|
||||
{
|
||||
"orth":".",
|
||||
"tag":"-",
|
||||
"ner":"O"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
|
@ -5,12 +5,14 @@ import plac
|
|||
from pathlib import Path
|
||||
from wasabi import Printer
|
||||
import srsly
|
||||
import re
|
||||
|
||||
from .converters import conllu2json, iob2json, conll_ner2json
|
||||
from .converters import ner_jsonl2json
|
||||
|
||||
|
||||
# Converters are matched by file extension. To add a converter, add a new
|
||||
# Converters are matched by file extension except for ner/iob, which are
|
||||
# matched by file extension and content. To add a converter, add a new
|
||||
# entry to this dict with the file extension mapped to the converter function
|
||||
# imported from /converters.
|
||||
CONVERTERS = {
|
||||
|
@ -31,7 +33,9 @@ FILE_TYPES_STDOUT = ("json", "jsonl")
|
|||
input_file=("Input file", "positional", None, str),
|
||||
output_dir=("Output directory. '-' for stdout.", "positional", None, str),
|
||||
file_type=("Type of data to produce: {}".format(FILE_TYPES), "option", "t", str),
|
||||
n_sents=("Number of sentences per doc", "option", "n", int),
|
||||
n_sents=("Number of sentences per doc (0 to disable)", "option", "n", int),
|
||||
seg_sents=("Segment sentences (for -c ner)", "flag", "s"),
|
||||
model=("Model for sentence segmentation (for -s)", "option", "b", str),
|
||||
converter=("Converter: {}".format(tuple(CONVERTERS.keys())), "option", "c", str),
|
||||
lang=("Language (if tokenizer required)", "option", "l", str),
|
||||
morphology=("Enable appending morphology to tags", "flag", "m", bool),
|
||||
|
@ -41,6 +45,8 @@ def convert(
|
|||
output_dir="-",
|
||||
file_type="json",
|
||||
n_sents=1,
|
||||
seg_sents=False,
|
||||
model=None,
|
||||
morphology=False,
|
||||
converter="auto",
|
||||
lang=None,
|
||||
|
@ -70,14 +76,24 @@ def convert(
|
|||
msg.fail("Input file not found", input_path, exits=1)
|
||||
if output_dir != "-" and not Path(output_dir).exists():
|
||||
msg.fail("Output directory not found", output_dir, exits=1)
|
||||
input_data = input_path.open("r", encoding="utf-8").read()
|
||||
if converter == "auto":
|
||||
converter = input_path.suffix[1:]
|
||||
if converter == "ner" or converter == "iob":
|
||||
converter_autodetect = autodetect_ner_format(input_data)
|
||||
if converter_autodetect == "ner":
|
||||
msg.info("Auto-detected token-per-line NER format")
|
||||
converter = converter_autodetect
|
||||
elif converter_autodetect == "iob":
|
||||
msg.info("Auto-detected sentence-per-line NER format")
|
||||
converter = converter_autodetect
|
||||
else:
|
||||
msg.warn("Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert")
|
||||
if converter not in CONVERTERS:
|
||||
msg.fail("Can't find converter for {}".format(converter), exits=1)
|
||||
# Use converter function to convert data
|
||||
func = CONVERTERS[converter]
|
||||
input_data = input_path.open("r", encoding="utf-8").read()
|
||||
data = func(input_data, n_sents=n_sents, use_morphology=morphology, lang=lang)
|
||||
data = func(input_data, n_sents=n_sents, seg_sents=seg_sents, use_morphology=morphology, lang=lang, model=model)
|
||||
if output_dir != "-":
|
||||
# Export data to a file
|
||||
suffix = ".{}".format(file_type)
|
||||
|
@ -88,10 +104,29 @@ def convert(
|
|||
srsly.write_jsonl(output_file, data)
|
||||
elif file_type == "msg":
|
||||
srsly.write_msgpack(output_file, data)
|
||||
msg.good("Generated output file ({} documents)".format(len(data)), output_file)
|
||||
msg.good("Generated output file ({} documents): {}".format(len(data), output_file))
|
||||
else:
|
||||
# Print to stdout
|
||||
if file_type == "json":
|
||||
srsly.write_json("-", data)
|
||||
elif file_type == "jsonl":
|
||||
srsly.write_jsonl("-", data)
|
||||
|
||||
|
||||
def autodetect_ner_format(input_data):
|
||||
# guess format from the first 20 lines
|
||||
lines = input_data.split("\n")[:20]
|
||||
format_guesses = {"ner": 0, "iob": 0}
|
||||
iob_re = re.compile(r"\S+\|(O|[IB]-\S+)")
|
||||
ner_re = re.compile(r"\S+\s+(O|[IB]-\S+)$")
|
||||
for line in lines:
|
||||
line = line.strip()
|
||||
if iob_re.search(line):
|
||||
format_guesses["iob"] += 1
|
||||
if ner_re.search(line):
|
||||
format_guesses["ner"] += 1
|
||||
if format_guesses["iob"] == 0 and format_guesses["ner"] > 0:
|
||||
return "ner"
|
||||
if format_guesses["ner"] == 0 and format_guesses["iob"] > 0:
|
||||
return "iob"
|
||||
return None
|
||||
|
|
|
@ -1,17 +1,71 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from wasabi import Printer
|
||||
|
||||
from ...gold import iob_to_biluo
|
||||
from ...lang.xx import MultiLanguage
|
||||
from ...tokens.doc import Doc
|
||||
from ...util import load_model
|
||||
|
||||
|
||||
def conll_ner2json(input_data, **kwargs):
|
||||
def conll_ner2json(input_data, n_sents=10, seg_sents=False, model=None, **kwargs):
|
||||
"""
|
||||
Convert files in the CoNLL-2003 NER format into JSON format for use with
|
||||
train cli.
|
||||
Convert files in the CoNLL-2003 NER format and similar
|
||||
whitespace-separated columns into JSON format for use with train cli.
|
||||
|
||||
The first column is the tokens, the final column is the IOB tags. If an
|
||||
additional second column is present, the second column is the tags.
|
||||
|
||||
Sentences are separated with whitespace and documents can be separated
|
||||
using the line "-DOCSTART- -X- O O".
|
||||
|
||||
Sample format:
|
||||
|
||||
-DOCSTART- -X- O O
|
||||
|
||||
I O
|
||||
like O
|
||||
London B-GPE
|
||||
and O
|
||||
New B-GPE
|
||||
York I-GPE
|
||||
City I-GPE
|
||||
. O
|
||||
|
||||
"""
|
||||
delimit_docs = "-DOCSTART- -X- O O"
|
||||
msg = Printer()
|
||||
doc_delimiter = "-DOCSTART- -X- O O"
|
||||
# check for existing delimiters, which should be preserved
|
||||
if "\n\n" in input_data and seg_sents:
|
||||
msg.warn("Sentence boundaries found, automatic sentence segmentation with `-s` disabled.")
|
||||
seg_sents = False
|
||||
if doc_delimiter in input_data and n_sents:
|
||||
msg.warn("Document delimiters found, automatic document segmentation with `-n` disabled.")
|
||||
n_sents = 0
|
||||
# do document segmentation with existing sentences
|
||||
if "\n\n" in input_data and not doc_delimiter in input_data and n_sents:
|
||||
n_sents_info(msg, n_sents)
|
||||
input_data = segment_docs(input_data, n_sents, doc_delimiter)
|
||||
# do sentence segmentation with existing documents
|
||||
if not "\n\n" in input_data and doc_delimiter in input_data and seg_sents:
|
||||
input_data = segment_sents_and_docs(input_data, 0, "", model=model, msg=msg)
|
||||
# do both sentence segmentation and document segmentation according
|
||||
# to options
|
||||
if not "\n\n" in input_data and not doc_delimiter in input_data:
|
||||
# sentence segmentation required for document segmentation
|
||||
if n_sents > 0 and not seg_sents:
|
||||
msg.warn("No sentence boundaries found to use with option `-n {}`. Use `-s` to automatically segment sentences or `-n 0` to disable.".format(n_sents))
|
||||
else:
|
||||
n_sents_info(msg, n_sents)
|
||||
input_data = segment_sents_and_docs(input_data, n_sents, doc_delimiter, model=model, msg=msg)
|
||||
# provide warnings for problematic data
|
||||
if not "\n\n" in input_data:
|
||||
msg.warn("No sentence boundaries found. Use `-s` to automatically segment sentences.")
|
||||
if not doc_delimiter in input_data:
|
||||
msg.warn("No document delimiters found. Use `-n` to automatically group sentences into documents.")
|
||||
output_docs = []
|
||||
for doc in input_data.strip().split(delimit_docs):
|
||||
for doc in input_data.strip().split(doc_delimiter):
|
||||
doc = doc.strip()
|
||||
if not doc:
|
||||
continue
|
||||
|
@ -21,7 +75,17 @@ def conll_ner2json(input_data, **kwargs):
|
|||
if not sent:
|
||||
continue
|
||||
lines = [line.strip() for line in sent.split("\n") if line.strip()]
|
||||
words, tags, chunks, iob_ents = zip(*[line.split() for line in lines])
|
||||
cols = list(zip(*[line.split() for line in lines]))
|
||||
if len(cols) < 2:
|
||||
raise ValueError(
|
||||
"The token-per-line NER file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
|
||||
)
|
||||
words = cols[0]
|
||||
iob_ents = cols[-1]
|
||||
if len(cols) > 2:
|
||||
tags = cols[1]
|
||||
else:
|
||||
tags = ["-"] * len(words)
|
||||
biluo_ents = iob_to_biluo(iob_ents)
|
||||
output_doc.append(
|
||||
{
|
||||
|
@ -36,3 +100,47 @@ def conll_ner2json(input_data, **kwargs):
|
|||
)
|
||||
output_doc = []
|
||||
return output_docs
|
||||
|
||||
|
||||
def segment_sents_and_docs(doc, n_sents, doc_delimiter, model=None, msg=None):
|
||||
sentencizer = None
|
||||
if model:
|
||||
nlp = load_model(model)
|
||||
if "parser" in nlp.pipe_names:
|
||||
msg.info("Segmenting sentences with parser from model '{}'.".format(model))
|
||||
sentencizer = nlp.get_pipe("parser")
|
||||
if not sentencizer:
|
||||
msg.info("Segmenting sentences with sentencizer. (Use `-b model` for improved parser-based sentence segmentation.)")
|
||||
nlp = MultiLanguage()
|
||||
sentencizer = nlp.create_pipe("sentencizer")
|
||||
lines = doc.strip().split("\n")
|
||||
words = [line.strip().split()[0] for line in lines]
|
||||
nlpdoc = Doc(nlp.vocab, words=words)
|
||||
sentencizer(nlpdoc)
|
||||
lines_with_segs = []
|
||||
sent_count = 0
|
||||
for i, token in enumerate(nlpdoc):
|
||||
if token.is_sent_start:
|
||||
if n_sents and sent_count % n_sents == 0:
|
||||
lines_with_segs.append(doc_delimiter)
|
||||
lines_with_segs.append("")
|
||||
sent_count += 1
|
||||
lines_with_segs.append(lines[i])
|
||||
return "\n".join(lines_with_segs)
|
||||
|
||||
|
||||
def segment_docs(input_data, n_sents, doc_delimiter):
|
||||
sent_delimiter = "\n\n"
|
||||
sents = input_data.split(sent_delimiter)
|
||||
docs = [sents[i:i+n_sents] for i in range(0, len(sents), n_sents)]
|
||||
input_data = ""
|
||||
for doc in docs:
|
||||
input_data += sent_delimiter + doc_delimiter
|
||||
input_data += sent_delimiter.join(doc)
|
||||
return input_data
|
||||
|
||||
|
||||
def n_sents_info(msg, n_sents):
|
||||
msg.info("Grouping every {} sentences into a document.".format(n_sents))
|
||||
if n_sents == 1:
|
||||
msg.warn("To generate better training data, you may want to group sentences into documents with `-n 10`.")
|
||||
|
|
|
@ -2,17 +2,30 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import re
|
||||
from wasabi import Printer
|
||||
|
||||
from ...gold import iob_to_biluo
|
||||
from ...util import minibatch
|
||||
from .conll_ner2json import n_sents_info
|
||||
|
||||
|
||||
def iob2json(input_data, n_sents=10, *args, **kwargs):
|
||||
"""
|
||||
Convert IOB files into JSON format for use with train cli.
|
||||
Convert IOB files with one sentence per line and tags separated with '|'
|
||||
into JSON format for use with train cli. IOB and IOB2 are accepted.
|
||||
|
||||
Sample formats:
|
||||
|
||||
I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
|
||||
I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
|
||||
I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
|
||||
I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
|
||||
"""
|
||||
sentences = read_iob(input_data.split("\n"))
|
||||
docs = merge_sentences(sentences, n_sents)
|
||||
msg = Printer()
|
||||
docs = read_iob(input_data.split("\n"))
|
||||
if n_sents > 0:
|
||||
n_sents_info(msg, n_sents)
|
||||
docs = merge_sentences(docs, n_sents)
|
||||
return docs
|
||||
|
||||
|
||||
|
@ -21,7 +34,7 @@ def read_iob(raw_sents):
|
|||
for line in raw_sents:
|
||||
if not line.strip():
|
||||
continue
|
||||
tokens = [re.split("[^\w\-]", line.strip())]
|
||||
tokens = [t.split('|') for t in line.split()]
|
||||
if len(tokens[0]) == 3:
|
||||
words, pos, iob = zip(*tokens)
|
||||
elif len(tokens[0]) == 2:
|
||||
|
@ -29,7 +42,7 @@ def read_iob(raw_sents):
|
|||
pos = ["-"] * len(words)
|
||||
else:
|
||||
raise ValueError(
|
||||
"The iob/iob2 file is not formatted correctly. Try checking whitespace and delimiters."
|
||||
"The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
|
||||
)
|
||||
biluo = iob_to_biluo(iob)
|
||||
sentences.append(
|
||||
|
@ -40,7 +53,7 @@ def read_iob(raw_sents):
|
|||
)
|
||||
sentences = [{"tokens": sent} for sent in sentences]
|
||||
paragraphs = [{"sentences": [sent]} for sent in sentences]
|
||||
docs = [{"id": 0, "paragraphs": [para]} for para in paragraphs]
|
||||
docs = [{"id": i, "paragraphs": [para]} for i, para in enumerate(paragraphs)]
|
||||
return docs
|
||||
|
||||
|
||||
|
@ -50,7 +63,7 @@ def merge_sentences(docs, n_sents):
|
|||
group = list(group)
|
||||
first = group.pop(0)
|
||||
to_extend = first["paragraphs"][0]["sentences"]
|
||||
for sent in group[1:]:
|
||||
for sent in group:
|
||||
to_extend.extend(sent["paragraphs"][0]["sentences"])
|
||||
merged.append(first)
|
||||
return merged
|
||||
|
|
|
@ -4,7 +4,7 @@ from __future__ import unicode_literals
|
|||
import pytest
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.cli.converters import conllu2json
|
||||
from spacy.cli.converters import conllu2json, iob2json, conll_ner2json
|
||||
from spacy.cli.pretrain import make_docs
|
||||
|
||||
|
||||
|
@ -32,6 +32,91 @@ def test_cli_converters_conllu2json():
|
|||
assert [t["ner"] for t in tokens] == ["O", "B-PER", "L-PER", "O"]
|
||||
|
||||
|
||||
def test_cli_converters_iob2json():
|
||||
lines = [
|
||||
"I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O",
|
||||
"I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O",
|
||||
"I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O",
|
||||
"I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O",
|
||||
]
|
||||
input_data = "\n".join(lines)
|
||||
converted = iob2json(input_data, n_sents=10)
|
||||
assert len(converted) == 1
|
||||
assert converted[0]["id"] == 0
|
||||
assert len(converted[0]["paragraphs"]) == 1
|
||||
assert len(converted[0]["paragraphs"][0]["sentences"]) == 4
|
||||
for i in range(0, 4):
|
||||
sent = converted[0]["paragraphs"][0]["sentences"][i]
|
||||
assert len(sent["tokens"]) == 8
|
||||
tokens = sent["tokens"]
|
||||
assert [t["orth"] for t in tokens] == ["I", "like", "London", "and", "New", "York", "City", "."]
|
||||
assert [t["ner"] for t in tokens] == ["O", "O", "U-GPE", "O", "B-GPE", "I-GPE", "L-GPE", "O"]
|
||||
|
||||
|
||||
def test_cli_converters_conll_ner2json():
|
||||
lines = [
|
||||
"-DOCSTART- -X- O O",
|
||||
"",
|
||||
"I\tO",
|
||||
"like\tO",
|
||||
"London\tB-GPE",
|
||||
"and\tO",
|
||||
"New\tB-GPE",
|
||||
"York\tI-GPE",
|
||||
"City\tI-GPE",
|
||||
".\tO",
|
||||
"",
|
||||
"I O",
|
||||
"like O",
|
||||
"London B-GPE",
|
||||
"and O",
|
||||
"New B-GPE",
|
||||
"York I-GPE",
|
||||
"City I-GPE",
|
||||
". O",
|
||||
"",
|
||||
"I PRP O",
|
||||
"like VBP O",
|
||||
"London NNP B-GPE",
|
||||
"and CC O",
|
||||
"New NNP B-GPE",
|
||||
"York NNP I-GPE",
|
||||
"City NNP I-GPE",
|
||||
". . O",
|
||||
"",
|
||||
"I PRP _ O",
|
||||
"like VBP _ O",
|
||||
"London NNP _ B-GPE",
|
||||
"and CC _ O",
|
||||
"New NNP _ B-GPE",
|
||||
"York NNP _ I-GPE",
|
||||
"City NNP _ I-GPE",
|
||||
". . _ O",
|
||||
"",
|
||||
"I\tPRP\t_\tO",
|
||||
"like\tVBP\t_\tO",
|
||||
"London\tNNP\t_\tB-GPE",
|
||||
"and\tCC\t_\tO",
|
||||
"New\tNNP\t_\tB-GPE",
|
||||
"York\tNNP\t_\tI-GPE",
|
||||
"City\tNNP\t_\tI-GPE",
|
||||
".\t.\t_\tO",
|
||||
]
|
||||
input_data = "\n".join(lines)
|
||||
converted = conll_ner2json(input_data, n_sents=10)
|
||||
print(converted)
|
||||
assert len(converted) == 1
|
||||
assert converted[0]["id"] == 0
|
||||
assert len(converted[0]["paragraphs"]) == 1
|
||||
assert len(converted[0]["paragraphs"][0]["sentences"]) == 5
|
||||
for i in range(0, 5):
|
||||
sent = converted[0]["paragraphs"][0]["sentences"][i]
|
||||
assert len(sent["tokens"]) == 8
|
||||
tokens = sent["tokens"]
|
||||
assert [t["orth"] for t in tokens] == ["I", "like", "London", "and", "New", "York", "City", "."]
|
||||
assert [t["ner"] for t in tokens] == ["O", "O", "U-GPE", "O", "B-GPE", "I-GPE", "L-GPE", "O"]
|
||||
|
||||
|
||||
def test_pretrain_make_docs():
|
||||
nlp = English()
|
||||
|
||||
|
|
|
@ -145,6 +145,8 @@ $ python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
|
|||
| `--file-type`, `-t` <Tag variant="new">2.1</Tag> | option | Type of file to create (see below). |
|
||||
| `--converter`, `-c` <Tag variant="new">2</Tag> | option | Name of converter to use (see below). |
|
||||
| `--n-sents`, `-n` | option | Number of sentences per document. |
|
||||
| `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | flag | Segment sentences (for `-c ner`) |
|
||||
| `--model`, `-b` <Tag variant="new">2.2</Tag> | option | Model for parser-based sentence segmentation (for `-s`) |
|
||||
| `--morphology`, `-m` | option | Enable appending morphology to tags. |
|
||||
| `--lang`, `-l` <Tag variant="new">2.1</Tag> | option | Language code (if tokenizer required). |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
|
@ -174,10 +176,10 @@ All output files generated by this command are compatible with
|
|||
|
||||
| ID | Description |
|
||||
| ------------------------------ | --------------------------------------------------------------- |
|
||||
| `auto` | Automatically pick converter based on file extension (default). |
|
||||
| `auto` | Automatically pick converter based on file extension and file content (default). |
|
||||
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
|
||||
| `ner` | Tab-based named entity recognition format. |
|
||||
| `iob` | IOB or IOB2 named entity recognition format. |
|
||||
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
|
||||
## Train {#train}
|
||||
|
||||
|
|
Loading…
Reference in New Issue