spaCy/extra/example_data/ner_example_data/ner-token-per-line-conll200...

71 lines
900 B
Plaintext
Raw Normal View History

Updates/bugfixes for NER/IOB converters (#4186) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1|pos1|ent1 word2|pos2|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters
2019-08-29 10:04:01 +00:00
-DOCSTART- -X- O O
When WRB _ O
Sebastian NNP _ B-PERSON
Thrun NNP _ I-PERSON
started VBD _ O
working VBG _ O
on IN _ O
self NN _ O
- HYPH _ O
driving VBG _ O
cars NNS _ O
at IN _ O
Google NNP _ B-ORG
in IN _ O
2007 CD _ B-DATE
, , _ O
few JJ _ O
people NNS _ O
outside RB _ O
of IN _ O
the DT _ O
company NN _ O
took VBD _ O
him PRP _ O
seriously RB _ O
. . _ O
“ '' _ O
I PRP _ O
can MD _ O
tell VB _ O
you PRP _ O
very RB _ O
senior JJ _ O
CEOs NNS _ O
of IN _ O
major JJ _ O
American JJ _ B-NORP
car NN _ O
companies NNS _ O
would MD _ O
shake VB _ O
my PRP$ _ O
hand NN _ O
and CC _ O
turn VB _ O
away RB _ O
because IN _ O
I PRP _ O
was VBD _ O
nt RB _ O
worth JJ _ O
talking VBG _ O
to IN _ O
, , _ O
” '' _ O
said VBD _ O
Thrun NNP _ B-PERSON
, , _ O
in IN _ O
an DT _ O
interview NN _ O
with IN _ O
Recode NNP _ B-ORG
earlier RBR _ B-DATE
this DT _ I-DATE
week NN _ I-DATE
. . _ O