spaCy/spacy/cli
adrianeboyd a365359b36
Add convert CLI option to merge CoNLL-U subtokens (#4722)
* Add convert CLI option to merge CoNLL-U subtokens

Add `-T` option to convert CLI that merges CoNLL-U subtokens into one
token in the converted data. Each CoNLL-U sentence is read into a `Doc`
and the `Retokenizer` is used to merge subtokens with features as
follows:

* `orth` is the merged token orth (should correspond to raw text and `#
text`)

* `tag` is all subtoken tags concatenated with `_`, e.g. `ADP_DET`

* `pos` is the POS of the syntactic root of the span (as determined by
the Retokenizer)

* `morph` is all morphological features merged

* `lemma` is all subtoken lemmas concatenated with ` `, e.g. `de o`

* with `-m` all morphological features are combined with the tag using
the separator `__`, e.g.
`ADP_DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art`

* `dep` is the dependency relation for the syntactic root of the span
(as determined by the Retokenizer)

Concatenated tags will be mapped to the UD POS of the syntactic root
(e.g., `ADP`) and the morphological features will be the combined
features.

In many cases, the original UD subtokens can be reconstructed from the
available features given a language-specific lookup table, e.g.,
Portuguese `do / ADP_DET /
Definite=Def|Gender=Masc|Number=Sing|PronType=Art` is `de / ADP`, `o /
DET / Definite=Def|Gender=Masc|Number=Sing|PronType=Art` or lookup rules
for forms containing open class words like Spanish `hablarlo / VERB_PRON
/
Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|VerbForm=Inf`.

* Clean up imports
2020-01-29 17:44:25 +01:00
..
converters Add convert CLI option to merge CoNLL-U subtokens (#4722) 2020-01-29 17:44:25 +01:00
__init__.py Update spaCy for thinc 8.0.0 (#4920) 2020-01-29 17:06:46 +01:00
convert.py Add convert CLI option to merge CoNLL-U subtokens (#4722) 2020-01-29 17:44:25 +01:00
debug_data.py Report length of dev dataset correctly (#4891) 2020-01-08 16:51:51 +01:00
download.py Modernize plac commands for Python 3 (#4836) 2020-01-01 13:15:46 +01:00
evaluate.py Modernize plac commands for Python 3 (#4836) 2020-01-01 13:15:46 +01:00
info.py Modernize plac commands for Python 3 (#4836) 2020-01-01 13:15:46 +01:00
init_model.py Modernize plac commands for Python 3 (#4836) 2020-01-01 13:15:46 +01:00
link.py Modernize plac commands for Python 3 (#4836) 2020-01-01 13:15:46 +01:00
package.py Modernize plac commands for Python 3 (#4836) 2020-01-01 13:15:46 +01:00
pretrain.py Update spaCy for thinc 8.0.0 (#4920) 2020-01-29 17:06:46 +01:00
profile.py Update spaCy for thinc 8.0.0 (#4920) 2020-01-29 17:06:46 +01:00
train.py Update spaCy for thinc 8.0.0 (#4920) 2020-01-29 17:06:46 +01:00
train_from_config.py Update spaCy for thinc 8.0.0 (#4920) 2020-01-29 17:06:46 +01:00
validate.py Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00