spaCy

History

adrianeboyd a365359b36 Add convert CLI option to merge CoNLL-U subtokens (#4722 ) * Add convert CLI option to merge CoNLL-U subtokens Add `-T` option to convert CLI that merges CoNLL-U subtokens into one token in the converted data. Each CoNLL-U sentence is read into a `Doc` and the `Retokenizer` is used to merge subtokens with features as follows: * `orth` is the merged token orth (should correspond to raw text and `# text`) * `tag` is all subtoken tags concatenated with `_`, e.g. `ADP_DET` * `pos` is the POS of the syntactic root of the span (as determined by the Retokenizer) * `morph` is all morphological features merged * `lemma` is all subtoken lemmas concatenated with ` `, e.g. `de o` * with `-m` all morphological features are combined with the tag using the separator `__`, e.g. `ADP_DET__Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` * `dep` is the dependency relation for the syntactic root of the span (as determined by the Retokenizer) Concatenated tags will be mapped to the UD POS of the syntactic root (e.g., `ADP`) and the morphological features will be the combined features. In many cases, the original UD subtokens can be reconstructed from the available features given a language-specific lookup table, e.g., Portuguese `do / ADP_DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` is `de / ADP`, `o / DET / Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art` or lookup rules for forms containing open class words like Spanish `hablarlo / VERB_PRON / Case=Acc\|Gender=Masc\|Number=Sing\|Person=3\|PrepCase=Npr\|PronType=Prs\|VerbForm=Inf`. * Clean up imports		2020-01-29 17:44:25 +01:00
..
converters	Add convert CLI option to merge CoNLL-U subtokens (#4722 )	2020-01-29 17:44:25 +01:00
__init__.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
convert.py	Add convert CLI option to merge CoNLL-U subtokens (#4722 )	2020-01-29 17:44:25 +01:00
debug_data.py	Report length of dev dataset correctly (#4891 )	2020-01-08 16:51:51 +01:00
download.py	Modernize plac commands for Python 3 (#4836 )	2020-01-01 13:15:46 +01:00
evaluate.py	Modernize plac commands for Python 3 (#4836 )	2020-01-01 13:15:46 +01:00
info.py	Modernize plac commands for Python 3 (#4836 )	2020-01-01 13:15:46 +01:00
init_model.py	Modernize plac commands for Python 3 (#4836 )	2020-01-01 13:15:46 +01:00
link.py	Modernize plac commands for Python 3 (#4836 )	2020-01-01 13:15:46 +01:00
package.py	Modernize plac commands for Python 3 (#4836 )	2020-01-01 13:15:46 +01:00
pretrain.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
profile.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
train.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
train_from_config.py	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
validate.py	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00