spaCy/examples/training
Matthew Honnibal 6c783f8045 Bug fixes and options for TextCategorizer (#3472)
* Fix code for bag-of-words feature extraction

The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).

* Support 'bow' architecture for TextCategorizer

This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.

* Fix size limits in train_textcat example

* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
..
conllu.py Remove unused cytoolz / itertools imports 2018-12-03 02:12:07 +01:00
ner_multitask_objective.py Auto-format examples 2018-12-02 04:26:26 +01:00
pretrain_textcat.py Auto-format examples 2018-12-02 04:26:26 +01:00
rehearsal.py Update rehearsal example 2019-02-24 16:17:41 +01:00
train_intent_parser.py Auto-format examples 2018-12-02 04:26:26 +01:00
train_ner.py Test and update examples [ci skip] 2019-03-16 14:15:49 +01:00
train_new_entity_type.py Test and update examples [ci skip] 2019-03-16 14:15:49 +01:00
train_parser.py Test and update examples [ci skip] 2019-03-16 14:15:49 +01:00
train_tagger.py Test and update examples [ci skip] 2019-03-16 14:15:49 +01:00
train_textcat.py Bug fixes and options for TextCategorizer (#3472) 2019-03-23 16:44:44 +01:00
training-data.json Update Example input JSON file to adhere to specification. (#3243) 2019-02-07 16:18:01 +01:00
vocab-data.jsonl Use even smaller examle size 2017-10-30 19:46:45 +01:00