spaCy

History

Matthew Honnibal 6c783f8045 Bug fixes and options for TextCategorizer (#3472 ) * Fix code for bag-of-words feature extraction The _ml.py module had a redundant copy of a function to extract unigram bag-of-words features, except one had a bug that set values to 0. Another function allowed extraction of bigram features. Replace all three with a new function that supports arbitrary ngram sizes and also allows control of which attribute is used (e.g. ORTH, LOWER, etc). * Support 'bow' architecture for TextCategorizer This allows efficient ngram bag-of-words models, which are better when the classifier needs to run quickly, especially when the texts are long. Pass architecture="bow" to use it. The extra arguments ngram_size and attr are also available, e.g. ngram_size=2 means unigram and bigram features will be extracted. * Fix size limits in train_textcat example * Explain architectures better in docs		2019-03-23 16:44:44 +01:00
..
annotation.md	Don't auto-slugify accordion links [ci skip]	2019-03-12 15:30:49 +01:00
cli.md	Expose batch size and length caps on CLI for pretrain (#3417 )	2019-03-16 21:38:45 +01:00
cython-classes.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
cython-structs.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
cython.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
dependencyparser.md	💫 Make serialization methods consistent (#3385 )	2019-03-10 19:16:45 +01:00
doc.md	Add Doc.lang and Doc.lang_	2019-03-11 14:21:40 +01:00
entityrecognizer.md	💫 Make serialization methods consistent (#3385 )	2019-03-10 19:16:45 +01:00
entityruler.md	Tidy up and improve docs and docstrings (#3370 )	2019-03-08 11:42:26 +01:00
goldcorpus.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
goldparse.md	Auto-format [ci skip]	2019-02-27 12:07:35 +01:00
index.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
language.md	💫 Allow passing of config parameters to specific pipeline components (#3386 )	2019-03-10 23:36:47 +01:00
lemmatizer.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
lexeme.md	💫 Update website (#3285 )	2019-02-17 19:31:19 +01:00
matcher.md	Remove n_threads	2019-02-17 22:25:42 +01:00
phrasematcher.md	Remove n_threads	2019-02-17 22:25:42 +01:00
pipeline-functions.md	Tidy up and improve docs and docstrings (#3370 )	2019-03-08 11:42:26 +01:00
sentencizer.md	💫 Add better and serializable sentencizer (#3471 )	2019-03-23 15:45:02 +01:00
span.md	Update Span.__init__ docs (see #3445 ) [ci skip]	2019-03-20 17:24:17 +01:00
stringstore.md	💫 Make serialization methods consistent (#3385 )	2019-03-10 19:16:45 +01:00
tagger.md	💫 Make serialization methods consistent (#3385 )	2019-03-10 19:16:45 +01:00
textcategorizer.md	Bug fixes and options for TextCategorizer (#3472 )	2019-03-23 16:44:44 +01:00
token.md	Auto-format [ci skip]	2019-03-11 17:10:50 +01:00
tokenizer.md	💫 Make serialization methods consistent (#3385 )	2019-03-10 19:16:45 +01:00
top-level.md	Document new API [ci skip]	2019-03-11 15:23:53 +01:00
vectors.md	Update Vectors.find docs [ci skip]	2019-03-16 17:10:57 +01:00
vocab.md	Document new API [ci skip]	2019-03-11 15:23:53 +01:00