Commit Graph

27 Commits

Author SHA1 Message Date
Adriane Boyd c053f158c5
Add support for floret vectors (#8909)
* Add support for fasttext-bloom hash-only vectors

Overview:

* Extend `Vectors` to have two modes: `default` and `ngram`
  * `default` is the default mode and equivalent to the current
    `Vectors`
  * `ngram` supports the hash-only ngram tables from `fasttext-bloom`
* Extend `spacy.StaticVectors.v2` to handle both modes with no changes
  for `default` vectors
* Extend `spacy init vectors` to support ngram tables

The `ngram` mode **only** supports vector tables produced by this
fork of fastText, which adds an option to represent all vectors using
only the ngram buckets table and which uses the exact same ngram
generation algorithm and hash function (`MurmurHash3_x64_128`).
`fasttext-bloom` produces an additional `.hashvec` table, which can be
loaded by `spacy init vectors --fasttext-bloom-vectors`.

https://github.com/adrianeboyd/fastText/tree/feature/bloom

Implementation details:

* `Vectors` now includes the `StringStore` as `Vectors.strings` so that
  the API can stay consistent for both `default` (which can look up from
  `str` or `int`) and `ngram` (which requires `str` to calculate the
  ngrams).

* In ngram mode `Vectors` uses a default `Vectors` object as a cache
  since the ngram vectors lookups are relatively expensive.

  * The default cache size is the same size as the provided ngram vector
    table.

  * Once the cache is full, no more entries are added. The user is
    responsible for managing the cache in cases where the initial
    documents are not representative of the texts.

  * The cache can be resized by setting `Vectors.ngram_cache_size` or
    cleared with `vectors._ngram_cache.clear()`.

* The API ends up a bit split between methods for `default` and for
  `ngram`, so functions that only make sense for `default` or `ngram`
  include warnings with custom messages suggesting alternatives where
  possible.

* `Vocab.vectors` becomes a property so that the string stores can be
  synced when assigning vectors to a vocab.

* `Vectors` serializes its own config settings as `vectors.cfg`.

* The `Vectors` serialization methods have added support for `exclude`
  so that the `Vocab` can exclude the `Vectors` strings while serializing.

Removed:

* The `minn` and `maxn` options and related code from
  `Vocab.get_vector`, which does not work in a meaningful way for default
  vector tables.

* The unused `GlobalRegistry` in `Vectors`.

* Refactor to use reduce_mean

Refactor to use reduce_mean and remove the ngram vectors cache.

* Rename to floret

* Rename to floret in error messages

* Use --vectors-mode in CLI, vector init

* Fix vectors mode in init

* Remove unused var

* Minor API and docstrings adjustments

* Rename `--vectors-mode` to `--mode` in `init vectors` CLI
* Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support
  both modes.
* Minor updates to Vectors docstrings.

* Update API docs for Vectors and init vectors CLI

* Update types for StaticVectors
2021-10-27 14:08:31 +02:00
Sofie Van Landeghem 733e8ceea9
fix spancat initialize with labels (#8620) 2021-07-06 19:08:25 +02:00
Ines Montani c0926c9088
WIP: Various small training changes (#6818)
* Allow output_path to be None during training

* Fix cat scoring (?)

* Improve error message for weighted None score

* Improve messages

So we can call this in other places etc.

* FIx output path check

* Use latest wasabi

* Revert "Improve error message for weighted None score"

This reverts commit 7059926763.

* Exclude None scores from final score by default

It's otherwise very difficult to keep track of the score weights if we modify a config programmatically, source components etc.

* Update warnings and use logger.warning
2021-01-26 14:51:52 +11:00
Ines Montani 94a5a9814f Update argument handling and documentation 2020-12-08 20:41:18 +11:00
Ines Montani 2c9804038d Fix success message [ci skip] 2020-10-23 16:11:54 +02:00
Sofie Van Landeghem 75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
Ines Montani f2627157c8 Update docs [ci skip] 2020-10-01 17:38:17 +02:00
Ines Montani 7f68f4bd92 Hide jsonl_loc on init vectors and tidy up [ci skip] 2020-10-01 16:44:17 +02:00
Ines Montani 0a8a124a6e Update docs [ci skip] 2020-10-01 12:15:53 +02:00
Matthew Honnibal 59294e91aa Restore the 'jsonl' arg for init vectors
The lexemes.jsonl file is still used in our English vectors, and it may
be required by users as well. I think it's worth supporting the option.
2020-09-30 19:06:50 +02:00
Ines Montani a5debb356d Tidy up and adjust logging [ci skip] 2020-09-30 01:22:08 +02:00
Ines Montani 43c92ec8c9 Resolve dir for better output [ci skip] 2020-09-29 22:01:04 +02:00
Ines Montani 71a0ee274a Move init labels to init pipeline module 2020-09-29 18:09:33 +02:00
Ines Montani aa2a6882d0 Fix logging 2020-09-29 16:08:39 +02:00
Ines Montani 4925ad760a Add init vectors 2020-09-29 10:58:50 +02:00
Ines Montani a139fe672b Fix typos and refactor CLI logging 2020-09-28 21:17:10 +02:00
Ines Montani 822ea4ef61 Refactor CLI 2020-09-28 15:09:59 +02:00
Ines Montani a89e0ff7cb Fix typo 2020-09-28 12:55:21 +02:00
Ines Montani a62337b3f3 Tidy up vocab init 2020-09-28 12:53:06 +02:00
Ines Montani c22ecc66bb Don't support init path for now 2020-09-28 12:46:28 +02:00
Ines Montani a5f2cc0509 Tidy up and remove raw text (rehearsal) for now 2020-09-28 12:30:13 +02:00
Ines Montani 1590de11b1 Update config 2020-09-28 12:05:23 +02:00
Ines Montani e44a7519cd Update CLI and add [initialize] block 2020-09-28 11:56:14 +02:00
Ines Montani d5155376fd Update vocab init 2020-09-28 11:30:18 +02:00
Ines Montani 8b74fd19df init pipeline -> init nlp 2020-09-28 11:13:38 +02:00
Ines Montani 553bfea641 Fix commands 2020-09-28 10:53:17 +02:00
Matthew Honnibal 44bad1474c Add init_pipeline file 2020-09-28 09:47:34 +02:00