Add save after `--save-every` batches for `spacy pretrain` (#3510)

<!--- Provide a general summary of your changes in the title. -->

When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches.

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

To test...

Save this file to `sample_sents.jsonl`

```
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
{"text": "hello there."}
```

Then run `--save-every 2` when pretraining.

```bash
spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2
```

And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name.

At the end the training, you should see these files (`ls here/`):

```bash
config.json     model2.bin      model5.bin      model8.bin
log.jsonl       model2.temp.bin model5.temp.bin model8.temp.bin
model0.bin      model3.bin      model6.bin      model9.bin
model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin
model1.bin      model4.bin      model7.bin
model1.temp.bin model4.temp.bin model7.temp.bin
```

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

This is a new feature to `spacy pretrain`.

🌵 **Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error).** 

```
Processing matcher.pyx
[Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx'
Traceback (most recent call last):
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module>
    run(args.root)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run
    process(base, filename, db)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process
    preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd
    func(*args)
  File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx
    raise Exception("Cython failed")
Exception: Cython failed
Traceback (most recent call last):
  File "setup.py", line 276, in <module>
    setup_package()
  File "setup.py", line 209, in setup_package
    generate_cython(root, "spacy")
  File "setup.py", line 132, in generate_cython
    raise RuntimeError("Running cythonize failed")
RuntimeError: Running cythonize failed
```

Edit: Fixed! after deleting all `.cpp` files: `find spacy -name "*.cpp" | xargs rm`

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
This commit is contained in:
Motoki Wu 2019-04-22 05:10:16 -07:00 committed by Ines Montani
parent 189c90743c
commit 8e2cef49f3
2 changed files with 29 additions and 16 deletions

View File

@ -34,7 +34,8 @@ from .. import util
max_length=("Max words per example.", "option", "xw", int), max_length=("Max words per example.", "option", "xw", int),
min_length=("Min words per example.", "option", "nw", int), min_length=("Min words per example.", "option", "nw", int),
seed=("Seed for random number generators", "option", "s", float), seed=("Seed for random number generators", "option", "s", float),
nr_iter=("Number of iterations to pretrain", "option", "i", int), n_iter=("Number of iterations to pretrain", "option", "i", int),
n_save_every=("Save model every X batches.", "option", "se", int),
) )
def pretrain( def pretrain(
texts_loc, texts_loc,
@ -46,11 +47,12 @@ def pretrain(
loss_func="cosine", loss_func="cosine",
use_vectors=False, use_vectors=False,
dropout=0.2, dropout=0.2,
nr_iter=1000, n_iter=1000,
batch_size=3000, batch_size=3000,
max_length=500, max_length=500,
min_length=5, min_length=5,
seed=0, seed=0,
n_save_every=None,
): ):
""" """
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components, Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
@ -115,9 +117,26 @@ def pretrain(
msg.divider("Pre-training tok2vec layer") msg.divider("Pre-training tok2vec layer")
row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")} row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings) msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
for epoch in range(nr_iter):
for batch in util.minibatch_by_words( def _save_model(epoch, is_temp=False):
((text, None) for text in texts), size=batch_size is_temp_str = ".temp" if is_temp else ""
with model.use_params(optimizer.averages):
with (output_dir / ("model%d%s.bin" % (epoch, is_temp_str))).open(
"wb"
) as file_:
file_.write(model.tok2vec.to_bytes())
log = {
"nr_word": tracker.nr_word,
"loss": tracker.loss,
"epoch_loss": tracker.epoch_loss,
"epoch": epoch,
}
with (output_dir / "log.jsonl").open("a") as file_:
file_.write(srsly.json_dumps(log) + "\n")
for epoch in range(n_iter):
for batch_id, batch in enumerate(
util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
): ):
docs = make_docs( docs = make_docs(
nlp, nlp,
@ -133,17 +152,9 @@ def pretrain(
msg.row(progress, **row_settings) msg.row(progress, **row_settings)
if texts_loc == "-" and tracker.words_per_epoch[epoch] >= 10 ** 7: if texts_loc == "-" and tracker.words_per_epoch[epoch] >= 10 ** 7:
break break
with model.use_params(optimizer.averages): if n_save_every and (batch_id % n_save_every == 0):
with (output_dir / ("model%d.bin" % epoch)).open("wb") as file_: _save_model(epoch, is_temp=True)
file_.write(model.tok2vec.to_bytes()) _save_model(epoch)
log = {
"nr_word": tracker.nr_word,
"loss": tracker.loss,
"epoch_loss": tracker.epoch_loss,
"epoch": epoch,
}
with (output_dir / "log.jsonl").open("a") as file_:
file_.write(srsly.json_dumps(log) + "\n")
tracker.epoch_loss = 0.0 tracker.epoch_loss = 0.0
if texts_loc != "-": if texts_loc != "-":
# Reshuffle the texts if texts were loaded from a file # Reshuffle the texts if texts were loaded from a file

View File

@ -285,6 +285,7 @@ improvement.
```bash ```bash
$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width] $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
[--depth] [--embed-rows] [--dropout] [--seed] [--n-iter] [--use-vectors] [--depth] [--embed-rows] [--dropout] [--seed] [--n-iter] [--use-vectors]
[--n-save_every]
``` ```
| Argument | Type | Description | | Argument | Type | Description |
@ -302,6 +303,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
| `--seed`, `-s` | option | Seed for random number generators. | | `--seed`, `-s` | option | Seed for random number generators. |
| `--n-iter`, `-i` | option | Number of iterations to pretrain. | | `--n-iter`, `-i` | option | Number of iterations to pretrain. |
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. | | `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
| `--n-save_every`, `-se` | option | Save model every X batches. |
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. | | **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
### JSONL format for raw text {#pretrain-jsonl} ### JSONL format for raw text {#pretrain-jsonl}