added val loop options

This commit is contained in:
William Falcon 2019-06-27 13:58:13 -04:00
parent e9fca35039
commit e44644e4ba
3 changed files with 71 additions and 47 deletions

View File

@ -9,15 +9,6 @@ trainer = Trainer(progress_bar=True)
```
---
#### Print which gradients are nan
This option prints a list of tensors with nan gradients.
``` {.python}
# DEFAULT
trainer = Trainer(print_nan_grads=False)
```
---
#### Process position
When running multiple models on the same machine we want to decide which progress bar to use.
@ -29,3 +20,33 @@ trainer = Trainer(process_position=0)
# if this is the second model on the node, show the second progress bar below
trainer = Trainer(process_position=1)
```
---
#### Print which gradients are nan
This option prints a list of tensors with nan gradients.
``` {.python}
# DEFAULT
trainer = Trainer(print_nan_grads=False)
```
---
#### Save a snapshot of all hyperparameters
Whenever you call .save() on the test-tube experiment it logs all the hyperparameters in current use.
Give lightning a test-tube Experiment object to automate this for you.
---
#### Log metric row every k batches
Every k batches lightning will make an entry in the metrics log
``` {.python}
# DEFAULT (ie: save a .csv log file every 100 batches)
trainer = Trainer(add_log_row_interval=10)
```
---
#### Write logs file to csv every k batches
Every k batches, lightning will write the new logs to disk
``` {.python}
# DEFAULT (ie: save a .csv log file every 100 batches)
trainer = Trainer(log_save_interval=100)
```

View File

@ -16,6 +16,46 @@ trainer.fit(model)
But of course the fun is in all the advanced things it can do:
**Checkpointing**
- Model saving
- Model loading
**Computing cluster (SLURM)**
- Automatic checkpointing
- Automatic saving, loading
- Running grid search on a cluster
- Walltime auto-resubmit
**Debugging**
- [Fast dev run](Debugging/#fast-dev-run)
- [Inspect gradient norms](Debugging/#inspect-gradient-norms)
- [Log GPU usage](Debugging/#Log-gpu-usage)
- [Make model overfit on subset of data](Debugging/#make-model-overfit-on-subset-of-data)
- [Print the parameter count by layer](Debugging/#print-the-parameter-count-by-layer)
- [Pring which gradients are nan](Debugging/#print-which-gradients-are-nan)
**Distributed training**
- 16-bit mixed precision
- Single-gpu
- Multi-gpu
- Multi-node
**Experiment Logging**
- [Display metrics in progress bar](Logging/#display-metrics-in-progress-bar)
- Log arbitrary metrics
- [Process position](Logging/#process-position)
- [Write logs file to csv every k batches](Logging/#write-logs-file-to-csv-every-k-batches)
- [Log metric row every k batches](Logging/#log-metric-row-every-k-batches)
- Save a snapshot of all hyperparameters
- Save a snapshot of the code for a particular model run
**Training loop**
- [Accumulate gradients](Training%20Loop/#accumulated-gradients)
@ -32,40 +72,3 @@ But of course the fun is in all the advanced things it can do:
- [Set how much of the test set to check](Validation%20Loop/#set-how-much-of-the-test-set-to-check)
- [Set validation check frequency within 1 training epoch](Validation%20Loop/#set-validation-check-frequency-within-1-training-epoch)
- [Set the number of validation sanity steps](Validation%20Loop/#set-the-number-of-validation-sanity-steps)
**Debugging**
- [Fast dev run](Debugging/#fast-dev-run)
- [Inspect gradient norms](Debugging/#inspect-gradient-norms)
- [Log GPU usage](Debugging/#Log-gpu-usage)
- [Make model overfit on subset of data](Debugging/#make-model-overfit-on-subset-of-data)
- [Print the parameter count by layer](Debugging/#print-the-parameter-count-by-layer)
- [Pring which gradients are nan](Debugging/#print-which-gradients-are-nan)
**Experiment Logging**
- [Display metrics in progress bar](Logging/#display-metrics-in-progress-bar)
- Log arbitrary metrics
- [Process position](Logging/#process-position)
- Save a snapshot of all hyperparameters
- Save a snapshot of the code for a particular model run
**Distributed training**
- 16-bit mixed precision
- Single-gpu
- Multi-gpu
- Multi-node
**Checkpointing**
- Model saving
- Model loading
**Computing cluster (SLURM)**
- Automatic checkpointing
- Automatic saving, loading
- Running grid search on a cluster
- Walltime auto-resubmit

View File

@ -45,7 +45,7 @@ class Trainer(TrainerIO):
accumulate_grad_batches=1,
enable_early_stop=True, max_nb_epochs=1000, min_nb_epochs=1,
train_percent_check=1.0, val_percent_check=1.0, test_percent_check=1.0, val_check_interval=0.95,
log_save_interval=1, add_log_row_interval=1,
log_save_interval=100, add_log_row_interval=10,
lr_scheduler_milestones=None,
use_amp=False,
print_nan_grads=False,