18 KiB
18 KiB
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
[Unreleased]
Added
- Updated governance docs
- Added a check to ensure that the metric used for early stopping exists before training commences (#542)
- Added
optimizer_idx
argument tobackward
hook (#733) - Added
entity
argument toWandbLogger
to be passed towandb.init
(#783) - Added a tool for profiling training runs (#782)
- Improved flexibility for naming of TensorBoard logs, can now set
version
to astr
to just save to that directory, and usename=''
to prevent experiment-name directory (#804) - Added option to specify
step
key when logging metrics (#808) - Added
train_dataloader
,val_dataloader
andtest_dataloader
arguments toTrainer.fit()
, for alternative data parsing (#759)
Changed
- Changed default TQDM to use
tqdm.auto
for prettier outputs in IPython notebooks (#752) - Changed
pytorch_lightning.logging
topytorch_lightning.loggers
(#767) - Moved the default
tqdm_dict
definition from Trainer toLightningModule
, so it can be overridden by the user (#749)
Deprecated
- None
Removed
- Removed dependency on pandas (#736)
- Removed dependency on torchvision (#797)
- Removed dependency on scikit-learn (#801)
Fixed
- Fixed a bug where early stopping
on_end_epoch
would be called inconsistently whencheck_val_every_n_epoch == 0
(#743) - Fixed a bug where the model checkpointer didn't write to the same directory as the logger (#771)
- Fixed a bug where the
TensorBoardLogger
class would create an additional empty log file during fitting (#777) - Fixed a bug where
global_step
was advanced incorrectly when usingaccumulate_grad_batches > 1
(#832)
[0.6.0] - 2020-01-21
Added
- Added support for resuming from a specific checkpoint via
resume_from_checkpoint
argument (#516) - Added support for
ReduceLROnPlateau
scheduler (#320) - Added support for Apex mode
O2
in conjunction with Data Parallel (#493) - Added option (
save_top_k
) to save the top k models in theModelCheckpoint
class (#128) - Added
on_train_start
andon_train_end
hooks toModelHooks
(#598) - Added
TensorBoardLogger
(#607) - Added support for weight summary of model with multiple inputs (#543)
- Added
map_location
argument toload_from_metrics
andload_from_checkpoint
(#625) - Added option to disable validation by setting
val_percent_check=0
(#649) - Added
NeptuneLogger
class (#648) - Added
WandbLogger
class (#627)
Changed
- Changed the default progress bar to print to stdout instead of stderr (#531)
- Renamed
step_idx
tostep
,epoch_idx
toepoch
,max_num_epochs
tomax_epochs
andmin_num_epochs
tomin_epochs
(#589) - Renamed
total_batch_nb
tototal_batches
,nb_val_batches
tonum_val_batches
,nb_training_batches
tonum_training_batches
,max_nb_epochs
tomax_epochs
,min_nb_epochs
tomin_epochs
,nb_test_batches
tonum_test_batches
, andnb_val_batches
tonum_val_batches
(#567) - Changed gradient logging to use parameter names instead of indexes (#660)
- Changed the default logger to
TensorBoardLogger
(#609) - Changed the directory for tensorboard logging to be the same as model checkpointing (#706)
Deprecated
- Deprecated
max_nb_epochs
andmin_nb_epochs
(#567) - Deprecated the
on_sanity_check_start
hook inModelHooks
(#598)
Removed
- Removed the
save_best_only
argument fromModelCheckpoint
, usesave_top_k=1
instead (#128)
Fixed
- Fixed a bug which ocurred when using Adagrad with cuda (#554)
- Fixed a bug where training would be on the GPU despite setting
gpus=0
orgpus=[]
(#561) - Fixed an error with
print_nan_gradients
when some parameters do not require gradient (#579) - Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
- Fixed support for PyTorch 1.1.0 (#552)
- Fixed an issue with early stopping when using a
val_check_interval < 1.0
inTrainer
(#492) - Fixed bugs relating to the
CometLogger
object that would cause it to not work properly (#481) - Fixed a bug that would occur when returning
-1
fromon_batch_start
following an early exit or when the batch wasNone
(#509) - Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
- Fixed a bug where batch 'segments' would remain on the GPU when using
truncated_bptt > 1
(#532) - Fixed a bug when using
IterableDataset
(#547) - Fixed a bug where
.item
was called on non-tensor objects (#602) - Fixed a bug where
Trainer.train
would crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already atmax_epochs
(#608) - Fixed a bug where early stopping would begin two epochs early (#617)
- Fixed a bug where
num_training_batches
andnum_test_batches
would sometimes be rounded down to zero (#649) - Fixed a bug where an additional batch would be processed when manually setting
num_training_batches
(#653) - Fixed a bug when batches did not have a
.copy
method (#701) - Fixed a bug when using
log_gpu_memory=True
in Python 3.6 (#715) - Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
- Fixed a bug where
on_train_end
was not called when ealy stopping (#723)
[0.5.3] - 2019-11-06
Added
- Added option to disable default logger, checkpointer, and early stopping by passing
logger=False
,checkpoint_callback=False
andearly_stop_callback=False
respectively - Added
CometLogger
for use with Comet.ml - Added
val_check_interval
argument toTrainer
allowing validition to be performed at every given number of batches - Added functionality to save and load hyperparameters using the standard checkpoint mechanism
- Added call to
torch.cuda.empty_cache
before training starts - Added option for user to override the call t
backward
- Added support for truncated backprop through time via the
truncated_bptt_steps
argument inTrainer
- Added option to operate on all outputs from
training_step
in DDP2 - Added a hook for modifying DDP init
- Added a hook for modifying Apex
Changed
- Changed experiment version to be padded with zeros (e.g.
/dir/version_9
becomes/dir/version_0009
) - Changed callback metrics to include any metrics given in logs or progress bar
- Changed the default for
save_best_only
inModelCheckpoint
toTrue
- Added
tng_data_loader
for backwards compatibility - Renamed
MLFlowLogger.client
toMLFlowLogger.experiment
for consistency - Moved
global_step
increment to happen after the batch has been processed - Changed weights restore to first attempt HPC weights before restoring normally, preventing both weights being restored and running out of memory
- Changed progress bar functionality to add multiple progress bars for train/val/test
- Changed calls to
print
to uselogging
instead
Deprecated
- Deprecated
tng_dataloader
Removed
- None
Fixed
- Fixed an issue where the number of batches was off by one during training
- Fixed a bug that occured when setting a ckeckpoint callback and
early_stop_callback=False
- Fixed an error when importing CometLogger
- Fixed a bug where the
gpus
argument had some unexpected behaviour - Fixed a bug where the computed total number of batches was sometimes incorrect
- Fixed a bug where the progress bar would sometimes not show the total number of batches in test mode
- Fixed a bug when using the
log_gpu_memory='min_max'
option inTrainer
- Fixed a bug where checkpointing would sometimes erase the current directory
[0.5.2] - 2019-10-10
Added
- Added
weights_summary
argument toTrainer
to be set tofull
(full summary),top
(just top level modules) or other - Added
tags
argument toMLFlowLogger
Changed
- Changed default for
amp_level
toO1
Deprecated
- None
Removed
- Removed the
print_weights_summary
argument fromTrainer
Fixed
- Fixed a bug where logs were not written properly
- Fixed a bug where
logger.finalize
wasn't called after training is complete - Fixed callback metric errors in DDP
- Fixed a bug where
TestTubeLogger
didn't log to the correct directory
[0.5.1] - 2019-10-05
Added
- Added the
LightningLoggerBase
class for experiment loggers - Added
MLFlowLogger
for logging withmlflow
- Added
TestTubeLogger
for logging withtest_tube
- Added a different implementation of DDP (
distributed_backed='ddp2'
) where every node has one model using all GPUs - Added support for optimisers which require a closure (e.g. LBFGS)
- Added automatic
MASTER_PORT
defualt for DDP when not set manually - Added new GPU memory logging options
'min_max'
(log only the min/max utilization) and'all'
(log all the GPU memory)
Changed
- Changed schedulers to always be called with the current epoch
- Changed
test_tube
to an optional dependency - Changed data loaders to internally use a getter instead of a python property
- Disabled auto GPU loading when restoring weights to prevent out of memory errors
- Changed logging, early stopping and checkpointing to occur by default
Deprecated
- None
Removed
- None
Fixed
- Fixed a bug with samplers that do not specify
set_epoch
- Fixed a bug when using the
MLFlowLogger
with unsupported data types, this will now raise a warning - Fixed a bug where gradient norms were alwasy zero using
track_grad_norm
- Fixed a bug which causes a crash when logging memory
[0.5.0] - 2019-09-26
Added
- None
Changed
- Changed
data_batch
argument tobatch
throughout - Changed
batch_i
argument tobatch_idx
throughout - Changed
tng_dataloader
method totrain_dataloader
- Changed
on_tng_metrics
method toon_training_metrics
- Changed
gradient_clip
argument togradient_clip_val
- Changed
add_log_row_interval
torow_log_interval
Deprecated
- None
Removed
- None
Fixed
- Fixed a bug with tensorboard logging in multi-gpu setup
[0.4.9] - 2019-09-16
Added
- Added the flag
log_gpu_memory
toTrainer
to deactivate logging of GPU memory utilization - Added SLURM resubmit functionality (port from test-tube)
- Added optional weight_save_path to trainer to remove the need for a checkpoint_callback when using cluster training
- Added option to use single gpu per node with
DistributedDataParallel
Changed
- Changed functionality of
validation_end
andtest_end
with multiple dataloaders to be given all of the dataloaders at once rather than in seperate calls - Changed print_nan_grads to only print the parameter value and gradients when they contain NaN
- Changed gpu API to take integers as well (e.g.
gpus=2
instead ofgpus=[0, 1]
) - All models now loaded on to CPU to avoid device and out of memory issues in PyTorch
Deprecated
- None
Removed
- None
Fixed
- Fixed a bug where data types that implement
.to
but not.cuda
would not be properly moved onto the GPU - Fixed a bug where data would not be re-shuffled every epoch when using a
DistributedSampler
[0.4.8] - 2019-08-31
Added
- Added
test_step
andtest_end
methods, used whenTrainer.test
is called - Added
GradientAccumulationScheduler
callback which can be used to schedule changes to the number of accumulation batches - Added option to skip the validation sanity check by setting
nb_sanity_val_steps = 0
Changed
- None
Deprecated
- None
Removed
- None
Fixed
- Fixed a bug when setting
nb_sanity_val_steps = 0
[0.4.7] - 2019-08-24
Added
- None
Changed
- Changed the default
val_check_interval
to1.0
- Changed defaults for
nb_val_batches
,nb_tng_batches
andnb_test_batches
to 0
Deprecated
- None
Removed
- None
Fixed
- Fixed a bug where the full validation set as used despite setting
val_percent_check
- Fixed a bug where an
Exception
was thrown when using a data set containing a single batch - Fixed a bug where an
Exception
was thrown if noval_dataloader
was given - Fixed a bug where tuples were not properly transfered to the GPU
- Fixed a bug where data of a non standard type was not properly handled by the trainer
- Fixed a bug when loading data as a tuple
- Fixed a bug where
AttributeError
could be suppressed by theTrainer
[0.4.6] - 2019-08-15
Added
- Added support for data to be given as a
dict
orlist
with a single gpu - Added support for
configure_optimizers
to return a single optimizer, two list (optimizers and schedulers), or a single list
Changed
- None
Deprecated
- None
Removed
- None
Fixed
- Fixed a bug where returning just an optimizer list (i.e. without schedulers) from
configure_optimizers
would throw anException
[0.4.5] - 2019-08-13
Added
- Added
optimizer_step
method that can be overridden to change the standard optimizer behaviour
Changed
- None
Deprecated
- None
Removed
- None
Fixed
- None
[0.4.4] - 2019-08-12
Added
- Added supoort for multiple validation dataloaders
- Added support for latest test-tube logger (optimised for
torch==1.2.0
)
Changed
validation_step
andval_dataloader
are now optionallr_scheduler
is now activated after epoch
Deprecated
- None
Removed
- None
Fixed
- Fixed a bug where a warning would show when using
lr_scheduler
intorch>1.1.0
- Fixed a bug where an
Exception
would be thrown if usingtorch.DistributedDataParallel
without using aDistributedSampler
, this now throws aWarning
instead
[0.4.3] - 2019-08-10
Added
- None
Changed
- None
Deprecated
- None
Removed
- None
Fixed
- Fixed a bug where accumulate gradients would scale the loss incorrectly
[0.4.2] - 2019-08-08
Added
- None
Changed
- Changed install requirement to
torch==1.2.0
Deprecated
- None
Removed
- None
Fixed
- None
[0.4.1] - 2019-08-08
Added
- None
Changed
- Changed install requirement to
torch==1.1.0
Deprecated
- None
Removed
- None
Fixed
- None
[0.4.0] - 2019-08-08
Added
- Added 16-bit support for a single GPU
- Added support for training continuation (preserves epoch, global step etc.)
Changed
- Changed
training_step
andvalidation_step
, outputs will no longer be automatically reduced
Deprecated
- None
Removed
- Removed need for
Experiment
object inTrainer
Fixed
- Fixed issues with reducing outputs from generative models (such as images and text)
[0.3.6.1] - 2019-07-27
Added
- None
Changed
- None
Deprecated
- None
Removed
- None
Fixed
- Fixed a bug where
Experiment
object was not process safe, potentially causing logs to be overwritten
[0.3.6] - 2019-07-25
Added
- Added a decorator to do lazy data loading internally
Changed
- None
Deprecated
- None
Removed
- None
Fixed
- None