55 KiB
55 KiB
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
[unreleased] - YYYY-MM-DD
Added
Changed
- Changed epoch indexing from 0 instead of 1 (#2289)
Deprecated
Removed
Fixed
-
Fixed parsing TPU arguments and TPU tests (#2094)
-
Fixed number batches in case of multiple dataloaders and
limit_{*}_batches
(#1920, #2226) -
Fixed an issue with forward hooks not being removed after model summary (#2298)
-
Fixed ROC metric for CUDA tensors (#2304)
-
Fixed
average_precision
metric (#2319)
[0.8.1] - 2020-06-19
Fixed
- Fixed the
load_from_checkpoint
path detected as URL bug (#2244) - Fixed hooks - added barrier (#2245, #2257, #2260)
- Fixed
hparams
- remove frame inspection onself.hparams
(#2253) - Fixed setup and on fit calls (#2252)
- Fixed GPU template (#2255)
[0.8.0] - 2020-06-18
Added
- Added
overfit_batches
,limit_{val|test}_batches
flags (overfit now uses training set for all three) (#2213) - Added metrics
- Added type hints in
Trainer.fit()
andTrainer.test()
to reflect that also a list of dataloaders can be passed in (#1723) - Allow dataloaders without sampler field present (#1907)
- Added option
save_last
to save the model at the end of every epoch inModelCheckpoint
(#1908) - Early stopping checks
on_validation_end
(#1458) - Attribute
best_model_path
toModelCheckpoint
for storing and later retrieving the path to the best saved model file (#1799) - Speed up single-core TPU training by loading data using
ParallelLoader
(#2033) - Added a model hook
transfer_batch_to_device
that enables moving custom data structures to the target device (1756) - Added black formatter for the code with code-checker on pull (1610)
- Added back the slow spawn ddp implementation as
ddp_spawn
(#2115) - Added loading checkpoints from URLs (#1667)
- Added a callback method
on_keyboard_interrupt
for handling KeyboardInterrupt events during training (#2134) - Added a decorator
auto_move_data
that moves data to the correct device when using the LightningModule for inference (#1905) - Added
ckpt_path
option toLightningModule.test(...)
to load particular checkpoint (#2190) - Added
setup
andteardown
hooks for model (#2229)
Changed
- Allow user to select individual TPU core to train on (#1729)
- Removed non-finite values from loss in
LRFinder
(#1862) - Allow passing model hyperparameters as complete kwarg list (#1896)
- Renamed
ModelCheckpoint
's attributesbest
tobest_model_score
andkth_best_model
tokth_best_model_path
(#1799) - Re-Enable Logger's
ImportError
s (#1938) - Changed the default value of the Trainer argument
weights_summary
fromfull
totop
(#2029) - Raise an error when lightning replaces an existing sampler (#2020)
- Enabled
prepare_data
from correct processes - clarify local vs global rank (#2166) - Remove explicit flush from tensorboard logger (#2126)
- Changed epoch indexing from 1 instead of 0 (#2206)
Deprecated
- Deprecated flags: (#2213)
overfit_pct
in favour ofoverfit_batches
val_percent_check
in favour oflimit_val_batches
test_percent_check
in favour oflimit_test_batches
- Deprecated
ModelCheckpoint
's attributesbest
andkth_best_model
(#1799) - Dropped official support/testing for older PyTorch versions <1.3 (#1917)
- Deprecated Trainer
proc_rank
in favour ofglobal_rank
(#2166, #2269)
Removed
- Removed unintended Trainer argument
progress_bar_callback
, the callback should be passed in byTrainer(callbacks=[...])
instead (#1855) - Removed obsolete
self._device
in Trainer (#1849) - Removed deprecated API (#2073)
- Packages:
pytorch_lightning.pt_overrides
,pytorch_lightning.root_module
- Modules:
pytorch_lightning.logging.comet_logger
,pytorch_lightning.logging.mlflow_logger
,pytorch_lightning.logging.test_tube_logger
,pytorch_lightning.overrides.override_data_parallel
,pytorch_lightning.core.model_saving
,pytorch_lightning.core.root_module
- Trainer arguments:
add_row_log_interval
,default_save_path
,gradient_clip
,nb_gpu_nodes
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
- Trainer attributes:
nb_gpu_nodes
,num_gpu_nodes
,gradient_clip
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
,default_save_path
,tng_tqdm_dic
- Packages:
Fixed
- Run graceful training teardown on interpreter exit (#1631)
- Fixed user warning when apex was used together with learning rate schedulers (#1873)
- Fixed multiple calls of
EarlyStopping
callback (#1863) - Fixed an issue with
Trainer.from_argparse_args
when passing in unknown Trainer args (#1932) - Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
- Fixed root node resolution for SLURM cluster with dash in host name (#1954)
- Fixed
LearningRateLogger
in multi-scheduler setting (#1944) - Fixed test configuration check and testing (#1804)
- Fixed an issue with Trainer constructor silently ignoring unknown/misspelled arguments (#1820)
- Fixed
save_weights_only
in ModelCheckpoint (#1780) - Allow use of same
WandbLogger
instance for multiple training loops (#2055) - Fixed an issue with
_auto_collect_arguments
collecting local variables that are not constructor arguments and not working for signatures that have the instance not namedself
(#2048) - Fixed mistake in parameters' grad norm tracking (#2012)
- Fixed CPU and hanging GPU crash (#2118)
- Fixed an issue with the model summary and
example_input_array
depending on a specific ordering of the submodules in a LightningModule (#1773) - Fixed Tpu logging (#2230)
- Fixed Pid port + duplicate
rank_zero
logging (#2140, #2231)
[0.7.6] - 2020-05-16
Added
- Added callback for logging learning rates (#1498)
- Added transfer learning example (for a binary classification task in computer vision) (#1564)
- Added type hints in
Trainer.fit()
andTrainer.test()
to reflect that also a list of dataloaders can be passed in (#1723). - Added auto scaling of batch size (#1638)
- The progress bar metrics now also get updated in
training_epoch_end
(#1724) - Enable
NeptuneLogger
to work withdistributed_backend=ddp
(#1753) - Added option to provide seed to random generators to ensure reproducibility (#1572)
- Added override for hparams in
load_from_ckpt
(#1797) - Added support multi-node distributed execution under
torchelastic
(#1811, #1818) - Added using
store_true
for bool args (#1822, #1842) - Added dummy logger for internally disabling logging for some features (#1836)
Changed
- Enable
non-blocking
for device transfers to GPU (#1843) - Replace mata_tags.csv with hparams.yaml (#1271)
- Reduction when
batch_size < num_gpus
(#1609) - Updated LightningTemplateModel to look more like Colab example (#1577)
- Don't convert
namedtuple
totuple
when transferring the batch to target device (#1589) - Allow passing hparams as keyword argument to LightningModule when loading from checkpoint (#1639)
- Args should come after the last positional argument (#1807)
- Made ddp the default if no backend specified with multiple GPUs (#1789)
Deprecated
- Deprecated
tags_csv
in favor ofhparams_file
(#1271) - Deprecated
amp_level
in favor of native AMP (#1561)
Fixed
- Fixed broken link in PR template (#1675)
- Fixed ModelCheckpoint not None checking filepath (#1654)
- Trainer now calls
on_load_checkpoint()
when resuming from a checkpoint (#1666) - Fixed sampler logic for ddp with iterable dataset (#1734)
- Fixed
_reset_eval_dataloader()
for IterableDataset (#1560) - Fixed Horovod distributed backend to set the
root_gpu
property (#1669) - Fixed wandb logger
global_step
affects other loggers (#1492) - Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
- Fixed bugs that prevent lr finder to be used together with early stopping and validation dataloaders (#1676)
- Fixed a bug in Trainer that prepended the checkpoint path with
version_
when it shouldn't (#1748) - Fixed lr key name in case of param groups in LearningRateLogger (#1719)
- Fixed saving native AMP scaler state (introduced in #1561)
- Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
- Fixed num processes wasn't being set properly and auto sampler was ddp failing (#1819)
- Fixed bugs in semantic segmentation example (#1824)
- Fixed saving native AMP scaler state (#1561, #1777)
- Fixed native amp + ddp (#1788)
- Fixed
hparam
logging with metrics (#1647)
[0.7.5] - 2020-04-27
Changed
- Allow logging of metrics together with
hparams
(#1630) - Allow metrics logged together with hparams (#1630)
Removed
- Removed Warning from trainer loop (#1634)
Fixed
- Fixed ModelCheckpoint not being fixable (#1632)
- Fixed CPU DDP breaking change and DDP change (#1635)
- Tested pickling (#1636)
[0.7.4] - 2020-04-26
Added
- Added flag
replace_sampler_ddp
to manually disable sampler replacement in DDP (#1513) - Added speed parity tests (max 1 sec difference per epoch)(#1482)
- Added
auto_select_gpus
flag to trainer that enables automatic selection of available GPUs on exclusive mode systems. - Added learning rate finder (#1347)
- Added support for ddp mode in clusters without SLURM (#1387)
- Added
test_dataloaders
parameter toTrainer.test()
(#1434) - Added
terminate_on_nan
flag to trainer that performs a NaN check with each training iteration when set toTrue
(#1475) - Added speed parity tests (max 1 sec difference per epoch)(#1482)
- Added
terminate_on_nan
flag to trainer that performs a NaN check with each training iteration when set toTrue
. (#1475) - Added
ddp_cpu
backend for testing ddp without GPUs (#1158) - Added Horovod support as a distributed backend
Trainer(distributed_backend='horovod')
(#1529) - Added support for 8 core distributed training on Kaggle TPU's (#1568)
- Added support for native AMP (#1561, #1580)
Changed
- Changed the default behaviour to no longer include a NaN check with each training iteration. (#1475)
- Decoupled the progress bar from trainer` it is a callback now and can be customized or even be replaced entirely (#1450).
- Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
- Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
- Updated semantic segmentation example with custom U-Net and logging (#1371)
- Disabled val and test shuffling (#1600)
Deprecated
- Deprecated
training_tqdm_dict
in favor ofprogress_bar_dict
(#1450).
Removed
- Removed
test_dataloaders
parameter fromTrainer.fit()
(#1434)
Fixed
- Added the possibility to pass nested metrics dictionaries to loggers (#1582)
- Fixed memory leak from opt return (#1528)
- Fixed saving checkpoint before deleting old ones (#1453)
- Fixed loggers - flushing last logged metrics even before continue, e.g.
trainer.test()
results (#1459) - Fixed optimizer configuration when
configure_optimizers
returns dict withoutlr_scheduler
(#1443) - Fixed
LightningModule
- mixing hparams and arguments inLightningModule.__init__()
crashes load_from_checkpoint() (#1505) - Added a missing call to the
on_before_zero_grad
model hook (#1493). - Allow use of sweeps with
WandbLogger
(#1512) - Fixed a bug that caused the
callbacks
Trainer argument to reference a global variable (#1534). - Fixed a bug that set all boolean CLI arguments from
Trainer.add_argparse_args
always to True (#1571) - Fixed do not copy the batch when training on a single GPU (#1576, #1579)
- Fixed soft checkpoint removing on DDP (#1408)
- Fixed automatic parser bug (#1585)
- Fixed bool conversion from string (#1606)
[0.7.3] - 2020-04-09
Added
- Added
rank_zero_warn
for warning only in rank 0 (#1428)
Fixed
- Fixed default
DistributedSampler
for DDP training (#1425) - Fixed workers warning not on windows (#1430)
- Fixed returning tuple from
run_training_batch
(#1431) - Fixed gradient clipping (#1438)
- Fixed pretty print (#1441)
[0.7.2] - 2020-04-07
Added
- Added same step loggers' metrics aggregation (#1278)
- Added parity test between a vanilla MNIST model and lightning model (#1284)
- Added parity test between a vanilla RNN model and lightning model (#1351)
- Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
- Added support for hierarchical
dict
(#1152) - Added
TrainsLogger
class (#1122) - Added type hints to
pytorch_lightning.core
(#946) - Added support for
IterableDataset
in validation and testing (#1104) - Added support for non-primitive types in
hparams
forTensorboardLogger
(#1130) - Added a check that stops the training when loss or weights contain
NaN
orinf
values. (#1097) - Added support for
IterableDataset
whenval_check_interval=1.0
(default), this will trigger validation at the end of each epoch. (#1283) - Added
summary
method to Profilers. (#1259) - Added informative errors if user defined dataloader has zero length (#1280)
- Added testing for python 3.8 (#915)
- Added a
training_epoch_end
method which is the mirror ofvalidation_epoch_end
. (#1357) - Added model configuration checking (#1199)
- Added support for optimizer frequencies through
LightningModule.configure_optimizers()
(#1269) - Added option to run without an optimizer by returning
None
fromconfigure_optimizers
. (#1279) - Added a warning when the number of data loader workers is small. (#1378)
Changed
- Changed (renamed and refatored)
TensorRunningMean
->TensorRunningAccum
: running accumulations were generalized. (#1278) - Changed
progress_bar_refresh_rate
trainer flag to disable progress bar when set to 0. (#1108) - Enhanced
load_from_checkpoint
to also forward params to the model (#1307) - Updated references to
self.forward()
to instead use the__call__
interface. (#1211) - Changed default behaviour of
configure_optimizers
to use no optimizer rather than Adam. (#1279) - Allow to upload models on W&B (#1339)
- On DP and DDP2 unsqueeze is automated now (#1319)
- Did not always create a DataLoader during reinstantiation, but the same type as before (if subclass of DataLoader) (#1346)
- Did not interfere with a default sampler (#1318)
- Remove default Adam optimizer (#1317)
- Give warnings for unimplemented required lightning methods (#1317)
- Made
evaluate
method private >>Trainer._evaluate(...)
. (#1260) - Simplify the PL examples structure (shallower and more readable) (#1247)
- Changed min max gpu memory to be on their own plots (#1358)
- Remove
.item
which causes sync issues (#1254) - Changed smoothing in TQDM to decrease variability of time remaining between training / eval (#1194)
- Change default logger to dedicated one (#1064)
Deprecated
- Deprecated Trainer argument
print_nan_grads
(#1097) - Deprecated Trainer argument
show_progress_bar
(#1108)
Removed
- Removed test for no test dataloader in .fit (#1495)
- Removed duplicated module
pytorch_lightning.utilities.arg_parse
for loading CLI arguments (#1167) - Removed wandb logger's
finalize
method (#1193) - Dropped
torchvision
dependency in tests and added own MNIST dataset class instead (#986)
Fixed
- Fixed
model_checkpoint
when saving all models (#1359) Trainer.add_argparse_args
classmethod fixed. Now it adds a type for the arguments (#1147)- Fixed bug related to type checking of
ReduceLROnPlateau
lr schedulers(#1126) - Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
- Fixed a bug that created an extra dataloader with active
reload_dataloaders_every_epoch
(#1196) - Fixed all warnings and errors in the docs build process (#1191)
- Fixed an issue where
val_percent_check=0
would not disable validation (#1251) - Fixed average of incomplete
TensorRunningMean
(#1309) - Fixed
WandbLogger.watch
withwandb.init()
(#1311) - Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235).
- Fixed a bug that would cause
trainer.test()
to run on the validation set when overloadingvalidation_epoch_end
andtest_end
(#1353) - Fixed
WandbLogger.watch
- use of the watch method without importingwandb
(#1311) - Fixed
WandbLogger
to be used with 'ddp' - allow reinits in sub-processes (#1149, #1360) - Made
training_epoch_end
behave likevalidation_epoch_end
(#1357) - Fixed
fast_dev_run
running validation twice (#1365) - Fixed pickle error from quick patch
__code__
(#1352) - Fixed memory leak on GPU0 (#1094, #1349)
- Fixed checkpointing interval (#1272)
- Fixed validation and training loops run the partial dataset (#1192)
- Fixed running
on_validation_end
only on main process in DDP (#1125) - Fixed
load_spawn_weights
only in proc rank 0 (#1385) - Fixes
use_amp
issue (#1145) - Fixes using deprecated
use_amp
attribute (#1145) - Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1377)
- Fixed
Unimplemented backend XLA
error on TPU (#1387)
[0.7.1] - 2020-03-07
Fixed
- Fixes
print
issues anddata_loader
(#1080)
[0.7.0] - 2020-03-06
Added
- Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
- Added
reload_dataloaders_every_epoch=False
flag for trainer. Some users require reloading data every epoch (#926) - Added
progress_bar_refresh_rate=50
flag for trainer. Throttle refresh rate on notebooks (#926) - Updated governance docs
- Added a check to ensure that the metric used for early stopping exists before training commences (#542)
- Added
optimizer_idx
argument tobackward
hook (#733) - Added
entity
argument toWandbLogger
to be passed towandb.init
(#783) - Added a tool for profiling training runs (#782)
- Improved flexibility for naming of TensorBoard logs, can now set
version
to astr
to just save to that directory, and usename=''
to prevent experiment-name directory (#804) - Added option to specify
step
key when logging metrics (#808) - Added
train_dataloader
,val_dataloader
andtest_dataloader
arguments toTrainer.fit()
, for alternative data parsing (#759) - Added Tensor Processing Unit (TPU) support (#868)
- Added semantic segmentation example (#751,#876, #881)
- Split callbacks in multiple files (#849)
- Support for user defined callbacks (#889 and #950)
- Added support for multiple loggers to be passed to
Trainer
as an iterable (e.g. list, tuple, etc.) (#903) - Added support for step-based learning rate scheduling (#941)
- Added support for logging
hparams
as dict (#1029) - Checkpoint and early stopping now work without val. step (#1041)
- Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
- Added type hints for function arguments (#912, )
- Added default
argparser
forTrainer
(#952, #1023) - Added TPU gradient clipping (#963)
- Added max/min number of steps in
Trainer
(#728)
Changed
- Improved
NeptuneLogger
by addingclose_after_fit
argument to allow logging after training(#908) - Changed default TQDM to use
tqdm.auto
for prettier outputs in IPython notebooks (#752) - Changed
pytorch_lightning.logging
topytorch_lightning.loggers
(#767) - Moved the default
tqdm_dict
definition from Trainer toLightningModule
, so it can be overridden by the user (#749) - Moved functionality of
LightningModule.load_from_metrics
intoLightningModule.load_from_checkpoint
(#995) - Changed Checkpoint path parameter from
filepath
todirpath
(#1016) - Freezed models
hparams
asNamespace
property (#1029) - Dropped
logging
config in package init (#1015) - Renames model steps (#1051)
training_end
>>training_epoch_end
validation_end
>>validation_epoch_end
test_end
>>test_epoch_end
- Refactor dataloading, supports infinite dataloader (#955)
- Create single file in
TensorBoardLogger
(#777)
Deprecated
- Deprecated
pytorch_lightning.logging
(#767) - Deprecated
LightningModule.load_from_metrics
in favour ofLightningModule.load_from_checkpoint
(#995, #1079) - Deprecated
@data_loader
decorator (#926) - Deprecated model steps
training_end
,validation_end
andtest_end
(#1051, #1056)
Removed
- Removed dependency on
pandas
(#736) - Removed dependency on
torchvision
(#797) - Removed dependency on
scikit-learn
(#801)
Fixed
- Fixed a bug where early stopping
on_end_epoch
would be called inconsistently whencheck_val_every_n_epoch == 0
(#743) - Fixed a bug where the model checkpointer didn't write to the same directory as the logger (#771)
- Fixed a bug where the
TensorBoardLogger
class would create an additional empty log file during fitting (#777) - Fixed a bug where
global_step
was advanced incorrectly when usingaccumulate_grad_batches > 1
(#832) - Fixed a bug when calling
self.logger.experiment
with multiple loggers (#1009) - Fixed a bug when calling
logger.append_tags
on aNeptuneLogger
with a single tag (#1009) - Fixed sending back data from
.spawn
by saving and loading the trained model in/out of the process (#1017 - Fixed port collision on DDP (#1010)
- Fixed/tested pass overrides (#918)
- Fixed comet logger to log after train (#892)
- Remove deprecated args to learning rate step function (#890)
[0.6.0] - 2020-01-21
Added
- Added support for resuming from a specific checkpoint via
resume_from_checkpoint
argument (#516) - Added support for
ReduceLROnPlateau
scheduler (#320) - Added support for Apex mode
O2
in conjunction with Data Parallel (#493) - Added option (
save_top_k
) to save the top k models in theModelCheckpoint
class (#128) - Added
on_train_start
andon_train_end
hooks toModelHooks
(#598) - Added
TensorBoardLogger
(#607) - Added support for weight summary of model with multiple inputs (#543)
- Added
map_location
argument toload_from_metrics
andload_from_checkpoint
(#625) - Added option to disable validation by setting
val_percent_check=0
(#649) - Added
NeptuneLogger
class (#648) - Added
WandbLogger
class (#627)
Changed
- Changed the default progress bar to print to stdout instead of stderr (#531)
- Renamed
step_idx
tostep
,epoch_idx
toepoch
,max_num_epochs
tomax_epochs
andmin_num_epochs
tomin_epochs
(#589) - Renamed
total_batch_nb
tototal_batches
,nb_val_batches
tonum_val_batches
,nb_training_batches
tonum_training_batches
,max_nb_epochs
tomax_epochs
,min_nb_epochs
tomin_epochs
,nb_test_batches
tonum_test_batches
, andnb_val_batches
tonum_val_batches
(#567) - Changed gradient logging to use parameter names instead of indexes (#660)
- Changed the default logger to
TensorBoardLogger
(#609) - Changed the directory for tensorboard logging to be the same as model checkpointing (#706)
Deprecated
- Deprecated
max_nb_epochs
andmin_nb_epochs
(#567) - Deprecated the
on_sanity_check_start
hook inModelHooks
(#598)
Removed
- Removed the
save_best_only
argument fromModelCheckpoint
, usesave_top_k=1
instead (#128)
Fixed
- Fixed a bug which ocurred when using Adagrad with cuda (#554)
- Fixed a bug where training would be on the GPU despite setting
gpus=0
orgpus=[]
(#561) - Fixed an error with
print_nan_gradients
when some parameters do not require gradient (#579) - Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
- Fixed support for PyTorch 1.1.0 (#552)
- Fixed an issue with early stopping when using a
val_check_interval < 1.0
inTrainer
(#492) - Fixed bugs relating to the
CometLogger
object that would cause it to not work properly (#481) - Fixed a bug that would occur when returning
-1
fromon_batch_start
following an early exit or when the batch wasNone
(#509) - Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
- Fixed a bug where batch 'segments' would remain on the GPU when using
truncated_bptt > 1
(#532) - Fixed a bug when using
IterableDataset
(#547) - Fixed a bug where
.item
was called on non-tensor objects (#602) - Fixed a bug where
Trainer.train
would crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already atmax_epochs
(#608) - Fixed a bug where early stopping would begin two epochs early (#617)
- Fixed a bug where
num_training_batches
andnum_test_batches
would sometimes be rounded down to zero (#649) - Fixed a bug where an additional batch would be processed when manually setting
num_training_batches
(#653) - Fixed a bug when batches did not have a
.copy
method (#701) - Fixed a bug when using
log_gpu_memory=True
in Python 3.6 (#715) - Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
- Fixed a bug where
on_train_end
was not called when ealy stopping (#723)
[0.5.3] - 2019-11-06
Added
- Added option to disable default logger, checkpointer, and early stopping by passing
logger=False
,checkpoint_callback=False
andearly_stop_callback=False
respectively - Added
CometLogger
for use with Comet.ml - Added
val_check_interval
argument toTrainer
allowing validition to be performed at every given number of batches - Added functionality to save and load hyperparameters using the standard checkpoint mechanism
- Added call to
torch.cuda.empty_cache
before training starts - Added option for user to override the call t
backward
- Added support for truncated backprop through time via the
truncated_bptt_steps
argument inTrainer
- Added option to operate on all outputs from
training_step
in DDP2 - Added a hook for modifying DDP init
- Added a hook for modifying Apex
Changed
- Changed experiment version to be padded with zeros (e.g.
/dir/version_9
becomes/dir/version_0009
) - Changed callback metrics to include any metrics given in logs or progress bar
- Changed the default for
save_best_only
inModelCheckpoint
toTrue
- Added
tng_data_loader
for backwards compatibility - Renamed
MLFlowLogger.client
toMLFlowLogger.experiment
for consistency - Moved
global_step
increment to happen after the batch has been processed - Changed weights restore to first attempt HPC weights before restoring normally, preventing both weights being restored and running out of memory
- Changed progress bar functionality to add multiple progress bars for train/val/test
- Changed calls to
print
to uselogging
instead
Deprecated
- Deprecated
tng_dataloader
Fixed
- Fixed an issue where the number of batches was off by one during training
- Fixed a bug that occured when setting a ckeckpoint callback and
early_stop_callback=False
- Fixed an error when importing CometLogger
- Fixed a bug where the
gpus
argument had some unexpected behaviour - Fixed a bug where the computed total number of batches was sometimes incorrect
- Fixed a bug where the progress bar would sometimes not show the total number of batches in test mode
- Fixed a bug when using the
log_gpu_memory='min_max'
option inTrainer
- Fixed a bug where checkpointing would sometimes erase the current directory
[0.5.2] - 2019-10-10
Added
- Added
weights_summary
argument toTrainer
to be set tofull
(full summary),top
(just top level modules) or other - Added
tags
argument toMLFlowLogger
Changed
- Changed default for
amp_level
toO1
Removed
- Removed the
print_weights_summary
argument fromTrainer
Fixed
- Fixed a bug where logs were not written properly
- Fixed a bug where
logger.finalize
wasn't called after training is complete - Fixed callback metric errors in DDP
- Fixed a bug where
TestTubeLogger
didn't log to the correct directory
[0.5.1] - 2019-10-05
Added
- Added the
LightningLoggerBase
class for experiment loggers - Added
MLFlowLogger
for logging withmlflow
- Added
TestTubeLogger
for logging withtest_tube
- Added a different implementation of DDP (
distributed_backed='ddp2'
) where every node has one model using all GPUs - Added support for optimisers which require a closure (e.g. LBFGS)
- Added automatic
MASTER_PORT
defualt for DDP when not set manually - Added new GPU memory logging options
'min_max'
(log only the min/max utilization) and'all'
(log all the GPU memory)
Changed
- Changed schedulers to always be called with the current epoch
- Changed
test_tube
to an optional dependency - Changed data loaders to internally use a getter instead of a python property
- Disabled auto GPU loading when restoring weights to prevent out of memory errors
- Changed logging, early stopping and checkpointing to occur by default
Fixed
- Fixed a bug with samplers that do not specify
set_epoch
- Fixed a bug when using the
MLFlowLogger
with unsupported data types, this will now raise a warning - Fixed a bug where gradient norms were alwasy zero using
track_grad_norm
- Fixed a bug which causes a crash when logging memory
[0.5.0] - 2019-09-26
Changed
- Changed
data_batch
argument tobatch
throughout - Changed
batch_i
argument tobatch_idx
throughout - Changed
tng_dataloader
method totrain_dataloader
- Changed
on_tng_metrics
method toon_training_metrics
- Changed
gradient_clip
argument togradient_clip_val
- Changed
add_log_row_interval
torow_log_interval
Fixed
- Fixed a bug with tensorboard logging in multi-gpu setup
[0.4.9] - 2019-09-16
Added
- Added the flag
log_gpu_memory
toTrainer
to deactivate logging of GPU memory utilization - Added SLURM resubmit functionality (port from test-tube)
- Added optional weight_save_path to trainer to remove the need for a checkpoint_callback when using cluster training
- Added option to use single gpu per node with
DistributedDataParallel
Changed
- Changed functionality of
validation_end
andtest_end
with multiple dataloaders to be given all of the dataloaders at once rather than in seperate calls - Changed print_nan_grads to only print the parameter value and gradients when they contain NaN
- Changed gpu API to take integers as well (e.g.
gpus=2
instead ofgpus=[0, 1]
) - All models now loaded on to CPU to avoid device and out of memory issues in PyTorch
Fixed
- Fixed a bug where data types that implement
.to
but not.cuda
would not be properly moved onto the GPU - Fixed a bug where data would not be re-shuffled every epoch when using a
DistributedSampler
[0.4.8] - 2019-08-31
Added
- Added
test_step
andtest_end
methods, used whenTrainer.test
is called - Added
GradientAccumulationScheduler
callback which can be used to schedule changes to the number of accumulation batches - Added option to skip the validation sanity check by setting
nb_sanity_val_steps = 0
Fixed
- Fixed a bug when setting
nb_sanity_val_steps = 0
[0.4.7] - 2019-08-24
Changed
- Changed the default
val_check_interval
to1.0
- Changed defaults for
nb_val_batches
,nb_tng_batches
andnb_test_batches
to 0
Fixed
- Fixed a bug where the full validation set as used despite setting
val_percent_check
- Fixed a bug where an
Exception
was thrown when using a data set containing a single batch - Fixed a bug where an
Exception
was thrown if noval_dataloader
was given - Fixed a bug where tuples were not properly transfered to the GPU
- Fixed a bug where data of a non standard type was not properly handled by the trainer
- Fixed a bug when loading data as a tuple
- Fixed a bug where
AttributeError
could be suppressed by theTrainer
[0.4.6] - 2019-08-15
Added
- Added support for data to be given as a
dict
orlist
with a single gpu - Added support for
configure_optimizers
to return a single optimizer, two list (optimizers and schedulers), or a single list
Fixed
- Fixed a bug where returning just an optimizer list (i.e. without schedulers) from
configure_optimizers
would throw anException
[0.4.5] - 2019-08-13
Added
- Added
optimizer_step
method that can be overridden to change the standard optimizer behaviour
[0.4.4] - 2019-08-12
Added
- Added supoort for multiple validation dataloaders
- Added support for latest test-tube logger (optimised for
torch==1.2.0
)
Changed
validation_step
andval_dataloader
are now optionallr_scheduler
is now activated after epoch
Fixed
- Fixed a bug where a warning would show when using
lr_scheduler
intorch>1.1.0
- Fixed a bug where an
Exception
would be thrown if usingtorch.DistributedDataParallel
without using aDistributedSampler
, this now throws aWarning
instead
[0.4.3] - 2019-08-10
Fixed
- Fixed a bug where accumulate gradients would scale the loss incorrectly
[0.4.2] - 2019-08-08
Changed
- Changed install requirement to
torch==1.2.0
[0.4.1] - 2019-08-08
Changed
- Changed install requirement to
torch==1.1.0
[0.4.0] - 2019-08-08
Added
- Added 16-bit support for a single GPU
- Added support for training continuation (preserves epoch, global step etc.)
Changed
- Changed
training_step
andvalidation_step
, outputs will no longer be automatically reduced
Removed
- Removed need for
Experiment
object inTrainer
Fixed
- Fixed issues with reducing outputs from generative models (such as images and text)
[0.3.6] - 2019-07-25
Added
- Added a decorator to do lazy data loading internally
Fixed
- Fixed a bug where
Experiment
object was not process safe, potentially causing logs to be overwritten