34 KiB

Raw Blame History

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

[unreleased] - YYYY-MM-DD

Added

Added auto_select_gpus flag to trainer that enables automatic selection of available GPUs on exclusive mode systems.
Added learning rate finder (#1347)
Added terminate_on_nan flag to trainer that performs a NaN check with each training iteration when set to True. (#1475)

Changed

Changed the default behaviour to no longer include a NaN check with each training iteration. (#1475)

Deprecated

Removed

Fixed

[0.7.3] - 2020-04-09

Added

Added rank_zero_warn for warning only in rank 0 (#1428)
Added test_dataloaders parameter to Trainer.test() (#1393)

Changed

Removed test_dataloaders parameter from Trainer.fit() (#1393)

Fixed

Fixed optimizer configuration when configure_optimizers returns dict without lr_scheduler (#1443)
Fixed default DistributedSampler for DDP training (#1425)
Fixed workers warning not on windows (#1430)
Fixed returning tuple from run_training_batch (#1431)
Fixed gradient clipping (#1438)
Fixed pretty print (#1441)

[0.7.2] - 2020-04-07

Added

Added same step loggers' metrics aggregation (#1278)
Added parity test between a vanilla MNIST model and lightning model (#1284)
Added parity test between a vanilla RNN model and lightning model (#1351)
Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
Added support for hierarchical dict (#1152)
Added TrainsLogger class (#1122)
Added type hints to pytorch_lightning.core (#946)
Added support for IterableDataset in validation and testing (#1104)
Added support for non-primitive types in hparams for TensorboardLogger (#1130)
Added a check that stops the training when loss or weights contain NaN or inf values. (#1097)
Added support for IterableDataset when val_check_interval=1.0 (default), this will trigger validation at the end of each epoch. (#1283)
Added summary method to Profilers. (#1259)
Added informative errors if user defined dataloader has zero length (#1280)
Added testing for python 3.8 (#915)
Added a training_epoch_end method which is the mirror of validation_epoch_end. (#1357)
Added model configuration checking (#1199)
Added support for optimizer frequencies through LightningModule.configure_optimizers() (#1269)
Added option to run without an optimizer by returning None from configure_optimizers. (#1279)
Added a warning when the number of data loader workers is small. (#1378)

Changed

Changed (renamed and refatored) TensorRunningMean -> TensorRunningAccum: running accumulations were generalized. (#1278)
Changed progress_bar_refresh_rate trainer flag to disable progress bar when set to 0. (#1108)
Enhanced load_from_checkpoint to also forward params to the model (#1307)
Updated references to self.forward() to instead use the __call__ interface. (#1211)
Changed default behaviour of configure_optimizers to use no optimizer rather than Adam. (#1279)
Allow to upload models on W&B (#1339)
On DP and DDP2 unsqueeze is automated now (#1319)
Did not always create a DataLoader during reinstantiation, but the same type as before (if subclass of DataLoader) (#1346)
Did not interfere with a default sampler (#1318)
Remove default Adam optimizer (#1317)
Give warnings for unimplemented required lightning methods (#1317)
Made evaluate method private >> Trainer._evaluate(...). (#1260)
Simplify the PL examples structure (shallower and more readable) (#1247)
Changed min max gpu memory to be on their own plots (#1358)
Remove .item which causes sync issues (#1254)
Changed smoothing in TQDM to decrease variability of time remaining between training / eval (#1194)
Change default logger to dedicated one (#1064)

Deprecated

Deprecated Trainer argument print_nan_grads (#1097)
Deprecated Trainer argument show_progress_bar (#1108)

Removed

Removed duplicated module pytorch_lightning.utilities.arg_parse for loading CLI arguments (#1167)
Removed wandb logger's finalize method (#1193)
Dropped torchvision dependency in tests and added own MNIST dataset class instead (#986)

Fixed

Fixed model_checkpoint when saving all models (#1359)
Trainer.add_argparse_args classmethod fixed. Now it adds a type for the arguments (#1147)
Fixed bug related to type checking of ReduceLROnPlateau lr schedulers(#1114)
Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
Fixed a bug that created an extra dataloader with active reload_dataloaders_every_epoch (#1181)
Fixed all warnings and errors in the docs build process (#1191)
Fixed an issue where val_percent_check=0 would not disable validation (#1251)
Fixed average of incomplete TensorRunningMean (#1309)
Fixed WandbLogger.watch with wandb.init() (#1311)
Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235).
Fixed a bug that would cause trainer.test() to run on the validation set when overloading validation_epoch_end and test_end (#1353)
Fixed WandbLogger.watch - use of the watch method without importing wandb (#1311)
Fixed WandbLogger to be used with 'ddp' - allow reinits in sub-processes (#1149, #1360)
Made training_epoch_end behave like validation_epoch_end (#1357)
Fixed fast_dev_run running validation twice (#1365)
Fixed pickle error from quick patch __code__ (#1352)
Fixed memory leak on GPU0 (#1094, #1349)
Fixed checkpointing interval (#1272)
Fixed validation and training loops run the partial dataset (#1192)
Fixed running on_validation_end only on main process in DDP (#1125)
Fixed load_spawn_weights only in proc rank 0 (#1385)
Fixes use_amp issue (#1145)
Fixes using deprecated use_amp attribute (#1145)
Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1375)
Fixed Unimplemented backend XLA error on TPU (#1387)

[0.7.1] - 2020-03-07

Fixed

Fixes print issues and data_loader (#1080)

[0.7.0] - 2020-03-06

Added

Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
Added reload_dataloaders_every_epoch=False flag for trainer. Some users require reloading data every epoch (#926)
Added progress_bar_refresh_rate=50 flag for trainer. Throttle refresh rate on notebooks (#926)
Updated governance docs
Added a check to ensure that the metric used for early stopping exists before training commences (#542)
Added optimizer_idx argument to backward hook (#733)
Added entity argument to WandbLogger to be passed to wandb.init (#783)
Added a tool for profiling training runs (#782)
Improved flexibility for naming of TensorBoard logs, can now set version to a str to just save to that directory, and use name='' to prevent experiment-name directory (#804)
Added option to specify step key when logging metrics (#808)
Added train_dataloader, val_dataloader and test_dataloader arguments to Trainer.fit(), for alternative data parsing (#759)
Added Tensor Processing Unit (TPU) support (#868)
Added semantic segmentation example (#751,#876, #881)
Split callbacks in multiple files (#849)
Support for user defined callbacks (#889 and #950)
Added support for multiple loggers to be passed to Trainer as an iterable (e.g. list, tuple, etc.) (#903)
Added support for step-based learning rate scheduling (#941)
Added support for logging hparams as dict (#1029)
Checkpoint and early stopping now work without val. step (#1041)
Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
Added type hints for function arguments (#912, )
Added default argparser for Trainer (#952, #1023)
Added TPU gradient clipping (#963)
Added max/min number of steps in Trainer (#728)

Changed

Improved NeptuneLogger by adding close_after_fit argument to allow logging after training(#908)
Changed default TQDM to use tqdm.auto for prettier outputs in IPython notebooks (#752)
Changed pytorch_lightning.logging to pytorch_lightning.loggers (#767)
Moved the default tqdm_dict definition from Trainer to LightningModule, so it can be overridden by the user (#749)
Moved functionality of LightningModule.load_from_metrics into LightningModule.load_from_checkpoint (#995)
Changed Checkpoint path parameter from filepath to dirpath (#1016)
Freezed models hparams as Namespace property (#1029)
Dropped logging config in package init (#1015)
Renames model steps (#1051)
- training_end >> training_epoch_end
- validation_end >> validation_epoch_end
- test_end >> test_epoch_end
Refactor dataloading, supports infinite dataloader (#955)
Create single file in TensorBoardLogger (#777)

Deprecated

Deprecated pytorch_lightning.logging (#767)
Deprecated LightningModule.load_from_metrics in favour of LightningModule.load_from_checkpoint (#995, #1079)
Deprecated @data_loader decorator (#926)
Deprecated model steps training_end, validation_end and test_end (#1051, #1056)

Removed

Removed dependency on pandas (#736)
Removed dependency on torchvision (#797)
Removed dependency on scikit-learn (#801)

Fixed

Fixed a bug where early stopping on_end_epoch would be called inconsistently when check_val_every_n_epoch == 0 (#743)
Fixed a bug where the model checkpointer didn't write to the same directory as the logger (#771)
Fixed a bug where the TensorBoardLogger class would create an additional empty log file during fitting (#777)
Fixed a bug where global_step was advanced incorrectly when using accumulate_grad_batches > 1 (#832)
Fixed a bug when calling self.logger.experiment with multiple loggers (#1009)
Fixed a bug when calling logger.append_tags on a NeptuneLogger with a single tag (#1009)
Fixed sending back data from .spawn by saving and loading the trained model in/out of the process (#1017
Fixed port collision on DDP (#1010)
Fixed/tested pass overrides (#918)
Fixed comet logger to log after train (#892)
Remove deprecated args to learning rate step function (#890)

[0.6.0] - 2020-01-21

Added

Added support for resuming from a specific checkpoint via resume_from_checkpoint argument (#516)
Added support for ReduceLROnPlateau scheduler (#320)
Added support for Apex mode O2 in conjunction with Data Parallel (#493)
Added option (save_top_k) to save the top k models in the ModelCheckpoint class (#128)
Added on_train_start and on_train_end hooks to ModelHooks (#598)
Added TensorBoardLogger (#607)
Added support for weight summary of model with multiple inputs (#543)
Added map_location argument to load_from_metrics and load_from_checkpoint (#625)
Added option to disable validation by setting val_percent_check=0 (#649)
Added NeptuneLogger class (#648)
Added WandbLogger class (#627)

Changed

Changed the default progress bar to print to stdout instead of stderr (#531)
Renamed step_idx to step, epoch_idx to epoch, max_num_epochs to max_epochs and min_num_epochs to min_epochs (#589)
Renamed total_batch_nb to total_batches, nb_val_batches to num_val_batches, nb_training_batches to num_training_batches, max_nb_epochs to max_epochs, min_nb_epochs to min_epochs, nb_test_batches to num_test_batches, and nb_val_batches to num_val_batches (#567)
Changed gradient logging to use parameter names instead of indexes (#660)
Changed the default logger to TensorBoardLogger (#609)
Changed the directory for tensorboard logging to be the same as model checkpointing (#706)

Deprecated

Deprecated max_nb_epochs and min_nb_epochs (#567)
Deprecated the on_sanity_check_start hook in ModelHooks (#598)

Removed

Removed the save_best_only argument from ModelCheckpoint, use save_top_k=1 instead (#128)

Fixed

Fixed a bug which ocurred when using Adagrad with cuda (#554)
Fixed a bug where training would be on the GPU despite setting gpus=0 or gpus=[] (#561)
Fixed an error with print_nan_gradients when some parameters do not require gradient (#579)
Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
Fixed support for PyTorch 1.1.0 (#552)
Fixed an issue with early stopping when using a val_check_interval < 1.0 in Trainer (#492)
Fixed bugs relating to the CometLogger object that would cause it to not work properly (#481)
Fixed a bug that would occur when returning -1 from on_batch_start following an early exit or when the batch was None (#509)
Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
Fixed a bug where batch 'segments' would remain on the GPU when using truncated_bptt > 1 (#532)
Fixed a bug when using IterableDataset (#547)
Fixed a bug where .item was called on non-tensor objects (#602)
Fixed a bug where Trainer.train would crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already at max_epochs (#608)
Fixed a bug where early stopping would begin two epochs early (#617)
Fixed a bug where num_training_batches and num_test_batches would sometimes be rounded down to zero (#649)
Fixed a bug where an additional batch would be processed when manually setting num_training_batches (#653)
Fixed a bug when batches did not have a .copy method (#701)
Fixed a bug when using log_gpu_memory=True in Python 3.6 (#715)
Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
Fixed a bug where on_train_end was not called when ealy stopping (#723)

[0.5.3] - 2019-11-06

Added

Added option to disable default logger, checkpointer, and early stopping by passing logger=False, checkpoint_callback=False and early_stop_callback=False respectively
Added CometLogger for use with Comet.ml
Added val_check_interval argument to Trainer allowing validition to be performed at every given number of batches
Added functionality to save and load hyperparameters using the standard checkpoint mechanism
Added call to torch.cuda.empty_cache before training starts
Added option for user to override the call t backward
Added support for truncated backprop through time via the truncated_bptt_steps argument in Trainer
Added option to operate on all outputs from training_step in DDP2
Added a hook for modifying DDP init
Added a hook for modifying Apex

Changed

Changed experiment version to be padded with zeros (e.g. /dir/version_9 becomes /dir/version_0009)
Changed callback metrics to include any metrics given in logs or progress bar
Changed the default for save_best_only in ModelCheckpoint to True
Added tng_data_loader for backwards compatibility
Renamed MLFlowLogger.client to MLFlowLogger.experiment for consistency
Moved global_step increment to happen after the batch has been processed
Changed weights restore to first attempt HPC weights before restoring normally, preventing both weights being restored and running out of memory
Changed progress bar functionality to add multiple progress bars for train/val/test
Changed calls to print to use logging instead

Deprecated

Deprecated tng_dataloader

Fixed

Fixed an issue where the number of batches was off by one during training
Fixed a bug that occured when setting a ckeckpoint callback and early_stop_callback=False
Fixed an error when importing CometLogger
Fixed a bug where the gpus argument had some unexpected behaviour
Fixed a bug where the computed total number of batches was sometimes incorrect
Fixed a bug where the progress bar would sometimes not show the total number of batches in test mode
Fixed a bug when using the log_gpu_memory='min_max' option in Trainer
Fixed a bug where checkpointing would sometimes erase the current directory

[0.5.2] - 2019-10-10

Added

Added weights_summary argument to Trainer to be set to full (full summary), top (just top level modules) or other
Added tags argument to MLFlowLogger

Changed

Changed default for amp_level to O1

Removed

Removed the print_weights_summary argument from Trainer

Fixed

Fixed a bug where logs were not written properly
Fixed a bug where logger.finalize wasn't called after training is complete
Fixed callback metric errors in DDP
Fixed a bug where TestTubeLogger didn't log to the correct directory

[0.5.1] - 2019-10-05

Added

Added the LightningLoggerBase class for experiment loggers
Added MLFlowLogger for logging with mlflow
Added TestTubeLogger for logging with test_tube
Added a different implementation of DDP (distributed_backed='ddp2') where every node has one model using all GPUs
Added support for optimisers which require a closure (e.g. LBFGS)
Added automatic MASTER_PORT defualt for DDP when not set manually
Added new GPU memory logging options 'min_max' (log only the min/max utilization) and 'all' (log all the GPU memory)

Changed

Changed schedulers to always be called with the current epoch
Changed test_tube to an optional dependency
Changed data loaders to internally use a getter instead of a python property
Disabled auto GPU loading when restoring weights to prevent out of memory errors
Changed logging, early stopping and checkpointing to occur by default

Fixed

Fixed a bug with samplers that do not specify set_epoch
Fixed a bug when using the MLFlowLogger with unsupported data types, this will now raise a warning
Fixed a bug where gradient norms were alwasy zero using track_grad_norm
Fixed a bug which causes a crash when logging memory

[0.5.0] - 2019-09-26

Changed

Changed data_batch argument to batch throughout
Changed batch_i argument to batch_idx throughout
Changed tng_dataloader method to train_dataloader
Changed on_tng_metrics method to on_training_metrics
Changed gradient_clip argument to gradient_clip_val
Changed add_log_row_interval to row_log_interval

Fixed

Fixed a bug with tensorboard logging in multi-gpu setup

[0.4.9] - 2019-09-16

Added

Added the flag log_gpu_memory to Trainer to deactivate logging of GPU memory utilization
Added SLURM resubmit functionality (port from test-tube)
Added optional weight_save_path to trainer to remove the need for a checkpoint_callback when using cluster training
Added option to use single gpu per node with DistributedDataParallel

Changed

Changed functionality of validation_end and test_end with multiple dataloaders to be given all of the dataloaders at once rather than in seperate calls
Changed print_nan_grads to only print the parameter value and gradients when they contain NaN
Changed gpu API to take integers as well (e.g. gpus=2 instead of gpus=[0, 1])
All models now loaded on to CPU to avoid device and out of memory issues in PyTorch

Fixed

Fixed a bug where data types that implement .to but not .cuda would not be properly moved onto the GPU
Fixed a bug where data would not be re-shuffled every epoch when using a DistributedSampler

[0.4.8] - 2019-08-31

Added

Added test_step and test_end methods, used when Trainer.test is called
Added GradientAccumulationScheduler callback which can be used to schedule changes to the number of accumulation batches
Added option to skip the validation sanity check by setting nb_sanity_val_steps = 0

Fixed

Fixed a bug when setting nb_sanity_val_steps = 0

[0.4.7] - 2019-08-24

Changed

Changed the default val_check_interval to 1.0
Changed defaults for nb_val_batches, nb_tng_batches and nb_test_batches to 0

Fixed

Fixed a bug where the full validation set as used despite setting val_percent_check
Fixed a bug where an Exception was thrown when using a data set containing a single batch
Fixed a bug where an Exception was thrown if no val_dataloader was given
Fixed a bug where tuples were not properly transfered to the GPU
Fixed a bug where data of a non standard type was not properly handled by the trainer
Fixed a bug when loading data as a tuple
Fixed a bug where AttributeError could be suppressed by the Trainer

[0.4.6] - 2019-08-15

Added

Added support for data to be given as a dict or list with a single gpu
Added support for configure_optimizers to return a single optimizer, two list (optimizers and schedulers), or a single list

Fixed

Fixed a bug where returning just an optimizer list (i.e. without schedulers) from configure_optimizers would throw an Exception

[0.4.5] - 2019-08-13

Added

Added optimizer_step method that can be overridden to change the standard optimizer behaviour

[0.4.4] - 2019-08-12

Added

Added supoort for multiple validation dataloaders
Added support for latest test-tube logger (optimised for torch==1.2.0)

Changed

validation_step and val_dataloader are now optional
lr_scheduler is now activated after epoch

Fixed

Fixed a bug where a warning would show when using lr_scheduler in torch>1.1.0
Fixed a bug where an Exception would be thrown if using torch.DistributedDataParallel without using a DistributedSampler, this now throws a Warning instead

[0.4.3] - 2019-08-10

Fixed

Fixed a bug where accumulate gradients would scale the loss incorrectly

[0.4.2] - 2019-08-08

Changed

Changed install requirement to torch==1.2.0

[0.4.1] - 2019-08-08

Changed

Changed install requirement to torch==1.1.0

[0.4.0] - 2019-08-08

Added

Added 16-bit support for a single GPU
Added support for training continuation (preserves epoch, global step etc.)

Changed

Changed training_step and validation_step, outputs will no longer be automatically reduced

Removed

Removed need for Experiment object in Trainer

Fixed

Fixed issues with reducing outputs from generative models (such as images and text)

[0.3.6] - 2019-07-25

Added

Added a decorator to do lazy data loading internally

Fixed

Fixed a bug where Experiment object was not process safe, potentially causing logs to be overwritten

34 KiB Raw Blame History

Changelog

[unreleased] - YYYY-MM-DD

Added

Changed

Deprecated

Removed

Fixed

[0.7.3] - 2020-04-09

Added

Changed

Fixed

[0.7.2] - 2020-04-07

Added

Changed

Deprecated

Removed

Fixed

[0.7.1] - 2020-03-07

Fixed

[0.7.0] - 2020-03-06

Added

Changed

Deprecated

Removed

Fixed

[0.6.0] - 2020-01-21

Added

Changed

Deprecated

Removed

Fixed

[0.5.3] - 2019-11-06

Added

Changed

Deprecated

Fixed

[0.5.2] - 2019-10-10

Added

Changed

Removed

Fixed

[0.5.1] - 2019-10-05

Added

Changed

Fixed

[0.5.0] - 2019-09-26

Changed

Fixed

[0.4.9] - 2019-09-16

Added

Changed

Fixed

[0.4.8] - 2019-08-31

Added

Fixed

[0.4.7] - 2019-08-24

Changed

Fixed

[0.4.6] - 2019-08-15

Added

Fixed

[0.4.5] - 2019-08-13

Added

[0.4.4] - 2019-08-12

Added

Changed

Fixed

[0.4.3] - 2019-08-10

Fixed

[0.4.2] - 2019-08-08

Changed

[0.4.1] - 2019-08-08

Changed

[0.4.0] - 2019-08-08

Added

Changed

Removed

Fixed

[0.3.6] - 2019-07-25

34 KiB

Raw Blame History