Bug/4319 ddp checkpoint (#4323)

* Broadcast best model path to ensure we sync with main process + wait for main process to save

* Add barrier call to ensure all processes are in sync

* Added changelog commit

* Move sync of best model path/score to model checkpoint, keep barrier to ensure all processes complete

* Ensure we broadcast as tuple

* Add init check

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Removed model checkpoint code, added barrier to trainer to enforce we syncronize and wait for all processes to finish before completing training

* Add barrier within teardown call, removed horovod teardown to inherit from base accelerator

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
This commit is contained in:
Sean Naren 2020-10-24 21:55:49 +01:00 committed by GitHub
parent 207ff728c9
commit 5641b266d5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 4 additions and 4 deletions

View File

@ -31,6 +31,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
### Fixed
- Fixed synchronization of best model path in `ddp_accelerator` ([#4323](https://github.com/PyTorchLightning/pytorch-lightning/pull/4323))
## [1.0.3] - 2020-10-20

View File

@ -52,7 +52,8 @@ class Accelerator(object):
pass
def teardown(self):
pass
# Ensure if necessary all processes are finished
self.barrier()
def barrier(self, name: Optional[str] = None):
pass

View File

@ -101,6 +101,7 @@ class DataParallelAccelerator(Accelerator):
def teardown(self):
# replace the original fwd function
self.trainer.model.forward = self.model_autocast_original_forward
self.barrier()
def training_step(self, args):
if self.trainer.amp_backend == AMPType.NATIVE:

View File

@ -107,9 +107,6 @@ class HorovodAccelerator(Accelerator):
hvd.join()
return results
def teardown(self):
pass
def training_step(self, args):
if self.trainer.on_gpu:
batch = args[0]