Bug/4319 ddp checkpoint (#4323)
* Broadcast best model path to ensure we sync with main process + wait for main process to save * Add barrier call to ensure all processes are in sync * Added changelog commit * Move sync of best model path/score to model checkpoint, keep barrier to ensure all processes complete * Ensure we broadcast as tuple * Add init check * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Removed model checkpoint code, added barrier to trainer to enforce we syncronize and wait for all processes to finish before completing training * Add barrier within teardown call, removed horovod teardown to inherit from base accelerator Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
This commit is contained in:
parent
207ff728c9
commit
5641b266d5
|
@ -31,6 +31,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
|
|||
|
||||
### Fixed
|
||||
|
||||
- Fixed synchronization of best model path in `ddp_accelerator` ([#4323](https://github.com/PyTorchLightning/pytorch-lightning/pull/4323))
|
||||
|
||||
## [1.0.3] - 2020-10-20
|
||||
|
||||
|
|
|
@ -52,7 +52,8 @@ class Accelerator(object):
|
|||
pass
|
||||
|
||||
def teardown(self):
|
||||
pass
|
||||
# Ensure if necessary all processes are finished
|
||||
self.barrier()
|
||||
|
||||
def barrier(self, name: Optional[str] = None):
|
||||
pass
|
||||
|
|
|
@ -101,6 +101,7 @@ class DataParallelAccelerator(Accelerator):
|
|||
def teardown(self):
|
||||
# replace the original fwd function
|
||||
self.trainer.model.forward = self.model_autocast_original_forward
|
||||
self.barrier()
|
||||
|
||||
def training_step(self, args):
|
||||
if self.trainer.amp_backend == AMPType.NATIVE:
|
||||
|
|
|
@ -107,9 +107,6 @@ class HorovodAccelerator(Accelerator):
|
|||
hvd.join()
|
||||
return results
|
||||
|
||||
def teardown(self):
|
||||
pass
|
||||
|
||||
def training_step(self, args):
|
||||
if self.trainer.on_gpu:
|
||||
batch = args[0]
|
||||
|
|
Loading…
Reference in New Issue