multi processing warnings (#1602)

* multi processing warnings * multi processing warnings * multi processing warnings * multi processing warnings * multi processing warnings * multi processing warnings
2020-04-25 10:03:02 -04:00 · 2020-04-25 10:03:02 -04:00 · cbd088bd13
parent f531ab957b
commit cbd088bd13
2 changed files with 52 additions and 5 deletions
--- a/docs/source/multi_gpu.rst
+++ b/docs/source/multi_gpu.rst
@ -73,6 +73,47 @@ when needed.

 .. note:: For iterable datasets, we don't do this automatically.

+Make Model Picklable
+^^^^^^^^^^^^^^^^^^^^
+It's very likely your code is already `picklable <https://docs.python.org/3/library/pickle.html>`_,
+so you don't have to do anything to make this change.
+However, if you run distributed and see an error like this:
+
+.. code-block::
+
+    self._launch(process_obj)
+    File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47,
+    in _launch reduction.dump(process_obj, fp)
+    File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
+    ForkingPickler(file, protocol).dump(obj)
+    _pickle.PicklingError: Can't pickle <function <lambda> at 0x2b599e088ae8>:
+    attribute lookup <lambda> on __main__ failed
+
+This means you have something in your model definition, transforms, optimizer, dataloader or callbacks
+that is cannot be pickled. By pickled we mean the following would fail.
+
+.. code-block:: python
+
+    import pickle
+    pickle.dump(some_object)
+
+This is a limitation of using multiple processes for distributed training within PyTorch.
+To fix this issue, find your piece of code that cannot be pickled. The end of the stacktrace
+is usually helpful.
+
+.. code-block::
+
+    self._launch(process_obj)
+    File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47,
+    in _launch reduction.dump(process_obj, fp)
+    File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
+    ForkingPickler(file, protocol).dump(obj)
+    _pickle.PicklingError: Can't pickle [THIS IS THE THING TO FIND AND DELETE]:
+    attribute lookup <lambda> on __main__ failed
+
+ie: in the stacktrace example here, there seems to be a lambda function somewhere in the user code
+which cannot be pickled.
+
 Distributed modes
 -----------------
 Lightning allows multiple ways of training
@ -88,6 +129,8 @@ Data Parallel (dp)
 `DataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel>`_ splits a batch across k GPUs. That is, if you have a batch of 32 and use dp with 2 gpus,
 each GPU will process 16 samples, after which the root node will aggregate the results.

+.. warning:: DP use is discouraged by PyTorch and Lightning. Use ddp which is more stable and at least 3x faster
+
 .. code-block:: python

    # train on 1 GPU (using dp mode)
@ -167,7 +210,7 @@ Horovod can be configured in the training script to run with any number of GPUs
 When starting the training job, the driver application will then be used to specify the total
 number of worker processes:

-.. code-block:: bash
+.. code-block::

    # run training with 4 GPUs on a single machine
    horovodrun -np 4 python train.py
@ -193,9 +236,9 @@ DP and ddp2 roughly do the following:
        gpu_3_batch = batch[24:]

        y_0 = model_copy_gpu_0(gpu_0_batch)
-        y_1 = model_copy_gpu_0(gpu_1_batch)
-        y_2 = model_copy_gpu_0(gpu_2_batch)
-        y_3 = model_copy_gpu_0(gpu_3_batch)
+        y_1 = model_copy_gpu_1(gpu_1_batch)
+        y_2 = model_copy_gpu_2(gpu_2_batch)
+        y_3 = model_copy_gpu_3(gpu_3_batch)

        return [y_0, y_1, y_2, y_3]

--- a/pytorch_lightning/loggers/init.py
+++ b/pytorch_lightning/loggers/init.py
@ -70,7 +70,11 @@ Call the logger anywhere except ``__init__`` in your
 ...     def training_step(self, batch, batch_idx):
 ...         # example
 ...         self.logger.experiment.whatever_method_summary_writer_supports(...)
-...
+
+            # example if logger is a tensorboard logger
+            self.logger.experiment.add_image('images', grid, 0)
+            self.logger.experiment.add_graph(model, images)
+
 ...     def any_lightning_module_function_or_hook(self):
 ...         self.logger.experiment.add_histogram(...)