lightning/docs/source/visualize/logging_advanced.rst

:orphan:

.. _logging_advanced:

##########################################
Track and Visualize Experiments (advanced)
##########################################
**Audience:** Users who want to do advanced speed optimizations by customizing the logging behavior.

----

****************************
Change progress bar defaults
****************************
To change the default values (ie: version number) shown in the progress bar, override the :meth:`~pytorch_lightning.callbacks.progress.base.ProgressBarBase.get_metrics` method in your logger.

.. code-block:: python

    from pytorch_lightning.callbacks.progress import Tqdm


    class CustomProgressBar(Tqdm):
        def get_metrics(self, *args, **kwargs):
            # don't show the version number
            items = super().get_metrics()
            items.pop("v_num", None)
            return items

----

************************************
Customize tracking to speed up model
************************************


Modify logging frequency
========================

Logging a metric on every single batch can slow down training. By default, Lightning logs every 50 rows, or 50 training steps.
To change this behaviour, set the *log_every_n_steps* :class:`~pytorch_lightning.trainer.trainer.Trainer` flag.

.. testcode::

   k = 10
   trainer = Trainer(log_every_n_steps=k)

----

Modify flushing frequency
=========================

Metrics are kept in memory for N steps to improve training efficiency. Every N steps, metrics flush to disk. To change the frequency of this flushing, use the *flush_logs_every_n_steps* Trainer argument.

.. code-block:: python

    # faster training, high memory
    Trainer(flush_logs_every_n_steps=500)

    # slower training, low memory
    Trainer(flush_logs_every_n_steps=500)

The higher *flush_logs_every_n_steps* is, the faster the model will train but the memory will build up until the next flush.
The smaller *flush_logs_every_n_steps* is, the slower the model will train but memory will be kept to a minimum.

TODO: chart

----

******************
Customize self.log
******************

The LightningModule *self.log* method offers many configurations to customize its behavior.

----

add_dataloader_idx
==================
**Default:** True

If True, appends the index of the current dataloader to the name (when using multiple dataloaders). If False, user needs to give unique names for each dataloader to not mix the values.

.. code-block:: python

  self.log(add_dataloader_idx=True)

----

batch_size
==========
**Default:** None

Current batch size used for accumulating logs logged with ``on_epoch=True``. This will be directly inferred from the loaded batch, but for some data structures you might need to explicitly provide it.

.. code-block:: python

  self.log(batch_size=32)

----

enable_graph
============
**Default:** True

If True, will not auto detach the graph.

.. code-block:: python

  self.log(enable_graph=True)

----

logger
======
**Default:** True

Send logs to the logger like ``Tensorboard``, or any other custom logger passed to the :class:`~pytorch_lightning.trainer.trainer.Trainer` (Default: ``True``).

.. code-block:: python

  self.log(logger=True)

----

on_epoch
========
**Default:** It varies

If this is True, that specific *self.log* call accumulates and reduces all metrics to the end of the epoch.

.. code-block:: python

  self.log(on_epoch=True)

The default value depends in which function this is called

.. code-block:: python

  def training_step(self, batch, batch_idx):
      # Default: False
      self.log(on_epoch=False)


  def validation_step(self, batch, batch_idx):
      # Default: True
      self.log(on_epoch=True)


  def test_step(self, batch, batch_idx):
      # Default: True
      self.log(on_epoch=True)

----

on_step
=======
**Default:** It varies

If this is True, that specific *self.log* call will NOT accumulate metrics. Instead it will generate a timeseries across steps.

.. code-block:: python

  self.log(on_step=True)

The default value depends in which function this is called

.. code-block:: python

  def training_step(self, batch, batch_idx):
      # Default: True
      self.log(on_step=True)


  def validation_step(self, batch, batch_idx):
      # Default: False
      self.log(on_step=False)


  def test_step(self, batch, batch_idx):
      # Default: False
      self.log(on_step=False)


----

prog_bar
========
**Default:** False

If set to True, logs will be sent to the progress bar.

.. code-block:: python

  self.log(prog_bar=True)

----

rank_zero_only
==============
**Default:** True

Whether the value will be logged only on rank 0. This will prevent synchronization which would produce a deadlock as not all processes would perform this log call.

.. code-block:: python

  self.log(rank_zero_only=True)

----

reduce_fx
=========
**Default:** :meth:`torch.mean`

Reduction function over step values for end of epoch. Uses :meth:`torch.mean` by default.

.. code-block:: python

  self.log(reduce_fx=torch.mean)

----

sync_dist
=========
**Default:** False

If True, reduces the metric across devices. Use with care as this may lead to a significant communication overhead.

.. code-block:: python

  self.log(sync_dist=False)

----

sync_dist_group
===============
**Default:** None

The DDP group to sync across.

.. code-block:: python

  import torch.distributed as dist

  group = dist.init_process_group("nccl", rank=self.global_rank, world_size=self.world_size)
  self.log(sync_dist_group=group)

----

***************************************
Enable metrics for distributed training
***************************************
For certain types of metrics that need complex aggregation, we recommended to build your metric using torchmetric which ensures all the complexities of metric aggregation in distributed environments is handled.

First, implement your metric:

.. code-block:: python

  import torch
  import torchmetrics


  class MyAccuracy(Metric):
      def __init__(self, dist_sync_on_step=False):
          # call `self.add_state`for every internal state that is needed for the metrics computations
          # dist_reduce_fx indicates the function that should be used to reduce
          # state from multiple processes
          super().__init__(dist_sync_on_step=dist_sync_on_step)

          self.add_state("correct", default=torch.tensor(0), dist_reduce_fx="sum")
          self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")

      def update(self, preds: torch.Tensor, target: torch.Tensor):
          # update metric states
          preds, target = self._input_format(preds, target)
          assert preds.shape == target.shape

          self.correct += torch.sum(preds == target)
          self.total += target.numel()

      def compute(self):
          # compute final result
          return self.correct.float() / self.total

To use the metric inside Lightning, 1) initialize it in the init, 2) compute the metric, 3) pass it into *self.log*

.. code-block:: python

  class LitModel(LightningModule):
      def __init__(self):
          # 1. initialize the metric
          self.accuracy = MyAccuracy()

      def training_step(self, batch, batch_idx):
          x, y = batch
          preds = self(x)

          # 2. compute the metric
          self.accuracy(preds, y)

          # 3. log it
          self.log("train_acc_step", self.accuracy)

----

********************************
Log to a custom cloud filesystem
********************************
Lightning is integrated with the major remote file systems including local filesystems and several cloud storage providers such as
`S3 <https://aws.amazon.com/s3/>`_ on `AWS <https://aws.amazon.com/>`_, `GCS <https://cloud.google.com/storage>`_ on `Google Cloud <https://cloud.google.com/>`_,
or `ADL <https://azure.microsoft.com/solutions/data-lake/>`_ on `Azure <https://azure.microsoft.com/>`_.

PyTorch Lightning uses `fsspec <https://filesystem-spec.readthedocs.io/>`_ internally to handle all filesystem operations.

To save logs to a remote filesystem, prepend a protocol like "s3:/" to the root_dir used for writing and reading model data.

.. code-block:: python

    from pytorch_lightning.loggers import TensorBoardLogger

    logger = TensorBoardLogger(save_dir="s3://my_bucket/logs/")

    trainer = Trainer(logger=logger)
    trainer.fit(model)

----

*********************************
Track both step and epoch metrics
*********************************
To track the timeseries over steps (*on_step*) as well as the accumulated epoch metric (*on_epoch*), set both to True

.. code-block:: python

  self.log(on_step=True, on_epoch=True)

Setting both to True will generate two graphs with *_step* for the timeseries over steps and *_epoch* for the epoch metric.

# TODO: show images of both

----

**************************************
Understand self.log automatic behavior
**************************************
This table shows the default values of *on_step* and *on_epoch* depending on the *LightningModule* or *Callback* method.

----

In LightningModule
==================

.. list-table:: Default behavior of logging in ightningModule
   :widths: 50 25 25
   :header-rows: 1

   * - Method
     - on_step
     - on_epoch
   * - on_after_backward, on_before_backward, on_before_optimizer_step, on_before_zero_grad, training_step, training_step_end
     - True
     - False
   * - training_epoch_end, test_epoch_end, test_step, test_step_end, validation_epoch_end, validation_step, validation_step_end
     - False
     - True

----

In Callback
===========

.. list-table:: Default behavior of logging in Callback
   :widths: 50 25 25
   :header-rows: 1

   * - Method
     - on_step
     - on_epoch
   * - on_after_backward, on_before_backward, on_before_optimizer_step, on_before_zero_grad, on_train_batch_start, on_train_batch_end
     - True
     - False
   * - on_train_epoch_start, on_train_epoch_end, on_train_start, on_validation_batch_start, on_validation_batch_end, on_validation_start, on_validation_epoch_start, on_validation_epoch_end
     - False
     - True

.. note:: To add logging to an unsupported method, please open an issue with a clear description of why it is blocking you.
docs refactor 3/n (#12795) * updated titles + css * updated titles + css * levels structure * levels structure * levels structure * adding level indexes * finished intro guide layout * finished intro guide layout * general titles * general titles * added movie * added movie * finished 15 mins * levels * added core levels * added core levels * fixed api reference on the left * gpu guides * gpu guides * gpu guides * gpu guides * precision * hpu guide * added ipu * added ipu * added ipu * added ckpt docs * finished basic logging * intermediate * intermediate * intermediate * fixed * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * added logger stuff * added logger stuff * added logger stuff * added logger stuff * added logger stuff * ic * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * updated menu * added basic cloud docs * added basic cloud docs * added basic cloud docs * added basic cloud docs * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * added demos folder * added demos folder * added demos folder * added demos folder * added demos folder * added demos folder * twocolumns directive * twocols * twocols * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * updated titles + css * levels structure * adding level indexes * finished intro guide layout * general titles * added movie * finished 15 mins * levels * added core levels * fixed api reference on the left * gpu guides * precision * hpu guide * added ipu * added ckpt docs * finished basic logging * intermediate * fixed margins * added logger stuff * ic * added inconsolata * updated menu * added basic cloud docs * ic * added demos folder * twocolumns directive * registry * cleaning up * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * deconflict * deconflict * deconflict * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add testsetup sections wherever needed; fix errors in building docs * pre-commit fixes * Fix duplicate label * minor nit with pre-commit * Fix labels * More changes... * require * debug & cli * prec & model & visu * fix references * fix references * fix refs * fix refs - model_parallel * fix references * prune testsetup with global * refs in index * Fix duplicate label errors * Update orphan docs * Update orphan docs * Update orphan docs * fix links * Fix genindex and search index * fix refs * fix refs * Fix index rst related issues * fix refs * inc to rst * Fix links ref * fix more references * fix refs * deconflict * errors * errors * errors * fix refs * fix refs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix warnings * Fix LightningCLI errors * Fix LightningCLI errors * Fix LightningCLI errors * Fix LightningCLI errors * fix doc build * Duplicate Label fix (docs) (#12800) Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * ignore typing in demo folder * Ignore demos for mypy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kushashwa Ravi Shrimali <kushashwaravishrimali@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: Kaushik B <kaushikbokka@gmail.com> Co-authored-by: otaj <ota@grid.ai> 2022-04-19 18:15:47 +00:00			`:orphan:`

			`.. _logging_advanced:`

			`##########################################`
			`Track and Visualize Experiments (advanced)`
			`##########################################`
			`Audience: Users who want to do advanced speed optimizations by customizing the logging behavior.`

			`----`

			`****************************`
			`Change progress bar defaults`
			`****************************`
			To change the default values (ie: version number) shown in the progress bar, override the :meth:`~pytorch_lightning.callbacks.progress.base.ProgressBarBase.get_metrics` method in your logger.

			`.. code-block:: python`

			`from pytorch_lightning.callbacks.progress import Tqdm`


			`class CustomProgressBar(Tqdm):`
			`def get_metrics(self, args, *kwargs):`
			`# don't show the version number`
			`items = super().get_metrics()`
			`items.pop("v_num", None)`
			`return items`

			`----`

			`************************************`
			`Customize tracking to speed up model`
			`************************************`


			`Modify logging frequency`
			`========================`

			`Logging a metric on every single batch can slow down training. By default, Lightning logs every 50 rows, or 50 training steps.`
			To change this behaviour, set the log_every_n_steps :class:`~pytorch_lightning.trainer.trainer.Trainer` flag.

			`.. testcode::`

			`k = 10`
			`trainer = Trainer(log_every_n_steps=k)`

			`----`

			`Modify flushing frequency`
			`=========================`

			`Metrics are kept in memory for N steps to improve training efficiency. Every N steps, metrics flush to disk. To change the frequency of this flushing, use the flush_logs_every_n_steps Trainer argument.`

			`.. code-block:: python`

			`# faster training, high memory`
			`Trainer(flush_logs_every_n_steps=500)`

			`# slower training, low memory`
			`Trainer(flush_logs_every_n_steps=500)`

			`The higher flush_logs_every_n_steps is, the faster the model will train but the memory will build up until the next flush.`
			`The smaller flush_logs_every_n_steps is, the slower the model will train but memory will be kept to a minimum.`

			`TODO: chart`

			`----`

			`******************`
			`Customize self.log`
			`******************`

			`The LightningModule self.log method offers many configurations to customize its behavior.`

			`----`

			`add_dataloader_idx`
			`==================`
			`Default: True`

			`If True, appends the index of the current dataloader to the name (when using multiple dataloaders). If False, user needs to give unique names for each dataloader to not mix the values.`

			`.. code-block:: python`

			`self.log(add_dataloader_idx=True)`

			`----`

			`batch_size`
			`==========`
			`Default: None`

			Current batch size used for accumulating logs logged with ``on_epoch=True``. This will be directly inferred from the loaded batch, but for some data structures you might need to explicitly provide it.

			`.. code-block:: python`

			`self.log(batch_size=32)`

			`----`

			`enable_graph`
			`============`
			`Default: True`

			`If True, will not auto detach the graph.`

			`.. code-block:: python`

			`self.log(enable_graph=True)`

			`----`

			`logger`
			`======`
			`Default: True`

			Send logs to the logger like ``Tensorboard``, or any other custom logger passed to the :class:`~pytorch_lightning.trainer.trainer.Trainer` (Default: ``True``).

			`.. code-block:: python`

			`self.log(logger=True)`

			`----`

			`on_epoch`
			`========`
			`Default: It varies`

			`If this is True, that specific self.log call accumulates and reduces all metrics to the end of the epoch.`

			`.. code-block:: python`

			`self.log(on_epoch=True)`

			`The default value depends in which function this is called`

			`.. code-block:: python`

			`def training_step(self, batch, batch_idx):`
			`# Default: False`
			`self.log(on_epoch=False)`


			`def validation_step(self, batch, batch_idx):`
			`# Default: True`
			`self.log(on_epoch=True)`


			`def test_step(self, batch, batch_idx):`
			`# Default: True`
			`self.log(on_epoch=True)`

			`----`

			`on_step`
			`=======`
			`Default: It varies`

			`If this is True, that specific self.log call will NOT accumulate metrics. Instead it will generate a timeseries across steps.`

			`.. code-block:: python`

			`self.log(on_step=True)`

			`The default value depends in which function this is called`

			`.. code-block:: python`

			`def training_step(self, batch, batch_idx):`
			`# Default: True`
			`self.log(on_step=True)`


			`def validation_step(self, batch, batch_idx):`
			`# Default: False`
			`self.log(on_step=False)`


			`def test_step(self, batch, batch_idx):`
			`# Default: False`
			`self.log(on_step=False)`


			`----`

			`prog_bar`
			`========`
			`Default: False`

			`If set to True, logs will be sent to the progress bar.`

			`.. code-block:: python`

			`self.log(prog_bar=True)`

			`----`

			`rank_zero_only`
			`==============`
			`Default: True`

			`Whether the value will be logged only on rank 0. This will prevent synchronization which would produce a deadlock as not all processes would perform this log call.`

			`.. code-block:: python`

			`self.log(rank_zero_only=True)`

			`----`

			`reduce_fx`
			`=========`
			Default: :meth:`torch.mean`

			Reduction function over step values for end of epoch. Uses :meth:`torch.mean` by default.

			`.. code-block:: python`

			`self.log(reduce_fx=torch.mean)`

			`----`

			`sync_dist`
			`=========`
			`Default: False`

			`If True, reduces the metric across devices. Use with care as this may lead to a significant communication overhead.`

			`.. code-block:: python`

			`self.log(sync_dist=False)`

			`----`

			`sync_dist_group`
			`===============`
			`Default: None`

			`The DDP group to sync across.`

			`.. code-block:: python`

			`import torch.distributed as dist`

			`group = dist.init_process_group("nccl", rank=self.global_rank, world_size=self.world_size)`
			`self.log(sync_dist_group=group)`

			`----`

			`***************************************`
			`Enable metrics for distributed training`
			`***************************************`
			`For certain types of metrics that need complex aggregation, we recommended to build your metric using torchmetric which ensures all the complexities of metric aggregation in distributed environments is handled.`

			`First, implement your metric:`

			`.. code-block:: python`

			`import torch`
			`import torchmetrics`


			`class MyAccuracy(Metric):`
			`def __init__(self, dist_sync_on_step=False):`
			# call `self.add_state`for every internal state that is needed for the metrics computations
			`# dist_reduce_fx indicates the function that should be used to reduce`
			`# state from multiple processes`
			`super().__init__(dist_sync_on_step=dist_sync_on_step)`

			`self.add_state("correct", default=torch.tensor(0), dist_reduce_fx="sum")`
			`self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")`

			`def update(self, preds: torch.Tensor, target: torch.Tensor):`
			`# update metric states`
			`preds, target = self._input_format(preds, target)`
			`assert preds.shape == target.shape`

			`self.correct += torch.sum(preds == target)`
			`self.total += target.numel()`

			`def compute(self):`
			`# compute final result`
			`return self.correct.float() / self.total`

			`To use the metric inside Lightning, 1) initialize it in the init, 2) compute the metric, 3) pass it into self.log`

			`.. code-block:: python`

			`class LitModel(LightningModule):`
			`def __init__(self):`
			`# 1. initialize the metric`
			`self.accuracy = MyAccuracy()`

			`def training_step(self, batch, batch_idx):`
			`x, y = batch`
			`preds = self(x)`

			`# 2. compute the metric`
			`self.accuracy(preds, y)`

			`# 3. log it`
			`self.log("train_acc_step", self.accuracy)`

			`----`

			`********************************`
			`Log to a custom cloud filesystem`
			`********************************`
			`Lightning is integrated with the major remote file systems including local filesystems and several cloud storage providers such as`
			`S3 <https://aws.amazon.com/s3/>`_ on `AWS <https://aws.amazon.com/>`_, `GCS <https://cloud.google.com/storage>`_ on `Google Cloud <https://cloud.google.com/>`_,
			or `ADL <https://azure.microsoft.com/solutions/data-lake/>`_ on `Azure <https://azure.microsoft.com/>`_.

			PyTorch Lightning uses `fsspec <https://filesystem-spec.readthedocs.io/>`_ internally to handle all filesystem operations.

			`To save logs to a remote filesystem, prepend a protocol like "s3:/" to the root_dir used for writing and reading model data.`

			`.. code-block:: python`

			`from pytorch_lightning.loggers import TensorBoardLogger`

			`logger = TensorBoardLogger(save_dir="s3://my_bucket/logs/")`

			`trainer = Trainer(logger=logger)`
			`trainer.fit(model)`

			`----`

			`*********************************`
			`Track both step and epoch metrics`
			`*********************************`
			`To track the timeseries over steps (on_step) as well as the accumulated epoch metric (on_epoch), set both to True`

			`.. code-block:: python`

			`self.log(on_step=True, on_epoch=True)`

			`Setting both to True will generate two graphs with _step for the timeseries over steps and _epoch for the epoch metric.`

			`# TODO: show images of both`

			`----`

			`**************************************`
			`Understand self.log automatic behavior`
			`**************************************`
			`This table shows the default values of on_step and on_epoch depending on the LightningModule or Callback method.`

			`----`

			`In LightningModule`
			`==================`

			`.. list-table:: Default behavior of logging in ightningModule`
			`:widths: 50 25 25`
			`:header-rows: 1`

			`* - Method`
			`- on_step`
			`- on_epoch`
			`* - on_after_backward, on_before_backward, on_before_optimizer_step, on_before_zero_grad, training_step, training_step_end`
			`- True`
			`- False`
			`* - training_epoch_end, test_epoch_end, test_step, test_step_end, validation_epoch_end, validation_step, validation_step_end`
			`- False`
			`- True`

			`----`

			`In Callback`
			`===========`

			`.. list-table:: Default behavior of logging in Callback`
			`:widths: 50 25 25`
			`:header-rows: 1`

			`* - Method`
			`- on_step`
			`- on_epoch`
			`* - on_after_backward, on_before_backward, on_before_optimizer_step, on_before_zero_grad, on_train_batch_start, on_train_batch_end`
			`- True`
			`- False`
			`* - on_train_epoch_start, on_train_epoch_end, on_train_start, on_validation_batch_start, on_validation_batch_end, on_validation_start, on_validation_epoch_start, on_validation_epoch_end`
			`- False`
			`- True`

			`.. note:: To add logging to an unsupported method, please open an issue with a clear description of why it is blocking you.`