lightning/docs/source-pytorch/clouds/run_intermediate.rst

:orphan:

.. _grid_cloud_run_intermediate:

#################################
Train on the cloud (intermediate)
#################################
**Audience**: User looking to run many models at once

----

****************
What is a sweep?
****************
A sweep is the term giving to running the same model multiple times with different hyperparameters to find the one that performs the best (according to your definition of performance).

Let's say I have a python script that trains a Lighting model to classify images. We run this file like so:

.. code:: bash

      grid run file.py --batch_size 8

with such a model, I would be interested in knowing how it performs with different batch size. In this case, I'm going to train many versions of this model.

.. code:: bash

      # run 4 models in parallel
      grid run file.py --batch_size 8
      grid run file.py --batch_size 16
      grid run file.py --batch_size 32
      grid run file.py --batch_size 64

Now I can see how my model performs according to the layers and based on time and cost I can pick my "best" model:

.. list-table:: Training speed vs cost
   :widths: 10 40 15 15
   :header-rows: 1

   * - Batch size
     - classification accuracy (%)
     - training time
     - cost
   * - 8
     - 0.80
     - 5 minutes
     - $0.15
   * - 16
     - 0.85
     - 10 minutes
     - $0.30
   * - 32
     - 0.90
     - 30 minutes
     - $0.50
   * - 64
     - 0.95
     - 60 minutes
     - $1.01

----

*************
Start a Sweep
*************
First, recall that in the `previous tutorial <run_basic.rst>`_ we ran a single model using this command:

.. code:: bash

    grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5

Now we're going to run that same model 4 different times each with a different number of layers:

.. code:: bash

    grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 8
    grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 16
    grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 32
    grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 64

Grid has a special syntax based on python that gives you shortcuts for sweeps. The shortcut for the above commands is:

.. code:: bash

    grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size "[8, 16, 32, 64]"

----

****************
Syntax Shortcuts
****************

List
====

.. code:: bash

    grid run file.py --batch_size "[8, 16, 32, 64]"

equivalent to:

.. code:: bash

    grid run file.py --batch_size 8
    grid run file.py --batch_size 16
    grid run file.py --batch_size 32
    grid run file.py --batch_size 64

----

Range
=====

.. code:: bash

    grid run file.py --batch_size "range(1, 10, 2)"

equivalent to:

.. code:: bash

  grid run main.py --batch_size 1
  grid run main.py --batch_size 3
  grid run main.py --batch_size 5
  grid run main.py --batch_size 7
  grid run main.py --batch_size 9

---

String list
===========

.. code:: bash

    grid run file.py --model_backbone "['resnet18' 'transformer', 'resnet50']"

equivalent to:

.. code:: bash

  grid run file.py --model_backbone 'resnet18'
  grid run file.py --model_backbone 'transformer'
  grid run file.py --model_backbone 'resnet50'

----

Sampling
========

.. code:: bash

    grid run file.py --learning_rate "uniform(1e-5, 1e-1, 3)"

equivalent to:

.. code:: bash

    grid run file.py --learning_rate 0.03977392
    grid run file.py --learning_rate 0.04835479
    grid run file.py --learning_rate 0.05200016

----

****************
Sweep strategies
****************
Models often have dozens of hyperparameters. We usually don't run all combinations because it would be too prohibitive. Grid supports two strategies:

----

Grid search
===========
Grid search is a common approach that tries all combinations of hyperparamaters. Grid will automatically compute combinations when it detects special syntax:

.. code:: bash

    grid run file.py --batch_size "[1, 2]" --layers "[3, 5]"

is equivalent to:

.. code:: bash

    grid run file.py --batch_size 1 --layers 3
    grid run file.py --batch_size 2 --layers 3
    grid run file.py --batch_size 1 --layers 5
    grid run file.py --batch_size 2 --layers 5

----

Random search
=============
With random search, we choose only a subset of hyperparamaters. The larger the number of trials (*num_trials*) the more probable we'll find a great performing model without needing to try all possible combinations.

.. code:: bash

    grid run --strategy random_search --num_trials 2 file.py --batch_size "[1, 2]" --layers "[3, 5]"

the above command generates the 4 combinations and runs only 2 at random

.. code:: bash

    grid run file.py --batch_size 2 --layers 3
    grid run file.py --batch_size 1 --layers 5

----

**********
Next Steps
**********
Here are the recommended next steps depending on your workflow.

.. raw:: html

    <div class="display-card-container">
        <div class="row">

.. Add callout items below this line

.. displayitem::
   :header: Run with your own cloud credentials
   :description: Learn how to use Grid products with your Company or University cloud account.
   :col_css: col-md-4
   :button_link: run_expert.html
   :height: 180
   :tag: expert

.. raw:: html

        </div>
    </div
docs refactor 3/n (#12795) * updated titles + css * updated titles + css * levels structure * levels structure * levels structure * adding level indexes * finished intro guide layout * finished intro guide layout * general titles * general titles * added movie * added movie * finished 15 mins * levels * added core levels * added core levels * fixed api reference on the left * gpu guides * gpu guides * gpu guides * gpu guides * precision * hpu guide * added ipu * added ipu * added ipu * added ckpt docs * finished basic logging * intermediate * intermediate * intermediate * fixed * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * fixed margins * added logger stuff * added logger stuff * added logger stuff * added logger stuff * added logger stuff * ic * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * added inconsolata * updated menu * added basic cloud docs * added basic cloud docs * added basic cloud docs * added basic cloud docs * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * ic * added demos folder * added demos folder * added demos folder * added demos folder * added demos folder * added demos folder * twocolumns directive * twocols * twocols * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * registry * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * cleaning up * updated titles + css * levels structure * adding level indexes * finished intro guide layout * general titles * added movie * finished 15 mins * levels * added core levels * fixed api reference on the left * gpu guides * precision * hpu guide * added ipu * added ckpt docs * finished basic logging * intermediate * fixed margins * added logger stuff * ic * added inconsolata * updated menu * added basic cloud docs * ic * added demos folder * twocolumns directive * registry * cleaning up * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * deconflict * deconflict * deconflict * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add testsetup sections wherever needed; fix errors in building docs * pre-commit fixes * Fix duplicate label * minor nit with pre-commit * Fix labels * More changes... * require * debug & cli * prec & model & visu * fix references * fix references * fix refs * fix refs - model_parallel * fix references * prune testsetup with global * refs in index * Fix duplicate label errors * Update orphan docs * Update orphan docs * Update orphan docs * fix links * Fix genindex and search index * fix refs * fix refs * Fix index rst related issues * fix refs * inc to rst * Fix links ref * fix more references * fix refs * deconflict * errors * errors * errors * fix refs * fix refs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix warnings * Fix LightningCLI errors * Fix LightningCLI errors * Fix LightningCLI errors * Fix LightningCLI errors * fix doc build * Duplicate Label fix (docs) (#12800) Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * ignore typing in demo folder * Ignore demos for mypy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kushashwa Ravi Shrimali <kushashwaravishrimali@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: Kaushik B <kaushikbokka@gmail.com> Co-authored-by: otaj <ota@grid.ai> 2022-04-19 18:15:47 +00:00			`:orphan:`

			`.. _grid_cloud_run_intermediate:`

			`#################################`
			`Train on the cloud (intermediate)`
			`#################################`
			`Audience: User looking to run many models at once`

			`----`

			`****************`
			`What is a sweep?`
			`****************`
			`A sweep is the term giving to running the same model multiple times with different hyperparameters to find the one that performs the best (according to your definition of performance).`

			`Let's say I have a python script that trains a Lighting model to classify images. We run this file like so:`

			`.. code:: bash`

			`grid run file.py --batch_size 8`

			`with such a model, I would be interested in knowing how it performs with different batch size. In this case, I'm going to train many versions of this model.`

			`.. code:: bash`

			`# run 4 models in parallel`
			`grid run file.py --batch_size 8`
			`grid run file.py --batch_size 16`
			`grid run file.py --batch_size 32`
			`grid run file.py --batch_size 64`

			`Now I can see how my model performs according to the layers and based on time and cost I can pick my "best" model:`

			`.. list-table:: Training speed vs cost`
			`:widths: 10 40 15 15`
			`:header-rows: 1`

			`* - Batch size`
			`- classification accuracy (%)`
			`- training time`
			`- cost`
			`* - 8`
			`- 0.80`
			`- 5 minutes`
			`- $0.15`
			`* - 16`
			`- 0.85`
			`- 10 minutes`
			`- $0.30`
			`* - 32`
			`- 0.90`
			`- 30 minutes`
			`- $0.50`
			`* - 64`
			`- 0.95`
			`- 60 minutes`
			`- $1.01`

			`----`

			`*************`
			`Start a Sweep`
			`*************`
			First, recall that in the `previous tutorial <run_basic.rst>`_ we ran a single model using this command:

			`.. code:: bash`

			`grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5`

			`Now we're going to run that same model 4 different times each with a different number of layers:`

			`.. code:: bash`

			`grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 8`
			`grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 16`
			`grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 32`
			`grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size 64`

			`Grid has a special syntax based on python that gives you shortcuts for sweeps. The shortcut for the above commands is:`

			`.. code:: bash`

			`grid run --datastore_name cifar5 cifar5.py --data_dir /datastores/cifar5 --batch_size "[8, 16, 32, 64]"`

			`----`

			`****************`
			`Syntax Shortcuts`
			`****************`

			`List`
			`====`

			`.. code:: bash`

			`grid run file.py --batch_size "[8, 16, 32, 64]"`

			`equivalent to:`

			`.. code:: bash`

			`grid run file.py --batch_size 8`
			`grid run file.py --batch_size 16`
			`grid run file.py --batch_size 32`
			`grid run file.py --batch_size 64`

			`----`

			`Range`
			`=====`

			`.. code:: bash`

			`grid run file.py --batch_size "range(1, 10, 2)"`

			`equivalent to:`

			`.. code:: bash`

			`grid run main.py --batch_size 1`
			`grid run main.py --batch_size 3`
			`grid run main.py --batch_size 5`
			`grid run main.py --batch_size 7`
			`grid run main.py --batch_size 9`

			`---`

			`String list`
			`===========`

			`.. code:: bash`

			`grid run file.py --model_backbone "['resnet18' 'transformer', 'resnet50']"`

			`equivalent to:`

			`.. code:: bash`

			`grid run file.py --model_backbone 'resnet18'`
			`grid run file.py --model_backbone 'transformer'`
			`grid run file.py --model_backbone 'resnet50'`

			`----`

			`Sampling`
			`========`

			`.. code:: bash`

			`grid run file.py --learning_rate "uniform(1e-5, 1e-1, 3)"`

			`equivalent to:`

			`.. code:: bash`

			`grid run file.py --learning_rate 0.03977392`
			`grid run file.py --learning_rate 0.04835479`
			`grid run file.py --learning_rate 0.05200016`

			`----`

			`****************`
			`Sweep strategies`
			`****************`
			`Models often have dozens of hyperparameters. We usually don't run all combinations because it would be too prohibitive. Grid supports two strategies:`

			`----`

			`Grid search`
			`===========`
			`Grid search is a common approach that tries all combinations of hyperparamaters. Grid will automatically compute combinations when it detects special syntax:`

			`.. code:: bash`

			`grid run file.py --batch_size "[1, 2]" --layers "[3, 5]"`

			`is equivalent to:`

			`.. code:: bash`

			`grid run file.py --batch_size 1 --layers 3`
			`grid run file.py --batch_size 2 --layers 3`
			`grid run file.py --batch_size 1 --layers 5`
			`grid run file.py --batch_size 2 --layers 5`

			`----`

			`Random search`
			`=============`
			`With random search, we choose only a subset of hyperparamaters. The larger the number of trials (num_trials) the more probable we'll find a great performing model without needing to try all possible combinations.`

			`.. code:: bash`

			`grid run --strategy random_search --num_trials 2 file.py --batch_size "[1, 2]" --layers "[3, 5]"`

			`the above command generates the 4 combinations and runs only 2 at random`

			`.. code:: bash`

			`grid run file.py --batch_size 2 --layers 3`
			`grid run file.py --batch_size 1 --layers 5`

			`----`

			`**********`
			`Next Steps`
			`**********`
			`Here are the recommended next steps depending on your workflow.`

			`.. raw:: html`

			`<div class="display-card-container">`
			`<div class="row">`

			`.. Add callout items below this line`

			`.. displayitem::`
			`:header: Run with your own cloud credentials`
			`:description: Learn how to use Grid products with your Company or University cloud account.`
			`:col_css: col-md-4`
			`:button_link: run_expert.html`
			`:height: 180`
			`:tag: expert`

			`.. raw:: html`

			`</div>`
			`</div`