lightning/Examples.md at f11eda857d82abc2603b053db446425ff00bf12e

4.9 KiB

Raw Blame History

Template model definition

In 99% of cases you want to just copy this template to start a new lightningModule and change the core of what your model is actually trying to do.

# get a copy of the module template
wget https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/lightning_module_template.py

Trainer Example

** __main__ function**

Normally, we want to let the __main__ function start the training. Inside the main we parse training arguments with whatever hyperparameters we want. Your LightningModule will have a chance to add hyperparameters.

from test_tube import HyperOptArgumentParser

if __name__ == '__main__':

    # use default args given by lightning
    root_dir = os.path.split(os.path.dirname(sys.modules['__main__'].__file__))[0]
    parent_parser = HyperOptArgumentParser(strategy='random_search', add_help=False)
    add_default_args(parent_parser, root_dir)

    # allow model to overwrite or extend args
    parser = ExampleModel.add_model_specific_args(parent_parser)
    hyperparams = parser.parse_args()

    # train model
    main(hyperparams)

Main Function

The main function is your entry into the program. This is where you init your model, checkpoint directory, and launch the training. The main function should have 3 arguments:

hparams: a configuration of hyperparameters.
slurm_manager: Slurm cluster manager object (can be None)
dict: for you to return any values you want (useful in meta-learning, otherwise set to _)

def main(hparams, cluster, results_dict):
    """
    Main training routine specific for this project
    :param hparams:
    :return:
    """
    # init experiment
    log_dir = os.path.dirname(os.path.realpath(__file__))
    exp = Experiment(
        name='test_tube_exp',
        debug=True,
        save_dir=log_dir,
        version=0,
        autosave=False,
        description='test demo'
    )
    
    # set the hparams for the experiment
    exp.argparse(hparams)
    exp.save()

    # build model
    model = MyLightningModule(hparams)

    # callbacks
    early_stop = EarlyStopping(
        monitor=hparams.early_stop_metric,
        patience=hparams.early_stop_patience,
        verbose=True,
        mode=hparams.early_stop_mode
    )

    model_save_path = '{}/{}/{}'.format(hparams.model_save_path, exp.name, exp.version)
    checkpoint = ModelCheckpoint(
        filepath=model_save_path,
        save_function=None,
        save_best_only=True,
        verbose=True,
        monitor=hparams.model_save_monitor_value,
        mode=hparams.model_save_monitor_mode
    )

    # configure trainer
    trainer = Trainer(
        experiment=exp,
        cluster=cluster,
        checkpoint_callback=checkpoint,
        early_stop_callback=early_stop,
    )

    # train model
    trainer.fit(model)

The main function will start training on your main function. If you use the HyperParameterOptimizer in hyper parameter optimization mode, this main function will get one set of hyperparameters. If you use it as a simple argument parser you get the default arguments in the argument parser.

So, calling main(hyperparams) runs the model with the default argparse arguments.

main(hyperparams)

CPU hyperparameter search

# run a grid search over 20 hyperparameter combinations.
hyperparams.optimize_parallel_cpu(
    main_local,
    nb_trials=20,
    nb_workers=1
)

Hyperparameter search on a single or multiple GPUs

# run a grid search over 20 hyperparameter combinations.
hyperparams.optimize_parallel_gpu(
    main_local,
    nb_trials=20,
    nb_workers=1,
    gpus=[0,1,2,3]
)

Hyperparameter search on a SLURM HPC cluster

def optimize_on_cluster(hyperparams):
    # enable cluster training
    cluster = SlurmCluster(
        hyperparam_optimizer=hyperparams,
        log_path=hyperparams.tt_save_path,
        test_tube_exp_name=hyperparams.tt_name
    )

    # email for cluster coms
    cluster.notify_job_status(email='add_email_here', on_done=True, on_fail=True)

    # configure cluster
    cluster.per_experiment_nb_gpus = hyperparams.per_experiment_nb_gpus
    cluster.job_time = '48:00:00'
    cluster.gpu_type = '1080ti'
    cluster.memory_mb_per_node = 48000

    # any modules for code to run in env
    cluster.add_command('source activate pytorch_lightning')

    # name of exp
    job_display_name = hyperparams.tt_name.split('_')[0]
    job_display_name = job_display_name[0:3]

    # run hopt
    print('submitting jobs...')
    cluster.optimize_parallel_cluster_gpu(
        main,
        nb_trials=hyperparams.nb_hopt_trials,
        job_name=job_display_name
    )

# run cluster hyperparameter search    
optimize_on_cluster(hyperparams)

4.9 KiB Raw Blame History