* Basic wandb support
* refactor(wandb): remove unused variables and document logger
* docs(wandb): explain how to use WandbLogger
* test(wandb): add tests for WandbLogger
* feat(wandb): add save_dir
* fix(wandb): allow pickle of logger
* fix(wandb): save logs in custom directory
* test(wandb): test import
* docs(wandb): simplify docstring and use doctest
* test: increase number of epochs for satisfactory accuracy
* test(test_load_model_from_checkpoint): ensure we load last checkpoint
Co-authored-by: Chris Van Pelt <vanpelt@wandb.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
* added neptune integration
* added tests for NeptuneLogger, added neptune to docs
* updated link to neptune support
* fixed docstrings, fixed try/except in tests, changed append_tags input
* fixed docstrings line lenght
* bumped epoch nr in model restore tests
* added tags support for single strings
* fixed passing neptune token to backend
* fixed project name in offline mode
* added save_top_k=-1 to checkpoint callback
* reformated initialization of neptune in online mode
* bumped epoch nr to 4 in test_load_model_from_checkpoint
* bumped epoch nr to 5
Co-authored-by: William Falcon <waf2107@columbia.edu>
* fix dangling gradients
make sure only the gradients of the current optimizer's paramaters are calculated in the training step.
* add note about multiple optimizer gradient update
* Update training_loop.py
* Renamed `on_sanity_check_start` to `on_train_start` and added `on_train_end` to `ModelHooks`
* changed tests to use `on_train_start` instead of `on_sanity_check_start`
* type: debug
Calculate the adequate number of steps to run during sanity_check.
This fixes the bug when there are two or more validation dataloaders.
- Before: total=self.num_sanity_val_steps
- After: total=self.num_sanity_val_steps*len(self.get_val_dataloaders())
* type: refactor
Put total=... in the next line
* type: refactor
run flake8
* use print for INFO and lower levels summarize()
* use logging.INFO instead of magic number
* bring logging.info back for other cases
* move logging config to __init__.py
* prepend the model summary with a newline
* feat: add reducelronplateau callback
* feat: use reducelronplateau callback in trainer
* feat: only on unsupported lr schedulers
* feat: last but not the least merge of master
* feat: merge master
* feat: support only on scheduler in reduceLrOnPlateauScheduler
* refactor: code style
* Update pt_callbacks.py
* Update trainer.py
* Update train_loop_mixin.py
* Update trainer.py
* Update train_loop_mixin.py
* Avoid race condition in creating checkpoint directories
In multi-GPU training, several processes run the code that creates checkpoint dirs. This fix avoids a probably rare situation (but it happened to me) where another process created a dir between the `exists` check and the `makedirs` call.
* Remove the now unneeded check for dir existence