Commit Graph

12 Commits

Author SHA1 Message Date
William Falcon cf182e80fc
Finish Allow on_save_checkpoint... (#3688)
* Finish #3562

* Apply suggestions from code review

* Apply suggestions from code review

* fix tests

* Finish #3562

* Apply suggestions from code review

* Apply suggestions from code review

* fix tests

* fix structure

* fix structure

* make save_last test pass

* unnecessary global rank check

* fix test

* update test

* update test

* test

* test

* run save on all

* remove assert

* tracking saves

* check if fails

* test

* clean up

* adjust horovod test

* clean up

* remove unnecessary makdirs

* change

* undo

* debug

* debug

* debug

* debug

* mock

* undo debug code

* add extra assertions

* test

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <adrian.waelchli@inf.unibe.ch>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-09-30 16:15:29 -04:00
Jirka Borovec 7b64472ced
fix lib paths after Wandb 0.10 (#3520)
* try

* try

* drop 0.20

* drop 0.19.5

* -U

* Fixed Horovod in CI due to wandb==0.10.0 sys.path modifications (#3525)

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* format

* wb freeze

* types

Co-authored-by: Travis Addair <taddair@uber.com>
2020-09-17 08:37:49 -04:00
William Falcon cd16aa9854
ref: checkpoint connector methods 4/n (#3474)
* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n

* ref: checkpoint connector methods 4/n
2020-09-12 08:42:27 -04:00
Adrian Wälchli d03953260d
Fix weights_save_path when logger is used + simplify path handling + better docs (#2681)
* fix weights_save path and drop ckpt_path

* add tests

* unused import

* update docs

* changelog

* pep8

* fix horovod test

* make backward compatible

* perform same test for all loggers

* fix for when logger=False and weights_save_path is set

* update changelog

* update docs

* update tests

* do not set save dir dynamically

* remove duplicate test

* remove duplicated tests

* update tests

* update tests

* remove remaining ckpt_path references

* move defaults to init as suggested by @Borda

* test deprecation
2020-07-27 12:53:11 -04:00
William Falcon f35337adba
Fixes .test() for ddp (#2570)
* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint

* enable none checkpoint
2020-07-09 18:36:36 -04:00
Adrian Wälchli 78db847e42
Fixed skipped horovod tests (#2514)
* skip ckpt test on rank  > 0

* fx test

* add extra assert

* code factor

* add back removed

* add old loading code

* add back old

* unused import

* add same skip to run_model_without_loggers

* test if horovod now works with python 3.8

* test remove all 3.8 skips

* remove spawn

* fix

* fix test

* move load check up

* fix test multigpu

* rename

* fix gpu mode

* on gpu fix when on cpu

* move
2020-07-07 14:54:07 -04:00
Jirka Borovec f1c96930b1
repair CI for Win (#2358)
* no cov

* no cov

* ReduceOp

* group

* reduce_op.sum

* Update sklearns.py

* formatting

* horovod

* Apply suggestions from code review

* horovod

* horovod

* horovod

* horovod

* ci

* print

* ci

* timeout

* timeout

* time

* fix

* distributed cpu

* pipes

* time

* cpu

* spawn

* spawn

* spawn

* tp

* separate

* os

* os

* npm

* Fix load_from_checkpoint() not working with URL on Windows

* Update CHANGELOG

* Update CHANGELOG.md

Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* Apply suggestions from code review

* fix

* fix meta tags creating empty lines

* pyright

* node

* fix httpserver address

* drop tutils.default_trainer_options

* imports

* Better fix for load_from_checkpoint() not working with absolute path on Windows (#2294)

* Fix load_from_checkpoint() not working with URL on Windows

* Update CHANGELOG

* Update CHANGELOG.md

Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>

* drop duplicate

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: airium <airium@outlook.com>
Co-authored-by: Peter Yu <2057325+yukw777@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: AIRIUM <38249940+airium@users.noreply.github.com>
2020-06-26 21:38:25 -04:00
Jirka Borovec 9d2df24d6b
RC & Docs/changelog (#1776)
* missing

* RC

* tol

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* test

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-05-11 21:57:53 -04:00
Jirka Borovec 134eb61e1a
Tests: refactor cleanup (#1744)
* wip

* cleaning

* optim imports

* -

* default hparams

* fix restore

* fix imports
2020-05-10 13:15:28 -04:00
Jirka Borovec f380027951
refactor default model (#1652)
* refactor default model

* drop redundant seeds

* formatting

* path

* formatting

* rename
2020-05-02 08:38:22 -04:00
Travis Addair 2950f66983
Fix Horovod distributed backend to set the root_gpu property (#1669)
* params

* drop acc

* Fix Horovod distributed backend to set the root_gpu

* Fixed test

* Fixed tests

* Fixed lint

* Set root_gpu during initialization

* chlog

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
2020-05-01 14:13:35 -04:00
Travis Addair 7024177f7d
Added Horovod distributed backend (#1529)
* Initial commit of Horovod distributed backend implementation

* Update distrib_data_parallel.py

* Update distrib_data_parallel.py

* Update tests/models/test_horovod.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/models/test_horovod.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Fixed tests

* Added six

* tests

* Install tox for GitHub CI

* Retry tests

* Catch all exceptions

* Skip cache

* Remove tox

* Restore pip cache

* Remove the cache

* Restore pip cache

* Remove AMP

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
2020-04-22 17:39:08 -04:00