Commit Graph

24 Commits

Author SHA1 Message Date
Jirka Borovec 448be60701
update GPU to PT 1.5 (#2779)
* update gpu PT 1.6

* fix docker

* use PT 1.5

* Update tests/install_AMP.sh

Co-authored-by: Nathan Raw <nxr9266@g.rit.edu>

Co-authored-by: Nathan Raw <nxr9266@g.rit.edu>
2020-08-02 08:14:53 -04:00
Jirka Borovec 3772601cd6
update CI testing with pip upgrade (#2380)
* try pt1.5

* cpu

* upgrade

* tpu

* user

* [blocked by #2380] freeze GPU PT 1.4 (#2780)

* freeze

* user
2020-07-31 14:50:06 -04:00
Jirka Borovec bc7a08fbe0
test dockers & add AMP in pt-1.6 (#1584)
* exist images

* names

* images

* args

* pt 1.6 dev

* circleci

* update

* refactor

* build

* fix

* MKL
2020-07-31 08:23:13 -04:00
Adrian Wälchli 7ef73f242a
try remove pr (#2543) 2020-07-07 15:26:58 -04:00
Jirka Borovec 977df6ed31
Docker: building XLA base image (#2494)
* refactor

* add TPU base

* wip

* builds

* typo

* extras

* simple

* unzip

* rename
2020-07-06 14:21:36 -04:00
Jirka Borovec 39a6435726
Revert "Revert "join coverage (#2460)" (#2499)" (#2500)
This reverts commit 355918af8d.
2020-07-04 11:31:12 -04:00
William Falcon 355918af8d
Revert "join coverage (#2460)" (#2499)
This reverts commit 944ffba305.
2020-07-04 10:29:50 -04:00
Jirka Borovec 944ffba305
join coverage (#2460)
* join coverage

* full TPU test

* codecov

* typo

* report

* docker

* timeout

* base

* show

* cd dir

* req

* docker

* docker

* docker

* coverage

* upload

* drop main

* report

* report

* python

* upload

* drone

* drone

* drone

* drone

* drone

* drone

* drone

* drone

* drone
2020-07-04 10:22:58 -04:00
zcain117 1a40963d1d
Add Github Action to run TPU tests. (#2376)
* Add Github Action to run TPU tests.

* Trigger new Github Actions run.

* Clean up more comments.

* Use different fixed version of ml-testing-accelerators and update config to match.

* use cluster in us-central1-a

* Run 'gcloud logging read' directly without 'echo' to preserve newlines.

* cat coverage.xml on the TPU VM side and upload xml on the Github Action side

* Use new commit on ml-testing-accelerators so command runs fully.

* Preserve newlines in the xml and use if: always() temporarily to upload codecov

* Use pytorch_lightning for coverage instead of pytorch-lightning

* Remove the debug cat of coverage xml

* Apply suggestions from code review

* jsonnet rename

* name

* add codecov flags

* add codecov flags

* codecov

* codecov

* revert codecov

* Clean up after apt-get and remove old TODOs.

* More codefactor cleanups.

* drone

* drone

* disable codecov

* cleaning

* docker py versions

* docker py 3.7

* readme

* bash

* docker

* freeze conda

* py3.6

* Stop using apt-get clean.

* Dont rm pytorch-lightning

* Update docker/tpu/Dockerfile

* Longer timeout in the Github Action to wait for GKE to finish.

* job1

* job2

* job3

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
2020-07-01 21:44:19 -04:00
Jirka Borovec 4e13e419ea
add CLI test for examples (#2285)
* cli examples

* ddp

* CI

* CI

* req

* tests

* skip DDP

Co-authored-by: William Falcon <waf2107@columbia.edu>
2020-06-27 09:13:29 -04:00
Jirka Borovec bfaabd7b7f
clean requirements (#2128)
* clean requirements

* missing

* missing

* req

* min

* default >> base

* base.txt
2020-06-13 10:15:22 -04:00
Jirka Borovec 2674976f2c
remove deprecated API for v0.8 (#2073)
* remove deprecated API

* chlog

* times

* missed

* formatting check

* missing

* missing

* miss

* fix docs build error

* fix pep whitespace error

* docs

* wip

* amp_level

* amp_level

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2020-06-12 14:37:52 -04:00
Jirka Borovec c438d0dd90
increase acc (#2039)
* increase acc

* try 0.45

* @pytest

* @pytest

* try .50

* duration

* pytest
2020-06-03 08:28:19 -04:00
Adrian Wälchli a6de1b8d75
doctest for .rst files (#1511)
* add doctest to circleci

* Revert "add doctest to circleci"

This reverts commit c45b34ea911a81f87989f6c3a832b1e8d8c471c6.

* Revert "Revert "add doctest to circleci""

This reverts commit 41fca97fdcfe1cf4f6bdb3bbba75d25fa3b11f70.

* doctest docs rst files

* Revert "doctest docs rst files"

This reverts commit b4a2e83e3da5ed1909de500ec14b6b614527c07f.

* doctest only rst

* doctest debugging.rst

* doctest apex

* doctest callbacks

* doctest early stopping

* doctest for child modules

* doctest experiment reporting

* indentation

* doctest fast training

* doctest for hyperparams

* doctests for lr_finder

* doctests multi-gpu

* more doctest

* make doctest drone

* fix label build error

* update fast training

* update invalid imports

* fix problem with int device count

* rebase stuff

* wip

* wip

* wip

* intro guide

* add missing code block

* circleci

* logger import for doctest

* test if doctest runs on drone

* fix mnist download

* also run install deps for building docs

* install cmake

* try sudo

* hide output

* try pip stuff

* try to mock horovod

* Tranfer -> Transfer

* add torchvision to extras

* revert pip stuff

* mlflow file location

* do not mock torch

* torchvision

* drone extra req.

* try higher sphinx version

* Revert "try higher sphinx version"

This reverts commit 490ac28e46d6fd52352640dfdf0d765befa56988.

* try coverage command

* try coverage command

* try undoc flag

* newline

* undo drone

* report coverage

* review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* remove torchvision from extras

* skip tests only if torchvision not available

* fix testoutput torchvision

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-05-04 22:16:54 -04:00
William Falcon 29ebe92208
support for native amp (#1561)
* adding native amp suppport

* adding native amp suppport

* adding native amp suppport

* adding native amp suppport

* autocast

* autocast

* autocast

* autocast

* autocast

* autocast

* removed comments

* removed comments

* added state saving

* added state saving

* try install amp again

* added state saving

* drop Apex reinstall

Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
2020-04-23 14:47:08 -04:00
Jirka Borovec 0b22b64a10
Tests/docker (#1573)
* devel image

* try parallel

* new image
2020-04-23 12:52:59 -04:00
Travis Addair 7024177f7d
Added Horovod distributed backend (#1529)
* Initial commit of Horovod distributed backend implementation

* Update distrib_data_parallel.py

* Update distrib_data_parallel.py

* Update tests/models/test_horovod.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/models/test_horovod.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Fixed tests

* Added six

* tests

* Install tox for GitHub CI

* Retry tests

* Catch all exceptions

* Skip cache

* Remove tox

* Restore pip cache

* Remove the cache

* Restore pip cache

* Remove AMP

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
2020-04-22 17:39:08 -04:00
Jirka Borovec 724b787cd1
faster CI testing (#1323)
* MNIST digits

* increase test acc

* smaller parity

* drone builds

* increase GH action timeout

* drone format

* fix paths

* drone cache

* circle cache

* fix test

* lower nb epochs

* circleCI

* user orb

* fix test

* fix test

* circle cache

* circle cache

* circle cache

* comment caches

* benchmark batch size

* cache dataset

* smaller dataset

* smaller dataset

* fix nb samples

* batch size

* fix test
2020-04-02 12:28:44 -04:00
William Falcon 18d055a390
Parity test (#1284)
* adding test

* adding test

* added base parity model

* added base parity model

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* added parity test

* move parity to benchmark

* formatting

* fixed gradient acc sched

* move parity to benchmark

* formatting

* fixed gradient acc sched

* skip for CPU

* call last

Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
2020-03-30 18:16:32 -04:00
Jirka Borovec 61177cd1c8
system info (#1234)
* system info

* update big info

* test script

* update config

* rename script

* import path
2020-03-27 08:45:52 -04:00
Jirka Borovec 45d671a4a8
CI: split tests-examples (#990)
* CI: split tests-examples

* tests without template

* comment depends

* CircleCI typo

* add doctest

* update test req.

* CI tests

* setup macOS

* longer train

* lover pred acc

* fix model

* rename default model

* lower tests acc

* typo

* imports

* fix test optimizer

* update calls

* fix Win

* lower Drone image

* fix call

* pytorch image

* fix test

* add dev image

* add dev image

* update image

* drone volume

* lint

* update test notes

* rename tests/models >> tests/base

* group models

* conftest

* optim imports

* typos

* fix import

* fix tests

* install AMP

* tests

* fix import
2020-03-25 07:46:27 -04:00
Jirka Borovec 22a7264e9a
improve partial Codecov (#1172)
* ignore in setup

* show report

* abs imports

* abstract pass

* cover loggers

* doctest trains

* locals

* pass

* revert tensorboard

* use tensorboardX

* revert tensorboardX

* fix trains

* Add TrainsLogger.set_credentials (#1179)

* Add TrainsLogger.set_credentials to control trains server configuration and authentication from code. Sync trains package version.
Fix CI Trains tests

* Add global TrainsLogger set_bypass_mode (#1187)

* Add global TrainsLogger set_bypass_mode skips all external communication

Co-authored-by: bmartinn <>

* rm some no-cov

Co-authored-by: Martin.B <51887611+bmartinn@users.noreply.github.com>
2020-03-19 09:14:29 -04:00
Jirka Borovec f6a7a5278a
enable Codecov (#1133)
* update config

* try Drone cache

* drop Drone cache

* move import

* remove token
2020-03-14 13:01:57 -04:00
Jirka Borovec 5691ffb160
add Drone CI (#1115)
* add Drone config

* update Drone config

* add Drone config

* list GPUs

* add type

* native torch

* native torch

* fix image

* update

* SLURM_LOCALID

* add badge

* simple test
2020-03-11 15:39:59 -04:00