lightning

Commit Graph

Author	SHA1	Message	Date
Ethan Harris	1c2ecbf70c	TPUSpawn + IterableDataset error message (#6875 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-08 19:57:48 +05:30
Ethan Harris	87f0aeac25	Fix DDP_SPAWN compatibility with bug_report_model.py (#6892 )	2021-04-08 19:57:18 +05:30
Oleg	3007872d01	Update mlflow with using resolve_tags (#6746 ) * Update mlflow.py #6745 adds additional info about the run, as in the native API * Update mlflow.py trying to fix some backward compatibility issues with `resolve_tags` * wip on backward compatibility added a default for `getattr` in case the `registry` object exists, but has no proper attribute (weird case but who knows...) * fix pep * impoert * fix registry import * try fix failing tests removed the first if statement, so that `resolve_tags` would be defined either case * fix formatting Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-04-08 10:45:23 +01:00
scart97	eb15abcd82	Fix finetuning complex models correctly unfreezes. (#6880 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-08 12:59:06 +05:30
ananthsub	968ac091c0	Remove hardcoding of rank_zero_only.rank in accelerator connector (#6878 )	2021-04-08 12:56:59 +05:30
Carlos Mocholí	128f6ab508	Add separators to performance docs (#6882 )	2021-04-08 08:22:50 +01:00
sk	01b9cf8fdc	Fix csv extension check (#6436 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-08 01:16:31 +00:00
Kaushik B	9fbe724b2b	Update Changelog for v1.2.7 (#6874 ) * Update Changelog for v1.2.7 * legacy Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-04-07 22:58:41 +00:00
Carlos Mocholí	19e67d18c4	Docs fixes (#6870 )	2021-04-07 16:57:22 +01:00
shuyingsunshine21	313e81638d	Supporting Adding DDP Communication Hooks (#6736 ) * Fix some test errors Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * checkpoint consolidation * Update ddp_spawn.py * Update test_metric_result_integration.py * Update test_results.py * Update utils.py * Update utils.py * Update test_all_gather_grad.py * Update test_all_gather_grad.py * Update test_results.py * Revert "Update test_results.py" This reverts commit `9d4a2b891d`. * Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate" This reverts commit `c5053da789`, reversing changes made to `0d23d75bc9`. * Revert "Update test_all_gather_grad.py" This reverts commit `0d23d75bc9`. * Revert "Update utils.py" This reverts commit `70fe5da9c6`. * Revert "Update utils.py" This reverts commit `a9aae99f6e`. * Revert "Update test_results.py" This reverts commit `ea74906878`. * Revert "Update test_metric_result_integration.py" This reverts commit `bf70e431b3`. * Revert "Update ddp_spawn.py" This reverts commit `f17210183b`. * Revert "checkpoint consolidation" This reverts commit `536c1323b0`. * Revert "Revert "checkpoint consolidation"" This reverts commit `3a9fde915a`. * Revert "Revert "Revert "checkpoint consolidation""" This reverts commit `7a369f47e1`. * Revert "Revert "Update ddp_spawn.py"" This reverts commit `8222dc98ea`. * Revert "Revert "Update test_metric_result_integration.py"" This reverts commit `6c095b2370`. * Revert "Revert "Update test_results.py"" This reverts commit `250d0aaaa2`. * Revert "Revert "Update utils.py"" This reverts commit `8651d54d79`. * Revert "Revert "Update test_all_gather_grad.py"" This reverts commit `dcdcd29731`. * modify distributed environment to make test pass * add DDP communication hook * remove test related setting * remove more test related setting * fix ddp comm hook util import issue * comments * one more fix for test_custom_plugin * fix ddp spwan * fix sgd * address comments and add tests * 1. add is gpu checking 2. modify test a bit 3. formatting * formatting nit * fix conda 3.7 1.7 issue for no torch.distributed.algorithms module * need at least 1.8.0 * minor fix * modify changelog * changelog should link to PR number instead of issue number * refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge * move single device checking before call register_ddp_comm_hook * formatting * comments * typo * pre-commit formatting	2021-04-07 12:35:57 +01:00
ananthsub	86e1d9f759	[fix] Better support for rank_zero_only setting for SLURM and torchelastic (#6802 ) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-04-07 12:25:13 +01:00
Roger Shieh	a2c605785a	Update seed_everything() (#6843 ) * Update seed.py * Update pytorch_lightning/utilities/seed.py Co-authored-by: thomas chaton <thomas@grid.ai> * Update seed.py * Update seed.py * Update seed.py Co-authored-by: thomas chaton <thomas@grid.ai>	2021-04-07 13:17:48 +02:00
Adrian Wälchli	b7a22ba046	CI: fixture for global rank variable reset (#6839 )	2021-04-06 09:37:17 -07:00
Kaushik B	a17c027ea1	Update sync_dist warning for multiple processes (#6790 )	2021-04-06 16:57:43 +02:00
Anthony Kim	7f6154fcad	Add `Trainer(gradient_clip_algorithm='value'\|'norm')` (#6123 ) * add changelog * add clip by value * fix bug in training tricks.rst * fix bug in trainer.rst * Update trainer.rst * Update trainer.rst * Update CHANGELOG.md Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/plugins/precision/deepspeed_precision.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/utilities/enums.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * yapf formatting * update training tricks * update based on comment * update based on comment * Update pytorch_lightning/trainer/trainer.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * update based on comment * pep8 * mypy * mypy * Update docs/source/advanced/training_tricks.rst Co-authored-by: thomas chaton <thomas@grid.ai> * Update sharded_native_amp.py * Update test_sharded_parity.py * update test codes * Update test_tpu.py * Update pytorch_lightning/trainer/connectors/training_trick_connector.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update test_trainer.py * Update enums.py * Update enums.py * add super-class initialization to precision plugins. * add clip_grad horovod cpu test * add clip_grad horovod cpu test * use subprocess check_call * change order of horovod tests * set max_epochs 2 in horovod test * remove clip_grad_val test from horovod-cpu * remove "type: ignore" * divide clip grad val test in horovod * update based on comments * add super-class initialization to precision plugins. * bugfix * bugfix * revert some changes * revert some changes * Update tests/models/test_horovod.py * merge master * Delete signature test No point in testing a signature Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-04-06 08:27:37 -05:00
Mauricio Villegas	b7f3a3c421	Simple reproducibility with minimum boilerplate CLI training with `LightningCLI` (#4492 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-06 14:19:11 +01:00
Adrian Wälchli	127c52af74	Fix EarlyStopping logic when min_epochs not met (#6705 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-06 12:41:07 +01:00
Tharindu Hasthika	f581411210	Fixed missing arguments in `lr_find` call (#6784 ) There seem to be 3 arguments missing in the `lr_find` call in the tunining.py file.	2021-04-06 11:37:15 +02:00
Ethan Harris	89b5326ca5	Fix support for symlink save_dir in TensorBoardLogger (#6730 ) * Add test for symlink support and initial fix * Respond to comment and add docstring * Update CHANGELOG.md * Simplify * Update pytorch_lightning/utilities/cloud_io.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Make `LightningLocalFileSystem` protected Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-06 11:36:25 +02:00
Kaushik B	cf8e828559	[Fix] TPU Training Type Plugin (#6816 )	2021-04-06 15:02:44 +05:30
Eugene Khvedchenya	eafec7d425	Fix DPP + SyncBN (#6838 ) * Fix DPP + SyncBN Ensure that model is already on correct GPU before applying SyncBN conversion * Fix order of SyncBN for ddp_spawn	2021-04-06 08:40:29 +01:00
Michael Baumgartner	6dc1078822	Enforce an epoch scheduler interval when using SWA (#6588 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-06 02:57:33 +00:00
Sadiq Jaffer	7f91c5ebbc	Fix `unfreeze_and_add_param_group` expects `modules` rather than `module` (#6822 )	2021-04-06 01:50:42 +02:00
Karthik Prasad	c3da7f50bb	Sanitize `None` params during pruning (#6836 ) * sanitize none params during pruning * amend	2021-04-06 01:47:59 +02:00
Adrian Wälchli	264aa689de	fix boolean check on iterable dataset when len not defined (#6828 ) * fix iterable dataset len check * update predict and validate * add validate to test * add changelog * add predict	2021-04-05 17:47:21 +01:00
Kaushik B	22a266d8b8	Update TPU docs for installation (#6794 )	2021-04-04 00:19:43 +05:30
ananthsub	bb9ace4333	[typing] Add typehint for broadcast in training type plugin (#6777 ) * Update training_type_plugin.py * Update accelerator.py * Update pytorch_lightning/plugins/training_type/training_type_plugin.py Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>	2021-04-02 20:55:34 +02:00
Elizaveta Logacheva	f8a379830d	Remove extinct parameters from lightning_module.rst (#6801 ) Fixes #6800	2021-04-02 20:49:20 +02:00
Yuan-Hang Zhang	1bd5f36a5b	Fix validation progress counter with check_val_every_n_epoch > 1 (#5952 ) Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-02 17:40:41 +09:00
Jirka Borovec	0b843848b6	less IDE complain about unused args (#6786 ) * less IDE complain about unused args * ...	2021-04-01 18:19:00 +02:00
thomas chaton	3e3175d074	resolve bug (#6781 )	2021-04-01 11:43:23 +01:00
Kaushik B	13f67ad313	Update logic for checking TPUs availability (#6767 ) * Update logic for checking TPUs availability * fix flake8 * add fix	2021-04-01 03:04:33 +05:30
Kaushik B	a72a7992a2	Update clip gradients signature for precision plugins (#6764 )	2021-03-31 17:06:48 +05:30
Carlos Mocholí	495c385a54	Add 1.2.6 section to CHANGELOG (#6732 ) * Add 1.2.6 sections to CHANGELOG * Update CHANGELOG.md * legacy Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-30 18:25:22 -07:00
Carlos Mocholí	0dd2deebea	Remove legacy support for the magic `log`/`progress_bar` keys in dict returns (#6734 )	2021-03-31 00:28:04 +02:00
Sean Naren	f9bb7c641a	DeepSpeed ZeRO Docs update (#6752 ) * Added base docs * Add more information * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-03-30 21:52:02 +00:00
thomas chaton	1302766f83	DeepSpeed ZeRO Update (#6546 ) * Add context to call hook to handle all modules defined within the hook * Expose some additional parameters * Added docs, exposed parameters * Make sure we only configure if necessary * Setup activation checkpointing regardless, saves the user having to do it manually * Add some tests that fail currently * update * update * update * add tests * change docstring * resolve accumulate_grad_batches * resolve flake8 * Update DeepSpeed to use latest version, add some comments * add metrics * update * Small formatting fixes, clean up some code * Few cleanups * No need for default state * Fix tests, add some boilerplate that should move eventually * Add hook removal * Add a context manager to handle hook * Small naming cleanup * wip * move save_checkpoint responsability to accelerator * resolve flake8 * add BC * Change recommended scale to 16 * resolve flake8 * update test * update install * update * update test * update * update * update test * resolve flake8 * update * update * update on comments * Push * pull * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * update * Apply suggestions from code review * Swap to using world size defined by plugin * update * update todo * Remove deepspeed from extra, keep it in the base cuda docker install * Push * pull * update * update * update * update * Minor changes * duplicate * format * format2 Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-30 13:39:02 -04:00
Akihiro Nitta	9876df16a2	[docs] Update Bolts link (#6743 ) * Update Bolts link * Update Bolts link * formt Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-03-30 22:52:59 +05:30
thomas chaton	bb92754119	[bugfix] Add support for omegaconf and tpu (#6741 ) * fix_hydra * update changelog Co-authored-by: Your Name <you@example.com>	2021-03-30 16:21:25 +01:00
Jirka Borovec	583fcf281c	update chlog v1.2.5 (#6742 ) * update chlog v1.2.5 * legacy	2021-03-30 12:45:07 +02:00
Carlos Mocholí	90444706b2	Remove logger_connector legacy code (#6733 )	2021-03-30 12:33:33 +02:00
Jirka Borovec	3c86193de0	update readme by v1.2.x (#6728 )	2021-03-29 18:06:24 -04:00
Kaushik B	f79a13e495	[Model Parallel] Add configure sharded model hook (#6679 ) * Add base hook for model parallel * fix callback signature * Simplify hook * Add hook logic * add tests * add property setter * add logic for being called once * Update changelog * Fix * fix return type * fix lambda callback test * Fix tests * Apply code suggestions * add logic for setup_optimizers_predispatch * add common dummy model * Swap call order * Remove test that isn't needed anymore * Update tests * Add a bit more doc * Few code review fixes * Update pytorch_lightning/accelerators/accelerator.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Change hook name * Fix test * Test setup hook, refactor names * Swap call order of callbacks and model initialization * Change name of context manager Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-29 14:50:51 -06:00
thomas chaton	646cf2f7d4	[refactor] Move save_function to accelerator 1/n [DeepSpeed] (#6689 ) * move save_checkpoint responsability to accelerator * update	2021-03-29 21:02:37 +02:00
thomas chaton	3a4c4246ee	[TPU] update is_tpu_exists utils internal logic to rely on xmp.spawn (#6719 ) * update_logic * update * Update tests/utilities/test_xla_device_utils.py * Update pytorch_lightning/utilities/xla_device.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Update pytorch_lightning/utilities/xla_device.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * update test * Update tests/utilities/test_xla_device_utils.py * update * Apply fix * Docstring * flake8 * update Co-authored-by: Your Name <you@example.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-03-29 18:59:20 +01:00
Jirka Borovec	5b5a5cc80b	support python 3.9 (#4944 ) * support python 3.9 * update CI * onnxruntime * . * . * onnxruntime * t 55 * t 75 * add script * use * onnx * onnx * onnx * whl * np * find * 21 * Apply suggestions from code review * Apply suggestions from code review * onnx * CI * req * ~ dockers * min * . * drop horovod * drop horovod * drop horovod * fix * fix * .	2021-03-29 12:20:13 -04:00
Łukasz Zalewski	cca0eca5f3	More explicit exception message when testing with fast_dev_run=True (#6667 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-29 13:29:54 +00:00
Jirka Borovec	dcf6e4e310	remake nvidia docker (#6686 ) * use latest * remake * examples	2021-03-29 09:39:06 +01:00
Carlos Mocholí	f0c5479de9	Remove legacy `Result` parameters (#6016 )	2021-03-28 11:55:08 +02:00
thomas chaton	0e45220263	[warning] Add warning when values are not being reduced (#6417 ) * add warning non reduced * add test * update test * update changelog * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * update Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>	2021-03-26 18:33:11 +00:00

1 2 3 4 5 ...

4668 Commits All Branches Search

4668 Commits

All Branches