lightning

Commit Graph

Author	SHA1	Message	Date
ananthsub	851f9e3997	Move NaN/Inf detection to a separate utilities file (#6834 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-04-09 01:47:02 +02:00
Sean Naren	742c48e994	[Fix] Ensure we set the eval/train flag correctly on accelerator model (#6877 ) * Ensure we move the model to eval mode before running evaluation * Ensure we set the flag appropriately across all stages * Add test, move hooks logic * Apply same fix to the validate loop * Update pytorch_lightning/trainer/trainer.py * Fix function name * Fix order, add predict * Shorten the name * Fix input dm, drop duplicate on predict start hook call, as it's called in the setup function * Use hook, remove double call	2021-04-08 14:04:26 -04:00
Ethan Harris	1c2ecbf70c	TPUSpawn + IterableDataset error message (#6875 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-08 19:57:48 +05:30
scart97	eb15abcd82	Fix finetuning complex models correctly unfreezes. (#6880 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-08 12:59:06 +05:30
Kaushik B	9fbe724b2b	Update Changelog for v1.2.7 (#6874 ) * Update Changelog for v1.2.7 * legacy Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-04-07 22:58:41 +00:00
shuyingsunshine21	313e81638d	Supporting Adding DDP Communication Hooks (#6736 ) * Fix some test errors Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * checkpoint consolidation * Update ddp_spawn.py * Update test_metric_result_integration.py * Update test_results.py * Update utils.py * Update utils.py * Update test_all_gather_grad.py * Update test_all_gather_grad.py * Update test_results.py * Revert "Update test_results.py" This reverts commit `9d4a2b891d`. * Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkpoint_consolidate" This reverts commit `c5053da789`, reversing changes made to `0d23d75bc9`. * Revert "Update test_all_gather_grad.py" This reverts commit `0d23d75bc9`. * Revert "Update utils.py" This reverts commit `70fe5da9c6`. * Revert "Update utils.py" This reverts commit `a9aae99f6e`. * Revert "Update test_results.py" This reverts commit `ea74906878`. * Revert "Update test_metric_result_integration.py" This reverts commit `bf70e431b3`. * Revert "Update ddp_spawn.py" This reverts commit `f17210183b`. * Revert "checkpoint consolidation" This reverts commit `536c1323b0`. * Revert "Revert "checkpoint consolidation"" This reverts commit `3a9fde915a`. * Revert "Revert "Revert "checkpoint consolidation""" This reverts commit `7a369f47e1`. * Revert "Revert "Update ddp_spawn.py"" This reverts commit `8222dc98ea`. * Revert "Revert "Update test_metric_result_integration.py"" This reverts commit `6c095b2370`. * Revert "Revert "Update test_results.py"" This reverts commit `250d0aaaa2`. * Revert "Revert "Update utils.py"" This reverts commit `8651d54d79`. * Revert "Revert "Update test_all_gather_grad.py"" This reverts commit `dcdcd29731`. * modify distributed environment to make test pass * add DDP communication hook * remove test related setting * remove more test related setting * fix ddp comm hook util import issue * comments * one more fix for test_custom_plugin * fix ddp spwan * fix sgd * address comments and add tests * 1. add is gpu checking 2. modify test a bit 3. formatting * formatting nit * fix conda 3.7 1.7 issue for no torch.distributed.algorithms module * need at least 1.8.0 * minor fix * modify changelog * changelog should link to PR number instead of issue number * refine a bit on doc for register_ddp_comm_hook function, like ddp_comm_wrapper explanation and add hyperparameter for power sgd states in example usge * move single device checking before call register_ddp_comm_hook * formatting * comments * typo * pre-commit formatting	2021-04-07 12:35:57 +01:00
ananthsub	86e1d9f759	[fix] Better support for rank_zero_only setting for SLURM and torchelastic (#6802 ) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-04-07 12:25:13 +01:00
Adrian Wälchli	b7a22ba046	CI: fixture for global rank variable reset (#6839 )	2021-04-06 09:37:17 -07:00
Anthony Kim	7f6154fcad	Add `Trainer(gradient_clip_algorithm='value'\|'norm')` (#6123 ) * add changelog * add clip by value * fix bug in training tricks.rst * fix bug in trainer.rst * Update trainer.rst * Update trainer.rst * Update CHANGELOG.md Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/plugins/precision/deepspeed_precision.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/utilities/enums.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * yapf formatting * update training tricks * update based on comment * update based on comment * Update pytorch_lightning/trainer/trainer.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * update based on comment * pep8 * mypy * mypy * Update docs/source/advanced/training_tricks.rst Co-authored-by: thomas chaton <thomas@grid.ai> * Update sharded_native_amp.py * Update test_sharded_parity.py * update test codes * Update test_tpu.py * Update pytorch_lightning/trainer/connectors/training_trick_connector.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Update test_trainer.py * Update enums.py * Update enums.py * add super-class initialization to precision plugins. * add clip_grad horovod cpu test * add clip_grad horovod cpu test * use subprocess check_call * change order of horovod tests * set max_epochs 2 in horovod test * remove clip_grad_val test from horovod-cpu * remove "type: ignore" * divide clip grad val test in horovod * update based on comments * add super-class initialization to precision plugins. * bugfix * bugfix * revert some changes * revert some changes * Update tests/models/test_horovod.py * merge master * Delete signature test No point in testing a signature Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-04-06 08:27:37 -05:00
Mauricio Villegas	b7f3a3c421	Simple reproducibility with minimum boilerplate CLI training with `LightningCLI` (#4492 ) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-06 14:19:11 +01:00
Adrian Wälchli	127c52af74	Fix EarlyStopping logic when min_epochs not met (#6705 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-06 12:41:07 +01:00
Ethan Harris	89b5326ca5	Fix support for symlink save_dir in TensorBoardLogger (#6730 ) * Add test for symlink support and initial fix * Respond to comment and add docstring * Update CHANGELOG.md * Simplify * Update pytorch_lightning/utilities/cloud_io.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Make `LightningLocalFileSystem` protected Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-06 11:36:25 +02:00
Kaushik B	cf8e828559	[Fix] TPU Training Type Plugin (#6816 )	2021-04-06 15:02:44 +05:30
Michael Baumgartner	6dc1078822	Enforce an epoch scheduler interval when using SWA (#6588 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-04-06 02:57:33 +00:00
Karthik Prasad	c3da7f50bb	Sanitize `None` params during pruning (#6836 ) * sanitize none params during pruning * amend	2021-04-06 01:47:59 +02:00
Adrian Wälchli	264aa689de	fix boolean check on iterable dataset when len not defined (#6828 ) * fix iterable dataset len check * update predict and validate * add validate to test * add changelog * add predict	2021-04-05 17:47:21 +01:00
Yuan-Hang Zhang	1bd5f36a5b	Fix validation progress counter with check_val_every_n_epoch > 1 (#5952 ) Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-04-02 17:40:41 +09:00
Kaushik B	a72a7992a2	Update clip gradients signature for precision plugins (#6764 )	2021-03-31 17:06:48 +05:30
Carlos Mocholí	495c385a54	Add 1.2.6 section to CHANGELOG (#6732 ) * Add 1.2.6 sections to CHANGELOG * Update CHANGELOG.md * legacy Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-30 18:25:22 -07:00
Carlos Mocholí	0dd2deebea	Remove legacy support for the magic `log`/`progress_bar` keys in dict returns (#6734 )	2021-03-31 00:28:04 +02:00
thomas chaton	1302766f83	DeepSpeed ZeRO Update (#6546 ) * Add context to call hook to handle all modules defined within the hook * Expose some additional parameters * Added docs, exposed parameters * Make sure we only configure if necessary * Setup activation checkpointing regardless, saves the user having to do it manually * Add some tests that fail currently * update * update * update * add tests * change docstring * resolve accumulate_grad_batches * resolve flake8 * Update DeepSpeed to use latest version, add some comments * add metrics * update * Small formatting fixes, clean up some code * Few cleanups * No need for default state * Fix tests, add some boilerplate that should move eventually * Add hook removal * Add a context manager to handle hook * Small naming cleanup * wip * move save_checkpoint responsability to accelerator * resolve flake8 * add BC * Change recommended scale to 16 * resolve flake8 * update test * update install * update * update test * update * update * update test * resolve flake8 * update * update * update on comments * Push * pull * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update pytorch_lightning/plugins/training_type/deepspeed.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * update * Apply suggestions from code review * Swap to using world size defined by plugin * update * update todo * Remove deepspeed from extra, keep it in the base cuda docker install * Push * pull * update * update * update * update * Minor changes * duplicate * format * format2 Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>	2021-03-30 13:39:02 -04:00
Jirka Borovec	583fcf281c	update chlog v1.2.5 (#6742 ) * update chlog v1.2.5 * legacy	2021-03-30 12:45:07 +02:00
Carlos Mocholí	90444706b2	Remove logger_connector legacy code (#6733 )	2021-03-30 12:33:33 +02:00
Kaushik B	f79a13e495	[Model Parallel] Add configure sharded model hook (#6679 ) * Add base hook for model parallel * fix callback signature * Simplify hook * Add hook logic * add tests * add property setter * add logic for being called once * Update changelog * Fix * fix return type * fix lambda callback test * Fix tests * Apply code suggestions * add logic for setup_optimizers_predispatch * add common dummy model * Swap call order * Remove test that isn't needed anymore * Update tests * Add a bit more doc * Few code review fixes * Update pytorch_lightning/accelerators/accelerator.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Change hook name * Fix test * Test setup hook, refactor names * Swap call order of callbacks and model initialization * Change name of context manager Co-authored-by: SeanNaren <sean@grid.ai> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-29 14:50:51 -06:00
thomas chaton	3a4c4246ee	[TPU] update is_tpu_exists utils internal logic to rely on xmp.spawn (#6719 ) * update_logic * update * Update tests/utilities/test_xla_device_utils.py * Update pytorch_lightning/utilities/xla_device.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Update pytorch_lightning/utilities/xla_device.py Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * update test * Update tests/utilities/test_xla_device_utils.py * update * Apply fix * Docstring * flake8 * update Co-authored-by: Your Name <you@example.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-03-29 18:59:20 +01:00
Jirka Borovec	5b5a5cc80b	support python 3.9 (#4944 ) * support python 3.9 * update CI * onnxruntime * . * . * onnxruntime * t 55 * t 75 * add script * use * onnx * onnx * onnx * whl * np * find * 21 * Apply suggestions from code review * Apply suggestions from code review * onnx * CI * req * ~ dockers * min * . * drop horovod * drop horovod * drop horovod * fix * fix * .	2021-03-29 12:20:13 -04:00
Łukasz Zalewski	cca0eca5f3	More explicit exception message when testing with fast_dev_run=True (#6667 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>	2021-03-29 13:29:54 +00:00
Carlos Mocholí	f0c5479de9	Remove legacy `Result` parameters (#6016 )	2021-03-28 11:55:08 +02:00
thomas chaton	0e45220263	[warning] Add warning when values are not being reduced (#6417 ) * add warning non reduced * add test * update test * update changelog * Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * update Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>	2021-03-26 18:33:11 +00:00
Carlos Mocholí	21fc5eb21e	Automatically find and run special tests (#6669 )	2021-03-26 17:04:59 +00:00
Carlos Mocholí	bc613611e2	Do not add return dict items to callback_metrics (#6682 )	2021-03-26 14:05:20 +01:00
Ethan Harris	6b990f3fa5	Add artifcact_location arg to MLFlow logger (#6677 ) * Add artifcact_location arg to MLFlow logger * Add CHANGELOG URL * Update test	2021-03-26 00:12:03 +01:00
Jirka Borovec	217c12a4e7	Simplify deprecations (#6620 ) * use external deprecate * simplify * simplify * simplify * flake8 * . * others * .	2021-03-25 15:26:38 +01:00
Rohit Gupta	9be092dbdb	Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test (#6498 ) * update docs * add hook and update docs * update tests * chlog * Update CHANGELOG.md Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * chlog Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>	2021-03-25 14:20:49 +01:00
ananthsub	40976e4eba	Support teardown hook on DataModule (#4673 ) Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: chaton <thomas@grid.ai>	2021-03-25 07:51:55 -05:00
Kaushik B	2cbdc01256	Fix checkpoint callback & Trainer.test(_) issue for TPUs (#6654 ) * Fix checkpoint callback issue for TPUs * update changelog * add barrier * apply code suggestions * update trainer test * remove spaces * fix tpu tests * Apply suggestions from code review * add comment Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-03-25 10:37:37 +00:00
Shengyao Zhuang	b8ef52baa1	Match the number of outputs of backward with forward for AllGatherGrad (#6625 )	2021-03-25 15:07:58 +05:30
Carlos Mocholí	2dd6f9e09d	`MetricsHolder` clean-up + typing (#6645 ) * Metrics holder cleanup and better error message * Update pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py * _VALUE -> _METRIC_TYPE	2021-03-24 20:34:46 +01:00
Ethan Harris	d02fe342c1	Feature/double precision (#6595 ) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>	2021-03-24 15:47:58 +05:30
Jirka Borovec	70beddfc13	Prune metrics: others 11/DoNe (#6659 ) * classif * grad_img * nlp * ssl * format	2021-03-24 09:16:28 +01:00
Ethan Harris	741c452551	Fix disabled grads after call to predict (#6657 )	2021-03-23 23:07:48 +01:00
Jirka Borovec	64d0fa4472	update coverage config (#6524 ) * update coverage config * parallel * parallel * Apply suggestions from code review * Apply suggestions from code review * paralel * paralel * paralel * combine * combine * . * .. * .. * .. * rev * cb * cb * drop * drop * . * .. * ... * ... * ... * .	2021-03-23 23:05:04 +01:00
thomas chaton	fd5cb7fcc3	Add PyTorch 1.8 Profiler 5/5 (#6618 ) * Refactor profilers * Update PassThrough * WIP - This is broken and will change * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: thomas chaton <thomas@grid.ai> * resolve tests * resolve tests * find output * try something * update * add support for test and predict * update * update * use getattr * test * test * update * tests * update * update * update * update * update * remove file * update * update * update * update * update * test * update# * update * update tests * update * add suport for 1.8 * rename records * add support for 1.8 * update * resolve flake8 * resolve test * Refactor basic profilers * Fixes * Unused import * Introduce setup * Profile on all ranks. Print to stdout on 0 * Introduce dirpath + filename * CHANGELOG * Add tests. Address comments * add `on_run_stage_setup` * add on_run_stage_setup function * update * add test for RegisterRecordFunction * update lightnng flow direction * move variable to private * remove trace * Undo code that should be in 3/4 * Multi-stage multi-rank * 2/5 changes * Pass stage in __del__ * Remove TODOs * Describe on_evaluation_end. Add tests * Typo * Address comments * deepcopy tests * Advanced teardown * Fix teardown test * Fix tests * Minor change * Update CHANGELOG.md * Fix test * Quick fixes * Fix 6522 * resolve ddp tests * resolve tests * resolve some tests * update tests * resolve tests * update * resolve tests * resolve some tests * Missed fixes from 3/5 * Fixes * resolve some tests * resolve test for 1.7.1 * Broken refactor * Missed stage * Minor changes * resolve tests * Update CHANGELOG * resolve bug * remove print * Typo * Cleanup * resolve ddp test * remove barrier * update profiler * update * Smaller model * update * resolve tests * update * Minor changes. CHANGELOG * Minimize diff * update to 1.8.1 * RunIf. Extra code. Check segfault * resolve tests * Typo. Bad merge * Fixing a bad merge * replace for kineto * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Update pytorch_lightning/profiler/pytorch.py Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> * Minor changes * Bad merge * Use lists for flexibility * Use sets * predict_step * Ananth's suggestion * update * Docs * Update pl_examples/basic_examples/profiler_example.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update example * update example Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com> Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-03-23 20:43:21 +00:00
Carlos Mocholí	51b10f78f4	Refactor PyTorch profiler 4/5 (#6349 ) Co-authored-by: thomas chaton <thomas@grid.ai>	2021-03-23 18:13:29 +01:00
thomas chaton	0995d30fab	Flash predict step (#6577 ) * add predict_step * Update predict_loop.py * Update trainer.py * Update trainer.py * resolve bugs * update * update * update * resolve bug * resolve some failing tests * udpate tests * update * resolve tests * add a test * remove typo * add a test for attachement * update * changed to on_train_dataloader * remove __flash_special_attr__ * resolve tests * update * update * update * update on comments * Update pytorch_lightning/trainer/data_loading.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>	2021-03-23 11:13:13 -04:00
Jirka Borovec	a74909affa	prune metrics: info retrieval (#6649 )	2021-03-23 15:05:32 +00:00
Carlos Mocholí	36d180e532	Refactor base profilers 3/5 (#6621 ) Co-authored-by: tchaton <thomas@grid.ai>	2021-03-23 10:07:35 +00:00
Jirka Borovec	f93414d085	Prune metyrics: regression 9/n (#6637 ) * psnr * r2score * ssim * chlog	2021-03-23 10:01:25 +00:00
Jirka Borovec	efce2b7777	Prune metrics: regression 8/n (#6636 ) * explained_variance * tests * mean_absolute_error * mean_squared_error * mean_relative_error * mean_squared_log_error * chlog	2021-03-23 09:35:51 +01:00
thomas chaton	2064ece582	[refactor] Add setup to profilers + _run_stage_setup to trainer 2/5 (#6633 ) * add setup * update * updates on comment * Minor changes * Extra import * Docs Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>	2021-03-22 14:32:31 -04:00

1 2 3 4 5 ...

1394 Commits