lightning/examples/pytorch/tensor_parallel
Jirka Borovec 1e88899a51
bump python 3.9+ (#20413)
* bump python 3.9+

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* --unsafe-fixes

* contextlib.AbstractContextManager

* type: ignore[misc]

* update CI

* apply fixes

* apply fixes

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
2024-11-25 09:20:17 +01:00
..
README.md Add Studio badge to tensor parallel docs (#19913) 2024-05-28 09:04:55 -04:00
data.py
model.py bump python 3.9+ (#20413) 2024-11-25 09:20:17 +01:00
parallelism.py
train.py (10/10) Support 2D Parallelism - Port Fabric docs to PL (#19899) 2024-05-23 08:55:52 -04:00

README.md

Tensor Parallel and 2D Parallel

This example shows how to apply tensor-parallelism to your model (here Llama 3 7B) with the ModelParallelStrategy, and how it can be combined with FSDP (2D parallelism). PyTorch 2.3+ and a machine with at least 4 GPUs and 24 GB memory each are required to run this example.

pip install 'torch>=2.3'

Navigate to this example folder and run the training script:

cd examples/pytorch/tensor_parallel
python train.py

You should see an output like this:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

Number of model parameters: 6.7 B
Starting training ...

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

Epoch 0: 100%|█████████████████████████████████████████████| 10/10 [01:49<00:00, 0.09it/s, v_num=2]
`Trainer.fit` stopped: `max_epochs=1` reached.                                      
Saving a (distributed) checkpoint ...
Training successfully completed!
Peak memory usage: 36.73 GB
!NOTE

The ModelParallelStrategy is experimental and subject to change. Report issues on GitHub.