Commit Graph

77 Commits

Author SHA1 Message Date
thomas chaton 71f44775c9
Resolve boto3 unable to local credentials (#19472) 2024-02-14 10:54:01 +00:00
thomas chaton b097a4df3f
Improve data processing to enable downloading LAOIN 400M (#19452) 2024-02-13 13:23:39 +00:00
Xinyu Yang 47c8f4cba0
bugfix: skip write index.json if no data is wrote. (#19439) 2024-02-09 17:08:28 +00:00
Xinyu Yang 7b867c7d91
bugfix: correct node rank (#19437) 2024-02-09 15:21:28 +00:00
thomas chaton 4c2fc3b0cb
Add DNS optimize support (#19429)
* update

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2024-02-08 11:14:57 +00:00
thomas chaton ac9d63f4eb
Lightning Data: Refactor files (#19424) 2024-02-08 08:02:08 +00:00
thomas chaton 28a80238a4
Add support for tif (#19421) 2024-02-06 15:23:40 +00:00
thomas chaton 7dfc279b3f
Add support for parallelizing processing parquet files across workers and nodes. (#19400) 2024-02-05 23:21:25 +00:00
thomas chaton af7e79a84a
Data Processing: Tiny optimization (#19389) 2024-02-01 18:21:54 +00:00
thomas chaton 8280519642
Data Processor: Add is_last argument to know when the last item for the current worker is being processed (#19383) 2024-02-01 12:09:06 +00:00
thomas chaton 5a0d2eff8c
map operator: Add support for non absolute input_dir and output_dir (#19378) 2024-02-01 08:25:47 +00:00
thomas chaton 28b380610f
StreamingDataloader: Resolve typo (#19370) 2024-01-30 16:52:47 +00:00
thomas chaton 322f474978
JPEGSerializer: Fix serializer io.bytes image (#19369) 2024-01-30 16:52:25 +00:00
thomas chaton 10c3a71dbd
Bump Lightning Cloud 0.5.64 (#19372) 2024-01-30 14:57:11 +00:00
thomas chaton b0e1ee2469
map operator: Add support for nested folders (#19366) 2024-01-29 19:17:28 +00:00
thomas chaton 37a521cad2
map operator: Add weights to evenly distributed works among workers (#19365) 2024-01-29 18:27:37 +00:00
thomas chaton c10fd22c74
BC: Switch map operator arguments order (#19345)
update

Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2024-01-25 09:37:28 +00:00
thomas chaton 012f68dcfd
StreamingDataloader: Add profiling support (#19338) 2024-01-24 20:30:55 +00:00
thomas chaton 0a75d3b7e6
tiny improvement (#19341) 2024-01-24 17:58:30 +00:00
Andy☼ McSherry☼ 577bd85654
Allow any AWS authentication method in studios (#19336) 2024-01-24 16:20:53 +00:00
thomas chaton ed367ca675
StreamingDataLoader: Resolve fault tolerance with the CombinedStreamingDataset and multiple workers (#19326) 2024-01-23 17:54:10 +00:00
thomas chaton d08e6cd916
Add walk operator (#19333) 2024-01-23 14:21:08 +00:00
thomas chaton 75510dd9f8
StreamingDataset: Add intra node shuffling to accelerate second epoch (#19296) 2024-01-19 17:08:32 +00:00
thomas chaton 97d71aba0b
Data Processor: Resolve several bugs found while publishing a Studio (#19309) 2024-01-18 20:46:06 +00:00
thomas chaton 19d9eabbc5
Enable map over inputs without files input (#19285) 2024-01-16 12:19:01 +00:00
thomas chaton 564be3b521
Streaming Dataset: Resolve chunks eviction (#19214)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2024-01-01 18:58:58 -05:00
thomas chaton 91ef1902ec
StreamingDataset: Fault Tolerance v2 2/n (#19201)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-12-24 09:43:58 +09:00
thomas chaton c989a97aa1
feat(fr) StreamingDataset: Fault Tolerance v2 1/n (#19196)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-12-21 17:01:23 +00:00
thomas chaton 12847132b1
lightning.data: Remove torch distributed for the Dataset Optimizer (#19182)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-12-20 13:57:07 +00:00
thomas chaton 0a5cca6711
StreamingDataset: Cleanup chunks right away if the dataset doesn't fit within the cache (#19168)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-12-18 23:01:55 +00:00
thomas chaton 7bd75778a6
Fix: Resolve checkpointing for the Streaming Dataset (#19123)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-12-08 11:10:07 +00:00
thomas chaton e6b79d984d
StreamingDataset improve deletion strategy (#19118)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-12-07 08:48:08 -05:00
thomas chaton 4d15468555
Improve StreamingDataset Speed (#19114)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-12-05 19:50:27 +00:00
thomas chaton 08c9e51335
Resolve path for StreamingDataset (#19094)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-30 16:13:06 +00:00
thomas chaton a6da1e3351
Add fault tolerance Streaming Dataset 2/n (#19052)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-23 17:40:04 +00:00
thomas chaton 7eca9c1642
Add numpy support for the StreamingDataset 1/2 (#19050)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-22 18:00:15 +00:00
thomas chaton 1073276a58
Add fault tolerance for the StreamingDataset 1/n (#19049)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-22 17:22:00 +00:00
thomas chaton bc1658039f
Add direct s3 support to the streaming dataset (#19044)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-22 01:17:49 +00:00
thomas chaton d3df1273b6
Add disk usage check before downloading files (#19041)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-21 20:10:18 +00:00
thomas chaton 6e517bd55b
Resolve Item Loader bugs (#19017)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2023-11-16 18:06:58 -05:00
thomas chaton 792cb73fc6
Remove the LightningDataset relying on un-maintained torchdata (#19019)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-16 16:08:15 -05:00
thomas chaton 7288302186
Add multiple uploaders to the map, optimize (#18989)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-13 14:27:50 -05:00
thomas chaton 1c86011dab
Add Video/Audio support (#18977)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-09 18:37:37 +00:00
thomas chaton 1b3a3fbaad
Prevent downloading more chunks than needed (#18964)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-07 19:40:21 +00:00
thomas chaton 20f58f63ef
Bump Lightning Cloud to 0.5.51 (#18962)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-07 17:30:39 +00:00
Adrian Wälchli 8a5d3423a7
Cache directory per worker to avoid collisions (#18957) 2023-11-07 10:19:03 -05:00
thomas chaton 529f07f254
Add support for deleting chunks (#18959)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
2023-11-07 09:46:13 +00:00
Adrian Wälchli 62771f3932
Greedily select files for data processor workers based on size (#18907)
Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>
2023-11-06 19:33:50 -05:00
thomas chaton e79ac21415
Add the input_dir in the cache_dir to avoid overlapping downloads (#18960)
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
2023-11-06 19:01:37 -05:00
Adrian Wälchli c4af18b2c5
Create cache dir if it doesn't exist (#18955) 2023-11-06 11:02:05 -05:00