Add support for custom cloud compute configurations for Flows (#14831)

* use more recent lightning cloud launcher

* allow LightningApp to use custom cloud compute for flows

* feedback from adrian

* adjust other cloud tests

* update

* update

* update commens

* Update src/lightning_app/core/app.py

Co-authored-by: Sherin Thomas <sherin@grid.ai>

* Close profiler when `StopIteration` is raised (#14945)

* Find last checkpoints on restart (#14907)


Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Remove unused gcsfs dependency (#14962)

* Update hpu mixed precision link (#14974)

Signed-off-by: Jerome <janand@habana.ai>

* Bump version of fsspec (#14975)

fsspec verbump

* Fix TPU test CI (#14926)

* Fix TPU test CI

* +x first

* Lite first to uncovert errors faster

* Fixes

* One more

* Simplify XLALauncher wrapping to avoid pickle error

* debug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Debug commit successful. Trying local definitions

* Require tpu for mock test

* ValueError: The number of devices must be either 1 or 8, got 4 instead

* Fix mock test

* Simplify call, rely on defaults

* Skip OSError for now. Maybe upgrading will help

* Simplify launch tests, move some to lite

* Stricter typing

* RuntimeError: Accessing the XLA device before processes have spawned is not allowed.

* Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed."

This reverts commit f65107ebf3.

* Alternative boring solution to the reverted commit

* Fix failing test on CUDA machine

* Workarounds

* Try latest mkl

* Revert "Try latest mkl"

This reverts commit d06813aa67.

* Wrong exception

* xfail

* Mypy

* Comment change

* Spawn launch refactor

* Accept that we cannot lazy init now

* Fix mypy and launch test failures

* The base dockerfile already includes mkl-2022.1.0 - what if we use it?

* try a different mkl version

* Revert mkl version changes

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* Trainer: fix support for non-distributed PyTorch (#14971)

* Trainer: fix non-distributed use
* Update CHANGELOG

* fixes typing errors in rich_progress.py (#14963)

* revert default cloud compute rename

* allow LightningApp to use custom cloud compute for flows

* feedback from adrian

* update

* resolve merge with master conflict

* remove preemptible

* update CHANGELOG

* add basic flow cloud compute documentation

* fix docs build

* add missing symlink

* try to fix sphinx

* another attempt for docs

* fix new test

Signed-off-by: Jerome <janand@habana.ai>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Sherin Thomas <sherin@grid.ai>
Co-authored-by: Ziyad Sheebaelhamd <47150407+ziyadsheeba@users.noreply.github.com>
Co-authored-by: otaj <6065855+otaj@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: DP <10988155+donlapark@users.noreply.github.com>
This commit is contained in:
Raphael Randschau 2022-10-25 11:29:15 -07:00 committed by GitHub
parent 53d2c0684e
commit 13baad56e4
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 113 additions and 2 deletions

View File

@ -0,0 +1,40 @@
:orphan:
***************************
Customize my Flow resources
***************************
In the cloud, you can simply configure which machine to run on by passing
a :class:`~lightning_app.utilities.packaging.cloud_compute.CloudCompute` to your work ``__init__`` method:
.. code-block:: python
import lightning as L
# Run on a small, shared CPU machine. This is the default for every LightningFlow.
app = L.LightningApp(L.Flow(), flow_cloud_compute=L.CloudCompute())
Here is the full list of supported machine names:
.. list-table:: Hardware by Accelerator Type
:widths: 25 25 25
:header-rows: 1
* - Name
- # of CPUs
- Memory
* - flow-lite
- 0.3
- 4 GB
The up-to-date prices for these instances can be found `here <https://lightning.ai/pages/pricing>`_.
----
************
CloudCompute
************
.. autoclass:: lightning_app.utilities.packaging.cloud_compute.CloudCompute
:noindex:

View File

@ -39,6 +39,14 @@ Peek under the hood
:height: 180
:tag: Intermediate
.. displayitem::
:header: Customize Flow compute resources
:description: Learn more about Flow customizations.
:col_css: col-md-4
:button_link: compute_content.html
:height: 180
:tag: Intermediate
.. displayitem::
:header: Dynamically create, execute and stop Work
:description: Learn more about components creation.

View File

@ -0,0 +1 @@
../../../source-app/core_api/lightning_app/compute_content.rst

View File

@ -13,10 +13,11 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added a `--secret` option to CLI to allow binding secrets to app environment variables when running in the cloud ([#14612](https://github.com/Lightning-AI/lightning/pull/14612))
- Added support for running the works without cloud compute in the default container ([#14819](https://github.com/Lightning-AI/lightning/pull/14819))
- Added an HTTPQueue as an optional replacement for the default redis queue ([#14978](https://github.com/Lightning-AI/lightning/pull/14978)
- Added authentication to HTTP queue ([#15202](https://github.com/Lightning-AI/lightning/pull/15202))
- Added support for configuring flow cloud compute ([#14831](https://github.com/Lightning-AI/lightning/pull/14831))
- Added support for adding descriptions to commands either through a docstring or the `DESCRIPTION` attribute ([#15193](https://github.com/Lightning-AI/lightning/pull/15193)
- Added a try / catch mechanism around request processing to avoid killing the flow ([#15187](https://github.com/Lightning-AI/lightning/pull/15187)
- Added a Database Component ([#14995](https://github.com/Lightning-AI/lightning/pull/14995)
- Added an Database Component ([#14995](https://github.com/Lightning-AI/lightning/pull/14995)
- Added authentication to HTTP queue ([#15202](https://github.com/Lightning-AI/lightning/pull/15202))
- Added support to pass a `LightningWork` to the `LightningApp` ([#15215](https://github.com/Lightning-AI/lightning/pull/15215)
- Added support getting CLI help for connected apps even if the app isn't running ([#15196](https://github.com/Lightning-AI/lightning/pull/15196)
- Added support for adding requirements to commands and installing them when missing when running an app command ([#15198](https://github.com/Lightning-AI/lightning/pull/15198)

View File

@ -11,6 +11,7 @@ from typing import Dict, List, Optional, Tuple, TYPE_CHECKING, Union
from deepdiff import DeepDiff, Delta
from lightning_utilities.core.apply_func import apply_to_collection
import lightning_app
from lightning_app import _console
from lightning_app.api.request_types import APIRequest, CommandRequest, DeltaRequest
from lightning_app.core.constants import (
@ -50,6 +51,7 @@ class LightningApp:
def __init__(
self,
root: Union["LightningFlow", "LightningWork"],
flow_cloud_compute: Optional["lightning_app.CloudCompute"] = None,
debug: bool = False,
info: frontend.AppInfo = None,
root_path: str = "",
@ -67,6 +69,7 @@ class LightningApp:
Arguments:
root: The root ``LightningFlow`` or ``LightningWork`` component, that defines all the app's nested
components, running infinitely. It must define a `run()` method that the app can call.
flow_cloud_compute: The default Cloud Compute used for flow, Rest API and frontend's.
debug: Whether to activate the Lightning Logger debug mode.
This can be helpful when reporting bugs on Lightning repo.
info: Provide additional info about the app which will be used to update html title,
@ -100,6 +103,7 @@ class LightningApp:
_validate_root_flow(root)
self._root = root
self.flow_cloud_compute = flow_cloud_compute or lightning_app.CloudCompute()
# queues definition.
self.delta_queue: Optional[BaseQueue] = None

View File

@ -36,6 +36,7 @@ from lightning_cloud.openapi import (
V1QueueServerType,
V1SourceType,
V1UserRequestedComputeConfig,
V1UserRequestedFlowComputeConfig,
V1Work,
)
from lightning_cloud.openapi.rest import ApiException
@ -206,6 +207,11 @@ class CloudRuntime(Runtime):
flow_servers=frontend_specs,
desired_state=V1LightningappInstanceState.RUNNING,
env=v1_env_vars,
user_requested_flow_compute_config=V1UserRequestedFlowComputeConfig(
name=self.app.flow_cloud_compute.name,
shm_size=self.app.flow_cloud_compute.shm_size,
preemptible=False,
),
)
# if requirements file at the root of the repository is present,
@ -242,6 +248,7 @@ class CloudRuntime(Runtime):
works=[V1Work(name=work_req.name, spec=work_req.spec) for work_req in work_reqs],
local_source=True,
dependency_cache_key=app_spec.dependency_cache_key,
user_requested_flow_compute_config=app_spec.user_requested_flow_compute_config,
)
if ENABLE_MULTIPLE_WORKS_IN_DEFAULT_CONTAINER:

View File

@ -29,6 +29,7 @@ from lightning_cloud.openapi import (
V1QueueServerType,
V1SourceType,
V1UserRequestedComputeConfig,
V1UserRequestedFlowComputeConfig,
V1Work,
)
@ -37,6 +38,7 @@ from lightning_app.runners import backends, cloud
from lightning_app.storage import Drive, Mount
from lightning_app.utilities.cloud import _get_project
from lightning_app.utilities.dependency_caching import get_hash
from lightning_app.utilities.packaging.cloud_compute import CloudCompute
class MyWork(LightningWork):
@ -66,6 +68,47 @@ class WorkWithTwoDrives(LightningWork):
class TestAppCreationClient:
"""Testing the calls made using GridRestClient to create the app."""
@mock.patch("lightning_app.runners.backends.cloud.LightningClient", mock.MagicMock())
def test_run_with_custom_flow_compute_config(self, monkeypatch):
mock_client = mock.MagicMock()
mock_client.projects_service_list_memberships.return_value = V1ListMembershipsResponse(
memberships=[V1Membership(name="test-project", project_id="test-project-id")]
)
mock_client.lightningapp_instance_service_list_lightningapp_instances.return_value = (
V1ListLightningappInstancesResponse(lightningapps=[])
)
cloud_backend = mock.MagicMock()
cloud_backend.client = mock_client
monkeypatch.setattr(backends, "CloudBackend", mock.MagicMock(return_value=cloud_backend))
monkeypatch.setattr(cloud, "LocalSourceCodeDir", mock.MagicMock())
app = mock.MagicMock()
app.flows = []
app.frontend = {}
app.flow_cloud_compute = CloudCompute(name="t2.medium")
cloud_runtime = cloud.CloudRuntime(app=app, entrypoint_file="entrypoint.py")
cloud_runtime._check_uploaded_folder = mock.MagicMock()
monkeypatch.setattr(Path, "is_file", lambda *args, **kwargs: False)
monkeypatch.setattr(cloud, "Path", Path)
cloud_runtime.dispatch()
body = Body8(
app_entrypoint_file=mock.ANY,
enable_app_server=True,
flow_servers=[],
image_spec=None,
works=[],
local_source=True,
dependency_cache_key=mock.ANY,
user_requested_flow_compute_config=V1UserRequestedFlowComputeConfig(
name="t2.medium",
preemptible=False,
shm_size=0,
),
)
cloud_runtime.backend.client.lightningapp_v2_service_create_lightningapp_release.assert_called_once_with(
project_id="test-project-id", app_id=mock.ANY, body=body
)
@mock.patch("lightning_app.runners.backends.cloud.LightningClient", mock.MagicMock())
def test_run_on_byoc_cluster(self, monkeypatch):
mock_client = mock.MagicMock()
@ -100,6 +143,7 @@ class TestAppCreationClient:
works=[],
local_source=True,
dependency_cache_key=mock.ANY,
user_requested_flow_compute_config=mock.ANY,
)
cloud_runtime.backend.client.lightningapp_v2_service_create_lightningapp_release.assert_called_once_with(
project_id="default-project-id", app_id=mock.ANY, body=body
@ -142,6 +186,7 @@ class TestAppCreationClient:
works=[],
local_source=True,
dependency_cache_key=mock.ANY,
user_requested_flow_compute_config=mock.ANY,
)
cloud_runtime.backend.client.lightningapp_v2_service_create_lightningapp_release.assert_called_once_with(
project_id="test-project-id", app_id=mock.ANY, body=body
@ -264,6 +309,7 @@ class TestAppCreationClient:
enable_app_server=True,
flow_servers=[],
dependency_cache_key=get_hash(requirements_file),
user_requested_flow_compute_config=mock.ANY,
image_spec=Gridv1ImageSpec(
dependency_file_info=V1DependencyFileInfo(
package_manager=V1PackageManager.PIP, path="requirements.txt"
@ -431,6 +477,7 @@ class TestAppCreationClient:
enable_app_server=True,
flow_servers=[],
dependency_cache_key=get_hash(requirements_file),
user_requested_flow_compute_config=mock.ANY,
image_spec=Gridv1ImageSpec(
dependency_file_info=V1DependencyFileInfo(
package_manager=V1PackageManager.PIP, path="requirements.txt"
@ -590,6 +637,7 @@ class TestAppCreationClient:
enable_app_server=True,
flow_servers=[],
dependency_cache_key=get_hash(requirements_file),
user_requested_flow_compute_config=mock.ANY,
image_spec=Gridv1ImageSpec(
dependency_file_info=V1DependencyFileInfo(
package_manager=V1PackageManager.PIP, path="requirements.txt"
@ -623,6 +671,7 @@ class TestAppCreationClient:
enable_app_server=True,
flow_servers=[],
dependency_cache_key=get_hash(requirements_file),
user_requested_flow_compute_config=mock.ANY,
image_spec=Gridv1ImageSpec(
dependency_file_info=V1DependencyFileInfo(
package_manager=V1PackageManager.PIP, path="requirements.txt"
@ -756,6 +805,7 @@ class TestAppCreationClient:
package_manager=V1PackageManager.PIP, path="requirements.txt"
)
),
user_requested_flow_compute_config=mock.ANY,
works=[
V1Work(
name="test-work",