- Local Ray
- KubeRay: create a cluster on demand in your Kubernetes cluster.
- Existing Ray Cluster
Ray Clusters
Local Ray
To execute jobs without an external Ray cluster, you can just trigger theTable.backfill method. This will auto-create a Ray cluster on your machine. Because it’s on your laptop/desktop, this is only suitable for prototyping on small datasets. But it is the easiest way to get started. Simply define the UDF, add a column, and trigger the job:
KubeRay
If you have a Kubernetes cluster with kuberay-operator, you can use Geneva to automatically provision RayClusters. To do so, define a Geneva cluster, representing the resource needs, Docker images, and other Ray configurations necessary to run your job. Make sure your cluster has adequate compute resources to provision the RayCluster. Here is an example Geneva cluster definition:External Ray cluster
If you already have a Ray cluster, Geneva can execute jobs against it too. You do so by defining a Geneva cluster which has the address of the cluster. Here’s an example:Dependencies
Most UDFs require some dependencies: helper libraries likepillow for image processing, pre-trained models like open-clip to calculate embeddings, or even small config files. We have two ways to get them to workers:
- Use defaults
- Define a manifest
Use Defaults
By default, LanceDB packages your local environment and sends it to Ray workers. This includes your local Pythonsite-packages (defined by site.getsitepackages()) and either the current workspace root (if you’re in a python repo) or the current working directory (if you’re not). If you don’t explicitly define a manifest, this is what will happen.
Define a Manifest
Sometimes you need more control over what the workers get access to. For example:- you might need to include files from another directory, or another python package
- you might not want to send all your local dependencies (if your repo has lots of dependencies but your UDF will only need a few)
- you might need packages to be built separately for the worker’s architecture (for example, you can’t build
pyarrowon a Mac and run it on a Linux Ray worker). - you might want to reuse dependencies between two backfill jobs, so you know they are running with the same environment.
define_manifest() packages files in the local environment and stores the Manifest metadata and files in persistent storage. The Manifest can then be referenced by name, shared, and reused.
GenevaManifestBuilder)
| Contents | How you can define it |
|---|---|
| Local python packages | Will be uploaded automatically, unless you set .skip_site_packages(True). |
| Local working directory (or workspace root, if in a python repo) | Will be uploaded automatically. |
Python packages to be installed with pip | Use .pip(packages: list[str]) or .add_pip(package: str). See Ray’s RuntimeEnv docs for details. |
Local python packages outside of site_packages | Use .py_modules(modules: list[str]) or .add_py_module(module: str). See Ray’s RuntimeEnv docs for details. |
| Container image for head node | Use .head_image(head_image: str) or default_head_image() to use the default. Note that, if the image is also defined in the GenevaCluster, the image set here in the Manifest will take priority. |
| Container image for worker nodes | Use .worker_image(worker_image: str) or default_worker_image() to use the default for the current platform. As with the head image, this takes priority over any images set in the Cluster. |
.delete_local_zips(False) and .local_zip_output_dir(path) then examine the zip files in path.
Putting it all together: Execution Contexts
An execution context represents the concrete execution environment (Cluster and Manifest) used to execute a distributed job. Callingcontext will enter a context manager that will provision an execution cluster and execute the Job using the Cluster and Manifest definitions provided. Because you’ve already defined the cluster and manifest, you can just reference them by name. Note that providing a manifest is optional. Once completed, the context manager will automatically de-provision the cluster.