Ray on GKE using DraNet
To get started, follow the instructions to create a GKE cluster with DRA support and using DraNet, it is important to follow the instructions, since there are multiple dependencies on the Kubernetes API version, the RDMA NCCL installer and the DraNet component.
The worker nodes in this configuration are a4-highgpu-8g instances, each equipped with eight NVIDIA B200 GPUs and eight RDMA-capable RoCE NICs.
Deploy RayCluster
Install Ray CRDs and the KubeRay operator:
kubectl create -k "github.com/ray-project/kuberay/ray-operator/config/default?ref=v1.4.1"
We create one ResourceClaimTemplate
, for the RDMA devices on the node, along
with a DeviceClass
for the RDMA device.
apiVersion: resource.k8s.io/v1beta1
kind: DeviceClass
metadata:
name: dranet
spec:
selectors:
- cel:
expression: device.driver == "dra.net"
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
name: all-nic
spec:
spec:
devices:
requests:
- name: nic
deviceClassName: dranet
count: 8
selectors:
- cel:
expression: device.attributes["dra.net"].rdma == true
Until the official Ray images support NVIDIA B200 with CUDA capability sm_100 you need to build a custom image:
FROM rayproject/ray:2.47.1-py39-cu128
USER root
RUN python -m pip install --upgrade pip
RUN pip uninstall cupy-cuda12x -y && conda install -c conda-forge cupy
RUN pip install --no-cache-dir --force-reinstall numpy==1.26.4
RUN pip install --no-cache-dir --force-reinstall scipy==1.11.4
RUN pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
RUN apt-get update && apt-get -y install libnl-3-200 libnl-route-3-200
USER 1000
Install a RayCluster and use the RDMA NICs on the workers nodes, you need to specify some NCCL environment variables for optimal performance on Google Cloud RDMA network:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: a4-ray-cluster
spec:
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
spec:
containers:
- name: ray-head
image: aojea/ray:2.44.1-py39-cu128
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
workerGroupSpecs:
- replicas: 2
minReplicas: 0
maxReplicas: 4
groupName: gpu-group
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: aojea/ray:2.44.1-py39-cu128
resources:
limits:
cpu: "200"
memory: "1600Gi"
nvidia.com/gpu: "8"
requests:
cpu: "120"
memory: "1600Gi"
nvidia.com/gpu: "8"
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: TORCH_DISTRIBUTED_DEBUG
value: "INFO"
- name: NCCL_DEBUG
value: INFO # Or "WARN", "DEBUG", "TRACE" for more verbosity
- name: NCCL_DEBUG_SUBSYS
value: INIT,NET,ENV,COLL,GRAPH
- name: NCCL_NET
value: gIB
- name: NCCL_CROSS_NIC
value: "0"
- name: NCCL_NET_GDR_LEVEL
value: "PIX"
- name: NCCL_P2P_NET_CHUNKSIZE
value: "131072"
- name: NCCL_NVLS_CHUNKSIZE
value: "524288"
- name: NCCL_IB_ADAPTIVE_ROUTING
value: "1"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "4"
- name: NCCL_IB_TC
value: "52"
- name: NCCL_IB_FIFO_TC
value: "84"
- name: NCCL_TUNER_CONFIG_PATH
value: "/usr/local/gib/configs/tuner_config_a4.txtpb"
volumeMounts:
- name: library-dir-host
mountPath: /usr/local/nvidia
- name: gib
mountPath: /usr/local/gib
- name: shared-memory
mountPath: /dev/shm
resourceClaims:
- name: nics
resourceClaimTemplateName: all-nic
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
volumes:
- name: library-dir-host
hostPath:
path: /home/kubernetes/bin/nvidia
- name: gib
hostPath:
path: /home/kubernetes/bin/gib
- name: shared-memory
emptyDir:
medium: "Memory"
sizeLimit: 250Gi
If in a future we want to create smaller workers that use a subset of GPUs in the Node we should use also the NVIDIA GPU DRA Driver to ensure the allocated GPUs and NICs on the node are aligned for optimal performance.
Validate the deployment is working checking the Pods status:
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
a4-ray-cluster-gpu-group-worker-gzzt6 1/1 Running 0 8m11s 10.48.4.6 gke-dranet-aojea-dranet-a4-54bd557d-1blr <none> <none>
a4-ray-cluster-gpu-group-worker-hnsvx 1/1 Running 0 8m11s 10.48.3.6 gke-dranet-aojea-dranet-a4-54bd557d-5w4l <none> <none>
a4-ray-cluster-head 1/1 Running 0 8m11s 10.48.2.6 gke-dranet-aojea-default-pool-7abaddc3-n287 <none> <none>
Check if a4-ray-cluster-head-svc
Service has been created successfully:
kubectl get services a4-ray-cluster-head-svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
a4-ray-cluster-head-svc ClusterIP None <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP 13m
Identify your RayCluster’s head pod:
$ export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
$ echo $HEAD_POD
a4-ray-cluster-head
Print the cluster resources:
$ kubectl exec -it $HEAD_POD -- python -c "import pprint; import ray; ray.init(); pprint.pprint(ray.cluster_resources(), sort_dicts=True)"
2025-07-14 10:44:41,326 INFO worker.py:1520 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2025-07-14 10:44:41,327 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.48.2.6:6379...
2025-07-14 10:44:41,343 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at 10.48.2.6:8265
{'CPU': 402.0,
'GPU': 16.0,
'accelerator_type:B200': 2.0,
'memory': 3438653071770.0,
'node:10.48.2.6': 1.0,
'node:10.48.3.6': 1.0,
'node:10.48.4.6': 1.0,
'node:__internal_head__': 1.0,
'object_store_memory': 401148243558.0}
Forward the port and check Ray dashboard:
kubectl port-forward svc/a4-ray-cluster-head-svc 8265:8265
Forwarding from 127.0.0.1:8265 -> 8265
Forwarding from [::1]:8265 -> 8265
Handling connection for 8265
GPU-to-GPU using Ray Collective Communication Library
Create a python file with the following code named nccl_allreduce_multigpu.py
:
import ray
import torch
import os
import ray.util.collective as collective
@ray.remote(num_gpus=8)
class Worker:
def __init__(self):
self.send_tensors = []
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:0'))
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:1') * 2)
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:2'))
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:3') * 2)
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:4'))
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:5') * 2)
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:6'))
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:7') * 2)
self.recv = torch.zeros((4,), dtype=torch.float32, device='cuda:0')
def setup(self, world_size, rank):
collective.init_collective_group(world_size, rank, "nccl", "177")
return True
def compute(self):
collective.allreduce_multigpu(self.send_tensors, "177")
cpu_tensors = [t.cpu() for t in self.send_tensors]
return (
cpu_tensors,
self.send_tensors[0].device,
self.send_tensors[1].device,
self.send_tensors[2].device,
self.send_tensors[3].device,
self.send_tensors[4].device,
self.send_tensors[5].device,
self.send_tensors[6].device,
self.send_tensors[7].device,
)
def destroy(self):
collective.destroy_collective_group("177")
if __name__ == "__main__":
ray.init(address="auto")
num_workers = 2
workers = []
init_rets = []
for i in range(num_workers):
w = Worker.remote()
workers.append(w)
init_rets.append(w.setup.remote(num_workers, i))
ray.get(init_rets)
print("Collective groups initialized.")
results = ray.get([w.compute.remote() for w in workers])
print("\n--- Allreduce Results ---")
for i, (tensors_list, *devices) in enumerate(results):
print(f"Worker {i} results:")
for j, tensor in enumerate(tensors_list):
print(f" Tensor {j} (originally on {devices[j]}): {tensor}")
ray.get([w.destroy.remote() for w in workers])
print("\nCollective groups destroyed.")
ray.shutdown()
Create Ray job (should be created with the previously port forwarded, in this case 8265):
$ ray job submit --address="http://localhost:8265" --runtime-env-json='{"working_dir": ".", "pip": ["torch"]}' -- python nccl_allreduce_multigpu.py
Job submission server address: http://localhost:8265
2025-07-14 17:32:08,731 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_ec361f13f7b82502.zip.
2025-07-14 17:32:08,733 INFO packaging.py:588 -- Creating a file package for local module '.'.
-------------------------------------------------------
Job 'raysubmit_QQTKZQDTDA3ifPMW' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_QQTKZQDTDA3ifPMW
Query the status of the job:
ray job status raysubmit_QQTKZQDTDA3ifPMW
Request the job to be stopped:
ray job stop raysubmit_QQTKZQDTDA3ifPMW
Tailing logs until the job exits (disable with --no-wait):
<snipped>
--- Allreduce Results ---
Worker 0 results:
(Worker pid=3590, ip=10.48.4.17) id=0x15b3, options=0x0, comp_mask=0x0}
(Worker pid=3590, ip=10.48.4.17) a4-ray-cluster-gpu-group-worker-pbkpw:3590:3778 [6] NCCL INFO NET/gIB: IbDev 6 Port 1 qpn 2440 se
Tensor 0 (originally on cuda:0): tensor([24., 24., 24., 24.])
Tensor 1 (originally on cuda:1): tensor([24., 24., 24., 24.])
Tensor 2 (originally on cuda:2): tensor([24., 24., 24., 24.])
Tensor 3 (originally on cuda:3): tensor([24., 24., 24., 24.])
Tensor 4 (originally on cuda:4): tensor([24., 24., 24., 24.])
Tensor 5 (originally on cuda:5): tensor([24., 24., 24., 24.])
Tensor 6 (originally on cuda:6): tensor([24., 24., 24., 24.])
Tensor 7 (originally on cuda:7): tensor([24., 24., 24., 24.])
Worker 1 results:
Tensor 0 (originally on cuda:0): tensor([24., 24., 24., 24.])
Tensor 1 (originally on cuda:1): tensor([24., 24., 24., 24.])
Tensor 2 (originally on cuda:2): tensor([24., 24., 24., 24.])
Tensor 3 (originally on cuda:3): tensor([24., 24., 24., 24.])
Tensor 4 (originally on cuda:4): tensor([24., 24., 24., 24.])
Tensor 5 (originally on cuda:5): tensor([24., 24., 24., 24.])
Tensor 6 (originally on cuda:6): tensor([24., 24., 24., 24.])
Tensor 7 (originally on cuda:7): tensor([24., 24., 24., 24.])
<snipped>
Since we are setting the informational NCCL environment variables NCCL_DEBUG and NCCL_DEBUG_SUBSYS we can verify in the logs that RDMA GPUDirect is being used:
# [... snipped ...]
# The gIB (InfiniBand) plugin is initialized
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3753 [2] NCCL INFO NET/gIB : Initializing gIB v1.0.6 [cite: 1887]
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3753 [2] NCCL INFO Initialized NET plugin gIB [cite: 1889]
# Environment variable for GPU Direct RDMA level is detected
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3754 [3] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to PIX [cite: 59]
# NCCL confirms that GPU Direct RDMA is enabled for each HCA (NIC) and GPU pairing
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3758 [7] NCCL INFO NET/gIB : GPU Direct RDMA Enabled for HCA 0 'mlx5_0' [cite: 41]
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3754 [3] NCCL INFO GPU Direct RDMA Enabled for GPU 7 / HCA 0 (distance 4 <= 4), read 0 mode Default [cite: 66]
# Finally, communication channels are established using GDRDMA
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3799 [2] NCCL INFO Channel 02/0 : 10[2] -> 2[2] [receive] via NET/gIB/2/GDRDMA [cite: 1734]
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3799 [2] NCCL INFO Channel 02/0 : 2[2] -> 10[2] [send] via NET/gIB/2/GDRDMA [cite: 1739]
# [... snipped ...]