Running the LLaMA 3.1 405B Model in Kubernetes using Ollama and OpenWebUI
This tutorial will guide you through deploying the LLaMA 3.1 405B model in a Kubernetes cluster using the Ollama service and OpenWebUI. The LLaMA 3.1 405B model is a large language model with 405 billion parameters, optimised for various tasks using 4-bit quantisation. We will deploy Ollama and OpenWebUI in the llama405b
namespace, leveraging H100SXM-80 GPUs to handle the heavy computation requirements of the model.
Prerequisites
- A Kubernetes cluster with H100SXM-80 GPU nodes available.
- A configured kubectl to manage your Kubernetes cluster.
- Permissions to create namespaces, deployments, and services in the cluster.
- Ollama docker image:
ollama/ollama:latest
.
Step 1: Create the llama405b
Namespace
Firstly, create the llama405b
namespace in your Kubernetes cluster to isolate the resources for this model.
kubectl create namespace llama405b
Step 2: Create the Entrypoint Script ConfigMap
Create a ConfigMap
that holds the entrypoint.sh
script to control how Ollama pulls and serves the model.
# entrypoint-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-entrypoint
namespace: llama405b # Specify the namespace
data:
entrypoint.sh: |
#!/bin/bash
# Start Ollama in the background.
/bin/ollama serve &
# Record Process ID.
pid=$!
# Wait for Ollama to start.
sleep 10
# Pull the specific LLaMA 405B model
ollama pull llama3.1:405b
# Wait for the Ollama process to finish.
wait $pid
Apply this ConfigMap to your cluster:
kubectl apply -f entrypoint-configmap.yaml
Step 3: Deploy Ollama Service
The following ollama-deployment.yaml
file creates a deployment for the Ollama service, pulling the LLaMA 3.1 405B model and serving it on port 80:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deployment
namespace: llama405b
labels:
app: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 80
volumeMounts:
- name: entrypoint-script
mountPath: /entrypoint.sh
subPath: entrypoint.sh
command: ["/usr/bin/bash", "/entrypoint.sh"]
resources:
limits:
nvidia.com/gpu: 3 # Requesting 3 NVIDIA GPUs for the container
volumes:
- name: entrypoint-script
configMap:
name: ollama-entrypoint # Reference to the ConfigMap created earlier
defaultMode: 0755
nodeSelector:
gpu.nvidia.com/class: H100SXM-80 # Use nodes with H100SXM-80 GPUs
restartPolicy: Always
We must allocate at least 3 H100SXM-80 GPUs to run this very large model. Now apply the deployment to the cluster:
kubectl apply -f ollama-deployment.yaml
Step 4: Expose the Ollama Service
Create a LoadBalancer service to expose Ollama externally using the following manifest. ollama-service.yaml
:
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: llama405b
labels:
type: external
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 11434
protocol: TCP
selector:
app: ollama
Apply the service manifest:
kubectl apply -f ollama-service.yaml
Note: Ensure the label type: external
is added to the service to request an external IP address. The targetPort
is set to 11434
, which is the default port Ollama uses.
Step 5: Deploy OpenWebUI Service
OpenWebUI provides a UI to interact with the deployed model. Use the following deployment manifest, openwebui-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: openwebui-deployment
namespace: llama405b
labels:
app: openwebui
spec:
replicas: 1
selector:
matchLabels:
app: openwebui
template:
metadata:
labels:
app: openwebui
spec:
containers:
- name: openwebui
image: ghcr.io/open-webui/open-webui:main
ports:
- containerPort: 8080
env:
- name: OLLAMA_BASE_URL
value: "http://ollama-service:80" # Point to the Ollama service within the cluster
Apply the deployment manifest:
kubectl apply -f openwebui-deployment.yaml
Step 6: Expose OpenWebUI Service
Create a LoadBalancer service to expose OpenWebUI externally by deploying the manifest, openwebui-service.yaml
:
apiVersion: v1
kind: Service
metadata:
name: openwebui-service
namespace: llama405b
labels:
type: external
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8080
protocol: TCP
selector:
app: openwebui
Apply the service manifest:
kubectl apply -f openwebui-service.yaml
Step 7: Verifying the Deployment
- Check the Logs: Monitor the logs of the ollama pod to verify that the model is being pulled successfully:
kubectl logs <ollama-pod-name> -n llama405b
- Access the Ollama pod and list the model:
kubectl exec -it <ollama-pod-name> -n llama405b -- /bin/bash
ollama list
This command should show the llama3.1:405b
model and its size. This confirms the model has been downloaded inside the pod.
Step 8: Accessing the OpenWebUI Service
Once the external IP for the openwebui-service
is assigned, navigate to the OpenWebUI interface by visiting the provided external IP address: http://<openwebui-external-ip>
You should see the OpenWebUI dashboard. In the dropdown menu for selecting models, you can interact with the LLaMA 3.1 405B model.
Note: Because of the large size of the LLaMA 405B model, the service will be available only after the model has been downloaded inside the pod. The initial loading into GPU memory can take up to 6-7 minutes, resulting in a delayed response in OpenWebUI. Once the model is fully loaded into GPU memory, the response time is minimal. However, if there's a period of inactivity, Ollama offloads the model from GPU memory. This means the response will be delayed again, as the model needs to be reloaded into memory.
- Check GPU Usage after sending a request via OpenWebUI:
kubectl exec -it <ollama-pod-name> -n llama405b -- nvidia-smi
The output will display the GPU usage details, confirming the model is loaded into GPU memory.
Alternatively to OpenWebUI, you can also use curl
in your terminal to send requests to the model as following:
curl -X POST http://<EXTERNAL-IP>:80/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:405b",
"prompt": "Tell me about yourself in pirate speak."
}'
Here the EXTERNAL-IP
is the IP address of the Ollama service load balancer which listents at port 80
. Just like in the case of OpenWebUI, there will be a cold start accounting for loading the model into GPU memory, so please allow some time for the response.
By following this tutorial, you should be able to successfully deploy the LLaMA 3.1 405B model using Ollama and OpenWebUI and interact with it through the OpenWebUI interface.