Running the LLaMA 3.1 405B Model in Kubernetes using Ollama and OpenWebUI

This tutorial will guide you through deploying the LLaMA 3.1 405B model in a Kubernetes cluster using the Ollama service and OpenWebUI. The LLaMA 3.1 405B model is a large language model with 405 billion parameters, optimised for various tasks using 4-bit quantisation. We will deploy Ollama and OpenWebUI in the llama405b namespace, leveraging H100SXM-80 GPUs to handle the heavy computation requirements of the model.

Prerequisites

A Kubernetes cluster with H100SXM-80 GPU nodes available.
A configured kubectl to manage your Kubernetes cluster.
Permissions to create namespaces, deployments, and services in the cluster.
Ollama docker image: ollama/ollama:latest.

Step 1: Create the `llama405b` Namespace

Firstly, create the llama405b namespace in your Kubernetes cluster to isolate the resources for this model.

kubectl create namespace llama405b

Step 2: Create the Entrypoint Script `ConfigMap`

Create a ConfigMap that holds the entrypoint.sh script to control how Ollama pulls and serves the model.

# entrypoint-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ollama-entrypoint
  namespace: llama405b  # Specify the namespace
data:
  entrypoint.sh: |
    #!/bin/bash

    # Start Ollama in the background.
    /bin/ollama serve &

    # Record Process ID.
    pid=$!

    # Wait for Ollama to start.
    sleep 10

    # Pull the specific LLaMA 405B model
    ollama pull llama3.1:405b

    # Wait for the Ollama process to finish.
    wait $pid

Apply this ConfigMap to your cluster:

kubectl apply -f entrypoint-configmap.yaml

Step 3: Deploy Ollama Service

The following ollama-deployment.yaml file creates a deployment for the Ollama service, pulling the LLaMA 3.1 405B model and serving it on port 80:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-deployment
  namespace: llama405b  
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 80
          volumeMounts:
            - name: entrypoint-script
              mountPath: /entrypoint.sh
              subPath: entrypoint.sh
          command: ["/usr/bin/bash", "/entrypoint.sh"]
          resources:
            limits:
              nvidia.com/gpu: 3  # Requesting 3 NVIDIA GPUs for the container
      volumes:
        - name: entrypoint-script
          configMap:
            name: ollama-entrypoint  # Reference to the ConfigMap created earlier
            defaultMode: 0755
      nodeSelector:
        gpu.nvidia.com/class: H100SXM-80  # Use nodes with H100SXM-80 GPUs
      restartPolicy: Always

We must allocate at least 3 H100SXM-80 GPUs to run this very large model. Now apply the deployment to the cluster:

kubectl apply -f ollama-deployment.yaml

Step 4: Expose the Ollama Service

Create a LoadBalancer service to expose Ollama externally using the following manifest. ollama-service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: llama405b
  labels:
    type: external
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 11434
      protocol: TCP
  selector:
    app: ollama

Apply the service manifest:

kubectl apply -f ollama-service.yaml

Note: Ensure the label type: external is added to the service to request an external IP address. The targetPort is set to 11434, which is the default port Ollama uses.

Step 5: Deploy OpenWebUI Service

OpenWebUI provides a UI to interact with the deployed model. Use the following deployment manifest, openwebui-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: openwebui-deployment
  namespace: llama405b
  labels:
    app: openwebui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: openwebui
  template:
    metadata:
      labels:
        app: openwebui
    spec:
      containers:
      - name: openwebui
        image: ghcr.io/open-webui/open-webui:main
        ports:
        - containerPort: 8080
        env:
        - name: OLLAMA_BASE_URL
          value: "http://ollama-service:80"  # Point to the Ollama service within the cluster

Apply the deployment manifest:

kubectl apply -f openwebui-deployment.yaml

Step 6: Expose OpenWebUI Service

Create a LoadBalancer service to expose OpenWebUI externally by deploying the manifest, openwebui-service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: openwebui-service
  namespace: llama405b
  labels:
    type: external   
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
  selector:
    app: openwebui

Apply the service manifest:

kubectl apply -f openwebui-service.yaml

Step 7: Verifying the Deployment

Check the Logs: Monitor the logs of the ollama pod to verify that the model is being pulled successfully:

kubectl logs <ollama-pod-name> -n llama405b

Access the Ollama pod and list the model:

kubectl exec -it <ollama-pod-name> -n llama405b -- /bin/bash
ollama list

This command should show the llama3.1:405b model and its size. This confirms the model has been downloaded inside the pod.

Step 8: Accessing the OpenWebUI Service

Once the external IP for the openwebui-service is assigned, navigate to the OpenWebUI interface by visiting the provided external IP address: http://<openwebui-external-ip>

You should see the OpenWebUI dashboard. In the dropdown menu for selecting models, you can interact with the LLaMA 3.1 405B model.

Note: Because of the large size of the LLaMA 405B model, the service will be available only after the model has been downloaded inside the pod. The initial loading into GPU memory can take up to 6-7 minutes, resulting in a delayed response in OpenWebUI. Once the model is fully loaded into GPU memory, the response time is minimal. However, if there's a period of inactivity, Ollama offloads the model from GPU memory. This means the response will be delayed again, as the model needs to be reloaded into memory.

Check GPU Usage after sending a request via OpenWebUI:

kubectl exec -it <ollama-pod-name> -n llama405b -- nvidia-smi

The output will display the GPU usage details, confirming the model is loaded into GPU memory.

Alternatively to OpenWebUI, you can also use curl in your terminal to send requests to the model as following:

curl -X POST http://<EXTERNAL-IP>:80/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
           "model": "llama3.1:405b",
           "prompt": "Tell me about yourself in pirate speak."
         }'

Here the EXTERNAL-IP is the IP address of the Ollama service load balancer which listents at port 80. Just like in the case of OpenWebUI, there will be a cold start accounting for loading the model into GPU memory, so please allow some time for the response.

By following this tutorial, you should be able to successfully deploy the LLaMA 3.1 405B model using Ollama and OpenWebUI and interact with it through the OpenWebUI interface.

Running the LLaMA 3.1 405B Model in Kubernetes using Ollama and OpenWebUI

Prerequisites​

Step 1: Create the llama405b Namespace​

Step 2: Create the Entrypoint Script ConfigMap​

Step 3: Deploy Ollama Service​

Step 4: Expose the Ollama Service​

Step 5: Deploy OpenWebUI Service​

Step 6: Expose OpenWebUI Service​

Step 7: Verifying the Deployment​

Step 8: Accessing the OpenWebUI Service​

Contents