KEDA Autoscaling for an Ollama Application

This tutorial will walk you through the setup of KEDA-based autoscaling for an Ollama deployment using GPU usage as a metric. We'll cover the following:

Installing KEDA using Helm
Creating and Deploying the GPU Usage Metrics API
Creating a Service for the Metrics API
Configuring the KEDA ScaledObject for Autoscaling
Deploying Ollama with Autoscaling Enabled

Step 1: Installing KEDA with Helm

Firstly, install KEDA on your Kubernetes cluster using Helm:

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

Verify the installation:

kubectl get pods -n keda

You should see KEDA operator pods running.

Step 2: Creating and Deploying the GPU Usage Metrics API

Firstly, create a namespace where you'll apply all the manifests:

kubectl create namespace gpu-usage-autoscale

We'll create a FastAPI application that provides a simple endpoint returning GPU usage metrics, which will be used by KEDA for scaling decisions. However, you can create any custom metric application to trigger the KEDA autoscaling.

app.py
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()

class GPUUsageManager:
    def __init__(self):
        self.usage_by_model = {
            "ollama": self.get_gpu_usage()
        }

    def get_gpu_usage(self):
        return 80  # Simulating 80% GPU usage

gpu_usage_manager = GPUUsageManager()

@app.get("/metrics")
async def metrics():
    metrics = {
        "gpu_usage": {
            "ollama": gpu_usage_manager.usage_by_model["ollama"]
        }
    }
    return JSONResponse(status_code=200, content=metrics)

And here is the Dockerfile used to containerise the application:

# Use the official TensorFlow image with GPU support
FROM tensorflow/tensorflow:2.10.0-gpu

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install the required Python packages
RUN pip install --no-cache-dir fastapi uvicorn

# Expose port 8080 for the metrics endpoint
EXPOSE 8080

# Run the application
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Next, build and push the Docker image:

docker build -t <your-docker-username>/ollama-metrics-api:latest .
docker push <your-docker-username>/ollama-metrics-api:latest

Step 3: Deploying the GPU Usage Metrics API in Kubernetes

Now, create the Kubernetes resources to deploy this Metrics API, gpu-usage-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-metrics-api-deployment
  namespace: gpu-usage-autoscale
  labels:
    app: ollama-metrics-api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-metrics-api
  template:
    metadata:
      labels:
        app: ollama-metrics-api
    spec:
      containers:
      - name: ollama-metrics-api
        image: <your-docker-username>/ollama-metrics-api:latest
        ports:
        - containerPort: 8080
      nodeSelector:
        gpu.nvidia.com/class: L40S  # Ensure this matches the GPU nodes in your cluster

And expose as the gpu-usage-service.yaml service:

apiVersion: v1
kind: Service
metadata:
  name: ollama-metrics-api-service
  namespace: gpu-usage-autoscale
  labels:
    app: ollama-metrics-api
spec:
  selector:
    app: ollama-metrics-api
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080
  type: ClusterIP

Apply the deployment and service manifests:

kubectl apply -f gpu-usage-deployment.yaml -n gpu-usage-autoscale
kubectl apply -f gpu-usage-service.yaml -n gpu-usage-autoscale

Step 4: Configuring the KEDA ScaledObject for Autoscaling

Once the Metrics API is deployed, you can configure KEDA to autoscale the Ollama deployment based on the GPU usage. Create a scaled object file, gpu-usage-scaledobject.yaml:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ollama-scaledobj
  namespace: gpu-usage-autoscale
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama-deployment  # The name of your existing Ollama deployment
  pollingInterval: 5  # How often KEDA checks the metric endpoint (in seconds)
  cooldownPeriod: 120  # Wait time before scaling down (in seconds)
  minReplicaCount: 1  # Minimum of 1 replica
  maxReplicaCount: 10  # Maximum number of replicas
  fallback:
    failureThreshold: 3
    replicas: 1  # Fallback replica count if metrics endpoint fails
  advanced:
    restoreToOriginalReplicaCount: true  # Restore original replica count after scaling
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 30  
          policies:
          - type: Percent
            value: 100
            periodSeconds: 15
          selectPolicy: Max  
        scaleUp:
          stabilizationWindowSeconds: 0
          policies:
          - type: Pods
            value: 4  
            periodSeconds: 10  
          selectPolicy: Max
  triggers:
  - type: metrics-api
    metadata:
      targetValue: "60"  # The GPU usage threshold at which scaling should happen
      format: "json"
      activationTargetValue: "50"  # Activate scaling at this value
      url: "http://ollama-metrics-api-service.gpu-usage-autoscale.svc.cluster.local:8080/metrics"
      valueLocation: "gpu_usage.ollama"
      method: GET

Now apply the scaled object manifest:

kubectl apply -f gpu-usage-scaledobject.yaml -n gpu-usage-autoscale

Step 5: Deploying Ollama

You should already have your Ollama deployment ready. Here’s an example manifest for Ollama, ollama-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-deployment
  namespace: gpu-usage-autoscale
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 80
        resources:
          limits:
            nvidia.com/gpu: 1
      nodeSelector:
        gpu.nvidia.com/class: L40S  # Ensure this matches the GPU nodes in your cluster

Now apply the Ollama deployment manifest:

kubectl apply -f ollama-deployment.yaml -n gpu-usage-autoscale

Step 6: Monitoring Autoscaling

You can monitor the autoscaling behavior of your deployment and check that KEDA is set up correctly:

kubectl get scaledobjects -n gpu-usage-autoscale

The above command checks the status of the deployed KEDA scaled object, and you should see the following similar output:

NAME               SCALETARGETKIND      SCALETARGETNAME     MIN   MAX   TRIGGERS      AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
ollama-scaledobj   apps/v1.Deployment   ollama-deployment   1     10    metrics-api                    True    True     False      Unknown   2m35s

READY: True indicates that the scaled object is correctly configured and ready to make scaling decisions.
ACTIVE: True indicates that scaling is currently active (based on the current GPU usage metrics).
FALLBACK: False means KEDA is not relying on the fallback configuration and is successfully fetching metrics.

If you see ACTIVE as False, it may indicate that the metrics endpoint hasn't reached the scaling threshold yet.

Now check if the pods are autoscalling. If you remember, we deployed a single pod, and now you should see multiple pods running, but not more than 10 as defined in the manifest scaled object. Check the Ollama deployment pods:

kubectl get pods -n gpu-usage-autoscale -l app=ollama -w

This command monitors the state of the Ollama pods as KEDA autoscaling triggers the scaling actions. The expected output will show the number of replicas scaling up or down based on GPU usage:

NAME                              READY   STATUS    RESTARTS   AGE
ollama-deployment-658cc67646-9xpxq   1/1     Running   0          3m
ollama-deployment-658cc67646-qwrfz   1/1     Running   0          3m
ollama-deployment-658cc67646-mzbcv   1/1     Running   0          2m
ollama-deployment-658cc67646-lkfqx   1/1     Running   0          2m
ollama-deployment-658cc67646-7kqxs   1/1     Running   0          1m

As scaling occurs, you will observe new pods being created (Pending to Running) or pods being removed (Terminating) in real-time.

In this tutorial, you've learned how to set up KEDA-based autoscaling for the Ollama deployment using GPU usage as the metric. We covered the installation of KEDA, the deployment of a metrics API, the creation of a ScaledObject, and the configuration for autoscaling based on GPU usage.

Feel free to adjust the threshold values and configurations to suit your requirements.

KEDA Autoscaling for an Ollama Application

Step 1: Installing KEDA with Helm​

Step 2: Creating and Deploying the GPU Usage Metrics API​

Step 3: Deploying the GPU Usage Metrics API in Kubernetes​

Step 4: Configuring the KEDA ScaledObject for Autoscaling​

Step 5: Deploying Ollama​

Step 6: Monitoring Autoscaling​

Contents