Skip to main content

KEDA Autoscaling for an Ollama Application

This tutorial will walk you through the setup of KEDA-based autoscaling for an Ollama deployment using GPU usage as a metric. We'll cover the following:

  • Installing KEDA using Helm
  • Creating and Deploying the GPU Usage Metrics API
  • Creating a Service for the Metrics API
  • Configuring the KEDA ScaledObject for Autoscaling
  • Deploying Ollama with Autoscaling Enabled

Step 1: Installing KEDA with Helm

Firstly, install KEDA on your Kubernetes cluster using Helm:

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

Verify the installation:

kubectl get pods -n keda

You should see KEDA operator pods running.

Step 2: Creating and Deploying the GPU Usage Metrics API

Firstly, create a namespace where you'll apply all the manifests:

kubectly create namespace gpu-usage-autoscale

We'll create a FastAPI application that provides a simple endpoint returning GPU usage metrics, which will be used by KEDA for scaling decisions. However, you can create any custom metric application to trigger the KEDA autoscaling.

app.py
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()

class GPUUsageManager:
def __init__(self):
self.usage_by_model = {
"ollama": self.get_gpu_usage()
}

def get_gpu_usage(self):
return 80 # Simulating 80% GPU usage

gpu_usage_manager = GPUUsageManager()

@app.get("/metrics")
async def metrics():
metrics = {
"gpu_usage": {
"ollama": gpu_usage_manager.usage_by_model["ollama"]
}
}
return JSONResponse(status_code=200, content=metrics)

And here is the Dockerfile used to containerise the application:

# Use the official TensorFlow image with GPU support
FROM tensorflow/tensorflow:2.10.0-gpu

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install the required Python packages
RUN pip install --no-cache-dir fastapi uvicorn

# Expose port 8080 for the metrics endpoint
EXPOSE 8080

# Run the application
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Next, build and push the Docker image:

docker build -t <your-docker-username>/ollama-metrics-api:latest .
docker push <your-docker-username>/ollama-metrics-api:latest

Step 3: Deploying the GPU Usage Metrics API in Kubernetes

Now, create the Kubernetes resources to deploy this Metrics API, gpu-usage-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-metrics-api-deployment
namespace: gpu-usage-autoscale
labels:
app: ollama-metrics-api
spec:
replicas: 1
selector:
matchLabels:
app: ollama-metrics-api
template:
metadata:
labels:
app: ollama-metrics-api
spec:
containers:
- name: ollama-metrics-api
image: <your-docker-username>/ollama-metrics-api:latest
ports:
- containerPort: 8080
nodeSelector:
gpu.nvidia.com/class: L40S # Ensure this matches the GPU nodes in your cluster

And expose as the gpu-usage-service.yaml service:

apiVersion: v1
kind: Service
metadata:
name: ollama-metrics-api-service
namespace: gpu-usage-autoscale
labels:
app: ollama-metrics-api
spec:
selector:
app: ollama-metrics-api
ports:
- protocol: TCP
port: 8080
targetPort: 8080
type: ClusterIP

Apply the deployment and service manifests:

kubectl apply -f gpu-usage-deployment.yaml -n gpu-usage-autoscale
kubectl apply -f gpu-usage-service.yaml -n gpu-usage-autoscale

Step 4: Configuring the KEDA ScaledObject for Autoscaling

Once the Metrics API is deployed, you can configure KEDA to autoscale the Ollama deployment based on the GPU usage. Create a scaled object file, gpu-usage-scaledobject.yaml:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ollama-scaledobj
namespace: gpu-usage-autoscale
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama-deployment # The name of your existing Ollama deployment
pollingInterval: 5 # How often KEDA checks the metric endpoint (in seconds)
cooldownPeriod: 120 # Wait time before scaling down (in seconds)
minReplicaCount: 1 # Minimum of 1 replica
maxReplicaCount: 10 # Maximum number of replicas
fallback:
failureThreshold: 3
replicas: 1 # Fallback replica count if metrics endpoint fails
advanced:
restoreToOriginalReplicaCount: true # Restore original replica count after scaling
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 15
selectPolicy: Max
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Pods
value: 4
periodSeconds: 10
selectPolicy: Max
triggers:
- type: metrics-api
metadata:
targetValue: "60" # The GPU usage threshold at which scaling should happen
format: "json"
activationTargetValue: "50" # Activate scaling at this value
url: "http://ollama-metrics-api-service.gpu-usage-autoscale.svc.cluster.local:8080/metrics"
valueLocation: "gpu_usage.ollama"
method: GET

Now apply the scaled object manifest:

kubectl apply -f gpu-usage-scaledobject.yaml -n gpu-usage-autoscale

Step 5: Deploying Ollama

You should already have your Ollama deployment ready. Here’s an example manifest for Ollama, ollama-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deployment
namespace: gpu-usage-autoscale
labels:
app: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 80
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
gpu.nvidia.com/class: L40S # Ensure this matches the GPU nodes in your cluster

Now apply the Ollama deployment manifest:

kubectl apply -f ollama-deployment.yaml -n gpu-usage-autoscale

Step 6: Monitoring Autoscaling

You can monitor the autoscaling behavior of your deployment and check that KEDA is set up correctly:

kubectl get scaledobjects -n gpu-usage-autoscale

The above command checks the status of the deployed KEDA scaled object, and you should see the following similar output:

NAME               SCALETARGETKIND      SCALETARGETNAME     MIN   MAX   TRIGGERS      AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
ollama-scaledobj apps/v1.Deployment ollama-deployment 1 10 metrics-api True True False Unknown 2m35s
  • READY: True indicates that the scaled object is correctly configured and ready to make scaling decisions.
  • ACTIVE: True indicates that scaling is currently active (based on the current GPU usage metrics).
  • FALLBACK: False means KEDA is not relying on the fallback configuration and is successfully fetching metrics.

If you see ACTIVE as False, it may indicate that the metrics endpoint hasn't reached the scaling threshold yet.

Now check if the pods are autoscalling. If you remember, we deployed a single pod, and now you should see multiple pods running, but not more than 10 as defined in the manifest scaled object. Check the Ollama deployment pods:

kubectl get pods -n gpu-usage-autoscale -l app=ollama -w

This command monitors the state of the Ollama pods as KEDA autoscaling triggers the scaling actions. The expected output will show the number of replicas scaling up or down based on GPU usage:

NAME                              READY   STATUS    RESTARTS   AGE
ollama-deployment-658cc67646-9xpxq 1/1 Running 0 3m
ollama-deployment-658cc67646-qwrfz 1/1 Running 0 3m
ollama-deployment-658cc67646-mzbcv 1/1 Running 0 2m
ollama-deployment-658cc67646-lkfqx 1/1 Running 0 2m
ollama-deployment-658cc67646-7kqxs 1/1 Running 0 1m

As scaling occurs, you will observe new pods being created (Pending to Running) or pods being removed (Terminating) in real-time.

In this tutorial, you've learned how to set up KEDA-based autoscaling for the Ollama deployment using GPU usage as the metric. We covered the installation of KEDA, the deployment of a metrics API, the creation of a ScaledObject, and the configuration for autoscaling based on GPU usage.

Feel free to adjust the threshold values and configurations to suit your requirements.