KEDA Autoscaling for an Ollama Application
This tutorial will walk you through the setup of KEDA-based autoscaling for an Ollama deployment using GPU usage as a metric. We'll cover the following:
- Installing KEDA using Helm
- Creating and Deploying the GPU Usage Metrics API
- Creating a Service for the Metrics API
- Configuring the KEDA ScaledObject for Autoscaling
- Deploying Ollama with Autoscaling Enabled
Step 1: Installing KEDA with Helm
Firstly, install KEDA on your Kubernetes cluster using Helm:
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace
Verify the installation:
kubectl get pods -n keda
You should see KEDA operator pods running.
Step 2: Creating and Deploying the GPU Usage Metrics API
Firstly, create a namespace where you'll apply all the manifests:
kubectl create namespace gpu-usage-autoscale
We'll create a FastAPI application that provides a simple endpoint returning GPU usage metrics, which will be used by KEDA for scaling decisions. However, you can create any custom metric application to trigger the KEDA autoscaling.
from fastapi import FastAPI
from fastapi.responses import JSONResponse
app = FastAPI()
class GPUUsageManager:
def __init__(self):
self.usage_by_model = {
"ollama": self.get_gpu_usage()
}
def get_gpu_usage(self):
return 80 # Simulating 80% GPU usage
gpu_usage_manager = GPUUsageManager()
@app.get("/metrics")
async def metrics():
metrics = {
"gpu_usage": {
"ollama": gpu_usage_manager.usage_by_model["ollama"]
}
}
return JSONResponse(status_code=200, content=metrics)
And here is the Dockerfile used to containerise the application:
# Use the official TensorFlow image with GPU support
FROM tensorflow/tensorflow:2.10.0-gpu
# Set the working directory in the container
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Install the required Python packages
RUN pip install --no-cache-dir fastapi uvicorn
# Expose port 8080 for the metrics endpoint
EXPOSE 8080
# Run the application
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
Next, build and push the Docker image:
docker build -t <your-docker-username>/ollama-metrics-api:latest .
docker push <your-docker-username>/ollama-metrics-api:latest
Step 3: Deploying the GPU Usage Metrics API in Kubernetes
Now, create the Kubernetes resources to deploy this Metrics API, gpu-usage-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-metrics-api-deployment
namespace: gpu-usage-autoscale
labels:
app: ollama-metrics-api
spec:
replicas: 1
selector:
matchLabels:
app: ollama-metrics-api
template:
metadata:
labels:
app: ollama-metrics-api
spec:
containers:
- name: ollama-metrics-api
image: <your-docker-username>/ollama-metrics-api:latest
ports:
- containerPort: 8080
nodeSelector:
gpu.nvidia.com/class: L40S # Ensure this matches the GPU nodes in your cluster
And expose as the gpu-usage-service.yaml
service:
apiVersion: v1
kind: Service
metadata:
name: ollama-metrics-api-service
namespace: gpu-usage-autoscale
labels:
app: ollama-metrics-api
spec:
selector:
app: ollama-metrics-api
ports:
- protocol: TCP
port: 8080
targetPort: 8080
type: ClusterIP
Apply the deployment and service manifests:
kubectl apply -f gpu-usage-deployment.yaml -n gpu-usage-autoscale
kubectl apply -f gpu-usage-service.yaml -n gpu-usage-autoscale
Step 4: Configuring the KEDA ScaledObject for Autoscaling
Once the Metrics API is deployed, you can configure KEDA to autoscale the Ollama deployment based on the GPU usage. Create a scaled object file, gpu-usage-scaledobject.yaml
:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ollama-scaledobj
namespace: gpu-usage-autoscale
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama-deployment # The name of your existing Ollama deployment
pollingInterval: 5 # How often KEDA checks the metric endpoint (in seconds)
cooldownPeriod: 120 # Wait time before scaling down (in seconds)
minReplicaCount: 1 # Minimum of 1 replica
maxReplicaCount: 10 # Maximum number of replicas
fallback:
failureThreshold: 3
replicas: 1 # Fallback replica count if metrics endpoint fails
advanced:
restoreToOriginalReplicaCount: true # Restore original replica count after scaling
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 15
selectPolicy: Max
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Pods
value: 4
periodSeconds: 10
selectPolicy: Max
triggers:
- type: metrics-api
metadata:
targetValue: "60" # The GPU usage threshold at which scaling should happen
format: "json"
activationTargetValue: "50" # Activate scaling at this value
url: "http://ollama-metrics-api-service.gpu-usage-autoscale.svc.cluster.local:8080/metrics"
valueLocation: "gpu_usage.ollama"
method: GET
Now apply the scaled object manifest:
kubectl apply -f gpu-usage-scaledobject.yaml -n gpu-usage-autoscale
Step 5: Deploying Ollama
You should already have your Ollama deployment ready. Here’s an example manifest for Ollama, ollama-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deployment
namespace: gpu-usage-autoscale
labels:
app: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 80
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
gpu.nvidia.com/class: L40S # Ensure this matches the GPU nodes in your cluster
Now apply the Ollama deployment manifest:
kubectl apply -f ollama-deployment.yaml -n gpu-usage-autoscale
Step 6: Monitoring Autoscaling
You can monitor the autoscaling behavior of your deployment and check that KEDA is set up correctly:
kubectl get scaledobjects -n gpu-usage-autoscale
The above command checks the status of the deployed KEDA scaled object, and you should see the following similar output:
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE
ollama-scaledobj apps/v1.Deployment ollama-deployment 1 10 metrics-api True True False Unknown 2m35s
- READY:
True
indicates that the scaled object is correctly configured and ready to make scaling decisions. - ACTIVE:
True
indicates that scaling is currently active (based on the current GPU usage metrics). - FALLBACK:
False
means KEDA is not relying on the fallback configuration and is successfully fetching metrics.
If you see ACTIVE as False
, it may indicate that the metrics endpoint hasn't reached the scaling threshold yet.
Now check if the pods are autoscalling. If you remember, we deployed a single pod, and now you should see multiple pods running, but not more than 10 as defined in the manifest scaled object. Check the Ollama deployment pods:
kubectl get pods -n gpu-usage-autoscale -l app=ollama -w
This command monitors the state of the Ollama pods as KEDA autoscaling triggers the scaling actions. The expected output will show the number of replicas scaling up or down based on GPU usage:
NAME READY STATUS RESTARTS AGE
ollama-deployment-658cc67646-9xpxq 1/1 Running 0 3m
ollama-deployment-658cc67646-qwrfz 1/1 Running 0 3m
ollama-deployment-658cc67646-mzbcv 1/1 Running 0 2m
ollama-deployment-658cc67646-lkfqx 1/1 Running 0 2m
ollama-deployment-658cc67646-7kqxs 1/1 Running 0 1m
As scaling occurs, you will observe new pods being created (Pending
to Running
) or pods being removed (Terminating
) in real-time.
In this tutorial, you've learned how to set up KEDA-based autoscaling for the Ollama deployment using GPU usage as the metric. We covered the installation of KEDA, the deployment of a metrics API, the creation of a ScaledObject, and the configuration for autoscaling based on GPU usage.
Feel free to adjust the threshold values and configurations to suit your requirements.