Auto Scaling

Auto Scaling ensures that your endpoint can handle varying loads efficiently, adjusting the number of replicas between the min/max number set for the Endpoint. Auto Scaling will be triggered based on concurrent requests you send to the endpoint and GPU utilisation.

info

Auto Scaling can take some time to provision/de-provision replicas.

Zero to Scale Policy

The Scale to Zero policy allows the Endpoint to go to zero replicas after a certain period with no incoming requests.

Auto Scaling

Zero to Scale Policy​

Contents

Zero to Scale Policy