Skip to main content

Auto Scaling

Auto Scaling ensures that your endpoint can handle varying loads efficiently, adjusting the number of replicas between the min/max number set for the Endpoint. Auto Scaling will be triggered based on concurrent requests you send to the endpoint and GPU utilisation.


Auto Scaling can take some time to provision/de-provision replicas.

Zero to Scale Policy

The Scale to Zero policy allows the Endpoint to go to zero replicas after a certain period with no incoming requests.