Auto Scaling
Auto Scaling ensures that your endpoint can handle varying loads efficiently, adjusting the number of replicas between the min/max number set for the Endpoint. Auto Scaling will be triggered based on concurrent requests you send to the endpoint and GPU utilisation.
info
Auto Scaling can take some time to provision/de-provision replicas.
Zero to Scale Policy
The Scale to Zero policy allows the Endpoint to go to zero replicas after a certain period with no incoming requests.