Skip to main content

Get Started

Step 1: Create an endpoint

  1. Select a Model: Choose from our list of popular, pre-trained models.

  2. Configure Compute Resources:

    • GPU Selection: Select the GPU type based on your performance and cost requirements. OGC might provide a recommended GPU that ensures the optimal cost/performance for the selected model.
    • Location: Choose the geographical location where your endpoint will be hosted.
  3. Autoscaling (Optional):

    • Replicas: Define the minimum and maximum number of replicas for your endpoint to scale between.
    • Zero to Scale Policy: If the minimum replicas are set to Zero, your endpoint will scale down to zero replicas based on the defined policy.

Step 2: Accessing your Endpoints

After creating, you are provided with an Authorization Token. When accessing inference endpoints, an authorization token is required to ensure secure access through the Authorization Header.

info

Authentication Token is only shown once upon creation. You can get a new one under Request Access Token option for your endpoint.

You'll be able to see a sample cURL command under the endpoint details.

This is an example for a Llama 3.1 8B model:

export ACCESS_TOKEN='<authorization-token>'
export ACCESS_URL='<endpoint-url>'

curl -H "Content-Type: application/json" \
-H "Authorization: Bearer $ACCESS_TOKEN" \
-d '{"model": "model","messages":[{"role": "system", "content": "You are an assistant that speaks in pirate speak."},{"role": "user", "content": "Write a poem about colors"}],"max_tokens": 50,"stream": false}' \
"$ACCESS_URL/openai/v1/chat/completions"