Skip to main content

Get Started

This guide will help you get started with Fine-Tuning models. Our Fine-Tuning feature enables you to train popular LLMs on any publicly available dataset hosted on Hugging Face. Follow the steps below to launch and manage a fine-tuning job.

Step 1: Select Base Model

You can choose from a number of of Open Source base models to fine tune.

Step 2: Add Dataset

Specify the Hugging Face dataset path you’d like to use for fine-tuning. There are two types of datasets:

  • Training: Dataset to train your model against
  • Validation (optional): Dataset to validate the model against unseen data that is outside of the training loop.

Requirements

TypeDescription
File Type.jsonl (JSON Lines) or .csv
ColumnsMust include a column named text (case-sensitive) with the training data

Source

  • HuggingFace dataset: Full path (example imdatta0/ultrachat_1k)

Step 3: Output Job Details

Provide the following:

  • Suffix: A custom identifier added to your Job ID for easier tracking.
  • Registry: Choose the model registry where the fine-tuned model will be saved for future deployment.
  • Seed: Set a random seed for reproducibility (defaults to 42).

Step 4: Configure Hyperparameters

Each hyperparameter comes with a default value, which you can adjust:

HyperparameterDescription
Batch sizeNumber of training samples used in one forward/backward pass.
Learning RateControls the step size during optimization.
Number of EpochsTotal number of times the model will iterate over the full training set.
Warmup RatioProportion of training steps to gradually increase the learning rate.
Weight DecayRegularization to prevent overfitting by penalizing large weights.

Step 5: Set LoRA Parameters

ParameterDescription
LoRA RankDimension of the low-rank decomposition
LoRA AlphaScaling factor for LoRA updates.
LoRA DropoutDropout rate applied to LoRA layers.

Step 6: Launch the Job

Once a job is launched, it will automatically start the training run. These runs can vary in time, depending on the configurations set.

You can track the progress of your job by monitoring its status as it transitions from Initializing to Running to Completed.

Step 8: Post-Job Artifacts

Once the job is completed:

  • Navigate to the job's details view to view the list of Checkpoints created for epochs.
  • For each checkpoint, you’ll see:
    • Training Loss: The error (or loss) the model incurred while learning from the training dataset. A decreasing training loss generally indicates the model is learning, but if it’s very low compared to validation loss, it might be overfitting.
    • Validation Loss: Visible only if a validation dataset is provided. The error computed using the validation dataset (unseen during training) is an indicator of how well the model will perform on real-world or unseen data.
  • From the selected checkpoint, you can either deploy the fine tuned model by:
    • Navigating to the Model Registry where the model weights becomes Available in your chosen Registry and location, or
    • Deploy the model directly to an Endpoint.

Registered model weights appear under Model Registry > Model > Versions. Know more about Model Registry. From here, you can deploy the fine-tuned model to Endpoints for inference.

Contents