Dataset Requirements
The structure of the input datasets you want to use for training is essential for training performant models. Here is a guide for preparing and formatting your datasets for fine-tuning Large Language Models (LLMs) using OGC's Fine-Tuning Service.
Supported Formats
Currently OGC supports datasets from Hugging Face Datasets only, that need to be:
- Structured datasets (recommended)
- .CSV
- .JSONL
- Plain text (.txt)
Coming soon: Support for uploading your own datasets.
Dataset Structure
Your dataset should consist of pairs of prompts (instructions/questions) and corresponding responses (answers). Each pair will guide the LLM to generate relevant, high-quality answers based on given inputs.
Datasets containing a single column or field encapsulating the text.
The column/field should have the name "text"
. For example:
Dataset({
features: ['text'],
num_rows: 3000
})
Validation Dataset
Used to evaluate model performance during training and prevent overfitting. Should follow the same format as the training dataset (.CSV, Plain text, or .JSONL).
Best Practices for Preparing Your Dataset
- Consistency: Keep prompt and response formatting consistent across all data points.
- Clear Delimitation: To obtain good quality results, always use the special tokens specific to the base model you're using to ensure clear separation between prompts and completions. For example, if you're using the
Qwen/Qwen2.5-1.5B
base model, you should use the structure:
<|im_start|>system Your system prompt here. <|im_end|>
<|im_start|>user Your user prompt here. <|im_end|>
<|im_start|>assistant Assistant's completion here. <|im_end|>
- Balanced Dataset: Aim for a varied and balanced dataset to enhance the model's generalization capabilities.
Following these guidelines ensures your fine-tuning process will yield the best results, providing accurate, relevant, and context-aware responses from your model.