Dataset Requirements
The structure of the input datasets you want to use for training is essential for training performant models. This guide explains which dataset to choose from Hugging Face.
Using Hugging Face Datasets
OGC supports datasets available on the Hugging Face Hub. When selecting a dataset from Hugging Face:
Supported Formats
- .CSV
- Parquet
Dataset Structures
You can use either of the following formats for Hugging Face datasets:
1. Single Column Format
- The dataset should have one column containing the entire training text.
- Column name can vary (commonly "text").
- Use special tokens to differentiate between instructions and responses (e.g.,
<s>[INST]...[/INST]</s>
). - Recommended batch size:
32
.
Example dataset structure:
Dataset({
features: ['text'],
num_rows: 3000
})
2. Two Column Format
Dataset must include exactly two columns:
- "prompt": The instruction or input.
- "completion": The expected model output.
Example dataset structure:
Dataset({
features: ['prompt', 'completion'],
num_rows: 3000
})
Validation Dataset
A validation dataset is optional but recommended, as it helps monitor model performance and reduce the risk of overfitting.
- Must follow the same structure and format as the training dataset.
- You can either select a validation dataset directly from Hugging Face or upload one manually to OGC.
Following these guidelines ensures your fine-tuning process will yield the best results, providing accurate, relevant, and context-aware responses from your model.