Skip to main content

Dataset Requirements

The structure of the input datasets you want to use for training is essential for training performant models. This guide explains which dataset to choose from Hugging Face.

1. Selecting a Dataset

You can train with datasets from either of the following sources:

  • Hugging Face Hub
  • Custom upload via the OGC interface

For custom uploads, ensure your data is properly formatted and use one of the following file types:

  • .CSV
  • .JSONL

All datasets must conform to the schema described in section 2.

2. Dataset Schema

2.1 Two‑column format

Provide exactly two columns:

  • "prompt": The instruction or input.
  • "completion": The expected model output.

Example dataset structure:

{"prompt": "The sky is",
"completion": " blue."}

which is useful to fine-tune models to complete a sentence. In our next iteration, we'll add other formats, including language-modelling and prompt-completion conversational datasets to fine-tune models able to hold elaborate conversations.

3. Validation Dataset

A validation dataset is optional but recommended, as it helps monitor model performance and reduce the risk of overfitting.

  • Must follow the same structure and format as the training dataset.

  • You can either select a validation dataset directly from Hugging Face or upload one manually to OGC.

  • Automatic Splitting: If you do not provide a separate validation dataset, OGC will automatically split your uploaded dataset into training and validation subsets, reserving 10% of the data for validation purposes.

4. Under the Hood: Dataset Preparation

OGC's Fine-Tuning Service includes advanced data preparation to optimise model performance:

4.1 Packing

  • OGC automatically applies a technique called packing, where multiple prompt-response pairs are concatenated together into fixed-length blocks (e.g., 1024 tokens per block).
  • Packing ensures efficient training by maximising the use of the model's context window, eliminating unnecessary padding, and allowing each token in the batch to contribute meaningfully to training.

4.2 Masking the Prompt

  • OGC employs prompt masking, a technique that ensures the model is penalised during training only for its response accuracy, not for reproducing the original prompt.
  • Specifically, the prompt tokens (everything before and including the final delimiter for prompt instructions, e.g., [/INST]) are masked with a special ignore value (-100) during loss computation. This directs the model's learning exclusively toward improving response quality.

4.3 Applying the Tokenizer Chat Template

Our service constructs chat prompts using the tokenizer’s chat template internally. The service accepts a list of messages and invokes tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) to generate a single prompt string in the format required by the target model.

The chat template used during training is saved with the resulting model artefact. During evaluation and serving, the same template is applied automatically to reproduce the exact prompt format. This design minimises formatting errors, improves consistency across runs, and simplifies dataset preparation by converting message lists into a single text field that is passed directly to training.