LLama 3

Welcome to Meta's LLama 3, the latest iteration of our cutting-edge Large Language Model (LLM) integration for GPU infrastructure platforms. LLama 3 represents a significant advancement in leveraging the power of GPU acceleration for fine-tuning and inferencing LLMs, enabling users to achieve unprecedented performance and scalability in natural language processing tasks.

Llama 3 comes in 2 sizes — 8B and 70B parameters — in pre-trained and instruction tuned variants. This model only takes text only input, and generates text and code only as output.

Snippet for usage with Transformers

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B"

pipeline = transformers.pipeline(
  "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")

pipeline("Hey how are you doing today?")

Leveraging GPU Acceleration for LLMs

LLama 3 builds upon the success of its predecessors by introducing enhanced features and optimizations tailored to meet the evolving needs of modern AI practitioners. With LLama 3, you can seamlessly integrate LLMs into your GPU infrastructure to unlock new levels of efficiency and productivity in natural language understanding, generation, and analysis.

For inferencing

Nvidia H100 80GB (~$40K x 2)
A100 40GB (~$10K x 3)

For fine-tuning

Around ~10-20 more memory if doing a LoRA fine-tuning.

Fine Tuning
Inference
Performance

Fine-tuning LLama 3 with advanced techniques such as PyTorch Fully Sharded Data Parallelism (FSDP) and Q-Lora can significantly enhance their performance and efficiency, especially on distributed systems or constrained hardware. FSDP allows for efficient distributed training across multiple GPUs, while Q-Lora enables quantization-aware training, which can reduce model size and inference latency without sacrificing accuracy.

PyTorch FSDP: PyTorch FSDP is a powerful technique for distributed training that shards model parameters and activations across multiple devices while enabling efficient communication and synchronization. This approach can significantly accelerate training on systems with multiple GPUs, improving scalability and reducing time to convergence.
Q-Lora: Q-Lora stands for Quantization of Latent Representations, and it facilitates quantization-aware training of large language models. By quantizing model weights and activations during training, Q-Lora can reduce model size and inference latency without compromising performance. This is particularly useful for deploying LLMs on resource-constrained devices or in low-latency environments.

To effectively fine-tune LLama 3 with PyTorch FSDP and Q-Lora, you will need hardware with the following specifications:

Multi-GPU Setup: PyTorch FSDP is designed for distributed training across multiple GPUs. Therefore, you will need a system with multiple high-performance GPUs, such as NVIDIA's Tesla A100 or H100 to leverage the full benefits of FSDP.

High RAM: Since LLama 3 is a large model, it's essential to have sufficient RAM to accommodate both the model and the data being processed, especially when using distributed training techniques like FSDP.

Storage: Adequate storage space is necessary to store the pre-trained LLama 3 model, fine-tuning scripts, and any additional data required for fine-tuning and distributed training.

Here is a basic example of how to fine-tune LLama 3 with PyTorch FSDP and Q-Lora using the Hugging Face Transformers library and PyTorch Lightning.

from transformers import LLamaForSequenceClassification, LLamaTokenizer
from torch.nn import CrossEntropyLoss
import pytorch_lightning as pl
from pytorch_lightning.plugins import DDPPlugin
from transformers.adapters.composition import Stack

# Load LLama 3 tokenizer and model
tokenizer = LLamaTokenizer.from_pretrained("llama-3")
model = LLamaForSequenceClassification.from_pretrained("llama-3")

# Set up adapter fusion
adapter_stack = Stack("llama-3", non_linearity="gelu", reduction_factor=4)

# Add adapter to the model
model.add_adapter("custom_adapter", adapter_stack)


# Define PyTorch Lightning module for fine-tuning
class LLamaFineTuning(pl.LightningModule):
  def __init__(self, model):
    super().__init__()
    self.model = model

  def forward(self, input_ids, attention_mask, labels=None):
    outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
    return outputs.loss, outputs.logits

  def training_step(self, batch, batch_idx):
    input_ids, attention_mask, labels = batch
    loss, _ = self(input_ids, attention_mask, labels)
    self.log('train_loss', loss)
    return loss

  def configure_optimizers(self):
    return AdamW(self.model.parameters(), lr=2e-5)


# Define PyTorch Lightning Trainer
trainer = pl.Trainer(
  gpus=2,
  plugins=DDPPlugin(find_unused_parameters=False),
  precision=16,  # Enable mixed-precision training for better memory usage
  accumulate_grad_batches=8,  # Accumulate gradients to effectively increase batch size
)

# Fine-tune LLama 3 using PyTorch Lightning Trainer
fine_tuner = LLamaFineTuning(model)
trainer.fit(fine_tuner, train_dataloader)

Inferencing with Meta's LLama-3 enables developers to deploy and utilize fine-tuned LLMs in production environments for real-time text generation, analysis, and interaction. LLama 3 optimizes inferencing workflows by leveraging GPU acceleration to maximize throughput and minimize latency, ensuring responsive and scalable NLP applications.

Depending on the use case, we can start with installing TensorRT library, an open-source tool, is designed to enhance the inference speed of the newest LLMs specifically on NVIDIA GPUs. NeMo, an end-to-end framework tailored for creating, tailoring, and deploying AI applications with a generative focus, utilizes both TensorRT-LLM and the NVIDIA Triton Inference Server for deploying generative AI solutions.

info

Ori GPU VMs come with pre-defined init scripts for installations such as Python 3.10 and TensorRT.

Following are the steps to build and run Llama-3 on both single GPU, single node multi-GPU and multi-node multi-GPU:

Download weights from Hugging Face Transformers In this example setup, we are using Meta-Llama-3-8B-Instruct. Now download the model weights from Hugging Face.

git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

Convert weights from HF Transformers to TensorRT-LLM format The convert_checkpoint.py script converts HF weights to TensorRT-LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.

# Build the LLaMA-3-8B model using a single GPU and FP16.
python3.10 convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
            --output_dir ./tllm_checkpoint_1gpu_bf16 \
            --dtype bfloat16

# Build the LLaMA-3-8B model using 2-way tensor parallelism and 2-way pipeline parallelism.
python3.10 convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
            --output_dir ./tllm_checkpoint_4gpu_tp2_pp2 \
            --dtype bfloat16
            --tp_size 2 \
            --pp_size 2

Build TensorRT engine The trtllm-build command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.

# Using single GPU
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
            --output_dir ./tmp/llama/8B/trt_engines/bf16/1-gpu \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16

# Using 2-way tensor parallelism and 2-way pipeline parallelism
trtllm-build --checkpoint_dir ./tllm_checkpoint_4gpu_tp2_pp2 \
            --output_dir ./tmp/llama/8B/trt_engines/bf16/4-gpu \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16

Refer to TensorRT's example repository run similar popular LLMs.

Running the Llama-3-8B-Instruct model Use the following command to run the 8B engine on a single GPU from above:

python3
run.py - -max_output_len = 100
                           - -tokenizer_dir. / Meta - Llama - 3 - 8
B - Instruct
- -engine_dir =./ tmp / llama / 8
B / trt_engines / bf16 / 1 - gpu
- -max_attention_window_size = 4096
--input_text
"How do I count to nine in French?"

LLama 3

Leveraging GPU Acceleration for LLMs

Blog

Contents

LLama 3

Leveraging GPU Acceleration for LLMs​

Blog​

Contents

Leveraging GPU Acceleration for LLMs

Blog