LLama 3
Welcome to Meta's LLama 3, the latest iteration of our cutting-edge Large Language Model (LLM) integration for GPU infrastructure platforms. LLama 3 represents a significant advancement in leveraging the power of GPU acceleration for fine-tuning and inferencing LLMs, enabling users to achieve unprecedented performance and scalability in natural language processing tasks.
Llama 3 comes in 2 sizes — 8B and 70B parameters — in pre-trained and instruction tuned variants. This model only takes text only input, and generates text and code only as output.
Snippet for usage with Transformers
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3-8B"
pipeline = transformers.pipeline(
"text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")
pipeline("Hey how are you doing today?")
Leveraging GPU Acceleration for LLMs
LLama 3 builds upon the success of its predecessors by introducing enhanced features and optimizations tailored to meet the evolving needs of modern AI practitioners. With LLama 3, you can seamlessly integrate LLMs into your GPU infrastructure to unlock new levels of efficiency and productivity in natural language understanding, generation, and analysis.
For inferencing
- Nvidia H100 80GB (~$40K x 2)
- A100 40GB (~$10K x 3)
For fine-tuning
Around ~10-20 more memory if doing a LoRA fine-tuning.
- Fine Tuning
- Inference
- Performance
Fine-tuning LLama 3 with advanced techniques such as PyTorch Fully Sharded Data Parallelism (FSDP) and Q-Lora can significantly enhance their performance and efficiency, especially on distributed systems or constrained hardware. FSDP allows for efficient distributed training across multiple GPUs, while Q-Lora enables quantization-aware training, which can reduce model size and inference latency without sacrificing accuracy.
- PyTorch FSDP: PyTorch FSDP is a powerful technique for distributed training that shards model parameters and activations across multiple devices while enabling efficient communication and synchronization. This approach can significantly accelerate training on systems with multiple GPUs, improving scalability and reducing time to convergence.
- Q-Lora: Q-Lora stands for Quantization of Latent Representations, and it facilitates quantization-aware training of large language models. By quantizing model weights and activations during training, Q-Lora can reduce model size and inference latency without compromising performance. This is particularly useful for deploying LLMs on resource-constrained devices or in low-latency environments.
To effectively fine-tune LLama 3 with PyTorch FSDP and Q-Lora, you will need hardware with the following specifications:
Multi-GPU Setup: PyTorch FSDP is designed for distributed training across multiple GPUs. Therefore, you will need a system with multiple high-performance GPUs, such as NVIDIA's Tesla A100 or H100 to leverage the full benefits of FSDP.
High RAM: Since LLama 3 is a large model, it's essential to have sufficient RAM to accommodate both the model and the data being processed, especially when using distributed training techniques like FSDP.
Storage: Adequate storage space is necessary to store the pre-trained LLama 3 model, fine-tuning scripts, and any additional data required for fine-tuning and distributed training.
Here is a basic example of how to fine-tune LLama 3 with PyTorch FSDP and Q-Lora using the Hugging Face Transformers library and PyTorch Lightning.
from transformers import LLamaForSequenceClassification, LLamaTokenizer
from torch.nn import CrossEntropyLoss
import pytorch_lightning as pl
from pytorch_lightning.plugins import DDPPlugin
from transformers.adapters.composition import Stack
# Load LLama 3 tokenizer and model
tokenizer = LLamaTokenizer.from_pretrained("llama-3")
model = LLamaForSequenceClassification.from_pretrained("llama-3")
# Set up adapter fusion
adapter_stack = Stack("llama-3", non_linearity="gelu", reduction_factor=4)
# Add adapter to the model
model.add_adapter("custom_adapter", adapter_stack)
# Define PyTorch Lightning module for fine-tuning
class LLamaFineTuning(pl.LightningModule):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, input_ids, attention_mask, labels=None):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
return outputs.loss, outputs.logits
def training_step(self, batch, batch_idx):
input_ids, attention_mask, labels = batch
loss, _ = self(input_ids, attention_mask, labels)
self.log('train_loss', loss)
return loss
def configure_optimizers(self):
return AdamW(self.model.parameters(), lr=2e-5)
# Define PyTorch Lightning Trainer
trainer = pl.Trainer(
gpus=2,
plugins=DDPPlugin(find_unused_parameters=False),
precision=16, # Enable mixed-precision training for better memory usage
accumulate_grad_batches=8, # Accumulate gradients to effectively increase batch size
)
# Fine-tune LLama 3 using PyTorch Lightning Trainer
fine_tuner = LLamaFineTuning(model)
trainer.fit(fine_tuner, train_dataloader)
Inferencing with Meta's LLama-3 enables developers to deploy and utilize fine-tuned LLMs in production environments for real-time text generation, analysis, and interaction. LLama 3 optimizes inferencing workflows by leveraging GPU acceleration to maximize throughput and minimize latency, ensuring responsive and scalable NLP applications.
Depending on the use case, we can start with installing TensorRT library, an open-source tool, is designed to enhance the inference speed of the newest LLMs specifically on NVIDIA GPUs. NeMo, an end-to-end framework tailored for creating, tailoring, and deploying AI applications with a generative focus, utilizes both TensorRT-LLM and the NVIDIA Triton Inference Server for deploying generative AI solutions.
Ori GPU VMs come with pre-defined init scripts for installations such as Python 3.10 and TensorRT.
Following are the steps to build and run Llama-3 on both single GPU, single node multi-GPU and multi-node multi-GPU:
- Download weights from Hugging Face Transformers In this example setup, we are using Meta-Llama-3-8B-Instruct. Now download the model weights from Hugging Face.
git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
- Convert weights from HF Transformers to TensorRT-LLM format The
convert_checkpoint.py
script converts HF weights to TensorRT-LLM checkpoints. The number of checkpoint files (in .safetensors format) is same to the number of GPUs used to run inference.
# Build the LLaMA-3-8B model using a single GPU and FP16.
python3.10 convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
--output_dir ./tllm_checkpoint_1gpu_bf16 \
--dtype bfloat16
OR
# Build the LLaMA-3-8B model using 2-way tensor parallelism and 2-way pipeline parallelism.
python3.10 convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
--output_dir ./tllm_checkpoint_4gpu_tp2_pp2 \
--dtype bfloat16
--tp_size 2 \
--pp_size 2
- Build TensorRT engine The
trtllm-build
command builds TensorRT-LLM engines from TensorRT-LLM checkpoints. The number of engine files is also same to the number of GPUs used to run inference.
# Using single GPU
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
--output_dir ./tmp/llama/8B/trt_engines/bf16/1-gpu \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16
OR
# Using 2-way tensor parallelism and 2-way pipeline parallelism
trtllm-build --checkpoint_dir ./tllm_checkpoint_4gpu_tp2_pp2 \
--output_dir ./tmp/llama/8B/trt_engines/bf16/4-gpu \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16
Refer to TensorRT's example repository run similar popular LLMs.
- Running the Llama-3-8B-Instruct model Use the following command to run the 8B engine on a single GPU from above:
python3
run.py - -max_output_len = 100
- -tokenizer_dir. / Meta - Llama - 3 - 8
B - Instruct
- -engine_dir =./ tmp / llama / 8
B / trt_engines / bf16 / 1 - gpu
- -max_attention_window_size = 4096
--input_text
"How do I count to nine in French?"
Our Benchmarking Framework, BeFOri for Llama family benchmarks on Nvidia GPUs such as A100, V100 and H100.
Checkout our Github repo that helps you select hardware to run models depending on your application requirements. The BeFOri framework allows you to measure 4 key metrics for GPU performance running inference with LLMs. The following are the metrics:
- Time to First Token (TTFT)
- Inter-Token Latency (ITL)
- End-to-End Latency (EL)
- Token Throughput (TT)
For now BeFOri supports Llama family and other popular open source models such as Open AI, Anthropic, Together AI, Hugging Face, and LiteLLM.
Blog
Read our blog on how BeFOri works.