Return to site

Quick Guide: How to Finetune and Evaluate the Large Langauge Model

· Fine Tune LLM

Key Summary

1 Environment Setup

  • Train Evironment : NIVIDA A100
  • Framework: Huggingface, Pytorch
  • TRL: Transformer Reinforcement Learning (TRL)
  • Accelerateor: FlashAttention
  • Evaluate open LLMs on MT-Bench: MT-Bench is a Benchmark designed by LMSYS to test the conversation and instruction-following capabilities of large language models (LLMs). It evaluates LLMs through multi-turn conversations, focusing on their ability to engage in coherent, informative, and engaging exchanges. Since human evaluation is very expensive and time consuming, LMSYS uses GPT-4-Turbo to grade the model responses. MT-Bench is part of the FastChat Repository.

2 Dataset

Important Datasets for fine-tuning LLMs:

The choice of dataset and format depends on the specific scenario and use case. But, preference datasets can inherit biases from the humans or AI that created them. To ensure broader applicability and fairness, incorporate diverse feedback when constructing these datasets.

3 TRL and SFTTrainer for Fine-Tuning

SFTTrainer builds upon the robust foundation of the Trainer class from Transformers, offering all the essential features: logging, evaluation, and checkpointing. But SFTTrainer goes a step further by adding a suite of features that streamline the fine-tuning process:

  • Dataset Versatility: SFTTrainer seamlessly handles conversational and instruction formats, ensuring for optimal training.
  • Completion Focus: SFTTrainer prioritizes completions for training, ignoring prompts and maximizing efficiency.
  • Packing datasets: SFTTrainer packs a punch by optimizing your datasets for faster, more efficient training runs.
  • Fine-Tuning Finesse: Unleash the power of PEFT (Parameter-Efficient Fine-Tuning) techniques like Q-LoRA. It reduces memory usage during fine-tuning without sacrificing performance through a process called quantization.
  • Conversational Readiness: SFTTrainer equips model and tokenizer for conversational fine-tuning by equipping them with essential special tokens.

4 Evaluation

MT-Bench offers two evaluation strategies:

  • Single-Answer Grading: LLMs directly grade their own answers on a 10-point scale.
  • Pair-Wise Comparison: LLMs compare two responses and determine which one is better, resulting in a win rate.

For our evaluation, we will leverage the pair-wise comparison method to compare fine-tuned gemma-7b-Dolly15k-chatml with Mistral-7B-Instruct-v0.2 model. MT-Bench currently only supports OpenAI or Anthropic as the judge, in this case , so we will leverage the FastChat Repository and incorporate reference answers from GPT-4 Turbo (gpt-4-1106-preview). This allows us to maintain a high-quality evaluation process without incurring significant costs.

  • Generate Responses using gemma-7b-Dolly15k-chatml and Mistral-7B-Instruct-v0.2
  • Evaluate the responses using pair-wise comparison and GPT-4-Turbo as Judge
  • Plot and compare the results

Notebooks

1 Fine-tune Google's GEMMA 7B based on databricks-dolly-15k dataset

  • When working with QLoRA, training focuses solely on adapter parameters, not the entire model. Consequently, only adapter weights are saved during training. To facilitate text generation inference, you can save the full model by merging the adapter weights with the main model weights using the merge_and_unload method. Subsequently, save the complete model using the save_pretrained method. This will create a standard model suitable for inference tasks.
  • Fine-tuned GEMMA 7B Model
  • Fine-tuned GEMMA 7B Full Model

2 Evaluate the Fine-tune models

3 Fine-tune CodeLLAMA for Text to SQL

4 Deployment Demo

Simple Demo on Huggingface Space HuggingFace Space