5 minute read

Stanford-Alpaca

Stanford Alpaca

Introduction

Stanford Alpaca Github Report

  • Stanford Alpaca is An “Instruction-following” LLaMA Model
  • This is the repo aims to build and share an instruction-following LLaMA model. The repo contains:
    • The 52K instruction-following data used for fine-tuning the model.
    • The code for generating the data.
    • The code for fine-tuning the model.
    • The code for recovering Alpaca-7B weights from our released weight diff.

Overview

  • The current “Alpaca 7B model” is fine-tuned from a “7B LLaMA” model on 52K instruction-following data generated by the techniques in the Self-Instruct paper.
  • Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite.
  • Alpaca is still under development, and there are many limitations that have to be addressed.
  • Alphaca is not yet fine-tuned to be safe and harmless.
  • Initial release contains the data generation procedure, dataset, and training recipe.
  • Model weights can be released if the creators of LLaMA gives permission.
  • Live demo to help readers better understand the capabilities and limits of Alpaca is available.
  • Based on followin papers:
    • LLaMA: Open and Efficient Foundation Language Models. Hugo2023
    • Self-Instruct: Aligning Language Model with Self Generated Instructions. Yizhong2022
  • Data Release
    • alpaca_data.json contains 52K instruction-following data we used for fine-tuning the Alpaca model. This JSON file is a list of dictionaries, each dictionary contains the following fields: Instruction, input, output (text-davinci-003 geneated answer).

Highlevel Activities of the Alpaca Project

Highlevel Actitivies done by Stanford Alpaca team and Project Output

  1. Data Generation: The team used OpenAI’s GPT-3.5 model to generate a dataset of 52,000 instruction-response pairs. They did this by providing GPT-3.5 with a variety of instructions and asking it to produce corresponding responses.

  2. Fine-Tuning: They used this generated dataset to fine-tune Meta’s LLaMA model, making it better at following instructions.

  3. Evaluation: The fine-tuned Alpaca model was then evaluated for its ability to follow instructions effectively, comparing its performance to more advanced models.

Output: The project resulted in a fine-tuned version of the LLaMA model, called Alpaca, which is smaller, more efficient, and capable of following instructions well.

Capablitities

This model can perform following tasks.

Data Generation

Output

  • This process produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500).
  • The dataset of 52K generated data is much more diverse than the data released by self-instruct.

Process

  • Built on the data generation pipeline from self-instruct and made the following modifications:
  • Used text-davinci-003 to generate the instruction data instead of davinci.
  • Wrote a new prompt (prompt.txt) that explicitly gave the requirement of instruction generation to text-davinci-003.
  • Adopted much more aggressive batch decoding, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
  • Simplified the data generation pipeline by discarding the difference between classification and non-classification instructions.
  • Only generated a single instance for each instruction

Fine-tuning

Created a fine-tuned model using standard Hugging Face training code. fine-tuned LLaMA-7B and LLaMA-13B with the following hyperparameters:

- Hyperparameter	LLaMA-7B	LLaMA-13B
- Batch size	128	128
- Learning rate	2e-5	1e-5
- Epochs	3	5
- Max length	512	512
- Weight decay	0	0

Fine-tuning Dependency and LLaMa Installation

To reproduce fine-tuned model, first install the requirements

pip install -r requirements.txt

Command to finetune

  • Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode.
  • We were able to reproduce a model of similar quality as the one we hosted in our demo with the following command using Python 3.10.
  • Replace with a port of your own, with the path to your converted checkpoint and tokenizer (following instructions in the PR), and with where you want to store your outputs.
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

The same script also works for OPT fine-tuning. Here’s an example for fine-tuning OPT-6.7B

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path "facebook/opt-6.7b" \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
    --tf32 True

Addressing OOM

  • Naively, fine-tuning a 7B model requires about 7 x 4 x 4 = 112 GB of VRAM.
  • Commands given above enable parameter sharding, so no redundant model copy is stored on any GPU. If you’d like to further reduce the memory footprint, here are some options:

  • Turn on CPU offload for FSDP with –fsdp “full_shard auto_wrap offload”. This saves VRAM at the cost of longer runtime.
  • DeepSpeed stage-3 (with offload) can at times be more memory efficient than FSDP with offload. Here’s an example to use DeepSpeed stage-3 with 4 GPUs with both parameter and optimizer offload:
pip install deepspeed
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 True

The DeepSpeed library also provides some helpful functions to estimate memory usage.

  • LoRA fine-tunes low-rank slices of the query, key, and value embedding heads. This can reduce the total memory footprint from 112GB to about 7x4=28GB.
  1. Convert Meta’s released weights into huggingface format. Follow this guide: https://huggingface.co/docs/transformers/main/model_doc/llama
  2. Make sure you cloned the released weight diff into your local machine. The weight diff is located at: https://huggingface.co/tatsu-lab/alpaca-7b/tree/main
  3. Run this function with the correct paths. E.g., python weight_diff.py recover –path_raw --path_diff --path_tuned Once step 3 completes, you should have a directory with the recovered weights, from which you can load the model like the following
import transformers
alpaca_model = transformers.AutoModelForCausalLM.from_pretrained("<path_to_store_recovered_weights>")
alpaca_tokenizer = transformers.AutoTokenizer.from_pretrained("<path_to_store_recovered_weights>")

Updated: