Stanford Alpaca
Stanford Alpaca
Introduction
- Stanford Alpaca is An “Instruction-following” LLaMA Model
- This is the repo aims to build and share an instruction-following LLaMA model. The repo contains:
- The 52K instruction-following data used for fine-tuning the model.
- The code for generating the data.
- The code for fine-tuning the model.
- The code for recovering Alpaca-7B weights from our released weight diff.
Overview
- The current “Alpaca 7B model” is fine-tuned from a “7B LLaMA” model on 52K instruction-following data generated by the techniques in the Self-Instruct paper.
- Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite.
- Alpaca is still under development, and there are many limitations that have to be addressed.
- Alphaca is not yet fine-tuned to be safe and harmless.
- Initial release contains the data generation procedure, dataset, and training recipe.
- Model weights can be released if the creators of LLaMA gives permission.
- Live demo to help readers better understand the capabilities and limits of Alpaca is available.
- Based on followin papers:
- LLaMA: Open and Efficient Foundation Language Models. Hugo2023
- Self-Instruct: Aligning Language Model with Self Generated Instructions. Yizhong2022
- Data Release
- alpaca_data.json contains 52K instruction-following data we used for fine-tuning the Alpaca model. This JSON file is a list of dictionaries, each dictionary contains the following fields: Instruction, input, output (text-davinci-003 geneated answer).
Highlevel Activities of the Alpaca Project
Highlevel Actitivies done by Stanford Alpaca team and Project Output
-
Data Generation: The team used OpenAI’s GPT-3.5 model to generate a dataset of 52,000 instruction-response pairs. They did this by providing GPT-3.5 with a variety of instructions and asking it to produce corresponding responses.
-
Fine-Tuning: They used this generated dataset to fine-tune Meta’s LLaMA model, making it better at following instructions.
-
Evaluation: The fine-tuned Alpaca model was then evaluated for its ability to follow instructions effectively, comparing its performance to more advanced models.
Output: The project resulted in a fine-tuned version of the LLaMA model, called Alpaca, which is smaller, more efficient, and capable of following instructions well.
Capablitities
This model can perform following tasks.
Data Generation
Output
- This process produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500).
- The dataset of 52K generated data is much more diverse than the data released by self-instruct.
Process
- Built on the data generation pipeline from self-instruct and made the following modifications:
- Used text-davinci-003 to generate the instruction data instead of davinci.
- Wrote a new prompt (prompt.txt) that explicitly gave the requirement of instruction generation to text-davinci-003.
- Adopted much more aggressive batch decoding, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
- Simplified the data generation pipeline by discarding the difference between classification and non-classification instructions.
- Only generated a single instance for each instruction
Fine-tuning
Created a fine-tuned model using standard Hugging Face training code. fine-tuned LLaMA-7B and LLaMA-13B with the following hyperparameters:
- Hyperparameter LLaMA-7B LLaMA-13B
- Batch size 128 128
- Learning rate 2e-5 1e-5
- Epochs 3 5
- Max length 512 512
- Weight decay 0 0
Fine-tuning Dependency and LLaMa Installation
To reproduce fine-tuned model, first install the requirements
pip install -r requirements.txt
Command to finetune
- Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode.
- We were able to reproduce a model of similar quality as the one we hosted in our demo with the following command using Python 3.10.
- Replace
with a port of your own, with the path to your converted checkpoint and tokenizer (following instructions in the PR), and with where you want to store your outputs.
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
--model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir <your_output_dir> \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True
The same script also works for OPT fine-tuning. Here’s an example for fine-tuning OPT-6.7B
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
--model_name_or_path "facebook/opt-6.7b" \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir <your_output_dir> \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
--tf32 True
Addressing OOM
- Naively, fine-tuning a 7B model requires about 7 x 4 x 4 = 112 GB of VRAM.
-
Commands given above enable parameter sharding, so no redundant model copy is stored on any GPU. If you’d like to further reduce the memory footprint, here are some options:
- Turn on CPU offload for FSDP with –fsdp “full_shard auto_wrap offload”. This saves VRAM at the cost of longer runtime.
- DeepSpeed stage-3 (with offload) can at times be more memory efficient than FSDP with offload. Here’s an example to use DeepSpeed stage-3 with 4 GPUs with both parameter and optimizer offload:
pip install deepspeed
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
--model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir <your_output_dir> \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--deepspeed "./configs/default_offload_opt_param.json" \
--tf32 True
The DeepSpeed library also provides some helpful functions to estimate memory usage.
- LoRA fine-tunes low-rank slices of the query, key, and value embedding heads. This can reduce the total memory footprint from 112GB to about 7x4=28GB.
- Convert Meta’s released weights into huggingface format. Follow this guide: https://huggingface.co/docs/transformers/main/model_doc/llama
- Make sure you cloned the released weight diff into your local machine. The weight diff is located at: https://huggingface.co/tatsu-lab/alpaca-7b/tree/main
- Run this function with the correct paths. E.g.,
python weight_diff.py recover –path_raw
--path_diff --path_tuned Once step 3 completes, you should have a directory with the recovered weights, from which you can load the model like the following
import transformers
alpaca_model = transformers.AutoModelForCausalLM.from_pretrained("<path_to_store_recovered_weights>")
alpaca_tokenizer = transformers.AutoTokenizer.from_pretrained("<path_to_store_recovered_weights>")