What is TRL?¶
Quote from TRL’s readme:
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Built on top of the 🤗 Transformers ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
In short, TRL is a library developed by the Hugging Face team. TRL offers a foundational layer for fine-tuning, while frameworks like LlamaFactory offers additional features and a potentially better user interface.
Environment setup¶
# create environment
conda create -n trl python=3.11
conda activate trl
# Install PyTorch with CUDA support:
module load cuda-cudnn/12.1-8.9.3
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install trl
# Install llmkit-data for data processing and evaluation
git clone http://62.234.201.16/llm-kit/Data.git
pip install -e Data
# Install flash-attention2 (Optional)
# curl -O https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
# pip install ./flash_attn-2.7.2.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl --no-build-isolation
# Install deepspeed (Optional, required only for speeding up multi-GPU training)
pip install deepspeed
# Install vllm for inference
pip install vllm
pip install tensorboardX
Then download APPS and Create a directory named result to store your training results.
APPS=/path/to/apps
result=/path/to/result
Data processing¶
This step prepares the dataset in a standard format using the llmkit_data library.
## Process Train set
python -m llmkit_data.cli.prep_apps \
--apps ${APPS}/train.jsonl \
--out ${result}/dataset/train.jsonl \
--type SFT
## Process Test set
python -m llmkit_data.cli.prep_apps \
--apps ${APPS}/test.jsonl \
--out ${result}/dataset/test.jsonl \
--type SFT \
--prompt_only
python -m llmkit_data.cli.convert_to_trl \
--dataset ${result}/dataset/train.jsonl \
--out ${result}/trl_dataset/train.jsonl
This is only can be used for oneturn instruction dataset. By default, TRL should mask the user prompt and training on reponse. For multiturn converstation, you need to find out how TRL deal with them and write you own converter.
Training¶
This section details the training process using TRL.
TRL provides numerous examples on their webpage and in their repository.
Our configuration is based on deepspeed_zero3. To minimize configuration changes, we will use TRL’s example configuration and modify its contents via command-line arguments. Therefore, some fields in the configuration file may not reflect the actual settings used during training.
The script launched by accelerate
is a copy of sft.py with a minor modification.
tokenizer = AutoTokenizer.from_pretrained(
model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, use_fast=True, padding_side="right"
)
The key change is adding padding_side="right"
to avoid a warning. However, due to packing, its impact may be limited.
Before running the training commands, download the configuration file and training script, place them in $result
, and then set the necessary environment variables to execute the commands directly."
base_model=/lustre/S/huangdi/open_for_out/models/deepseek-coder-6.7b-instruct
config=${result}/deepspeed_zero3.yaml
script=${result}/sft.py
For a single node with 8 GPUs, you should run:
LAUNCHER="\
accelerate launch \
--config_file ${config} \
--num_processes 8 \
"
PROGRAM="\
${script} \
--model_name_or_path ${base_model} \
--dataset_name ${result}/trl_dataset \
--bf16 \
--seed 42 \
--packing \
--max_seq_length 2048 \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16 \
--gradient_checkpointing \
--learning_rate 2.0e-5 \
--lr_scheduler_type cosine \
--warmup_steps 25 \
--torch_dtype bfloat16 \
--save_strategy no \
--output_dir ${result}/model \
--report_to tensorboard \
--logging_steps=1 \
"
bash -c "$LAUNCHER $PROGRAM"
To use flash attention for acceleration, you should add the following parameter:
--attn_implementation flash_attention_2 \
However, in the versions I tested (transformers==4.47.1, flash_attn==2.7.2.post1), they appear to be incompatible, resulting in the following error:
TypeError: _flash_attention_forward() got an unexpected keyword argument 'num_items_in_batch'
To launch a multi-node task, you should refer to the accelerate documentation. There is also a Slurm example and a multi-node training example can be found in Finetuning Llama2 70B
LAUNCHER="\
accelerate launch \
--config_file ${config} \
--num_processes $((SLURM_NNODES * USER_GPUS_PER_NODE)) \
--num_machines $SLURM_NNODES \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--machine_rank \$SLURM_PROCID\
"
PROGRAM="\
${script} \
--model_name_or_path ${base_model} \
--dataset_name ${result}/trl_dataset \
--bf16 \
--seed 42 \
--packing \
--max_seq_length 2048 \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16 \
--gradient_checkpointing \
--learning_rate 2.0e-5 \
--lr_scheduler_type cosine \
--warmup_steps 25 \
--torch_dtype bfloat16 \
--save_strategy no \
--output_dir ${result}/model \
--report_to tensorboard \
--logging_steps=1 \
"
bash -c "$LAUNCHER $PROGRAM"
When submitting a task to multi-node Slurm, you may want certain tasks, such as data processing and evaluation, to be performed only once. Here is one way to achieve this:
flag=${result}/finished.flag
if [[ $SLURM_PROCID -eq 0 ]]; then
# Main process logic
echo "Main process, preprocessing dataset."
python -m llmkit_data.cli.prep_apps \
--apps ${APPS}/train.jsonl \
--out ${result}/dataset/train.jsonl \
--type SFT
python -m llmkit_data.cli.prep_apps \
--apps ${APPS}/test.jsonl \
--out ${result}/dataset/test.jsonl \
--type SFT \
--prompt_only
python -m llmkit_data.cli.convert_to_trl \
--dataset ${result}/dataset/train.jsonl \
--out ${result}/trl_dataset/train.jsonl
# Signal other processes to continue
touch ${flag}
else
# Other processes wait for the main process to finish
while [[ ! -f ${flag} ]]; do
sleep 1
done
fi
The TensorBoard logs can be found in the ${result}/model/runs
.
Evaluation¶
llmkit-data offers tools for performing inference and evaluation. Here’s how to use them:
python -m llmkit_data.cli.sample \
--prompts ${result}/dataset/test.jsonl \
--out ${samples} \
--model ${result}/model \
--gpu_per_model 1
python -m llmkit_data.cli.eval_apps \
--samples ${samples} \
--out ${evals} \
--apps ${APPS}
Similar to single-node evaluation, we can perform multi-node evaluation using the Slurm environment variable SLURM_PROCID
:
if [[ $SLURM_PROCID -eq 0 ]]; then
# Main process logic
echo "Main process, evaluating result."
python -m llmkit_data.cli.sample \
--prompts ${result}/dataset/test.jsonl \
--out ${samples} \
--model ${result}/model \
--gpu_per_model 1
python -m llmkit_data.cli.eval_apps \
--samples ${samples} \
--out ${evals} \
--apps ${APPS}
rm ${flag}
fi
This script ensures only the main process (SLURM_PROCID -eq 0
) performs inference and evaluation, avoiding redundant computation on other nodes.
Results¶
Performance Comparison: 1-Node vs. 2-Node (A100-40G)
Metric | 1 Node (8 GPUs) | 2 Nodes (4 GPUs/node) |
---|---|---|
Train Runtime (seconds) | 19750.8989 | 13263.2354 |
Train Samples/Second | 3.688 | 5.492 |
Train Steps/Second | 0.014 | 0.021 |
Train Loss | 0.3478 | 0.3483 |
As the table clearly shows, the 2-node configuration significantly outperforms the 1-node configuration in terms of runtime and throughput (samples/second and steps/second), despite using the same total number of GPUs. The difference in loss is negligible.
1 Node * 8 A100-40G
Difficulty | pass@1 | pass@5 | pass@10 |
---|---|---|---|
total | 0.1154316069057105 | 0.2112660469234174 | 0.2557768924302789 |
interview | 0.06989365603226988 | 0.15193159527592973 | 0.19508617528419508 |
competition | 0.006774193548387097 | 0.027956989247311825 | 0.04516129032258064 |
introductory | 0.33228021978021977 | 0.5115831589045875 | 0.5728021978021978 |
2 Nodes * 4 A100-40G
Difficulty | pass@1 | pass@5 | pass@10 |
---|---|---|---|
total | 0.11282868525896414 | 0.20518771474946773 | 0.25152722443559095 |
interview | 0.06725339200586725 | 0.1475602586713698 | 0.19288595526219288 |
competition | 0.0070967741935483875 | 0.0228494623655914 | 0.03225806451612903 |
introductory | 0.32857142857142857 | 0.49869723530437815 | 0.5645604395604396 |
base
Difficulty | pass@1 | pass@5 | pass@10 |
---|---|---|---|
total | 0.11723771580345287 | 0.20706064630367418 | 0.24833997343957503 |
introductory | 0.3244505494505494 | 0.46723465027036454 | 0.5260989010989011 |
interview | 0.07418408507517418 | 0.1581393589094359 | 0.19801980198019803 |
competition | 0.009354838709677418 | 0.026420890937019968 | 0.03870967741935484 |
1 Node * 8 A100-40G Loss
2 Nodes * 4 A100-40G Loss
While the provided script offers a starting point for training large language models using TRL, further investigation is needed to optimize model performance. The evaluation benchmarks haven’t shown a significant improvement after training.
Download
Here are the resources included in this guide:
For SLURM scripts, you should copy the following templates: /tools/template.gpu.slurm
for single-node jobs,
/tools/template.multi-gpus.slurm
& /tools/template.multi-gpus-task.sh
for multi-node jobs.
Follow the instructions within the slurm scripts to configure them, and execute your script at the job step.