Skip to article frontmatterSkip to article content

What is TRL?

Quote from TRL’s readme:

TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Built on top of the 🤗 Transformers ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.

In short, TRL is a library developed by the Hugging Face team. TRL offers a foundational layer for fine-tuning, while frameworks like LlamaFactory offers additional features and a potentially better user interface.

Environment setup

# create environment
conda create -n trl python=3.11
conda activate trl

# Install PyTorch with CUDA support:
module load cuda-cudnn/12.1-8.9.3
pip install torch --index-url https://download.pytorch.org/whl/cu121

pip install trl

# Install llmkit-data for data processing and evaluation
git clone http://62.234.201.16/llm-kit/Data.git
pip install -e Data

# Install flash-attention2 (Optional)
# curl -O https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
# pip install ./flash_attn-2.7.2.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl --no-build-isolation

# Install deepspeed (Optional, required only for speeding up multi-GPU training)
pip install deepspeed

# Install vllm for inference
pip install vllm

pip install tensorboardX

Then download APPS and Create a directory named result to store your training results.

APPS=/path/to/apps
result=/path/to/result

Data processing

This step prepares the dataset in a standard format using the llmkit_data library.

## Process Train set
python -m llmkit_data.cli.prep_apps \
    --apps ${APPS}/train.jsonl \
    --out ${result}/dataset/train.jsonl \
    --type SFT

## Process Test set
python -m llmkit_data.cli.prep_apps \
    --apps ${APPS}/test.jsonl \
    --out ${result}/dataset/test.jsonl \
    --type SFT \
    --prompt_only
python -m llmkit_data.cli.convert_to_trl \
    --dataset ${result}/dataset/train.jsonl \
    --out ${result}/trl_dataset/train.jsonl

This is only can be used for oneturn instruction dataset. By default, TRL should mask the user prompt and training on reponse. For multiturn converstation, you need to find out how TRL deal with them and write you own converter.

Training

This section details the training process using TRL.

TRL provides numerous examples on their webpage and in their repository.

Our configuration is based on deepspeed_zero3. To minimize configuration changes, we will use TRL’s example configuration and modify its contents via command-line arguments. Therefore, some fields in the configuration file may not reflect the actual settings used during training.

The script launched by accelerate is a copy of sft.py with a minor modification.

    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, use_fast=True, padding_side="right"
    )

The key change is adding padding_side="right" to avoid a warning. However, due to packing, its impact may be limited.

Before running the training commands, download the configuration file and training script, place them in $result, and then set the necessary environment variables to execute the commands directly."

base_model=/lustre/S/huangdi/open_for_out/models/deepseek-coder-6.7b-instruct
config=${result}/deepspeed_zero3.yaml
script=${result}/sft.py

For a single node with 8 GPUs, you should run:

LAUNCHER="\
accelerate launch \
  --config_file ${config} \
  --num_processes 8 \
"

PROGRAM="\
  ${script} \
  --model_name_or_path ${base_model} \
  --dataset_name ${result}/trl_dataset \
  --bf16 \
  --seed 42 \
  --packing \
  --max_seq_length 2048 \
  --num_train_epochs 2 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 16 \
  --gradient_checkpointing \
  --learning_rate 2.0e-5 \
  --lr_scheduler_type cosine \
  --warmup_steps 25 \
  --torch_dtype bfloat16 \
  --save_strategy no \
  --output_dir ${result}/model \
  --report_to tensorboard \
  --logging_steps=1 \
"

bash -c "$LAUNCHER $PROGRAM"

To use flash attention for acceleration, you should add the following parameter:

  --attn_implementation flash_attention_2 \

However, in the versions I tested (transformers==4.47.1, flash_attn==2.7.2.post1), they appear to be incompatible, resulting in the following error:

TypeError: _flash_attention_forward() got an unexpected keyword argument 'num_items_in_batch'

To launch a multi-node task, you should refer to the accelerate documentation. There is also a Slurm example and a multi-node training example can be found in Finetuning Llama2 70B

LAUNCHER="\
accelerate launch \
  --config_file ${config} \
  --num_processes $((SLURM_NNODES * USER_GPUS_PER_NODE)) \
  --num_machines $SLURM_NNODES \
  --main_process_ip $MASTER_ADDR \
  --main_process_port $MASTER_PORT \
  --machine_rank \$SLURM_PROCID\
"

PROGRAM="\
  ${script} \
  --model_name_or_path ${base_model} \
  --dataset_name ${result}/trl_dataset \
  --bf16 \
  --seed 42 \
  --packing \
  --max_seq_length 2048 \
  --num_train_epochs 2 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 16 \
  --gradient_checkpointing \
  --learning_rate 2.0e-5 \
  --lr_scheduler_type cosine \
  --warmup_steps 25 \
  --torch_dtype bfloat16 \
  --save_strategy no \
  --output_dir ${result}/model \
  --report_to tensorboard \
  --logging_steps=1 \
"

bash -c "$LAUNCHER $PROGRAM"

When submitting a task to multi-node Slurm, you may want certain tasks, such as data processing and evaluation, to be performed only once. Here is one way to achieve this:

flag=${result}/finished.flag
if [[ $SLURM_PROCID -eq 0 ]]; then
    # Main process logic
    echo "Main process, preprocessing dataset."
    python -m llmkit_data.cli.prep_apps \
        --apps ${APPS}/train.jsonl \
        --out ${result}/dataset/train.jsonl \
        --type SFT

    python -m llmkit_data.cli.prep_apps \
        --apps ${APPS}/test.jsonl \
        --out ${result}/dataset/test.jsonl \
        --type SFT \
        --prompt_only

    python -m llmkit_data.cli.convert_to_trl \
        --dataset ${result}/dataset/train.jsonl \
        --out ${result}/trl_dataset/train.jsonl

    # Signal other processes to continue
    touch ${flag}
else
    # Other processes wait for the main process to finish
    while [[ ! -f ${flag} ]]; do
        sleep 1
    done
fi

The TensorBoard logs can be found in the ${result}/model/runs.

Evaluation

llmkit-data offers tools for performing inference and evaluation. Here’s how to use them:

python -m llmkit_data.cli.sample \
    --prompts ${result}/dataset/test.jsonl \
    --out ${samples} \
    --model ${result}/model \
    --gpu_per_model 1

python -m llmkit_data.cli.eval_apps \
    --samples ${samples} \
    --out ${evals} \
    --apps ${APPS}

Similar to single-node evaluation, we can perform multi-node evaluation using the Slurm environment variable SLURM_PROCID:

if [[ $SLURM_PROCID -eq 0 ]]; then
    # Main process logic
    echo "Main process, evaluating result."

    python -m llmkit_data.cli.sample \
        --prompts ${result}/dataset/test.jsonl \
        --out ${samples} \
        --model ${result}/model \
        --gpu_per_model 1

    python -m llmkit_data.cli.eval_apps \
        --samples ${samples} \
        --out ${evals} \
        --apps ${APPS}

    rm ${flag}
fi

This script ensures only the main process (SLURM_PROCID -eq 0) performs inference and evaluation, avoiding redundant computation on other nodes.

Results

Performance Comparison: 1-Node vs. 2-Node (A100-40G)

Metric1 Node (8 GPUs)2 Nodes (4 GPUs/node)
Train Runtime (seconds)19750.898913263.2354
Train Samples/Second3.6885.492
Train Steps/Second0.0140.021
Train Loss0.34780.3483

As the table clearly shows, the 2-node configuration significantly outperforms the 1-node configuration in terms of runtime and throughput (samples/second and steps/second), despite using the same total number of GPUs. The difference in loss is negligible.

1 Node * 8 A100-40G

Difficultypass@1pass@5pass@10
total0.11543160690571050.21126604692341740.2557768924302789
interview0.069893656032269880.151931595275929730.19508617528419508
competition0.0067741935483870970.0279569892473118250.04516129032258064
introductory0.332280219780219770.51158315890458750.5728021978021978

2 Nodes * 4 A100-40G

Difficultypass@1pass@5pass@10
total0.112828685258964140.205187714749467730.25152722443559095
interview0.067253392005867250.14756025867136980.19288595526219288
competition0.00709677419354838750.02284946236559140.03225806451612903
introductory0.328571428571428570.498697235304378150.5645604395604396

base

Difficultypass@1pass@5pass@10
total0.117237715803452870.207060646303674180.24833997343957503
introductory0.32445054945054940.467234650270364540.5260989010989011
interview0.074184085075174180.15813935890943590.19801980198019803
competition0.0093548387096774180.0264208909370199680.03870967741935484

1 Node * 8 A100-40G Loss

Loss 1node * 8gpu

2 Nodes * 4 A100-40G Loss

Loss 2node * 4gpu

While the provided script offers a starting point for training large language models using TRL, further investigation is needed to optimize model performance. The evaluation benchmarks haven’t shown a significant improvement after training.

Download

Here are the resources included in this guide:

For SLURM scripts, you should copy the following templates: /tools/template.gpu.slurm for single-node jobs, /tools/template.multi-gpus.slurm & /tools/template.multi-gpus-task.sh for multi-node jobs. Follow the instructions within the slurm scripts to configure them, and execute your script at the job step.