Skip to article frontmatterSkip to article content

Experiments on APPS dataset

To test the performance of SFT training of OpenRLHF, we run experiments on the APPS dataset. We conduct SFT training on the APPS training dataset, and test the model on APPS test dataset. Below we list configurations of our experiment.

Dataset source: https://huggingface.co/datasets/codeparrot/apps/tree/main

Model: deepseek-coder-6.7b-instruct

Package version:

  • openrlhf==0.5.3
  • transformers==4.46.3
  • ray==2.12.0
  • flash-attn==2.7.0.post2
  • deepspeed==0.15.0
  • vllm==0.6.4.post1
  • torch==2.5.1+cu121
  • python 3.11.10

The full packages is provided here.

About package installation: The typical sequence is as follows: First, install torch. Next, download and install the appropriate version of flash-attn from the official releases page (installing flash-attn through pip directly might get stuck). It appears that the wheels should have cxxabiFALSE to function properly. Finally, install the remaining packages, for which running pip install openrlhf vllm should suffice.

Training procedure

I run the experiments on 8 A100-PCIE-40GB GPUs (r8a100-a) on r8nv-gpu-dist. I use 2 nodes (r8a100-a[02,03]) with 4 GPUs each. The main training script is as follows:

MODEL_PATH="/lustre/S/huangdi/open_for_out/models/deepseek-coder-6.7b-instruct"
OUTPUT_DIR="./checkpoint/dsc_6.7b_inst_sft_apps"
DATAFILE_PATHS="data/train_sft.jsonl"
HOSTFILE="${SLURM_SUBMIT_DIR}/hostfile"

deepspeed --hostfile $HOSTFILE --no_ssh --node_rank $SLURM_PROCID \
  --master_addr $MASTER_ADDR --master_port=$MASTER_PORT \
  --module openrlhf.cli.train_sft \
  --pretrain $MODEL_PATH \
  --dataset $DATAFILE_PATHS \
  --save_path $OUTPUT_DIR/model \
  --max_len 2048 \
  --input_key question \
  --output_key response \
  --train_batch_size 256 \
  --micro_train_batch_size 2 \
  --max_epochs 2 \
  --max_samples 5000000 \
  --save_steps 16000 \
  --logging_steps 1 \
  --eval_steps -1 \
  --zero_stage 3 \
  --max_epochs 2 \
  --seed 42\
  --bf16 \
  --flash_attn \
  --learning_rate 2e-5 \
  --packing_samples \
  --use_tensorboard $OUTPUT_DIR/runs \
  --apply_chat_template \
  --gradient_checkpointing \

During training, I encountered an error in multi-node training like the following form:

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
==================================================
finetune.py FAILED
--------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-29_22:03:17
  host      : finetune-job-96q2q
  rank      : 4 (local_rank: 4)
  exitcode  : -7 (pid: 10)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 10
==================================================

This is because the size of shared memory is not enough. Adding the following SBATCH options to the slurm file resolves the error:

#SBATCH --mem=0
#SBATCH --exclusive

Please note that compared to other frameworks, OpenRLHF consumes more GPU memory. To mitigate this, I use --gradient_checkpointing to reduce memory usage, though this might slightly impact the final model performance.

The training time for the two epochs are 5h07m54s and 5h07m30s respectively on the 8 GPUs from r8a100-a[02,03] respectively. It appears that training on r8nv-gpu-dist is faster than on r8nv-gpu-hw.

I also run an experiment without packing with the same GPU configuration, the total time cost for 2 epochs is 10h20m18s.

Training loss

Train loss with packing:

loss_packing

The training loss is illustrated in the figure above, with a noticeable decrease occurring around epoch 400. This decrease appears to be due to the model memorizing every sample during the first epoch and beginning to overfit in the second epoch. For further discussion, please visit this page on Zhihu.

A comparison of Train loss without packing:

loss_no_packing

Evaluation

Currently, a total of 3,765 problems from the test set are evaluated, with problems lacking solutions filtered out. For each problem, 10 pieces of code are generated. The sampling temperature is set to 0.6, and top_p is set to 0.95.

The inference stage is run on 4 GPUs from the node r8nv-gpu-hw-80g with the following command:

python -m llmkit_data.cli.sample --prompts $DATAFILE_PATH --out $SAMPLE_PATH --model $MODEL_PATH --gpu_per_model 4

, and the evaluation stage is run on the node r8cpu with 32 CPU cores with the following command ($APPS_PATH is the folder containing train and test jsonl files, and remember not to put other data files in this folder):

python -m llmkit_data.cli.eval_apps --samples $SAMPLE_PATH --out $RESULT_PATH --apps $APPS_PATH

Inference is conducted using vllm, taking 30m43s for code generation, and approximately 39 minutes for evaluation.

For training with packing, the time cost for code generation is 42m01s and time for evaluation is around 39 minutes.

Results:

The pass@1, pass@5, and pass@10 statistics across different difficulty levels are presented in the table below:

Difficultypass@1pass@5pass@10
total0.127835325365205840.23063618541706190.2788844621513944
introductory0.34491758241758240.51001875109017970.5604395604395604
interview0.083094976164283090.177615380585677630.22845617895122847
competition0.0116129032258064520.040949820788530470.06129032258064516

A comparison of not adding packing:

Difficultypass@1pass@5pass@10
total0.12982735723771580.233521996669407040.2804780876494024
introductory0.34587912087912080.52796855921855920.5934065934065934
interview0.085331866519985340.176695711899232280.22185551888522184
competition0.0138709677419354830.0419354838709677450.06129032258064516

For comparison, I also run an experiment for the original deepseek-coder-6.7b-instruct model. Its inference stage costs 47m50s and evaluation stage costs around 53 minutes. The inference is slower since it sometimes generates additional text, but the reason for slower evaluation stage is currently unknown. The pass@k statistics are shown in the following table:

Difficultypass@1pass@5pass@10
total0.117237715803452870.207060646303674180.24833997343957503
introductory0.32445054945054940.467234650270364540.5260989010989011
interview0.074184085075174180.15813935890943590.19801980198019803
competition0.0093548387096774180.0264208909370199680.03870967741935484