Skip to article frontmatterSkip to article content

Experiments on APPS dataset

To test the performance of SFT training of LlamaFactory, we run experiments on the APPS dataset. We conduct SFT training on the APPS training dataset, and test the model on APPS test dataset. Below we list configurations of our experiment.

Dataset source: https://huggingface.co/datasets/codeparrot/apps/tree/main

Model: deepseek-coder-6.7b-instruct

Requirements: We use Python 3.10.13.

  1. Clone source code:
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
# We use version 0.8.3.dev0
git checkout bda302fbfbdb114dee7782d405732600d2d73279
  1. Run pip install -r assets/LlamaFactory/requirements.txt
  2. Download flash_attn and install: https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.0/flash_attn-2.5.0+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
  3. Install LlamaFactory:
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Training procedure

I run the experiments on 8 A100-PCIE-40GB GPUs (r8a100-a02). The detailed training config is in assets/LlamaFactory/dscoderinst_full_sft_ds3.yaml, please copy it to examples/train_full in LlamaFactory repo.

The bash command for training:

llamafactory-cli train examples/train_full/dscoderinst_full_sft_ds3.yaml

The total training time for the two epochs are:

Training with packing: 3h33min21s.

Training without packing: 8h58min33s.

Training loss

Training with packing:

Train loss

Last step loss: 0.3562

Training without packing:

Train loss

Last step loss: 0.3207

Evaluation

Currently, a total of 3,765 problems from the test set are evaluated, with problems lacking solutions filtered out. For each problem, 10 pieces of code are generated. The sampling temperature is set to 0.6, and top_p is set to 0.95.

The inference stage is run on 8 GPUs (r8a100-a02) with the following command:

python -m llmkit_data.cli.sample --prompts $DATAFILE_PATH --out $SAMPLE_PATH --model $MODEL_PATH --gpu_per_model 1
python -m llmkit_data.cli.eval_apps --samples $SAMPLE_PATH --out $RESULT_PATH --apps $APPS_PATH

Inference is conducted using vllm, taking 16min35s for code generation, and approximately 20 minutes for evaluation.

Results:

The pass@1, pass@5, and pass@10 statistics across different difficulty levels are presented in the table below:

Training with packing:

Difficultypass@1pass@5pass@10
total0.1299335989375830.234953308459284580.27915006640106244
introductory0.35315934065934070.52686747776033490.5755494505494505
interview0.083828382838283820.178783883679373260.22405573890722405
competition0.011290322580645160.0435355862775217640.06774193548387097

Training without packing:

Difficultypass@1pass@5pass@10
total0.1317928286852590.23786757730980840.2836653386454183
introductory0.35412087912087910.53320687249258680.5947802197802198
interview0.085845251191785850.181145627790292250.2258892555922259
competition0.0138709677419354830.043266769073220690.06129032258064516

For comparison, there is an experiment for the original deepseek-coder-6.7b-instruct model.

Difficultypass@1pass@5pass@10
total0.117237715803452870.207060646303674180.24833997343957503
introductory0.32445054945054940.467234650270364540.5260989010989011
interview0.074184085075174180.15813935890943590.19801980198019803
competition0.0093548387096774180.0264208909370199680.03870967741935484