CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization

1SKL of Processors, Institute of Computing Technology, CAS
2University of Science and Technology of China
3University of Chinese Academy of Sciences
4Cambricon Technologies

Abstract

Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthrough performance on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code- NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage ``distill-then-RL'' training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~21 %, while matching or even exceeding the performance of 671B DeepSeek-R1. We will release our model, training pipeline, and dataset to facilitate research in EDA and LLM communities.

Overview

Our framework comprises one automated testbench generation framework and 5 stages. Stages 1~3 constitute the distillation phase, and stages 4~5 comprise the reinforcement learning phase.

1. Code-to-NL: Following prior work [codev, mgverilog], we collect Verilog code snippets from GitHub and use DeepSeek-V3 to produce corresponding natural-language summaries, creating an NL-code corpus with approximately 150K data samples.

2. NL-to-Code: Using DeepSeek-R1, we take each NL description from stage 1 and generate the "thought" as well as an Verilog code snippet, producing NL–thought–code triples.

3. Difficulty Filtering and Supervised Fine-Tuning: We first filter the dataset by removing any examples for which base LLMs (e.g., Qwen2.5-Coder-Instruct-7B/32B) can generate correct code in any of 5 attempts (correctness is verified using our automatically generated testbench). We then perform SFT on the base LLM to bootstrap their reasoning ability, yielding the distilled model, CodeV-R1-Distill. This stage uses approximately 87K examples.

4. Equivalence Checking. We use our automated testbench to verify equivalence between the original snippets and the newly generated snippets. Any non-equivalent pairs are discarded, while equivalent pairs are retained as high-quality data for subsequent RL training. After this filtering, approximately 87K examples remain.

5. Difficulty Filtering and Reinforcement Learning. We again filter the retained set by removing any examples where the distilled model CodeV-R1-Distill generates correct code in any of 5 attempts (as checked by the testbench). We then apply our adaptive DAPO algorithm, a novel RLVR algorithm, to further improve Verilog-generation performance, resulting in the final model, CodeV-R1.

Main Results

We compares the main results of our CodeV-R1 with baseline methods on the VerilogEval v1/v2 and RTLLM v1.1/v2.0 benchmarks. Here we present the results of our distilled model and RL model. For baseline comparisons, we manually tested GPT, DeepSeek, and Qwen models, while sourcing results from RTLCoder, BetterV, CodeV, and CraftRTL.

Our CodeV-R1 model achieves 68.6% and 72.9% pass@1 accuracy on VerilogEval v2 and RTLLM v1.1, respectively. On RTLLM v1.1, it outperforms previous state-of-the-art models by approximately 20%, and surpasses the 671B DeepSeek-R1 on both RTLLM v1.1 and v2.0.

The following two tables are results on VerilogEval v1 / RTLLM v1.1, and VerilogEval v2 / RTLLM v2.0.

For VerilogEval v1 and RTLLM v1.1, we evaluate the models with *, while other results are sourced from their papers.

For VerilogEval v2 and RTLLM v2.0, we evaluate all models in this table. The SR means Specification-to-RTL, while CC means Code Completion.

BibTex

                
                  [TODO]