CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization

1SKL of Processors, Institute of Computing Technology, CAS
2University of Chinese Academy of Sciences
3University of Science and Technology of China
4Cambricon Technologies

Abstract

The increasing complexity and high costs associated with modern processor design have led to a surge in demand for processor design automation. Instruction-tuned large language models (LLMs) have demonstrated remarkable performance in automatically generating code for general-purpose programming languages like Python. However, these methods fail on hardware description languages (HDLs) like Verilog due to the scarcity of high-quality instruction tuning data, as even advanced LLMs like GPT-3.5 exhibit limited performance on Verilog generation. Regarding this issue, we observe that (1) Verilog code collected from the real world has higher quality than those generated by LLMs. (2) LLMs like GPT-3.5 excel in summarizing Verilog code rather than generating it. Based on these observations, this paper introduces CodeV, a series of open-source instruction-tuned Verilog generation LLMs. Instead of generating descriptions first and then getting the corresponding code from advanced LLMs, we prompt the LLM with Verilog code and let the LLM generate the corresponding natural language description by multi- level summarization. Experimental results show that CodeV relatively surpasses the previous open-source SOTA by 14.4% (BetterV in VerilogEval) and 11.3% (RTLCoder in RTLLM) respectively, and also relatively outperforms previous commercial SOTA GPT-4 by 22.1% in VerilogEval.

Overview

We first collect and filter high-quality Verilog modules from open-source codebases. The modules are then sent to GPT-3.5 to request multi-level summaries. Pairing high-level descriptions with corresponding modules, the high-quality dataset is utilized to fine-tune base LLMs, yielding CodeV models.

Multi-level Code Summarization

Manual annotation is prohibitively time-consuming and costly. Hence, we employed GPT-3.5 to generate high-level summaries for each Verilog module as its requirement description. As analyzed in VerilogEval, when required for summarising, LLMs often produce verbose descriptions, preferring line-by-line explanations over high-level summaries. To address this issue, we introduce a multi-level summarization method, employing few-shot learning to guide GPT-3.5 in first producing detailed descriptions and then abstracting high-level summaries.

An actual example of the prompt for multi-level summarization. (a) The prompt provided to GPT-3.5. (b) An example of the demonstrations, with code, low-level descriptions, and high-level summaries. (c) Summaries responded from GPT-3.5 with and (d) without multi-level summarization.

Main Results

We compares the main results of our CodeV with baseline methods on the VerilogEval and RTLLM benchmarks. We test CodeLlama, DeepSeek-Coder, and CodeQwen on RTLLM, while other baseline results are sourced from RTLCoder or BetterV paper. For a fair comparison, we also evaluate our models trained on comparable-size datasets against RTLCoder.

Comparison of our CodeV models against various baseline models. Some data are missing due to the models being close-sourced and the data not being reported previously. The best results are highlighted in bold. In VerilogEval, CodeV outperforms all previous methods across all metrics, relatively surpassing previous open-source SOTA BetterV by 14.4% and GPT-4 by 22.1% on average. In RTLLM, CodeV relatively surpasses previous open-source SOTA RTLCoder by 11.3%.

LLM-generated Verilog code

We have collected existing LLMs of Verilog code and demonstrated their performance on VerilogEval and RTLLM in Chip Design LLM Zoo.

Quick Start

          
from transformers import pipeline
import torch
prompt= "FILL IN THE QUESTION"
generator = pipeline(
  model="CODEV",
  task="text-generation",
  torch_dtype=torch.bfloat16,
  device_map="auto",
)
result = generator(prompt , max_length=2048,num_return_sequences=1, temperature=0.0)
response = result[0]["generated_text"]
print("Response:", response)