LLM-Based Code Generation Method for Golang Compiler Testing

Update:

🎉🎉This paper was published in ESEC/FSE Conference in December, 2023. It also won me the Student Research Competition (Undergraduate Division) at the conference held from December 3 to7, 2023, in San Francisco.

This is the official PyTorch implementation for our paper:

Title: LLM-Based Code Generation Method for Golang Compiler Testing [PDF]

Authors: Qiuhan Gu, Shicheng Yin and Yu Wang

Introduction

This repo provides the code for reproducing the experiments in LLM-Based Code Generation Method for Golang Compiler Testing. We present a LLM-based high-quality code generation method and implement it on Golang compiler testing. To summarize, our contributions in this work include:

A LLM-based high-quality code generation method.
Apply the method to the Golang compiler, generating testcases with 3.38% average coverage. It detects only 2.79% of syntax errors and 0% of undefined behavior in the testcases.

How to reproduce the result?

Go to scripts/goroot/src/codet5 folder, you can use ./script.sh to reproduce the experiment.

Model

Our model is finetuned based on the pre-trained CodeT5-small model. Our model can genarate the missing function body according to the input which privides the necessary class environment and an empty function. You can find them in model folder or in huggingface web.

See example below for formatting.

How to use

Here is how to use this model:

from transformers import T5ForConditionalGeneration, RobertaTokenizer

# load model and tokenizer
model_path = "intm/codet5-small-go_generation"
tokenizer = RobertaTokenizer.from_pretrained('intm/codet5-small-go_generation')
model = T5ForConditionalGeneration.from_pretrained(model_path)

# use model to generate code 
input_text = "package names\n\nimport \"knative.dev/pkg/kmeta\"\n\n\nfunc Deployment(rev kmeta.Accessor) string {\n\treturn kmeta.ChildName(rev.GetName(), \"-deployment\")\n}\n\n\nfunc ImageCache(rev kmeta.Accessor) string {\n\treturn kmeta.ChildName(rev.GetName(), \"-cache\")\n}\n\n\n\n\nfunc PA(rev kmeta.Accessor) string"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids=input_ids, max_new_tokens=256)  # max_new_token is same as max_trg_len in dataset

# convert the result to the string
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)


# this prints "return kmeta.ChildName(rev.GetName(), "-pa")"

Dataset

We process and filter Go language code files from the internet with the help of the syntax analysis tool tree-sitter [18] to obtain Go language code files that meet our requirements. Thus, we prepare the filtered dataset, which contains 1839 pieces of program, for model training and initial seed choice. You can find them in dataset folder or in huggingface web.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant No. 62232001 and No. 62202220

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dataset		dataset
images		images
model		model
scripts		scripts
.DS_Store		.DS_Store
._.DS_Store		._.DS_Store
._.git		._.git
._.gitattributes		._.gitattributes
._README.md		._README.md
._dataset		._dataset
._images		._images
._model		._model
._scripts		._scripts
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Based Code Generation Method for Golang Compiler Testing

Update:

Table of Contents

Introduction

How to reproduce the result?

Model

How to use

Dataset

Acknowledgments

About

Releases

Packages

Languages

GuQiuhan/LLM-Based-Code-Generation-Method-for-Golang-Compiler-Testing

Folders and files

Latest commit

History

Repository files navigation

LLM-Based Code Generation Method for Golang Compiler Testing

Update:

Table of Contents

Introduction

How to reproduce the result?

Model

How to use

Dataset

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages