PromptCS

A Prompt Learning Framework for Source Code Summarization

Dependency

python==3.8
torch==2.1.0
transformers==4.32.1
deepspeed==0.12.2 (option)
openai==0.28.0 (option)

Dataset

We use the Java, Javascript and Python dataset from the CodeXGLUE code-to-text docstring generation task, which is built upon the CodeSearchNet corpus and excludes defective data samples.

We further process them to obtain two additional fields: 'clean_code' and 'clean_doc'.

Download data and preprocess

unzip dataset.zip
cd dataset
wget https://zenodo.org/record/7857872/files/java.zip
wget https://zenodo.org/record/7857872/files/javascript.zip
wget https://zenodo.org/record/7857872/files/python.zip

unzip python.zip
unzip java.zip
unzip javascript.zip

python preprocess.py

rm *.pkl
rm -r */[^clean]*
cd ..

Data Format

After preprocessing dataset, you can obtain three .jsonl files, i.e. clean_train.jsonl, clean_valid.jsonl, clean_test.jsonl

For each file, each line in the uncompressed file represents one function. Here is an explanation of the fields:

The fields contained in the original CodeXGLUE dataset:
- repo: the owner/repo
- path: the full path to the original file
- func_name: the function or method name
- original_string: the raw string before tokenization or parsing
- language: the programming language
- code/function: the part of the original_string that is code
- code_tokens/function_tokens: tokenized version of code
- docstring: the top-level comment or docstring, if it exists in the original string
- docstring_tokens: tokenized version of docstring
The additional fields we added:
- clean_code: clean version of code that removing possible comments
- clean_doc: clean version of docstring that obtaining by concatenating docstring_tokens

Data Statistic

Programming Language	Training	Dev	Test
Python	251,820	13,914	14,918
Java	164,923	5,183	10,955
JavaScript	58,025	3,885	3,291

Training of PromptCS

Quick Start

cd PromptCS
CUDA_VISIBLE_DEVICES=0 python run.py --mode PromptCS --prompt_encoder_type lstm --template [0,100] --model_name_or_path ../LLMs/codegen-350m --train_filename ../dataset/java/clean_train.jsonl --dev_filename ../dataset/java/clean_valid.jsonl --test_filename ../dataset/java/clean_test.jsonl --output_dir ./saved_models --train_batch_size 16 --eval_batch_size 16 --learning_rate 5e-5

This can reproduce our experimental results on a single A800. If your device has insufficient GPU memory, or you need multi-GPU training, please check out the DeepSpeed version of training PromptCS

Train with DeepSpeed

We set the Zero Redundancy Optimizer (ZeRO) to ZeRO-3 and enable the offloading of optimizer computation to CPU. However, the experimental results obtained by using DeepSpeed to train PromptCS or fine-tune LLMs have not been validated.

cd PromptCS-DeepSpeed
deepspeed --num_gpus=2 run.py

Arguments

The explanation for some of the arguments is as follows.

model_name_or_path: Path to pre-trained model
mode: Operational mode. Choices=["PromptCS", "finetune"]
prompt_encoder_type: Architecture of prompt encoder. Choices=["lstm", "transformer"]
template: The concatenation method of pseudo tokens and code snippet. Default is the Back-end Mode [0, 100]
output_dir: The output directory where the model predictions and checkpoints will be written.

For a complete list of all arguments settings, please refer to the 'run.py'.

Evaluation

BLEU and SentenceBERT

cd PromptCS
python evaluate.py --predict_file_path ./saved_models/test_0.output --ground_truth_file_path ./saved_models/test_0.gold --SentenceBERT_model_path ../all-MiniLM-L6-v2

METEOR and ROUGE-L

To obtain METEOR and ROUGE-L, we need to activate the environment that contains python 2.7

conda activate py27
unzip evaluation
cd evaluation
python evaluate.py --predict_file_path ../PromptCS/saved_models/test_0.output --ground_truth_file_path ../PromptCS/saved_models/test_0.gold

Tip: The path should only contain English characters.

Zero-Shot LLMs

cd zeroshot
python manual.py --model_name_or_path ../bigcode/starcoderbase-3b --test_filename ../dataset/java/clean_test.jsonl
python manual_gpt_3.5.py --test_filename ../dataset/java/clean_test.jsonl

Few-Shot LLMs

We directly leverage the 10 Java examples provided by Ahmed et al. in their GitHub repository, since we use the same experimental dataset (i.e., the CSN corpus).

cd fewshot
python fewshot.py --model_name_or_path ../bigcode/starcoderbase-3b --test_filename ../dataset/java/clean_test.jsonl
python fewshot_gpt_3.5.py --test_filename ../dataset/java/clean_test.jsonl

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
PromptCS-DeepSpeed		PromptCS-DeepSpeed
PromptCS		PromptCS
fewshot		fewshot
zeroshot		zeroshot
LICENSE		LICENSE
README.md		README.md
dataset.zip		dataset.zip
evaluation.zip		evaluation.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PromptCS

Dependency

Dataset

Download data and preprocess

Data Format

Data Statistic

Training of PromptCS

Quick Start

Train with DeepSpeed

Arguments

Evaluation

BLEU and SentenceBERT

METEOR and ROUGE-L

Zero-Shot LLMs

Few-Shot LLMs

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

wssun/PromptCS

Folders and files

Latest commit

History

Repository files navigation

PromptCS

Dependency

Dataset

Download data and preprocess

Data Format

Data Statistic

Training of PromptCS

Quick Start

Train with DeepSpeed

Arguments

Evaluation

BLEU and SentenceBERT

METEOR and ROUGE-L

Zero-Shot LLMs

Few-Shot LLMs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages