Skip to content

Commit 7d18493

Browse files
authored
Update README.md
1 parent eba4ef2 commit 7d18493

File tree

1 file changed

+32
-19
lines changed

1 file changed

+32
-19
lines changed

README.md

Lines changed: 32 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,40 @@
44
💫 StarCoder is a language model (LM) trained on source code and natural language text. Its training data incorporates more that 80 different programming languages as well as text extracted from github issues and commits and from notebooks. This repository showcases how we get an overview of this LM's capabilities.
55

66
# Table of Contents
7-
1. [Fine-tuning](#fine-tuning)
7+
1. Quickstart
8+
- [Installation](#installation)
9+
- [Code generation with StarCoder](#code-generation)
10+
2. [Fine-tuning](#fine-tuning)
811
- [Step by step installation with conda](#step-by-step-installation-with-conda)
912
- [Datasets](#datasets)
1013
- [Stack Exchange](#stack-exchange-se)
1114
- [Merging PEFT adapter layers](#merging-peft-adapter-layers)
15+
16+
# Quickstart
17+
StarCoder was trained on github code, thus is can be use to perform text-generation. That is, completing the implementation of a function or infer the following characters in a line of code. This can be done with the help of the transformers's library.
18+
19+
## Installation
20+
Here we have to install all the libraries listed in `requirements.txt`
21+
```bash
22+
pip install -r requirements.txt
23+
```
24+
## Code generation
25+
The code generation pipeline is as follows
26+
27+
```python
28+
from transformers import AutoModelForCausalLM, AutoTokenizer
29+
30+
checkpoint = "bigcode/starcoder"
31+
device = "cuda" # for GPU usage or "cpu" for CPU usage
32+
33+
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
34+
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)
35+
36+
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
37+
outputs = model.generate(inputs)
38+
print(tokenizer.decode(outputs[0]))
39+
```
40+
1241
# Fine-tuning
1342

1443
Here, we showcase how we can fine-tune this LM on a specific downstream task.
@@ -72,7 +101,7 @@ Now that everything is done, you can clone the repository and get into the corre
72101
To execute the fine-tuning script run the following command:
73102
```bash
74103
python finetune/finetune.py \
75-
--model_path="bigcode/large-model"\
104+
--model_path="bigcode/starcoder"\
76105
--dataset_name="ArmelR/stack-exchange-instruction"\
77106
--subset="data/finetune"\
78107
--split="train"\
@@ -95,7 +124,7 @@ The command is quite similar to the what we use on alpaca code. However, the siz
95124
```bash
96125
python -m torch.distributed.launch \
97126
--nproc_per_node number_of_gpus finetune/finetune.py \
98-
--model_path="bigcode/large-model"\
127+
--model_path="bigcode/starcoder"\
99128
--dataset_name="ArmelR/stack-exchange-instruction"\
100129
--subset="data/finetune"\
101130
--split="train"\
@@ -127,19 +156,3 @@ For example
127156
python finetune/merge_peft_adapters.py --model_name_or_path bigcode/large-model --peft_model_path checkpoints/checkpoint-1000 --push_to_hub
128157
```
129158

130-
## How to do text-generation with StarCoder
131-
132-
```python
133-
from transformers import AutoModelForCausalLM, AutoTokenizer
134-
135-
checkpoint = "bigcode/large-model"
136-
device = "cuda" # for GPU usage or "cpu" for CPU usage
137-
138-
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
139-
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)
140-
141-
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
142-
outputs = model.generate(inputs)
143-
print(tokenizer.decode(outputs[0]))
144-
```
145-
## Text-inference

0 commit comments

Comments
 (0)