Skip to content

Commit b153d04

Browse files
committed
Added examples
1 parent df70a08 commit b153d04

File tree

1 file changed

+324
-0
lines changed

1 file changed

+324
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "6b0899a74456b1f6",
6+
"metadata": {},
7+
"source": [
8+
"### Using Reasoning Effort Parameter with o-series Models\n"
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "446a6db5",
14+
"metadata": {},
15+
"source": [
16+
"## Choosing the right Reasoning model and Reasoning effort for your use case \n",
17+
"\n",
18+
"Reasoning models, such as OpenAI’s o1 and o3-mini, are advanced language models trained with reinforcement learning to enhance complex reasoning. They generate a detailed internal thought process before responding, making them highly effective in problem-solving, coding, scientific reasoning, and multi-step planning for agentic workflows.\n",
19+
"\n",
20+
"In this Cookbook, we will explore an Eval based quantiative analysis to help you choose the right reasoning model and reasoning effort for your use case. \n",
21+
"\n",
22+
"This is a 3 step process: \n",
23+
"\n",
24+
"1. Build Your Evaluation Dataset\n",
25+
"2. Build a Pipeline to evaluate the reasoning model and capture metrics \n",
26+
"3. Choose the model/parameter based on cost/performance trade-off "
27+
]
28+
},
29+
{
30+
"cell_type": "markdown",
31+
"id": "8474d01e",
32+
"metadata": {},
33+
"source": [
34+
"### Step 1: Build Your Evaluation Dataset \n",
35+
"\n",
36+
"For this example, we will use the AI2-ARC dataset\n",
37+
"\n",
38+
"ARC-Challenge\n",
39+
"id: a string feature.\n",
40+
"question: a string feature.\n",
41+
"choices: a dictionary feature containing:\n",
42+
"text: a string feature.\n",
43+
"label: a string feature.\n",
44+
"answerKey: a string feature."
45+
]
46+
},
47+
{
48+
"cell_type": "code",
49+
"execution_count": 1,
50+
"id": "4b3867fc",
51+
"metadata": {},
52+
"outputs": [],
53+
"source": [
54+
"import requests\n",
55+
"\n",
56+
"url = \"https://huggingface.co/datasets/allenai/ai2_arc/resolve/main/ARC-Challenge/test-00000-of-00001.parquet\"\n",
57+
"response = requests.get(url)\n",
58+
"with open(\"test-00000-of-00001.parquet\", \"wb\") as f:\n",
59+
" f.write(response.content)"
60+
]
61+
},
62+
{
63+
"cell_type": "code",
64+
"execution_count": 2,
65+
"id": "f0ad995b",
66+
"metadata": {},
67+
"outputs": [
68+
{
69+
"name": "stdout",
70+
"output_type": "stream",
71+
"text": [
72+
"{\n",
73+
" \"id\": \"Mercury_7175875\",\n",
74+
" \"question\": \"An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?\",\n",
75+
" \"choices\": {\n",
76+
" \"text\": [\n",
77+
" \"Planetary density will decrease.\",\n",
78+
" \"Planetary years will become longer.\",\n",
79+
" \"Planetary days will become shorter.\",\n",
80+
" \"Planetary gravity will become stronger.\"\n",
81+
" ],\n",
82+
" \"label\": [\n",
83+
" \"A\",\n",
84+
" \"B\",\n",
85+
" \"C\",\n",
86+
" \"D\"\n",
87+
" ]\n",
88+
" },\n",
89+
" \"answerKey\": \"C\"\n",
90+
"}\n"
91+
]
92+
}
93+
],
94+
"source": [
95+
"import json\n",
96+
"import pandas as pd\n",
97+
"\n",
98+
"# Set Pandas options to display full text in cells\n",
99+
"pd.set_option('display.max_colwidth', None)\n",
100+
"\n",
101+
"# Reads the Parquet file into a DataFrame.\n",
102+
"df = pd.read_parquet(\"test-00000-of-00001.parquet\")\n",
103+
"\n",
104+
"# Convert the first row to a dictionary.\n",
105+
"row_dict = df.head(1).iloc[0].to_dict()\n",
106+
"\n",
107+
"# Pretty-print the row as a JSON string with an indentation of 4 spaces.\n",
108+
"# The default lambda converts non-serializable objects (like numpy arrays) to lists.\n",
109+
"print(json.dumps(row_dict, indent=4, default=lambda o: o.tolist() if hasattr(o, 'tolist') else o))"
110+
]
111+
},
112+
{
113+
"cell_type": "code",
114+
"execution_count": 3,
115+
"id": "116e476a",
116+
"metadata": {},
117+
"outputs": [
118+
{
119+
"name": "stdout",
120+
"output_type": "stream",
121+
"text": [
122+
"Total number of rows in the dataset: 1172\n"
123+
]
124+
}
125+
],
126+
"source": [
127+
"def display_total_rows(dataframe: pd.DataFrame):\n",
128+
" \"\"\"\n",
129+
" Display the total number of rows in the given DataFrame.\n",
130+
"\n",
131+
" Parameters:\n",
132+
" dataframe (pd.DataFrame): The input DataFrame.\n",
133+
"\n",
134+
" Returns:\n",
135+
" None\n",
136+
" \"\"\"\n",
137+
" total_rows = len(dataframe)\n",
138+
" print(f\"Total number of rows in the dataset: {total_rows}\")\n",
139+
"\n",
140+
"# Display the total number of rows in the dataset\n",
141+
"display_total_rows(df)\n"
142+
]
143+
},
144+
{
145+
"cell_type": "markdown",
146+
"id": "380c8c4e",
147+
"metadata": {},
148+
"source": [
149+
"### Step 2: Build a Pipeline to evaluate the reasoning model and capture metrics \n",
150+
"\n",
151+
"Let's write a python script to evaluate the reasoning model and capture metrics. \n"
152+
]
153+
},
154+
{
155+
"cell_type": "code",
156+
"execution_count": 14,
157+
"id": "7a9b4bc6",
158+
"metadata": {},
159+
"outputs": [],
160+
"source": [
161+
"import time \n",
162+
"import openai\n",
163+
"from openai import OpenAI\n",
164+
"\n",
165+
"# Initialize the OpenAI client\n",
166+
"client = OpenAI()\n",
167+
"\n",
168+
"\n",
169+
"def response_with_reasoning_effort(model: str, question: str, reasoning_effort: str):\n",
170+
" \"\"\"\n",
171+
" Send a question to the OpenAI model with a given reasoning effort level.\n",
172+
"\n",
173+
" Parameters:\n",
174+
" model (str): The name of the model.\n",
175+
" question (str): The input prompt.\n",
176+
" reasoning_level (str): The reasoning effort level (\"low\", \"medium\", or \"high\").\n",
177+
"\n",
178+
" Returns:\n",
179+
" answer (str): The model's answer.\n",
180+
" usage: The usage object containing token counts.\n",
181+
" \"\"\"\n",
182+
" \n",
183+
" start_time = time.time()\n",
184+
"\n",
185+
" # API Call \n",
186+
" response = client.chat.completions.create(\n",
187+
" model=model,\n",
188+
" # reasoning_effort=reasoning_effort,\n",
189+
"\n",
190+
" messages=[\n",
191+
" {\"role\": \"system\", \"content\": \"You are a helpful assistant that provides answe to multiple choice questions. Reply only with the letter of the correct answer choice.\"},\n",
192+
" {\"role\": \"user\", \"content\": question}]\n",
193+
" )\n",
194+
" \n",
195+
" end_time = time.time()\n",
196+
" \n",
197+
" # Extract answer from response.\n",
198+
" answer = response.choices[0].message.content.strip()\n",
199+
" usage = response.usage # Contains prompt_tokens, total_tokens, and (optionally) reasoning_tokens.\n",
200+
" \n",
201+
"\n",
202+
" return answer, usage, (end_time - start_time)\n"
203+
]
204+
},
205+
{
206+
"cell_type": "markdown",
207+
"id": "ca5ff73d",
208+
"metadata": {},
209+
"source": [
210+
"Run the pipeline for all the questions in the dataset. "
211+
]
212+
},
213+
{
214+
"cell_type": "markdown",
215+
"id": "c16bdbbc",
216+
"metadata": {},
217+
"source": []
218+
},
219+
{
220+
"cell_type": "code",
221+
"execution_count": 24,
222+
"id": "9589221b",
223+
"metadata": {},
224+
"outputs": [
225+
{
226+
"name": "stderr",
227+
"output_type": "stream",
228+
"text": [
229+
"Processing Questions: 100%|██████████| 2/2 [00:07<00:00, 3.96s/it]\n"
230+
]
231+
}
232+
],
233+
"source": [
234+
"from tqdm import tqdm\n",
235+
"\n",
236+
"results = [] # to accumulate results for each question and level\n",
237+
"\n",
238+
"for item in tqdm(range(2), desc=\"Processing Questions\"):\n",
239+
" q_text = str(df.iloc[item].question) + \"\\n\" + \"choices: \" + str(df.iloc[item].choices)\n",
240+
"\n",
241+
" # print(\"question: \", q_text)\n",
242+
"\n",
243+
" expected = df.iloc[item].answerKey\n",
244+
" # for reasoning_effort in [\"low\", \"medium\", \"high\"]:\n",
245+
" for reasoning_effort in [\"low\"]:\n",
246+
" try:\n",
247+
" answer, usage, duration = response_with_reasoning_effort('o3-mini', q_text, reasoning_effort)\n",
248+
" correct = False\n",
249+
" ans_norm = answer.lower().strip()\n",
250+
"\n",
251+
" # print(\"answer: \", answer)\n",
252+
" # print(\"expected: \", expected)\n",
253+
" # print(\"--------------------------------\")\n",
254+
" # print(\"usage: \", usage)\n",
255+
" # print(\"--------------------------------\")\n",
256+
" exp_norm = str(expected).lower().strip()\n",
257+
" if exp_norm in ans_norm or ans_norm in exp_norm:\n",
258+
" correct = True\n",
259+
" results.append({\n",
260+
" \"id\": df.iloc[item].id,\n",
261+
" # \"question\": q_text,\n",
262+
" \"level\": reasoning_effort,\n",
263+
" \"model_answer\": answer,\n",
264+
" \"correct\": correct,\n",
265+
" \"prompt_tokens\": usage.prompt_tokens,\n",
266+
" \"total_tokens\": usage.total_tokens,\n",
267+
" \"reasoning_tokens\": usage.completion_tokens_details[\"reasoning_tokens\"],\n",
268+
" \"duration\": duration\n",
269+
" })\n",
270+
" except TypeError as e:\n",
271+
" print(f\"Error processing question: {e}\")\n",
272+
" # skip \n",
273+
"\n",
274+
"# Convert results to DataFrame for analysis\n",
275+
"df_results = pd.DataFrame(results)"
276+
]
277+
},
278+
{
279+
"cell_type": "code",
280+
"execution_count": 25,
281+
"id": "45d88cc5",
282+
"metadata": {},
283+
"outputs": [
284+
{
285+
"name": "stdout",
286+
"output_type": "stream",
287+
"text": [
288+
" id level model_answer correct prompt_tokens total_tokens \\\n",
289+
"0 Mercury_7175875 low C True 127 202 \n",
290+
"1 Mercury_SC_409171 low B True 142 155 \n",
291+
"\n",
292+
" reasoning_tokens duration \n",
293+
"0 64 2.823480 \n",
294+
"1 0 5.100148 \n"
295+
]
296+
}
297+
],
298+
"source": [
299+
"print (df_results.head())\n"
300+
]
301+
}
302+
],
303+
"metadata": {
304+
"kernelspec": {
305+
"display_name": "Python 3",
306+
"language": "python",
307+
"name": "python3"
308+
},
309+
"language_info": {
310+
"codemirror_mode": {
311+
"name": "ipython",
312+
"version": 3
313+
},
314+
"file_extension": ".py",
315+
"mimetype": "text/x-python",
316+
"name": "python",
317+
"nbconvert_exporter": "python",
318+
"pygments_lexer": "ipython3",
319+
"version": "3.12.3"
320+
}
321+
},
322+
"nbformat": 4,
323+
"nbformat_minor": 5
324+
}

0 commit comments

Comments
 (0)