Added examples

msingh-openai · msingh-openai · commit b153d04033cd · 2025-02-18T11:12:34.000-08:00
diff --git a/examples/reasoning_models/using_reasoning_effort_parameter.ipynb b/examples/reasoning_models/using_reasoning_effort_parameter.ipynb
@@ -0,0 +1,324 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6b0899a74456b1f6",
+   "metadata": {},
+   "source": [
+    "### Using Reasoning Effort Parameter with o-series Models\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "446a6db5",
+   "metadata": {},
+   "source": [
+    "## Choosing the right Reasoning model and Reasoning effort for your use case \n",
+    "\n",
+    "Reasoning models, such as OpenAI’s o1 and o3-mini, are advanced language models trained with reinforcement learning to enhance complex reasoning. They generate a detailed internal thought process before responding, making them highly effective in problem-solving, coding, scientific reasoning, and multi-step planning for agentic workflows.\n",
+    "\n",
+    "In this Cookbook, we will explore an Eval based quantiative analysis to help you choose the right reasoning model and reasoning effort for your use case. \n",
+    "\n",
+    "This is a 3 step process: \n",
+    "\n",
+    "1. Build Your Evaluation Dataset\n",
+    "2. Build a Pipeline to evaluate the reasoning model and capture metrics \n",
+    "3. Choose the model/parameter based on cost/performance trade-off "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8474d01e",
+   "metadata": {},
+   "source": [
+    "### Step 1: Build Your Evaluation Dataset \n",
+    "\n",
+    "For this example, we will use the AI2-ARC dataset\n",
+    "\n",
+    "ARC-Challenge\n",
+    "id: a string feature.\n",
+    "question: a string feature.\n",
+    "choices: a dictionary feature containing:\n",
+    "text: a string feature.\n",
+    "label: a string feature.\n",
+    "answerKey: a string feature."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "4b3867fc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "\n",
+    "url = \"https://huggingface.co/datasets/allenai/ai2_arc/resolve/main/ARC-Challenge/test-00000-of-00001.parquet\"\n",
+    "response = requests.get(url)\n",
+    "with open(\"test-00000-of-00001.parquet\", \"wb\") as f:\n",
+    "    f.write(response.content)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "f0ad995b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{\n",
+      "    \"id\": \"Mercury_7175875\",\n",
+      "    \"question\": \"An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?\",\n",
+      "    \"choices\": {\n",
+      "        \"text\": [\n",
+      "            \"Planetary density will decrease.\",\n",
+      "            \"Planetary years will become longer.\",\n",
+      "            \"Planetary days will become shorter.\",\n",
+      "            \"Planetary gravity will become stronger.\"\n",
+      "        ],\n",
+      "        \"label\": [\n",
+      "            \"A\",\n",
+      "            \"B\",\n",
+      "            \"C\",\n",
+      "            \"D\"\n",
+      "        ]\n",
+      "    },\n",
+      "    \"answerKey\": \"C\"\n",
+      "}\n"
+     ]
+    }
+   ],
+   "source": [
+    "import json\n",
+    "import pandas as pd\n",
+    "\n",
+    "# Set Pandas options to display full text in cells\n",
+    "pd.set_option('display.max_colwidth', None)\n",
+    "\n",
+    "# Reads the Parquet file into a DataFrame.\n",
+    "df = pd.read_parquet(\"test-00000-of-00001.parquet\")\n",
+    "\n",
+    "# Convert the first row to a dictionary.\n",
+    "row_dict = df.head(1).iloc[0].to_dict()\n",
+    "\n",
+    "# Pretty-print the row as a JSON string with an indentation of 4 spaces.\n",
+    "# The default lambda converts non-serializable objects (like numpy arrays) to lists.\n",
+    "print(json.dumps(row_dict, indent=4, default=lambda o: o.tolist() if hasattr(o, 'tolist') else o))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "116e476a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total number of rows in the dataset: 1172\n"
+     ]
+    }
+   ],
+   "source": [
+    "def display_total_rows(dataframe: pd.DataFrame):\n",
+    "    \"\"\"\n",
+    "    Display the total number of rows in the given DataFrame.\n",
+    "\n",
+    "    Parameters:\n",
+    "        dataframe (pd.DataFrame): The input DataFrame.\n",
+    "\n",
+    "    Returns:\n",
+    "        None\n",
+    "    \"\"\"\n",
+    "    total_rows = len(dataframe)\n",
+    "    print(f\"Total number of rows in the dataset: {total_rows}\")\n",
+    "\n",
+    "# Display the total number of rows in the dataset\n",
+    "display_total_rows(df)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "380c8c4e",
+   "metadata": {},
+   "source": [
+    "### Step 2: Build a Pipeline to evaluate the reasoning model and capture metrics \n",
+    "\n",
+    "Let's write a python script to evaluate the reasoning model and capture metrics. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "7a9b4bc6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time \n",
+    "import openai\n",
+    "from openai import OpenAI\n",
+    "\n",
+    "# Initialize the OpenAI client\n",
+    "client = OpenAI()\n",
+    "\n",
+    "\n",
+    "def response_with_reasoning_effort(model: str, question: str, reasoning_effort: str):\n",
+    "    \"\"\"\n",
+    "    Send a question to the OpenAI model with a given reasoning effort level.\n",
+    "\n",
+    "    Parameters:\n",
+    "        model (str): The name of the model.\n",
+    "        question (str): The input prompt.\n",
+    "        reasoning_level (str): The reasoning effort level (\"low\", \"medium\", or \"high\").\n",
+    "\n",
+    "    Returns:\n",
+    "        answer (str): The model's answer.\n",
+    "        usage: The usage object containing token counts.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    start_time = time.time()\n",
+    "\n",
+    "    # API Call \n",
+    "    response = client.chat.completions.create(\n",
+    "        model=model,\n",
+    "       #  reasoning_effort=reasoning_effort,\n",
+    "\n",
+    "        messages=[\n",
+    "            {\"role\": \"system\", \"content\": \"You are a helpful assistant that provides answe to multiple choice questions. Reply only with the letter of the correct answer choice.\"},\n",
+    "            {\"role\": \"user\", \"content\": question}]\n",
+    "    )\n",
+    "    \n",
+    "    end_time = time.time()\n",
+    "    \n",
+    "    # Extract answer from response.\n",
+    "    answer = response.choices[0].message.content.strip()\n",
+    "    usage = response.usage  # Contains prompt_tokens, total_tokens, and (optionally) reasoning_tokens.\n",
+    "    \n",
+    "\n",
+    "    return answer, usage, (end_time - start_time)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca5ff73d",
+   "metadata": {},
+   "source": [
+    "Run the pipeline for all the questions in the dataset. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c16bdbbc",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "9589221b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Processing Questions: 100%|██████████| 2/2 [00:07<00:00,  3.96s/it]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from tqdm import tqdm\n",
+    "\n",
+    "results = []  # to accumulate results for each question and level\n",
+    "\n",
+    "for item in tqdm(range(2), desc=\"Processing Questions\"):\n",
+    "    q_text = str(df.iloc[item].question) + \"\\n\" + \"choices: \" + str(df.iloc[item].choices)\n",
+    "\n",
+    "    # print(\"question: \", q_text)\n",
+    "\n",
+    "    expected = df.iloc[item].answerKey\n",
+    "    # for reasoning_effort in [\"low\", \"medium\", \"high\"]:\n",
+    "    for reasoning_effort in [\"low\"]:\n",
+    "        try:\n",
+    "            answer, usage, duration = response_with_reasoning_effort('o3-mini', q_text, reasoning_effort)\n",
+    "            correct = False\n",
+    "            ans_norm = answer.lower().strip()\n",
+    "\n",
+    "      #      print(\"answer: \", answer)\n",
+    "      #      print(\"expected: \", expected)\n",
+    "      #      print(\"--------------------------------\")\n",
+    "      #      print(\"usage: \", usage)\n",
+    "      #      print(\"--------------------------------\")\n",
+    "            exp_norm = str(expected).lower().strip()\n",
+    "            if exp_norm in ans_norm or ans_norm in exp_norm:\n",
+    "                correct = True\n",
+    "            results.append({\n",
+    "                \"id\": df.iloc[item].id,\n",
+    "                # \"question\": q_text,\n",
+    "                \"level\": reasoning_effort,\n",
+    "                \"model_answer\": answer,\n",
+    "                \"correct\": correct,\n",
+    "                \"prompt_tokens\": usage.prompt_tokens,\n",
+    "                \"total_tokens\": usage.total_tokens,\n",
+    "                \"reasoning_tokens\": usage.completion_tokens_details[\"reasoning_tokens\"],\n",
+    "                \"duration\": duration\n",
+    "            })\n",
+    "        except TypeError as e:\n",
+    "            print(f\"Error processing question: {e}\")\n",
+    "            # skip \n",
+    "\n",
+    "# Convert results to DataFrame for analysis\n",
+    "df_results = pd.DataFrame(results)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "45d88cc5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                  id level model_answer  correct  prompt_tokens  total_tokens  \\\n",
+      "0    Mercury_7175875   low            C     True            127           202   \n",
+      "1  Mercury_SC_409171   low            B     True            142           155   \n",
+      "\n",
+      "   reasoning_tokens  duration  \n",
+      "0                64  2.823480  \n",
+      "1                 0  5.100148  \n"
+     ]
+    }
+   ],
+   "source": [
+    "print (df_results.head())\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}