rapidsai-community · ncclementi · Apr 17, 2025 · Apr 17, 2025 · Apr 17, 2025 · Apr 17, 2025
diff --git a/1.Intro_to_cuDF.ipynb b/1.Intro_to_cuDF.ipynb
@@ -72,13 +72,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "This `pageviews.csv` file contains just over 1M records of pageview counts from Wikipedia in various languages.\n",
+    "\n",
     "The data we will use in this tutorial is too small to really benefit from GPU acceleration, but we will explore it \n",
     "anyway."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -182,7 +184,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In descending order"
+    "**Exercise**: Get `grouped_pageviews` sorted in descending order.\n",
+    "_Hint_: Check the [cudf docs](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.dataframe.sort_values/)\n",
+    "\n",
+    "<details>\n",
+    "  <summary>Solution (click dropdown) </summary>\n",
+    "  <p>\n",
+    "\n",
+    "```python\n",
+    "# to run this type it in a code cell\n",
+    "grouped_pageviews.sort_values('page', ascending=False)\n",
+    "```\n",
+    "  </p>\n",
+    "</details>"
    ]
   },
   {
@@ -191,7 +205,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "grouped_pageviews.sort_values('page', ascending=False)"
+    "# your solution here"
    ]
   },
   {

diff --git a/2.cudf_pandas.ipynb b/2.cudf_pandas.ipynb
@@ -244,7 +244,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 6,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -310,6 +310,49 @@
         ")"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "**Exercise:** Find the top 5 most common parking violations for vehicles that are either SUVs (Vehicle Body Type = \"SUBN\")\n",
+        "or pickup trucks (Vehicle Body Type = \"PICK\"), but only for vehicles made after 2010, and show the count for each violation type.\n",
+        "\n",
+        "<details>\n",
+        "  <summary>Solution (click dropdown) </summary>\n",
+        "  <p>\n",
+        "\n",
+        "```python\n",
+        "# to run this type it in a code cell\n",
+        "\n",
+        "# Filter for SUVs and pickup trucks made after 2010\n",
+        "recent_suv_pickup = df[\n",
+        "    (df[\"Vehicle Body Type\"].isin([\"SUBN\", \"PICK\"])) & \n",
+        "    (df[\"Vehicle Year\"] > 2010)\n",
+        "]\n",
+        "\n",
+        "# Group by violation type and count, then get top 5\n",
+        "(\n",
+        "    recent_suv_pickup\n",
+        "    .groupby(\"Violation Description\")\n",
+        "    .size()\n",
+        "    .sort_values(ascending=False)\n",
+        "    .head(5)\n",
+        "    .rename(\"Number of Violations\")\n",
+        ")\n",
+        "```\n",
+        "  </p>\n",
+        "</details>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# your solution here"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {

diff --git a/3.cudf_polars_engine.ipynb b/3.cudf_polars_engine.ipynb
@@ -124,7 +124,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 3,
+      "execution_count": 2,
       "id": "a04f58fd-3df5-44ee-9a59-67d5c5146b08",
       "metadata": {
         "id": "a04f58fd-3df5-44ee-9a59-67d5c5146b08",
@@ -240,7 +240,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 7,
+      "execution_count": 6,
       "id": "308471b8-1086-470c-ac32-95e8af55fc9a",
       "metadata": {
         "id": "308471b8-1086-470c-ac32-95e8af55fc9a"
@@ -381,6 +381,43 @@
       },
       "source": [
         "Great! We see a nice performance gain when using the GPU engine!\n",
+        "\n",
+        "**Exercise:** Find the average transaction amount for each expense type.\n",
+        "\n",
+        "<details>\n",
+        "  <summary>Solution (click dropdown) </summary>\n",
+        "  <p>\n",
+        "\n",
+        "```python\n",
+        "# to run this type it in a code cell\n",
+        "ex_res_gpu = (\n",
+        "    transactions.group_by(\"EXP_TYPE\")\n",
+        "    .agg(pl.col(\"AMOUNT\").mean())\n",
+        "    .sort(by=\"AMOUNT\", descending=True)\n",
+        "    .collect(engine=gpu_engine)\n",
+        ")\n",
+        "ex_res_gpu\n",
+        "```\n",
+        "  </p>\n",
+        "</details>\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "b0380a1e",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# your solution here"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "9f185ac0",
+      "metadata": {},
+      "source": [
         "\n",
         "## What about Polars SQL? \n",
         "\n",
@@ -482,7 +519,58 @@
         "id": "7080def6-d4e5-408c-8c8b-83a0895b0c3b"
       },
       "source": [
-        "Again, we see a nice speedup using the GPU engine."
+        "Again, we see a nice speedup using the GPU engine.\n",
+        "\n",
+        "**Exercise:** Find the average transaction amount for each expense type.\n",
+        "\n",
+        "_Hints_:\n",
+        "1. First, think about what columns you need to group by to get daily transactions\n",
+        "   - You'll need YEAR, MONTH, and DAY columns\n",
+        "   - Remember that these are separate columns in the dataset\n",
+        "\n",
+        "2. For each day, we want to know:\n",
+        "   - How many transactions occurred (len)\n",
+        "   - The average transaction amount (mean)\n",
+        "\n",
+        "3. The aggregation should include:\n",
+        "   - A count of all transactions\n",
+        "   - The mean of the AMOUNT column\n",
+        "   - Use .alias() to give meaningful names to the results\n",
+        "\n",
+        "4. After getting the results:\n",
+        "   - Sort by transaction count in descending order\n",
+        "   - Take only the top 10 busiest days\n",
+        "\n",
+        "<details>\n",
+        "  <summary>Solution (click dropdown) </summary>\n",
+        "  <p>\n",
+        "\n",
+        "```python\n",
+        "# to run this type it in a code cell\n",
+        "active_days_gpu = (\n",
+        "    transactions.group_by([\"YEAR\", \"MONTH\", \"DAY\"])\n",
+        "    .agg([\n",
+        "        pl.len().alias(\"transaction_count\"),\n",
+        "        pl.col(\"AMOUNT\").mean().alias(\"avg_amount\")\n",
+        "    ])\n",
+        "    .sort(by=\"transaction_count\", descending=True)\n",
+        "    .head(10)\n",
+        "    .collect(engine=gpu_engine)\n",
+        ")\n",
+        "active_days_gpu\n",
+        "```\n",
+        "  </p>\n",
+        "</details>\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 16,
+      "id": "d35a0534",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# your solution here"
       ]
     },
     {

diff --git a/5.cuml_accel.ipynb b/5.cuml_accel.ipynb
@@ -270,47 +270,38 @@
         "Having one model trained in such short time allow us to quickly iterate on the hyperparameter configuration and find a \n",
         "model that performs better with excellent speedups.\n",
         "\n",
-        "For example, let's see what happens with `max_depth=30`"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/",
-          "height": 115
-        },
-        "id": "mfZamg7FVoPe",
-        "outputId": "19d44cd9-9863-495b-ac8d-067118fd9c22"
-      },
-      "outputs": [],
-      "source": [
-        "%%time\n",
+        "For example, let's see what happens with a different `max_depth`\n",
         "\n",
+        "**Exercise:** Train the `RandomForestClassifier` with a  different set of values and analyze the results. \n",
+        "\n",
+        "<details>\n",
+        "  <summary>Solution (click dropdown) </summary>\n",
+        "  <p>\n",
+        "\n",
+        "```python\n",
+        "# to run this type it in a code cell\n",
         "clf = RandomForestClassifier(\n",
         "    n_estimators=100,\n",
         "    max_depth=30,\n",
         "    max_features=1.0,\n",
         "    n_jobs=-1,\n",
         ")\n",
-        "clf.fit(X_train, y_train)"
+        "clf.fit(X_train, y_train)\n",
+        "\n",
+        "y_pred = clf.predict(X_test)\n",
+        "print(classification_report(y_test, y_pred))\n",
+        "```\n",
+        "  </p>\n",
+        "</details>\n"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "E97LObEYVocu",
-        "outputId": "e560b1a6-1b12-4cb5-bc66-09b3530af692"
-      },
+      "metadata": {},
       "outputs": [],
       "source": [
-        "y_pred = clf.predict(X_test)\n",
-        "print(classification_report(y_test, y_pred))"
+        "# your solution here"
       ]
     },
     {