Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions 1.Intro_to_cuDF.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This `pageviews.csv` file contains just over 1M records of pageview counts from Wikipedia in various languages.\n",
"\n",
"The data we will use in this tutorial is too small to really benefit from GPU acceleration, but we will explore it \n",
"anyway."
]
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -182,7 +184,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In descending order"
"**Exercise**: Get `grouped_pageviews` sorted in descending order.\n",
"_Hint_: Check the [cudf docs](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.dataframe.sort_values/)\n",
"\n",
"<details>\n",
" <summary>Solution (click dropdown) </summary>\n",
" <p>\n",
"\n",
"```python\n",
"# to run this type it in a code cell\n",
"grouped_pageviews.sort_values('page', ascending=False)\n",
"```\n",
" </p>\n",
"</details>"
]
},
{
Expand All @@ -191,7 +205,7 @@
"metadata": {},
"outputs": [],
"source": [
"grouped_pageviews.sort_values('page', ascending=False)"
"# your solution here"
]
},
{
Expand Down
45 changes: 44 additions & 1 deletion 2.cudf_pandas.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -310,6 +310,49 @@
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Exercise:** Find the top 5 most common parking violations for vehicles that are either SUVs (Vehicle Body Type = \"SUBN\")\n",
"or pickup trucks (Vehicle Body Type = \"PICK\"), but only for vehicles made after 2010, and show the count for each violation type.\n",
"\n",
"<details>\n",
" <summary>Solution (click dropdown) </summary>\n",
" <p>\n",
"\n",
"```python\n",
"# to run this type it in a code cell\n",
"\n",
"# Filter for SUVs and pickup trucks made after 2010\n",
"recent_suv_pickup = df[\n",
" (df[\"Vehicle Body Type\"].isin([\"SUBN\", \"PICK\"])) & \n",
" (df[\"Vehicle Year\"] > 2010)\n",
"]\n",
"\n",
"# Group by violation type and count, then get top 5\n",
"(\n",
" recent_suv_pickup\n",
" .groupby(\"Violation Description\")\n",
" .size()\n",
" .sort_values(ascending=False)\n",
" .head(5)\n",
" .rename(\"Number of Violations\")\n",
")\n",
"```\n",
" </p>\n",
"</details>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your solution here"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down
94 changes: 91 additions & 3 deletions 3.cudf_polars_engine.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 2,
"id": "a04f58fd-3df5-44ee-9a59-67d5c5146b08",
"metadata": {
"id": "a04f58fd-3df5-44ee-9a59-67d5c5146b08",
Expand Down Expand Up @@ -240,7 +240,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 6,
"id": "308471b8-1086-470c-ac32-95e8af55fc9a",
"metadata": {
"id": "308471b8-1086-470c-ac32-95e8af55fc9a"
Expand Down Expand Up @@ -381,6 +381,43 @@
},
"source": [
"Great! We see a nice performance gain when using the GPU engine!\n",
"\n",
"**Exercise:** Find the average transaction amount for each expense type.\n",
"\n",
"<details>\n",
" <summary>Solution (click dropdown) </summary>\n",
" <p>\n",
"\n",
"```python\n",
"# to run this type it in a code cell\n",
"ex_res_gpu = (\n",
" transactions.group_by(\"EXP_TYPE\")\n",
" .agg(pl.col(\"AMOUNT\").mean())\n",
" .sort(by=\"AMOUNT\", descending=True)\n",
" .collect(engine=gpu_engine)\n",
")\n",
"ex_res_gpu\n",
"```\n",
" </p>\n",
"</details>\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0380a1e",
"metadata": {},
"outputs": [],
"source": [
"# your solution here"
]
},
{
"cell_type": "markdown",
"id": "9f185ac0",
"metadata": {},
"source": [
"\n",
"## What about Polars SQL? \n",
"\n",
Expand Down Expand Up @@ -482,7 +519,58 @@
"id": "7080def6-d4e5-408c-8c8b-83a0895b0c3b"
},
"source": [
"Again, we see a nice speedup using the GPU engine."
"Again, we see a nice speedup using the GPU engine.\n",
"\n",
"**Exercise:** Find the average transaction amount for each expense type.\n",
"\n",
"_Hints_:\n",
"1. First, think about what columns you need to group by to get daily transactions\n",
" - You'll need YEAR, MONTH, and DAY columns\n",
" - Remember that these are separate columns in the dataset\n",
"\n",
"2. For each day, we want to know:\n",
" - How many transactions occurred (len)\n",
" - The average transaction amount (mean)\n",
"\n",
"3. The aggregation should include:\n",
" - A count of all transactions\n",
" - The mean of the AMOUNT column\n",
" - Use .alias() to give meaningful names to the results\n",
"\n",
"4. After getting the results:\n",
" - Sort by transaction count in descending order\n",
" - Take only the top 10 busiest days\n",
"\n",
"<details>\n",
" <summary>Solution (click dropdown) </summary>\n",
" <p>\n",
"\n",
"```python\n",
"# to run this type it in a code cell\n",
"active_days_gpu = (\n",
" transactions.group_by([\"YEAR\", \"MONTH\", \"DAY\"])\n",
" .agg([\n",
" pl.len().alias(\"transaction_count\"),\n",
" pl.col(\"AMOUNT\").mean().alias(\"avg_amount\")\n",
" ])\n",
" .sort(by=\"transaction_count\", descending=True)\n",
" .head(10)\n",
" .collect(engine=gpu_engine)\n",
")\n",
"active_days_gpu\n",
"```\n",
" </p>\n",
"</details>\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "d35a0534",
"metadata": {},
"outputs": [],
"source": [
"# your solution here"
]
},
{
Expand Down
45 changes: 18 additions & 27 deletions 5.cuml_accel.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -270,47 +270,38 @@
"Having one model trained in such short time allow us to quickly iterate on the hyperparameter configuration and find a \n",
"model that performs better with excellent speedups.\n",
"\n",
"For example, let's see what happens with `max_depth=30`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 115
},
"id": "mfZamg7FVoPe",
"outputId": "19d44cd9-9863-495b-ac8d-067118fd9c22"
},
"outputs": [],
"source": [
"%%time\n",
"For example, let's see what happens with a different `max_depth`\n",
"\n",
"**Exercise:** Train the `RandomForestClassifier` with a different set of values and analyze the results. \n",
"\n",
"<details>\n",
" <summary>Solution (click dropdown) </summary>\n",
" <p>\n",
"\n",
"```python\n",
"# to run this type it in a code cell\n",
"clf = RandomForestClassifier(\n",
" n_estimators=100,\n",
" max_depth=30,\n",
" max_features=1.0,\n",
" n_jobs=-1,\n",
")\n",
"clf.fit(X_train, y_train)"
"clf.fit(X_train, y_train)\n",
"\n",
"y_pred = clf.predict(X_test)\n",
"print(classification_report(y_test, y_pred))\n",
"```\n",
" </p>\n",
"</details>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "E97LObEYVocu",
"outputId": "e560b1a6-1b12-4cb5-bc66-09b3530af692"
},
"metadata": {},
"outputs": [],
"source": [
"y_pred = clf.predict(X_test)\n",
"print(classification_report(y_test, y_pred))"
"# your solution here"
]
},
{
Expand Down