Skip to content

docs: add llm kmeans notebook as an included example #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Nov 9, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
some fixes
  • Loading branch information
Henry J Solberg committed Nov 8, 2023
commit 392488cdf059f355bf770034004b8fbd4d0391f6
55 changes: 49 additions & 6 deletions notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
"\n",
"1. Use PaLM2TextEmbeddingGenerator to [generate text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) for each of 10000 complaints sent to an online bank. If you're not familiar with what a text embedding is, it's a list of numbers that are like coordinates in an imaginary \"meaning space\" for sentences. (It's like [word embeddings](https://en.wikipedia.org/wiki/Word_embedding), but for more general text.) The important point for our purposes is that similar sentences are close to each other in this imaginary space.\n",
"2. Use KMeans clustering to group together complaints whose text embeddings are near to eachother. This will give us sets of similar complaints, but we don't yet know _why_ these complaints are similar.\n",
"3. Simply ask PaLM2TextGenerator in English what the difference is between the groups of complaints that we got. Thanks to the power of modern LLMs, the response might give us a very good idea of what these complaints are all about, but remember to [\"understand the limits of your dataset and model.\"](https://ai.google/responsibility/responsible-ai-practices/#:~:text=Understand%20the%20limitations%20of%20your%20dataset%20and%20model)\n",
"3. Prompt PaLM2TextGenerator in English asking what the difference is between the groups of complaints that we got. Thanks to the power of modern LLMs, the response might give us a very good idea of what these complaints are all about, but remember to [\"understand the limits of your dataset and model.\"](https://ai.google/responsibility/responsible-ai-practices/#:~:text=Understand%20the%20limitations%20of%20your%20dataset%20and%20model)\n",
"\n",
"We will tie these pieces together in Python using BigQuery DataFrames. [Click here](https://cloud.google.com/bigquery/docs/dataframes-quickstart) to learn more about BigQuery DataFrames!"
]
Expand All @@ -87,13 +87,51 @@
"\n",
"* BigQuery (compute)\n",
"* BigQuery ML\n",
"* Generative AI support on Vertex AI\n",
"\n",
"Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models),\n",
"Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models), [Generative AI support on Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing#generative_ai_models),\n",
"and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),\n",
"and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n",
"to generate a cost estimate based on your projected usage."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Before you begin\n",
"\n",
"Complete the tasks in this section to set up your environment."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set up your Google Cloud project\n",
"\n",
"**The following steps are required, regardless of your notebook environment.**\n",
"\n",
"1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.\n",
"\n",
"2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n",
"\n",
"3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,artifactregistry.googleapis.com,cloudbuild.googleapis.com,cloudresourcemanager.googleapis.com) to enable the following APIs:\n",
"\n",
" * BigQuery API\n",
" * BigQuery Connection API\n",
" * Cloud Functions API\n",
" * Cloud Run API\n",
" * Artifact Registry API\n",
" * Cloud Build API\n",
" * Cloud Resource Manager API\n",
" * Vertex AI API\n",
"\n",
"4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk)."
]
},
{
"attachments": {},
"cell_type": "markdown",
Expand All @@ -120,10 +158,7 @@
},
"outputs": [],
"source": [
"import bigframes.pandas as bpd\n",
"\n",
"bpd.options.bigquery.project = \"bigframes-dev\"\n",
"bpd.options.bigquery.location = \"us\""
"import bigframes.pandas as bpd"
]
},
{
Expand Down Expand Up @@ -219,6 +254,14 @@
"combined_df = downsampled_issues_df.join(predicted_embeddings)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have the complaints and their text embeddings as two columns in our combined_df. Recall that complaints with numerically similar text embeddings should have similar meanings semantically. We will now group similar complaints together."
]
},
{
"attachments": {},
"cell_type": "markdown",
Expand Down