This tutorial shows you how to deploy and serve a gpt-oss-120b,
language model by using the vLLM framework.
You deploy this model on a Google Kubernetes Engine (GKE) autopilot cluster and
consume a single A4 virtual machine (VM) that has 8 B200 GPUs.
This tutorial is intended for machine learning (ML) engineers, platform administrators and operators, and for data and AI specialists who are interested in using Kubernetes container orchestration capabilities to handle inference workloads.
Objectives
- Access
gpt-oss-120bby using Hugging Face. - Prepare your environment.
- Create a GKE cluster in Autopilot mode.
- Create a Kubernetes secret for Hugging Face credentials.
- Deploy a vLLM container to your GKE cluster.
- Interact with the gpt-oss language model by using curl.
- Clean up.
Costs
This tutorial uses billable components of Google Cloud, including:
To generate a cost estimate based on your projected usage, use the Pricing Calculator.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
Install the Google Cloud CLI.
-
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
-
To initialize the gcloud CLI, run the following command:
gcloud init -
Create or select a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_IDwith a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_IDwith your Google Cloud project name.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the required API:
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.gcloud services enable container.googleapis.com
-
Install the Google Cloud CLI.
-
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
-
To initialize the gcloud CLI, run the following command:
gcloud init -
Create or select a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_IDwith a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_IDwith your Google Cloud project name.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the required API:
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.gcloud services enable container.googleapis.com
-
Grant roles to your user account. Run the following command once for each of the following IAM roles:
roles/container.admingcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
Replace the following:
PROJECT_ID: Your project ID.USER_IDENTIFIER: The identifier for your user account. For example,[email protected].ROLE: The IAM role that you grant to your user account.
- Sign in to or create a Hugging Face account.
Access gpt-oss by using Hugging Face
To use Hugging Face to access gpt-oss, do the following:
- Sign in to Hugging Face and explore the gpt-oss model.
- Create a Hugging Face
readaccess token. - Copy and save the
read accesstoken value. You use it later in this tutorial.
Prepare your environment
To prepare your environment, set the default environment variables:
gcloud config set project PROJECT_ID
gcloud config set billing/quota_project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL=RESERVATION_URL
export REGION=REGION
export CLUSTER_NAME=CLUSTER_NAME
export HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN
export NETWORK=NETWORK_NAME
export SUBNETWORK=SUBNETWORK_NAME
Replace the following:
PROJECT_ID: the ID of the Google Cloud project where you want to create the GKE cluster.RESERVATION_URL: the URL of the reservation that you want to use to create your GKE cluster. Based on the project in which the reservation exists, specify one of the following values:The reservation exists in your project:
RESERVATION_NAMEThe reservation exists in a different project, and your project can use the reservation:
projects/RESERVATION_PROJECT_ID/reservations/RESERVATION_NAME
REGION: the region where you want to create your GKE cluster. You can only create the cluster in the region where your reservation exists.CLUSTER_NAME: the name of the GKE cluster to create.HUGGING_FACE_TOKEN: the Hugging Face access token that you created in the previous section.NETWORK_NAME: the network that the GKE cluster uses. Specify one of the following values:If you created a custom network, then specify the name of your network.
Otherwise, specify
default.
SUBNETWORK_NAME: the subnetwork that the GKE cluster uses. Specify one of the following values:If you created a custom subnetwork, then specify the name of your subnetwork. You can only specify a subnetwork that exists in the same region as the reservation.
Otherwise, specify
default.
Create a GKE cluster in Autopilot mode
To create a GKE cluster in Autopilot mode, run the following command:
gcloud container clusters create-auto $CLUSTER_NAME \
--project=$PROJECT_ID \
--region=$REGION \
--release-channel=rapid \
--network=$NETWORK \
--subnetwork=$SUBNETWORK
Creating the GKE cluster might take some time to complete. To verify that Google Cloud has finished creating your cluster, go to Kubernetes clusters on the Google Cloud console.
Create a Kubernetes secret for Hugging Face credentials
To create a Kubernetes secret for Hugging Face credentials, do the following:
Configure
kubectlto communicate with your GKE cluster:gcloud container clusters get-credentials $CLUSTER_NAME \ --location=$REGIONCreate a Kubernetes secret to store your Hugging Face token:
kubectl create secret generic hf-secret \ --from-literal=hf_token=${HUGGING_FACE_TOKEN} \ --dry-run=client -o yaml | kubectl apply -f -
Deploy a vLLM container to your GKE cluster
Create a
vllm-gpt-oss-120b.yamlfile with your chosen vLLM deployment:apiVersion: apps/v1 kind: Deployment metadata: name: vllm-gpt-oss-deployment spec: replicas: 1 selector: matchLabels: app: gpt-oss template: metadata: labels: app: gpt-oss ai.gke.io/model: gpt-oss-120b ai.gke.io/inference-server: vllm examples.ai.gke.io/source: user-guide spec: containers: - name: vllm-inference image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250822_0916_RC01 resources: requests: cpu: "10" memory: "128Gi" ephemeral-storage: "240Gi" nvidia.com/gpu: "8" limits: cpu: "10" memory: "128Gi" ephemeral-storage: "240Gi" nvidia.com/gpu: "8" command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - --model=$(MODEL_ID) - --tensor-parallel-size=2 - --host=0.0.0.0 - --port=8000 - --max-model-len=8192 - --max-num-seqs=4 env: - name: MODEL_ID value: openai/gpt-oss-120b - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-secret key: hf_token volumeMounts: - mountPath: /dev/shm name: dshm livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 1200 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 1200 periodSeconds: 5 volumes: - name: dshm emptyDir: medium: Memory nodeSelector: cloud.google.com/gke-accelerator: nvidia-b200 cloud.google.com/reservation-name: $RESERVATION_URL cloud.google.com/reservation-affinity: "specific" cloud.google.com/gke-gpu-driver-version: latest --- apiVersion: v1 kind: Service metadata: name: oss-service spec: selector: app: gpt-oss type: ClusterIP ports: - protocol: TCP port: 8000 targetPort: 8000 --- apiVersion: monitoring.googleapis.com/v1 kind: PodMonitoring metadata: name: vllm-gpt-oss-monitoring spec: selector: matchLabels: app: gpt-oss endpoints: - port: 8000 path: /metrics interval: 30sApply the
vllm-gpt-oss-120b.yamlfile to your GKE cluster:envsubst < vllm-gpt-oss-120b.yaml | kubectl apply -f -During the deployment process, the container must download the
gpt-oss-120bmodel from Hugging Face. For this reason, deployment of the container might take up to 20 minutes to complete.To see the completion status, run the following command:
kubectl wait \ --for=condition=Available \ --timeout=1200s deployment/vllm-gpt-oss-deploymentThe
--timeout=1200sflag allows the command to monitor the deployment for up to 20 minutes.
Interact with the gpt-oss model by using curl
To verify the gpt-oss model that you deployed, do the following:
Set up port forwarding to the
gpt-ossmodel:kubectl port-forward service/oss-service 8000:8000Open a new terminal window. You can then chat with your model by using
curl:curl http://127.0.0.1:8000/v1/chat/completions \ -X POST \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "Describe a sailboat in one short sentence?" } ] }'The output that you see is similar to the following:
{ "id": "chatcmpl-2235c39759c040daae23ce2addc40c0a", "object": "chat.completion", "created": 1756831629, "model": "openai/gpt-oss-120b", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "A sleek vessel gliding on water, its cloth sails billowing like captured wind.", "refusal": null, "annotations": null, "audio": null, "function_call": null, "tool_calls": [], "reasoning_content": "User asks: \"Describe a sailboat in one short sentence?\" We need to produce a short sentence description. Should comply with policy. It's fine. Provide a short sentence." }, "logprobs": null, "finish_reason": "stop", "stop_reason": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 80, "total_tokens": 142, "completion_tokens": 62, "prompt_tokens_details": null }, "prompt_logprobs": null, "kv_transfer_params": null }Observe the performance of the model
To observe your model's performance, you can use the vLLM dashboard integration in Cloud Monitoring. This dashboard helps you view critical performance metrics for your model like token throughput, network latency, and error rates. For information, see vLLM in the Monitoring documentation.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete your project
Delete a Google Cloud project:
gcloud projects delete PROJECT_ID
Delete the resources
To delete the deployment and service in the
vllm-gpt-oss-120b.yamlfile and the Kubernetes secret from the GKE cluster, run the following:kubectl delete -f vllm-gpt-oss-120b.yaml kubectl delete secret hf-secretTo delete your GKE cluster, do the following:
gcloud container clusters delete $CLUSTER_NAME \ --region=$REGION