A collection of AI-assisted development tools and examples for working with Databricks.
This repository contains practical examples and tools for developing with Databricks using modern AI coding assistants like Cursor and Claude Code. The focus is on demonstrating best practices for local development with Databricks Connect and PySpark.
databricks.dev/
├── ai-tools/
│ ├── cursor/
│ │ └── pyspark/
│ │ ├── .cursor/ # Cursor IDE rules
│ │ └── dbconnect-nyc-example/ # NYC Taxi example with Databricks Connect
│ └── claude-code/
│ └── pyspark/
│ ├── dbconnect-nyc-example/ # NYC Taxi example with Claude Code
│ └── dbconnect-million-songs/ # Million Songs SPD pipeline example
├── LICENSE
└── README.md
This repository includes two implementations of the same NYC Taxi example project, each tailored for different AI coding assistants:
Location: ai-tools/cursor/pyspark/dbconnect-nyc-example/
A minimal PySpark application demonstrating Databricks Connect with Cursor IDE. Features:
- Cursor IDE rules (
.cursor/rules/) for Python development, project structure, and testing - 12 vibe coding prompts for generating NYC taxi analysis functions
- 3 implemented functions with comprehensive tests
- Complete documentation for AI-assisted development
📖 Read the Cursor example README
Location: ai-tools/claude-code/pyspark/dbconnect-nyc-example/
The same NYC Taxi example optimized for Claude Code. Features:
- Claude Code configuration (
.claude/) with project-specific rules - 12 vibe coding prompts ready for use with Claude Code
- Same data analysis capabilities as the Cursor version
- Streamlined for VS Code + Claude Code workflow
📖 Read the Claude Code example README
Quick Start (NYC Taxi):
# Choose your AI tool:
cd ai-tools/cursor/pyspark/dbconnect-nyc-example # For Cursor
# OR
cd ai-tools/claude-code/pyspark/dbconnect-nyc-example # For Claude Code
# 1. Authenticate with Databricks
databricks auth login --profile DEFAULT --host https://your-workspace.databricks.com
# 2. Install dependencies
uv sync
# 3. Run the application
uv run python src/main.py
# 4. Run tests
uv run pytest tests/ -vExpected Output:
Starting NYC Taxi Data Analysis...
==================================================
✓ Connected to Databricks
✓ Loaded NYC taxi data: 22,699,369 records
Sample NYC Taxi Trips:
--------------------------------------------------
+-------------------+-------------+-----------+-----------+------------+
|tpep_pickup_datetime|trip_distance|fare_amount|pickup_zip |dropoff_zip |
+-------------------+-------------+-----------+-----------+------------+
|2016-02-14 16:52:13|2.25 |9.0 |10282 |10171 |
|2016-02-04 18:44:19|8.04 |26.0 |10110 |10023 |
|2016-02-17 17:13:57|0.72 |5.5 |10103 |10022 |
...
Fare Per Mile Analysis (Top 10 by fare/mile):
--------------------------------------------------
+-------------+-----------+---------------------+-----------+------------+
|trip_distance|fare_amount|average_fare_per_mile|pickup_zip |dropoff_zip |
+-------------+-----------+---------------------+-----------+------------+
|0.01 |52.0 |5200.00 |10282 |10282 |
|0.03 |107.5 |3583.33 |10019 |10019 |
...
Analysis complete!
What the NYC Taxi Examples Demonstrate:
- Connect to Databricks using Databricks Connect
- Use serverless compute for data processing
- Query sample data (NYC taxi trips from
samples.nyctaxi.trips) - Work with DataFrames in PySpark
- Perform aggregations, filtering, and time-series analysis
- AI-assisted development with pre-written prompts
Next Steps:
- Explore the 12 vibe coding prompts for generating new analysis functions
- Use your AI assistant to implement additional patterns (9 prompts ready to use)
- Run tests with
uv run pytest tests/ -vto validate your code
Location: ai-tools/claude-code/pyspark/dbconnect-million-songs/
A comprehensive example demonstrating Spark Declarative Pipelines (SPD) with Databricks Asset Bundles. Features:
- Spark Declarative Pipelines: Declarative ETL pipeline for bronze layer data ingestion
- Databricks Asset Bundles: Infrastructure-as-code deployment with
databricks.yml - Auto Loader: Incremental CSV ingestion with schema inference
- Unity Catalog: Governed data storage in catalog.schema.table format
- Local Development: Query SPD-created tables using Databricks Connect
- Complete Testing: pytest suite with data quality validation
What This Example Demonstrates:
- Deploy SPD pipelines using
databricks bundle deploy - Ingest data from the Million Songs dataset into a bronze table
- Use Auto Loader (cloudFiles) for incremental processing
- Query Unity Catalog tables from your local environment
- Test data quality and schema compliance
📖 Read the Million Songs README
Quick Start:
cd ai-tools/claude-code/pyspark/dbconnect-million-songs
# Deploy the SPD pipeline
databricks bundle validate
databricks bundle deploy
databricks bundle run million_songs_spd
# Query the bronze table locally
uv sync
uv run src/main.py
# Run tests
uv run pytest tests/ -v- Python 3.11+
- uv package manager
- Databricks CLI
- Access to a Databricks workspace
The examples use Databricks CLI authentication profiles. Set up your profile:
databricks auth login --profile DEFAULT --host https://your-workspace.databricks.com- Cursor with Databricks: AI Enhanced Development: Comprehensive guide by Dustin Vannoy on leveraging Cursor IDE with Databricks Connect, including setup, Cursor rules, and MCP integration
- Cursor Rules: Check out ai-tools/cursor/pyspark/.cursor/rules/ for Python development and project structure rules
- Vibe Coding Prompts: See ai-tools/cursor/pyspark/dbconnect-nyc-example/docs/vibe_coding_nyc_taxi_prompts.md for 12 interesting query patterns
- Claude Code Configuration: Check out ai-tools/claude-code/pyspark/dbconnect-nyc-example/.claude/ for project-specific rules
- Vibe Coding Prompts: See ai-tools/claude-code/pyspark/dbconnect-nyc-example/docs/vibe_coding_nyc_taxi_prompts.md for 12 interesting query patterns
- 3 implemented and tested (Average Fare Per Mile, Busiest Pickup Locations, Peak Hours Analysis)
- 9 ready for AI-assisted development
- SPD Pipeline Example: See ai-tools/claude-code/pyspark/dbconnect-million-songs/ for Spark Declarative Pipelines and Databricks Asset Bundles
- Databricks Connect Docs: https://docs.databricks.com/dev-tools/databricks-connect.html
- PySpark Documentation: https://spark.apache.org/docs/latest/api/python/
This repository is for educational and demonstration purposes. Feel free to fork and adapt the examples for your own use cases.
See LICENSE for details.