A hospital data management system that uses Neo4j graph database to store and query healthcare data, built for intelligent chatbot integration.
This project provides an ETL (Extract, Transform, Load) pipeline to import structured hospital data from CSV files into a Neo4j graph database. The data model includes hospitals, patients, physicians, payers, visits, and reviews - creating a comprehensive healthcare knowledge graph with relationships between all entities.
Hospital-Chatbot/
├── data/ # CSV data files (excluded from git)
│ ├── healthcare_dataset.csv
│ ├── hospitals.csv
│ ├── patients.csv
│ ├── physicians.csv
│ ├── payers.csv
│ ├── visits.csv
│ └── reviews.csv
├── hospital_neo4j_etl/ # Neo4j ETL package
│ ├── src/
│ │ ├── hospital_bulk_csv_write.py # Main ETL script
│ │ └── entrypoint.sh # Docker entrypoint
│ ├── Dockerfile # Docker image definition
│ └── pyproject.toml # Python package config
├── analysis.py # Data analysis using Polars
├── docker-compose.yml # Docker orchestration (ETL service)
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
└── README.md
- Hospital: Healthcare facilities (ID, name, state)
- Patient: Patient records (ID, name, sex, DOB, blood type)
- Physician: Medical staff (ID, name, DOB, graduation year, medical school, salary)
- Payer: Insurance providers (ID, name)
- Visit: Patient hospital visits (ID, room number, admission type, dates, diagnosis, treatment)
- Review: Hospital reviews (ID, review text, patient/physician/hospital names)
- AT:
(Visit)-[:AT]->(Hospital)- Visit occurred at a hospital - WRITES:
(Visit)-[:WRITES]->(Review)- Visit resulted in a review - TREATS:
(Physician)-[:TREATS]->(Visit)- Physician treated a visit - COVERED_BY:
(Visit)-[:COVERED_BY {service_date, billing_amount}]->(Payer)- Visit covered by insurance - HAS:
(Patient)-[:HAS]->(Visit)- Patient had a visit - EMPLOYS:
(Hospital)-[:EMPLOYS]->(Physician)- Hospital employs physician
- Python 3.11+ (recommended: 3.11)
- Docker Desktop (for containerized deployment)
- Neo4j Database (optional if using Docker)
-
Clone the repository
git clone <your-repo-url> cd Hospital-Chatbot
-
Create and activate virtual environment
python -m venv venv .\venv\Scripts\Activate.ps1 # Windows PowerShell
-
Install dependencies
pip install -r requirements.txt -
Prepare your data
- Place your CSV files in the
data/directory - Ensure CSV files match the expected schema (see CSV File Requirements below)
- Place your CSV files in the
Note: The current docker-compose.yml includes only the ETL service. You'll need to run Neo4j separately or add it to the compose file.
-
Create
.envfile in project root:# Neo4j Configuration NEO4J_URI=bolt://localhost:7687 NEO4J_USERNAME=neo4j NEO4J_PASSWORD=your_password_here # CSV File Paths (use file:/// URI format for Neo4j) HOSPITALS_CSV_PATH=file:///data/hospitals.csv PAYERS_CSV_PATH=file:///data/payers.csv PHYSICIANS_CSV_PATH=file:///data/physicians.csv PATIENTS_CSV_PATH=file:///data/patients.csv VISITS_CSV_PATH=file:///data/visits.csv REVIEWS_CSV_PATH=file:///data/reviews.csv
-
Build and run:
docker compose up --build
-
Start Neo4j (install separately or use Docker)
# Using Docker for Neo4j only docker run -d \ --name neo4j \ -p 7474:7474 -p 7687:7687 \ -e NEO4J_AUTH=neo4j/password123 \ neo4j:5.14.0
-
Set environment variables and run ETL:
# Set environment variables (or use .env file with python-dotenv) $env:NEO4J_URI="bolt://localhost:7687" $env:NEO4J_USERNAME="neo4j" $env:NEO4J_PASSWORD="password123" # ... set CSV paths ... python hospital_neo4j_etl/src/hospital_bulk_csv_write.py
Analyze CSV data before loading into Neo4j:
python analysis.pyThis will display:
- Dataset dimensions (rows, columns)
- First 5 rows of hospital data
- First 5 rows of physician data
- ✅ No hardcoded credentials - All secrets use environment variables
- ✅
.envexcluded - Environment files not tracked in git - ✅ Patient data protected - CSV files in
data/folder excluded from version control - ✅ Virtual environment ignored -
venv/not tracked ⚠️ Change default passwords - Update Neo4j password in production
neo4j==5.14.1 # Neo4j Python driver
polars==0.19.19 # Fast DataFrame library
retry==0.9.2 # Retry decorator for fault tolerance
python-dotenv==1.0.0 # Environment variable management
ipython==8.18.1 # Enhanced Python shell
black # Code formatter
flake8 # Linting
- ✅ Comprehensive ETL Pipeline: Loads 6 node types and 6 relationship types
- ✅ Automatic Retry Logic: Retries up to 100 times with 10-second delays (handles Neo4j startup)
- ✅ Data Integrity: Uniqueness constraints on all node IDs
- ✅ Bulk CSV Import: Efficient loading using Neo4j's
LOAD CSV WITH HEADERS - ✅ Dockerized Deployment: Containerized ETL service
- ✅ Comprehensive Logging: Track pipeline progress with detailed logs
- ✅ Well-Documented Code: Extensive inline comments explaining each step
hospital_id(integer)hospital_name(string)hospital_state(string)
payer_id(integer)payer_name(string)
physician_id(integer)physician_name(string)physician_dob(date string)physician_grad_year(year)medical_school(string)salary(float)
patient_id(integer)patient_name(string)patient_sex(string)patient_dob(date string)patient_blood_type(string)
visit_id(integer)room_number(integer)admission_type(string)date_of_admission(date string)test_results(string)visit_status(string)chief_complaint(string)treatment_description(text)primary_diagnosis(string)discharge_date(date string)hospital_id(integer - foreign key)physician_id(integer - foreign key)patient_id(integer - foreign key)payer_id(integer - foreign key)billing_amount(float)
review_id(integer)review(text)patient_name(string)physician_name(string)hospital_name(string)visit_id(integer - foreign key)
Once data is loaded, you can query the graph using Cypher (Neo4j Browser at http://localhost:7474):
// Find all patients treated by a specific physician
MATCH (patient:Patient)-[:HAS]->(visit:Visit)<-[:TREATS]-(physician:Physician {name: "Dr. Smith"})
RETURN patient.name, visit.admission_date, visit.diagnosis
// Find hospitals with the most visits
MATCH (visit:Visit)-[:AT]->(hospital:Hospital)
RETURN hospital.name, COUNT(visit) AS total_visits
ORDER BY total_visits DESC
LIMIT 10
// Calculate average billing by insurance payer
MATCH (visit:Visit)-[coverage:COVERED_BY]->(payer:Payer)
RETURN payer.name,
AVG(coverage.billing_amount) AS avg_billing,
COUNT(visit) AS total_visits
ORDER BY avg_billing DESC
// Find patient visit history with full context
MATCH (patient:Patient {name: "John Doe"})-[:HAS]->(visit:Visit)
MATCH (visit)-[:AT]->(hospital:Hospital)
MATCH (visit)<-[:TREATS]-(physician:Physician)
MATCH (visit)-[:COVERED_BY]->(payer:Payer)
RETURN patient.name,
visit.admission_date,
hospital.name,
physician.name,
payer.name,
visit.diagnosis
ORDER BY visit.admission_date DESCError: docker compose: command not found
- Solution: Install Docker Desktop or use
docker-compose(older version)
Error: unexpected end of JSON input
- Solution: Clean Docker cache:
docker system prune -f - Ensure Docker Desktop is running
Error: Unable to connect to Neo4j
- Check Neo4j is running:
docker psor check Neo4j Desktop - Verify
NEO4J_URIin.envmatches your Neo4j instance - Default:
bolt://localhost:7687
Error: invalid pyproject.toml config
- ✅ Fixed: Typo corrected (
dependecies→dependencies)
Error: Module not found
pip install -r requirements.txt- Ensure CSV files exist in
data/directory - Check column names match requirements exactly
- Verify CSV encoding (UTF-8 recommended)
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is for educational purposes.
For questions or issues, please open a GitHub issue.
Built with ❤️ for healthcare data management