Kubernetes Error Root-cause Analysis - AI-powered Kubernetes troubleshooting and monitoring.
Want to try KubERA immediately? Choose your setup method:
./setup-playground.sh
# Set your OpenAI API key
export OPENAI_API_KEY="your-api-key-here"
# Set up complete testing environment (installs dependencies automatically)
make playground
# Start KubERA
make run
👉 See the complete playground guide →
The playground command will automatically:
- ✅ Check and install required tools (kind, kubectl, Docker, etc.)
- 🔧 Create a local Kubernetes cluster
- 📊 Install Prometheus and ArgoCD
- 🚀 Deploy sample workloads for testing
- 🗄️ Set up the KubERA database
Access the dashboard at http://localhost:8501 after setup completes.
- macOS or Linux (Windows with WSL2)
- OpenAI API Key - Get one from OpenAI Platform
- 5-10 minutes for initial setup
git clone <repository-url>
cd KubERA
# Set for current session
export OPENAI_API_KEY="your-api-key-here"
# Or add to your shell profile for persistence
echo 'export OPENAI_API_KEY="your-api-key-here"' >> ~/.zshrc
source ~/.zshrc
./setup-playground.sh
- 🤝 Guided setup with prompts
- 🔍 Automatic dependency detection
- 💡 Helpful tips and troubleshooting
make playground
- 🚀 Automated one-command setup
- ⚡ Fastest path to running environment
make run
- Open the Dashboard: http://localhost:5000
- View the Timeline: See events from all sources
- Click on Broken Pods: Get AI-powered analysis
- Explore Filters: Filter by namespace, severity, source
- Check the Terminal: See real-time AI diagnosis
# Create a test pod with image issue
kubectl run test-workload --image=internal-registry.local/app:latest
# Watch it appear in KubERA dashboard
# Click on it to see AI analysis
# View Kubernetes events
# Check Prometheus metrics at http://localhost:9090
# Explore ArgoCD at http://localhost:8080
# Check what's running
kubectl get pods -A
# Reset database for fresh start
make reset-db
# Clean up everything
make destroy-all
# See all available commands
make help
# Docker not running?
open /Applications/Docker.app
# Port already in use?
make destroy-all && make playground
# Need to reset everything?
make destroy-all
make playground
make run
KubERA includes built-in data anonymization to protect sensitive information when sending data to OpenAI's API for analysis.
KubERA automatically anonymizes the following sensitive data before sending to OpenAI:
Data Type | Example Original | Example Anonymized |
---|---|---|
Pod Names | my-app-12345-abcde |
pod-001-xyz12 |
Namespaces | production |
namespace-01 |
Container Names | web-server |
container-01 |
Docker Images | myregistry.com/app:v1.2.3 |
registry.example.com/app-01:v1.0.0 |
IP Addresses | 203.0.113.5 |
10.0.1.5 |
URLs/Domains | api.company.com |
app-01.example.com |
Secrets/Tokens | eyJhbGciOiJIUzI1... |
***REDACTED-SECRET-01*** |
File Paths | /opt/myapp/config |
/app/data/file-01 |
- Before AI Analysis: Sensitive data is replaced with consistent anonymous tokens
- AI Processing: OpenAI receives only anonymized data
- After Analysis: Original values are restored in the response
- User Notification: Terminal shows what was anonymized for transparency
- ✅ Consistent Mapping: Same values get same anonymous tokens
- ✅ Reversible: Original values restored in responses
- ✅ Transparent: Users see what was anonymized
- ✅ Configurable: Can be enabled/disabled per session
- ✅ No Storage: Mappings aren't persisted to disk
When anonymization occurs, you'll see notices like:
🔒 PRIVACY & ANONYMIZATION
─────────────────────────────
The following data was anonymized before sending to AI:
• 2 pod name(s)
• 1 namespace(s)
• 3 container name(s)
• 1 Docker image(s)
Original values have been restored in this response.
# Preview what would be anonymized
curl -X POST http://localhost:8501/api/anonymization/preview \
-H "Content-Type: application/json" \
-d '{"pod_name": "my-app-123", "namespace": "production"}'
# Toggle anonymization on/off
curl -X POST http://localhost:8501/api/anonymization/toggle \
-H "Content-Type: application/json" \
-d '{"enabled": false}'
Anonymization is enabled by default but can be controlled:
# In your code
llm_agent = LlmAgent(enable_anonymization=True) # Default
# Disable for testing
llm_agent.set_anonymization(False)
# Preview anonymization
preview = llm_agent.preview_anonymization(metadata)
This ensures your sensitive Kubernetes data stays private while still getting powerful AI-driven insights.
KubERA uses SQLite as its local database (kubera.db
) with SQLAlchemy ORM for operations.
CREATE TABLE k8s_alerts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
namespace TEXT NOT NULL,
pod_name TEXT NOT NULL,
issue_type TEXT NOT NULL, -- e.g., "CrashLoopBackOff", "ImagePullError"
severity TEXT NOT NULL, -- "high", "medium", "low"
first_seen TEXT NOT NULL, -- ISO timestamp string
last_seen TEXT, -- NULL means ongoing
event_hash TEXT UNIQUE, -- MD5 hash for deduplication
UNIQUE (namespace, pod_name, issue_type, first_seen)
);
CREATE TABLE prometheus_alerts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
namespace TEXT NOT NULL,
pod_name TEXT NOT NULL,
alert_name TEXT NOT NULL, -- e.g., "PodRestarting", "HighCPUUsage"
severity TEXT NOT NULL,
first_seen TEXT NOT NULL,
last_seen TEXT,
event_hash TEXT UNIQUE,
metric_value REAL, -- Associated metric value
UNIQUE (namespace, pod_name, alert_name, first_seen)
);
CREATE TABLE argocd_alerts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
application_name TEXT NOT NULL, -- ArgoCD application name
issue_type TEXT NOT NULL, -- e.g., "ArgoCDDegradedAlert"
severity TEXT NOT NULL,
first_seen TEXT NOT NULL,
last_seen TEXT,
event_hash TEXT UNIQUE,
sync_status TEXT, -- ArgoCD sync status
health_status TEXT, -- ArgoCD health status
UNIQUE (application_name, issue_type, first_seen)
);
A view that combines all alert sources:
CREATE VIEW all_alerts AS
SELECT
id, namespace, pod_name as name, issue_type, severity,
first_seen, last_seen, 'kubernetes' as source, event_hash
FROM k8s_alerts
UNION ALL
SELECT
id, namespace, pod_name as name, alert_name as issue_type, severity,
first_seen, last_seen, 'prometheus' as source, event_hash
FROM prometheus_alerts
UNION ALL
SELECT
id, NULL as namespace, application_name as name, issue_type, severity,
first_seen, last_seen, 'argocd' as source, event_hash
FROM argocd_alerts;
- Event Hash System: Unique MD5 hash based on
namespace:name:issue_type:source
prevents duplicates - Time-based Tracking:
first_seen
: When the issue first occurredlast_seen
: When resolved (NULL = still ongoing)
- Source Segregation: Data separated by source (kubernetes, prometheus, argocd)
- Severity Classification: Three-tier system (high/medium/low)
CrashLoopBackOff
(high severity)ImagePullError
(medium severity)PodOOMKilled
(high severity)FailedScheduling
(medium severity)FailingLiveness
(low severity)
PodRestarting
(medium severity)PodNotReady
(medium severity)HighCPUUsage
(medium severity)KubeDeploymentReplicasMismatch
(medium severity)TargetDown
(medium severity)
ArgoCDDegradedAlert
(high severity)- Sync status issues
- Health status problems
Key functions in db.py
:
record_k8s_failure()
- Insert Kubernetes alertsrecord_prometheus_alert()
- Insert Prometheus alertsrecord_argocd_alert()
- Insert ArgoCD alertsget_all_alerts()
- Retrieve unified alert dataget_active_alerts()
- Get only ongoing alertscleanup_old_alerts()
- Remove old alertscleanup_stale_ongoing_alerts()
- Remove stale ongoing alerts