Skip to content

jonaskahn/azure-invoice

Repository files navigation

Azure Invoice Analyzer Pro - Complete Strategy Document

Executive Summary

Azure Invoice Analyzer Pro is an advanced Streamlit application that transforms Azure billing data into actionable business intelligence. The system provides comprehensive cost analysis through 9 business categories, interactive drill-down capabilities, and complete financial reconciliation with PDF export functionality.

Quick Chart Reference

Chart Calculation Summary

Chart Type Calculation Method Key Metrics
Cost Category Pie/Bar df.groupby('CostCategory')['Cost'].sum() Category percentages of total cost
Service Provider Analysis df.groupby('ConsumedService')['Cost'].sum() Total cost per Azure service
Resource Group Costs df.groupby('ResourceGroup')['Cost'].sum() Total cost per resource group
Machine Efficiency Cost / Quantity per resource Cost per unit/hour analysis
Interactive Drill-Down Multi-level grouping + pattern matching Resource Group → Machine → Category
Cost Reconciliation abs(original_total - categorized_total) Financial validation & accuracy

Core Calculations

Cost Reconciliation:

original_total = df['Cost'].sum()
categorized_total = classified_df['Cost'].sum()
coverage = (categorized_cost / original_total) × 100

Category Classification:

# 9 business categories based on Azure service patterns
if MeterSubcategory in ['Premium SSD Managed Disks']:
    return 'Managed Disks'
elif ConsumedService == 'Microsoft.Compute' and not disk_category:
    return 'VM Compute'

Machine Cost Breakdown:

# Includes related resources (disks, NICs, etc.)
machine_data = df[df['ResourceName'] == machine_name]
related_data = df[df['ResourceName'].str.contains(machine_name)]
total_cost = combined_data['Cost'].sum()

Chart Types and Business Value

Chart Purpose Key Insight Optimization Action
Cost Category Pie Budget allocation by business function Which category consumes most budget Focus optimization on largest segments
Service Provider Bar Azure service spending distribution Which Microsoft services cost most Negotiate better rates, optimize usage
Resource Group Costs Environment/team cost allocation Which groups spend most Budget allocation, cost accountability
Machine Efficiency Resource utilization analysis Cost per hour performance Right-size over/under-utilized resources
Drill-Down Analysis Granular cost investigation Machine-level cost breakdown Target specific machines for optimization
Reconciliation Validation Financial accuracy verification Data integrity and completeness Ensure 100% cost accountability

Core Architecture

Data Processing Pipeline

  1. Data Ingestion: CSV file upload with validation and error handling
  2. Cost Classification: 9-category business intelligence classification system
  3. Cost Reconciliation: Mathematical validation ensuring 100% cost accountability
  4. Interactive Analysis: Multi-level drill-down from resource groups to individual machines
  5. Export Generation: Professional PDF reports and individual chart downloads

Classification Engine

9 Business Cost Categories

  1. Managed Disks (typically 60-80% of costs)

    • Premium SSD Managed Disks
    • Standard HDD Managed Disks
    • Standard SSD Managed Disks
    • Ultra SSD Managed Disks
  2. VM Compute (10-25% of costs)

    • Virtual machine runtime costs
    • CPU and memory consumption
    • Excludes storage components
  3. CDN (5-15% of costs)

    • Content Delivery Network services
    • Data transfer and caching
  4. Network/IP (3-10% of costs)

    • Virtual Network services
    • Public IP addresses
    • Network security groups
  5. Backup (2-8% of costs)

    • Recovery Services vault
    • Backup storage and operations
  6. Load Balancer (1-5% of costs)

    • Standard and Basic load balancers
    • Application gateways
  7. Other Storage (<2% of costs)

    • Blob storage, File storage
    • Table storage, Queue storage
  8. Bandwidth (<1% of costs)

    • Data transfer charges
    • Inter-region communication
  9. Key Vault (<1% of costs)

    • Secret management services
    • Certificate operations

Interactive Drill-Down System

Three-Level Analysis Hierarchy

  1. Resource Group Level

    • Total cost, machine count, usage hours
    • High-level resource allocation view
  2. Machine Level

    • Individual machine costs and usage
    • Service provider breakdown
    • Related resource identification
  3. Category Level

    • Detailed cost breakdown by business category
    • Service-level attribution
    • Optimization recommendations

Enhanced Machine Analysis

Related Resource Detection

  • Pattern Matching: Identifies associated resources using naming conventions
  • Resource Relationships: Links VMs with their disks, NICs, and other components
  • Complete Cost Attribution: Ensures all machine-related costs are captured

Categories Include:

vm-name → Primary virtual machine
vm-name-disk → Managed disks
vm-name-nic → Network interfaces
vm-name_OsDisk → Operating system disks
vm-name-backup → Backup services

Chart Calculations and Analysis

1. Executive Summary & Validation

Cost Reconciliation Engine

Mathematical Foundation:

# Core reconciliation calculation
original_total = df['Cost'].sum()
categorized_total = classified_df['Cost'].sum()
difference = abs(original_total - categorized_total)
reconciliation_success = difference < 0.01

# Coverage calculation
categorized_cost = total_cost - uncategorized_cost
coverage_percentage = (categorized_cost / original_total) × 100

Key Metrics Explained:

  • Original Invoice Total: Sum of all Cost column values from uploaded CSV
  • Categorized Total: Sum of all costs after classification processing
  • Difference: Mathematical variance between original and processed totals
  • Coverage Percentage: (Categorized costs / Original total) × 100
  • Reconciliation Status: ✅ Success if difference < $0.01, ❌ Failed otherwise

Business Interpretation:

  • 100% Coverage: All costs properly classified, complete financial accountability
  • 95-99% Coverage: Good classification, minor uncategorized items need review
  • <95% Coverage: Significant classification gaps, immediate attention required

Executive Metrics Dashboard

Calculation Method:

total_cost = df['Cost'].sum()
total_quantity = df['Quantity'].sum()
unique_resource_groups = df['ResourceGroup'].nunique()
unique_machines = df['ResourceName'].nunique()

Metrics Meaning:

  • Total Cost: Complete Azure spending for the invoice period
  • Total Usage: Sum of all quantity values (typically hours)
  • Top Category: Highest cost category from 9-category classification
  • Resource Groups: Number of distinct Azure resource groups
  • Services: Number of unique Azure service providers (Microsoft.Compute, etc.)

2. Cost Category Analysis

Cost Category Pie Chart

Calculation Method:

category_summary = classified_df.groupby('CostCategory').agg({
    'Cost': 'sum',
    'Quantity': 'sum'
}).round(4)

category_percentage = (category_summary['Cost'] / total_cost * 100).round(2)

Chart Elements:

  • Pie Segments: Each represents one of 9 business categories
  • Percentage Labels: Category cost as percentage of total invoice
  • Dollar Values: Absolute cost amount for each category
  • Color Coding: Consistent colors across all charts for category identification

Business Interpretation:

  • Dominant Categories: Largest segments indicate primary cost drivers
  • Distribution Balance: Even distribution suggests diversified infrastructure
  • Anomalies: Unusually large percentages may indicate optimization opportunities

Cost Category Bar Chart (Detailed View)

Calculation Method:

category_breakdown = classified_df.groupby('CostCategory').agg({
    'Cost': ['sum', 'count', 'mean'],
    'Quantity': 'sum'
})

record_count = category_breakdown['Cost']['count']
avg_cost = category_breakdown['Cost']['mean']

Data Points Explained:

  • Horizontal Bars: Length represents total cost for each category
  • Cost Values: Dollar amount displayed on each bar
  • Percentage Labels: Category percentage of total costs
  • Record Count: Number of invoice line items in each category (hover data)

Key Insights:

  • Bar Length Comparison: Visual representation of spending distribution
  • High Record Count: Many small transactions vs few large transactions
  • Cost Concentration: Identifies which categories drive majority of spending

3. Service Provider Analysis

Service Provider Chart

Calculation Method:

provider_summary = df.groupby('ConsumedService').agg({
    'Cost': ['sum', 'count'],
    'Quantity': 'sum'
}).round(4)

provider_percentage = (provider_summary['Cost'] / total_cost * 100).round(2)

Chart Components:

  • X-Axis: Azure service providers (Microsoft.Compute, Microsoft.Storage, etc.)
  • Y-Axis: Total cost in USD for each provider
  • Bar Height: Proportional to spending on each service
  • Text Labels: Dollar amounts displayed above bars

Service Provider Breakdown:

  • Microsoft.Compute: Virtual machines, disks, compute resources
  • Microsoft.Network: Virtual networks, load balancers, IP addresses
  • Microsoft.Storage: Blob storage, file storage, queues
  • Microsoft.RecoveryServices: Backup and disaster recovery
  • Microsoft.Cdn: Content delivery network services

Business Analysis:

  • Compute Dominance: High Microsoft.Compute costs indicate VM-heavy workloads
  • Storage Intensity: High Microsoft.Storage suggests data-intensive applications
  • Network Costs: Significant Microsoft.Network indicates complex networking requirements

4. Interactive Drill-Down Analysis

Resource Group Selection Metrics

Calculation Method:

rg_data = df[df['ResourceGroup'] == selected_rg]
rg_cost = rg_data['Cost'].sum()
rg_machines = rg_data['ResourceName'].nunique()
rg_quantity = rg_data['Quantity'].sum()

Metrics Explanation:

  • Total Cost: Sum of all costs within selected resource group
  • Machines: Count of unique ResourceName values in the group
  • Total Usage: Sum of quantity values (usually compute hours)

Machine Analysis Table

Calculation Method:

machine_summary = rg_data.groupby('ResourceName').agg({
    'Cost': 'sum',
    'Quantity': 'sum',
    'ConsumedService': lambda x: ', '.join(x.unique()),
    'MeterCategory': lambda x: ', '.join(x.unique())
})

cost_percentage = (machine_summary['Cost'] / rg_cost * 100).round(2)

Table Columns:

  • Machine Name: ResourceName from Azure invoice
  • Total Cost: Sum of all costs for this specific machine
  • Total Usage: Sum of quantity values for this machine
  • Services Used: List of Azure services consumed by this machine
  • Meter Categories: Types of Azure meters associated with this machine
  • Cost %: This machine's percentage of total resource group costs

Individual Machine Analysis

Calculation Method:

# Enhanced resource detection
machine_data = classified_df[classified_df['ResourceName'] == selected_machine]
related_data = classified_df[
    (classified_df['ResourceName'].str.contains(selected_machine, case=False)) |
    (classified_df['ResourceName'].str.startswith(selected_machine + '-')) |
    (classified_df['ResourceName'].str.startswith(selected_machine + '_'))
]

combined_data = pd.concat([machine_data, related_data]).drop_duplicates()

Machine Metrics:

  • Total Cost: Sum of costs for machine and all related resources
  • Total Usage: Sum of quantity values across all related resources
  • Categories: Number of cost categories this machine uses
  • Cost/Hour: Total cost divided by total usage hours

Related Resources Detection:

  • Primary Resource: Exact match of machine name
  • Associated Disks: Resources with patterns like "vm-name-disk", "vm-name_OsDisk"
  • Network Components: Resources like "vm-name-nic" (network interfaces)
  • Backup Resources: Resources like "vm-name-backup"

Machine Cost Breakdown Charts

Pie Chart (Cost Distribution):

category_breakdown = combined_data.groupby('CostCategory').agg({
    'Cost': 'sum',
    'Quantity': 'sum'
})

machine_percentage = (category_breakdown['Cost'] / machine_cost * 100).round(2)

Elements:

  • Pie Segments: Proportional to category costs for this machine
  • Percentage Labels: Category percentage of machine's total cost
  • Dollar Values: Absolute cost for each category

Bar Chart (Category Details):

  • Horizontal Bars: Cost amount for each category
  • Category Colors: Consistent color scheme matching pie chart
  • Hover Data: Additional details including quantity used

5. Resource Efficiency Analysis

Efficiency Metrics Chart

Calculation Method:

efficiency_df = df.groupby('ResourceName').agg({
    'Cost': 'sum',
    'Quantity': 'sum'
})

efficiency_df['CostPerUnit'] = efficiency_df['Cost'] / efficiency_df['Quantity']
efficiency_df['EfficiencyScore'] = efficiency_df['Cost'] / efficiency_df['Quantity']

Dual-Axis Chart Components:

  • Left Y-Axis: Total cost in USD (bars)
  • Right Y-Axis: Cost per unit/hour (line)
  • X-Axis: Resource names (top 15 by cost)
  • Bar Height: Total spending on each resource
  • Line Points: Efficiency score (cost per hour)

Efficiency Interpretation:

  • High Cost + Low Efficiency: Expensive resources with high cost per hour
  • High Cost + High Efficiency: Expensive but efficient resources
  • Low Cost + Low Efficiency: Cheap resources but poor cost per hour ratio
  • Low Cost + High Efficiency: Cost-effective resources

Optimization Targets:

  • High Efficiency Score: Resources with cost per hour above average
  • Right-Sizing Candidates: Resources with consistently high efficiency scores
  • Optimization Opportunities: Resources showing poor cost per unit ratios

6. Traditional Resource Analysis

Cost by Resource Group Chart

Calculation Method:

cost_by_rg = df.groupby('ResourceGroup')['Cost'].sum().sort_values(ascending=False)

Chart Elements:

  • X-Axis: Resource group names
  • Y-Axis: Total cost in USD
  • Bar Height: Proportional to total spending per resource group
  • Text Labels: Dollar amounts displayed above each bar

Business Usage:

  • Environment Comparison: Compare prod, dev, test environment costs
  • Department Allocation: Understand which teams/projects drive costs
  • Budget Attribution: Assign costs to specific business units

Top Machines by Cost Chart

Calculation Method:

cost_by_machine = df.groupby('ResourceName')['Cost'].sum().sort_values(ascending=False)
top_machines = cost_by_machine.head(Config.TOP_ITEMS_COUNT)

Analysis Focus:

  • Top N Machines: Configurable from 5-50 most expensive resources
  • Cost Ranking: Machines ordered by total spending
  • Optimization Targets: Highest-cost machines for immediate attention

Cost vs Usage Comparison Chart

Calculation Method:

agg_data = df.groupby('ResourceGroup').agg({
    'Cost': 'sum',
    'Quantity': 'sum'
})

Dual-Axis Visualization:

  • Primary Y-Axis (Bars): Total cost per resource group
  • Secondary Y-Axis (Line): Total usage hours per resource group
  • Correlation Analysis: Relationship between cost and usage

Business Insights:

  • High Cost + Low Usage: Potentially over-provisioned resources
  • Low Cost + High Usage: Efficient resource utilization
  • Correlation Patterns: Expected vs unexpected cost-usage relationships

7. Uncategorized Items Analysis

Uncategorized Cost Metrics

Calculation Method:

uncategorized_items = classified_df[classified_df['CostCategory'] == 'Other']
uncategorized_cost = uncategorized_items['Cost'].sum()
uncategorized_percentage = (uncategorized_cost / total_cost * 100)

Key Indicators:

  • Uncategorized Cost: Dollar amount not classified into business categories
  • Percentage of Total: Uncategorized amount as percentage of total invoice
  • Number of Items: Count of line items that couldn't be classified
  • Status Assessment: ✅ Excellent (<1%), ⚠️ Good (1-5%), ❌ Needs Review (>5%)

Service Type Breakdown

Calculation Method:

service_breakdown = uncategorized_items.groupby([
    'ConsumedService', 'MeterCategory', 'MeterSubcategory'
]).agg({
    'Cost': 'sum',
    'ResourceName': 'count'
})

Table Analysis:

  • Service Provider: Azure service generating uncategorized costs
  • Meter Category: Type of Azure service meter
  • Meter Subcategory: Specific service subtype
  • Total Cost: Dollar amount for this service combination
  • Item Count: Number of invoice lines for this service type
  • % of Uncategorized: Percentage of uncategorized costs from this service

8. Enhanced Service Breakdown (Machine Level)

Comprehensive Service Analysis

Calculation Method:

enhanced_breakdown = all_machine_data.groupby([
    'ConsumedService', 'MeterCategory', 'MeterSubcategory'
]).agg({
    'Cost': 'sum',
    'Quantity': 'sum',
    'ResourceName': lambda x: ', '.join(x.unique())
})

Detailed Breakdown Components:

  • Service Provider: Azure service (Microsoft.Compute, Microsoft.Storage)
  • Meter Category: Service category (Virtual Machines, Storage, Network)
  • Meter Subcategory: Specific service type (Premium SSD, D2s v3, etc.)
  • Cost: Total spending for this service type
  • Quantity: Total usage amount (hours, GB, transactions)
  • Resource Names: Which specific resources use this service
  • Suggested Category: What business category this should be classified as

Storage Analysis for Machines

Calculation Method:

storage_data = all_machine_data[
    all_machine_data['MeterCategory'].str.contains('Storage|Disk', case=False)
]

storage_breakdown = storage_data.groupby(['MeterSubcategory', 'ResourceName']).agg({
    'Cost': 'sum',
    'Quantity': 'sum'
})

storage_percentage = (storage_cost / total_machine_cost * 100)

Storage Metrics:

  • Storage Type: Specific disk type (Premium SSD, Standard HDD)
  • Resource Name: Individual disk or storage resource
  • Cost: Dollar amount for this storage component
  • Quantity: Storage usage (typically GB-hours or disk-hours)
  • Storage Percentage: Storage costs as percentage of total machine costs

Optimization Insights:

  • Disk Tier Analysis: Premium vs Standard SSD usage
  • Storage Efficiency: Cost per GB analysis
  • Right-Sizing Opportunities: Over-provisioned storage identification

Technical Implementation

Required CSV Structure

Date,Cost,Quantity,ResourceGroup,ResourceName,ConsumedService,MeterCategory,MeterSubcategory
2024-01-01,45.67,720,prod-rg,web-server-01,Microsoft.Compute,Virtual Machines,D2s v3
2024-01-01,89.23,744,prod-rg,web-server-01-disk,Microsoft.Compute,Storage,Premium SSD Managed Disks

Classification Logic Implementation

# Managed Disks Classification
if MeterSubcategory in ['Premium SSD Managed Disks', 'Standard HDD Managed Disks']:
    return 'Managed Disks'

# VM Compute Classification
if ConsumedService == 'Microsoft.Compute' and MeterSubcategory not in disk_categories:
    return 'VM Compute'

# Network Classification
if MeterCategory == 'Virtual Network':
    return 'Network/IP'

Cost Reconciliation Engine

  • Mathematical Validation: Original total = Sum of categorized costs
  • Coverage Tracking: Percentage of costs successfully categorized
  • Gap Identification: Uncategorized items analysis and recommendations
  • Error Detection: Data quality issues and missing classifications

This comprehensive strategy document provides the technical foundation and business rationale for implementing Azure Invoice Analyzer Pro as a critical tool for Azure cost management and financial governance.

About

Simple app for simple life

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages