Skip to content

Simulate different performance calculation methodologies #5827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: drazen/nmv2-metrics-fanout
Choose a base branch
from

Conversation

durch
Copy link
Contributor

@durch durch commented Jun 5, 2025

This change is Reviewable

durch added 6 commits June 3, 2025 12:39
Implement foundation for simulated reward calculations to compare old (24h cache-based)
vs new (1h route-based) methodologies without blockchain transactions.

Database Changes:
- Add migration with 4 new tables for simulation system
- simulated_reward_epochs: tracks each simulation run
- simulated_node_performance: stores performance calculations
- simulated_rewards: stores reward calculation results
- simulated_route_analysis: metadata for route analysis
- Add comprehensive indexes for efficient querying

Configuration Changes:
- Add simulation_mode flag to rewarding configuration
- Add CLI flag --simulate-rewarding with proper dependencies/conflicts
- Add validation for simulation-specific settings
- Add time window configuration for new method (default: 1 hour)

Storage Layer:
- Add model structs with SQLx FromRow derives for all simulation tables
- Add comprehensive CRUD methods for simulation data management
- Add proper type annotations to fix SQLx compile issues
- Maintain separation between simulation and real rewarding logic

The simulation mode is mutually exclusive with real rewarding and does not
require mnemonic since no blockchain transactions are performed.
Add complete simulation engine that compares old (24h cache-based) vs new (1h route-based)
reward calculation methodologies with full integration into epoch operations.

Core Simulation Engine:
- Add SimulationCoordinator with configurable time windows and comparison settings
- Implement dual calculation methods with proper Performance type conversions
- Add comprehensive error handling with DatabaseError variant in RewardingError
- Store simulation results in database with proper relationship constraints

Old Method Implementation (24h Cache-Based):
- Wrap existing reliability calculation using get_all_avg_mix_reliability_in_last_24hr()
- Convert reliability percentages to Performance types using from_percentage_value()
- Maintain exact same logic as production for accurate baseline comparison
- Generate simulation data structures with proper metadata

New Method Implementation (1h Route-Based):
- Leverage calculate_corrected_node_reliabilities_for_interval() for route analysis
- Support configurable time windows (default 1 hour vs 24 hours)
- Provide detailed route statistics including success rates and failure analysis
- Convert route reliability data to Performance types with naive_try_from_f64()

Epoch Operations Integration:
- Extend EpochAdvancer struct with optional SimulationConfig field
- Update constructor and start method to accept simulation configuration
- Add simulation trigger in perform_epoch_operations() before real rewarding
- Ensure simulation failures don't break epoch advancement process

CLI Integration:
- Update run.rs to handle both --enable-rewarding and --simulate-rewarding modes
- Create SimulationConfig from rewarding.debug configuration settings
- Implement mutual exclusivity between real rewarding and simulation mode
- Skip permission checks for simulation-only mode (no blockchain transactions)

The simulation system runs in parallel with epoch operations, storing comparative
data for analysis without affecting production reward distribution.
This completes Phase 3 of the simulation system implementation:

- Add comprehensive REST API endpoints for simulation data access
- Implement /v1/simulation/* routes with full CRUD operations
- Support JSON/CSV export for external analysis
- Add statistical comparison between old vs new methods
- Provide node performance history tracking
- Include proper error handling and response formatting
- Simplify simulation coordinator to remove unused complex return types
- Clean up dead code while maintaining all functionality
- Pass clippy with no warnings

The simulation API provides complete access to:
- Simulation epoch listing and details
- Method comparison analytics (old 24h vs new 1h)
- Node performance analysis across epochs
- Route reliability statistics
- Export capabilities for further analysis

All simulation data is persisted and accessible via REST endpoints.
Refactors the simulation system to focus on performance methodology comparison
rather than reward amounts, enabling robust analysis of old vs new calculation
methods. Key improvements:

- Replace simulated_rewards table with performance_comparisons for better metrics
- Add performance_rankings table for ranking analysis across methodologies
- Enhance database schema with additional performance tracking fields
- Update simulation coordinator to use performance-focused data structures
- Add comprehensive performance ranking calculations
- Improve API models and handlers for performance comparison workflows
- Update SQLx query cache with new database schema changes

This provides a foundation for data-driven performance methodology evaluation
while maintaining separation from actual reward calculations.
Copy link

vercel bot commented Jun 5, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nym-explorer-v2 ❌ Failed (Inspect) Jun 25, 2025 1:06pm
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
docs-nextra ⬜️ Ignored (Inspect) Visit Preview Jun 25, 2025 1:06pm
nym-next-explorer ⬜️ Ignored (Inspect) Visit Preview Jun 25, 2025 1:06pm

durch added 2 commits June 5, 2025 10:45
…ystem

This commit addresses critical N+1 query performance issues identified in
the reward simulation system. The optimizations significantly reduce database
round trips and improve performance when processing large datasets.

**Key Optimizations:**

1. **Batch Identity Key Lookups**
   - Added `get_mixnode_identity_keys_batch()` and `get_gateway_identity_keys_batch()`
   - Updated simulation performance conversion to use batch operations
   - Reduced from N individual queries to 2 batch queries

2. **Batch Node Classification**
   - Added `classify_nodes_batch()` method for mixnode/gateway determination
   - Updated reliability calculation methods to use batch classification
   - Reduced from N individual lookups to 2 batch queries

3. **Batch Epoch Metadata Enhancement**
   - Added `count_simulated_node_performance_for_epochs_batch()`
   - Added `get_available_calculation_methods_for_epochs_batch()`
   - Updated API handlers to use batch operations for metadata enhancement
   - Reduced from 2N queries to 2 batch queries for epoch data

4. **Bulk Insert Optimizations**
   - Converted individual INSERT operations to use `sqlx::QueryBuilder::push_values()`
   - Optimized simulation data insertion methods
   - Eliminated transaction overhead from individual inserts

**Performance Impact:**
- Before: N+2N database queries for N nodes/epochs
- After: 2+2 batch queries regardless of dataset size
- Significant performance improvement for large simulation datasets

All changes maintain backward compatibility while providing substantial
performance benefits for the reward simulation system.
Implements automatic cleanup of route monitoring results to prevent
unbounded storage growth while maintaining data for performance analysis.

- Add purge_old_routes() method to StorageManager following existing patterns
- Integrate route cleanup into purge_old_statuses() wrapper function
- Route data now purged every epoch with 48-hour retention, to facilitate comparisons with legacy data
- Update logging to reflect cleanup of both node statuses and routes
Replace infinite busy-wait loop with timeout-based client cleanup to prevent potential system hangs during client lifecycle management.
Implements proper socket-level health checking for gateway websocket connections:

- Add is_connection_alive() method to GatewayTransceiver trait
- Implement socket-level health check using TcpStream::peek() on Unix systems
- Add is_gateway_connection_alive() method to MixnetClient for easy access
- Update network monitor to use connection health checks before sending
- Fix axum router state configuration in network monitor

The health check performs actual OS-level socket validation rather than just
checking file descriptor existence, providing reliable connection status.
@durch durch marked this pull request as ready for review June 9, 2025 09:28
…ing epoch

In simulation mode, the nym-api should be able to run performance
simulations regardless of which instance is handling epoch advancement,
since simulations don't perform any blockchain transactions.

This fix:
- Allows simulation mode to proceed when another API is advancing the epoch
- Skips epoch transition attempts in simulation mode
- Skips actual reward distribution blockchain operations in simulation mode
…mulation

The old method simulation was incorrectly treating reliability values (0-1 fractions)
as percentages, causing all performance scores to be truncated to 0 when cast to u64.

Changes:
- Old method: multiply reliability by 100 before passing to Performance::from_percentage_value()
- New method: remove division by 100 since Performance::naive_try_from_f64() expects fractions
- Route analysis: multiply average reliability by 100 for percentage display

This ensures consistent handling where internal calculations use fractions (0-1)
while storage and API responses use percentages (0-100).
durch and others added 4 commits June 24, 2025 21:28
* Check gateway supported versions

* Fix boxed errors

* Fmt

* feat(client-core): integrate gateway protocol validation into SDK

Enable protocol version checking for all SDK-based clients by using
gateways_for_init_with_protocol_validation in MixnetClientBuilder.
This ensures clients only connect to gateways with compatible protocol
versions, preventing potential communication issues.

- Replace gateways_for_init with gateways_for_init_with_protocol_validation
- Add import for the new validation function
- Protocol validation now active for network monitor and all SDK users

* refactor(client-core): allow gateways with newer protocol versions

Change protocol validation to be more permissive - instead of rejecting
gateways with newer protocol versions, now logs a warning and continues.
This enables graceful degradation when gateways upgrade, relying on their
backward compatibility while signaling users to update their clients.

Changes:
- Accept gateways with protocol version > client version
- Log warning about version mismatch suggesting client update
- Update log messages from "validation" to "check" for clarity
- Add trace logging showing both gateway and client versions

This prevents connectivity issues when gateways upgrade before clients,
improving overall network resilience.

* feat(nym-network-monitor): add diagnostic logging for ConnectionRefused debugging

- Add detailed logging for client rotation lifecycle with [CLIENT_ROTATION_*] tags
- Add request tracking with unique IDs using the random message as identifier
- Add HTTP connection acceptance logging with [HTTP_REQUEST] tags
- Improve locust script with connection error handling and backoff strategy
- Add TCP backlog configuration support (default 1024)
- Add socket configuration with SO_REUSEADDR and configurable backlog
- Add HTTP timeout and tracing layers for better observability

These changes help diagnose when and why ConnectionRefused errors occur during load testing,
particularly around client rotation periods.

* fix(nym-api): calculate route average reliability as node mean instead of route-based

- Changed route average reliability calculation to use mean of all node reliabilities
- Added median calculation alongside mean for better statistical representation
- Fixed null average reliability for old method which doesn't analyze routes
- Rounded all float values to 2 decimal places in API responses for cleaner display
- Store median values in analysis_parameters JSON field
…comparison

- Added ReliabilityDistribution struct with 6 categories: excellent (>95%), very_good (90-95%), good (75-90%), moderate (50-75%), poor (25-50%), very_poor (<25%)
- Updated ComparisonSummaryStats to include distribution breakdowns for both old and new methods
- Provides better insight into performance distribution beyond averages
- Added tests for distribution calculation
Since old method data is already available in production database,
simulation now only calculates new method performance. The API combines
production data with simulated data at query time.

Key changes:
- Remove old_method_simulation and run_both_methods config
- Add production_performance field to NodePerformanceData model
- Update API endpoints to fetch production data from node annotations cache
- Refactor compare_methods to use single performance dataset
- Fix route analysis comparison to handle missing old method data
- Update all tests to use new single-dataset approach

This simplifies the simulation code and ensures consistency with
production calculations.
durch added 2 commits June 25, 2025 00:40
…ures

- Remove old_method/new_method nested objects from NodeMethodComparison
- Use direct fields for production_performance and simulated_performance
- Update build_node_comparisons_from_single_dataset to create cleaner structure
- Fix calculate_summary_statistics to work with new data model
- Update all tests to match new structure

The API now returns a single dataset with production performance values
included directly, rather than separate old/new method objects.

- Deduplicate routes, to control for deterministic routing
durch and others added 2 commits June 25, 2025 10:50
- Add create_or_get_simulated_reward_epoch to check for existing simulations
- Skip simulation if already exists for the epoch and calculation method
- Add tests to verify duplicate prevention works correctly
- Update all test calls to use the new method

This fixes the issue where multiple simulations were created for the same
epoch when epoch operations were retried or the service restarted.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant