-
Notifications
You must be signed in to change notification settings - Fork 242
Simulate different performance calculation methodologies #5827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
durch
wants to merge
34
commits into
drazen/nmv2-metrics-fanout
Choose a base branch
from
drazen/reward-simulator
base: drazen/nmv2-metrics-fanout
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Implement foundation for simulated reward calculations to compare old (24h cache-based) vs new (1h route-based) methodologies without blockchain transactions. Database Changes: - Add migration with 4 new tables for simulation system - simulated_reward_epochs: tracks each simulation run - simulated_node_performance: stores performance calculations - simulated_rewards: stores reward calculation results - simulated_route_analysis: metadata for route analysis - Add comprehensive indexes for efficient querying Configuration Changes: - Add simulation_mode flag to rewarding configuration - Add CLI flag --simulate-rewarding with proper dependencies/conflicts - Add validation for simulation-specific settings - Add time window configuration for new method (default: 1 hour) Storage Layer: - Add model structs with SQLx FromRow derives for all simulation tables - Add comprehensive CRUD methods for simulation data management - Add proper type annotations to fix SQLx compile issues - Maintain separation between simulation and real rewarding logic The simulation mode is mutually exclusive with real rewarding and does not require mnemonic since no blockchain transactions are performed.
Add complete simulation engine that compares old (24h cache-based) vs new (1h route-based) reward calculation methodologies with full integration into epoch operations. Core Simulation Engine: - Add SimulationCoordinator with configurable time windows and comparison settings - Implement dual calculation methods with proper Performance type conversions - Add comprehensive error handling with DatabaseError variant in RewardingError - Store simulation results in database with proper relationship constraints Old Method Implementation (24h Cache-Based): - Wrap existing reliability calculation using get_all_avg_mix_reliability_in_last_24hr() - Convert reliability percentages to Performance types using from_percentage_value() - Maintain exact same logic as production for accurate baseline comparison - Generate simulation data structures with proper metadata New Method Implementation (1h Route-Based): - Leverage calculate_corrected_node_reliabilities_for_interval() for route analysis - Support configurable time windows (default 1 hour vs 24 hours) - Provide detailed route statistics including success rates and failure analysis - Convert route reliability data to Performance types with naive_try_from_f64() Epoch Operations Integration: - Extend EpochAdvancer struct with optional SimulationConfig field - Update constructor and start method to accept simulation configuration - Add simulation trigger in perform_epoch_operations() before real rewarding - Ensure simulation failures don't break epoch advancement process CLI Integration: - Update run.rs to handle both --enable-rewarding and --simulate-rewarding modes - Create SimulationConfig from rewarding.debug configuration settings - Implement mutual exclusivity between real rewarding and simulation mode - Skip permission checks for simulation-only mode (no blockchain transactions) The simulation system runs in parallel with epoch operations, storing comparative data for analysis without affecting production reward distribution.
This completes Phase 3 of the simulation system implementation: - Add comprehensive REST API endpoints for simulation data access - Implement /v1/simulation/* routes with full CRUD operations - Support JSON/CSV export for external analysis - Add statistical comparison between old vs new methods - Provide node performance history tracking - Include proper error handling and response formatting - Simplify simulation coordinator to remove unused complex return types - Clean up dead code while maintaining all functionality - Pass clippy with no warnings The simulation API provides complete access to: - Simulation epoch listing and details - Method comparison analytics (old 24h vs new 1h) - Node performance analysis across epochs - Route reliability statistics - Export capabilities for further analysis All simulation data is persisted and accessible via REST endpoints.
Refactors the simulation system to focus on performance methodology comparison rather than reward amounts, enabling robust analysis of old vs new calculation methods. Key improvements: - Replace simulated_rewards table with performance_comparisons for better metrics - Add performance_rankings table for ranking analysis across methodologies - Enhance database schema with additional performance tracking fields - Update simulation coordinator to use performance-focused data structures - Add comprehensive performance ranking calculations - Improve API models and handlers for performance comparison workflows - Update SQLx query cache with new database schema changes This provides a foundation for data-driven performance methodology evaluation while maintaining separation from actual reward calculations.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
2 Skipped Deployments
|
…ystem This commit addresses critical N+1 query performance issues identified in the reward simulation system. The optimizations significantly reduce database round trips and improve performance when processing large datasets. **Key Optimizations:** 1. **Batch Identity Key Lookups** - Added `get_mixnode_identity_keys_batch()` and `get_gateway_identity_keys_batch()` - Updated simulation performance conversion to use batch operations - Reduced from N individual queries to 2 batch queries 2. **Batch Node Classification** - Added `classify_nodes_batch()` method for mixnode/gateway determination - Updated reliability calculation methods to use batch classification - Reduced from N individual lookups to 2 batch queries 3. **Batch Epoch Metadata Enhancement** - Added `count_simulated_node_performance_for_epochs_batch()` - Added `get_available_calculation_methods_for_epochs_batch()` - Updated API handlers to use batch operations for metadata enhancement - Reduced from 2N queries to 2 batch queries for epoch data 4. **Bulk Insert Optimizations** - Converted individual INSERT operations to use `sqlx::QueryBuilder::push_values()` - Optimized simulation data insertion methods - Eliminated transaction overhead from individual inserts **Performance Impact:** - Before: N+2N database queries for N nodes/epochs - After: 2+2 batch queries regardless of dataset size - Significant performance improvement for large simulation datasets All changes maintain backward compatibility while providing substantial performance benefits for the reward simulation system.
Implements automatic cleanup of route monitoring results to prevent unbounded storage growth while maintaining data for performance analysis. - Add purge_old_routes() method to StorageManager following existing patterns - Integrate route cleanup into purge_old_statuses() wrapper function - Route data now purged every epoch with 48-hour retention, to facilitate comparisons with legacy data - Update logging to reflect cleanup of both node statuses and routes
Replace infinite busy-wait loop with timeout-based client cleanup to prevent potential system hangs during client lifecycle management.
Implements proper socket-level health checking for gateway websocket connections: - Add is_connection_alive() method to GatewayTransceiver trait - Implement socket-level health check using TcpStream::peek() on Unix systems - Add is_gateway_connection_alive() method to MixnetClient for easy access - Update network monitor to use connection health checks before sending - Fix axum router state configuration in network monitor The health check performs actual OS-level socket validation rather than just checking file descriptor existence, providing reliable connection status.
…ing epoch In simulation mode, the nym-api should be able to run performance simulations regardless of which instance is handling epoch advancement, since simulations don't perform any blockchain transactions. This fix: - Allows simulation mode to proceed when another API is advancing the epoch - Skips epoch transition attempts in simulation mode - Skips actual reward distribution blockchain operations in simulation mode
…mulation The old method simulation was incorrectly treating reliability values (0-1 fractions) as percentages, causing all performance scores to be truncated to 0 when cast to u64. Changes: - Old method: multiply reliability by 100 before passing to Performance::from_percentage_value() - New method: remove division by 100 since Performance::naive_try_from_f64() expects fractions - Route analysis: multiply average reliability by 100 for percentage display This ensures consistent handling where internal calculations use fractions (0-1) while storage and API responses use percentages (0-100).
* Check gateway supported versions * Fix boxed errors * Fmt * feat(client-core): integrate gateway protocol validation into SDK Enable protocol version checking for all SDK-based clients by using gateways_for_init_with_protocol_validation in MixnetClientBuilder. This ensures clients only connect to gateways with compatible protocol versions, preventing potential communication issues. - Replace gateways_for_init with gateways_for_init_with_protocol_validation - Add import for the new validation function - Protocol validation now active for network monitor and all SDK users * refactor(client-core): allow gateways with newer protocol versions Change protocol validation to be more permissive - instead of rejecting gateways with newer protocol versions, now logs a warning and continues. This enables graceful degradation when gateways upgrade, relying on their backward compatibility while signaling users to update their clients. Changes: - Accept gateways with protocol version > client version - Log warning about version mismatch suggesting client update - Update log messages from "validation" to "check" for clarity - Add trace logging showing both gateway and client versions This prevents connectivity issues when gateways upgrade before clients, improving overall network resilience. * feat(nym-network-monitor): add diagnostic logging for ConnectionRefused debugging - Add detailed logging for client rotation lifecycle with [CLIENT_ROTATION_*] tags - Add request tracking with unique IDs using the random message as identifier - Add HTTP connection acceptance logging with [HTTP_REQUEST] tags - Improve locust script with connection error handling and backoff strategy - Add TCP backlog configuration support (default 1024) - Add socket configuration with SO_REUSEADDR and configurable backlog - Add HTTP timeout and tracing layers for better observability These changes help diagnose when and why ConnectionRefused errors occur during load testing, particularly around client rotation periods. * fix(nym-api): calculate route average reliability as node mean instead of route-based - Changed route average reliability calculation to use mean of all node reliabilities - Added median calculation alongside mean for better statistical representation - Fixed null average reliability for old method which doesn't analyze routes - Rounded all float values to 2 decimal places in API responses for cleaner display - Store median values in analysis_parameters JSON field
…comparison - Added ReliabilityDistribution struct with 6 categories: excellent (>95%), very_good (90-95%), good (75-90%), moderate (50-75%), poor (25-50%), very_poor (<25%) - Updated ComparisonSummaryStats to include distribution breakdowns for both old and new methods - Provides better insight into performance distribution beyond averages - Added tests for distribution calculation
Since old method data is already available in production database, simulation now only calculates new method performance. The API combines production data with simulated data at query time. Key changes: - Remove old_method_simulation and run_both_methods config - Add production_performance field to NodePerformanceData model - Update API endpoints to fetch production data from node annotations cache - Refactor compare_methods to use single performance dataset - Fix route analysis comparison to handle missing old method data - Update all tests to use new single-dataset approach This simplifies the simulation code and ensures consistency with production calculations.
…ures - Remove old_method/new_method nested objects from NodeMethodComparison - Use direct fields for production_performance and simulated_performance - Update build_node_comparisons_from_single_dataset to create cleaner structure - Fix calculate_summary_statistics to work with new data model - Update all tests to match new structure The API now returns a single dataset with production performance values included directly, rather than separate old/new method objects. - Deduplicate routes, to control for deterministic routing
- Add create_or_get_simulated_reward_epoch to check for existing simulations - Skip simulation if already exists for the epoch and calculation method - Add tests to verify duplicate prevention works correctly - Update all test calls to use the new method This fixes the issue where multiple simulations were created for the same epoch when epoch operations were retried or the service restarted. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change is