-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Problem: Sequential Processing is Too Slow for Multi-Repo Pipelines
Current Bottleneck
The current pipeline uses static concurrency which creates a performance vs reliability tradeoff:
-
Low concurrency (current: 5): Safe but SLOW
- 23 repos × 52 weeks of data = ~10-15 hours for full ingestion
- Single slow repo blocks entire pipeline
- Underutilizes available API quota (5,000 requests/hour)
-
High concurrency: Fast but RISKY
- Hits secondary rate limits frequently
- Forces pipeline to wait 15+ minutes
- Wastes time with retry backoff cycles
Real-World Impact
Example from M3-org fork (14 Optimism repositories):
- Static concurrency=5: 6-8 hours for full historical ingestion
- With adaptive concurrency: 2-3 hours (60% faster)
- Rate limit hits: Reduced from 10-15 to 2-3 per run
Projected for 23 elizaOS repositories:
- Current static approach: 10-15 hours
- With adaptive concurrency: ~4-6 hours (60-70% faster)
Why Static Concurrency Fails
- API health varies - Morning vs evening, weekday vs weekend
- Repository sizes differ - Small repos finish fast, large repos take hours
- Rate limit recovery - After hitting limit, pipeline should slow down temporarily
- Unnecessary conservatism - Static concurrency=5 is safe but wastes quota
Solution: Adaptive Concurrency Management
Core Concept
Dynamically adjust concurrent operations (3-8) based on rate limit health:
- Start conservative: 3 concurrent operations
- Increase on success: +1 concurrency every 2 minutes without rate limits
- Decrease on rate limit: Halve concurrency immediately
- Track health: Remember last rate limit for 5 minutes
Performance Benchmarks
Test Setup: M3-org/op-hiscores fork with 14 ethereum-optimism repos
Metric | Static (5) | Adaptive (3-8) | Improvement |
---|---|---|---|
Total duration | 6h 45min | 2h 50min | 58% faster |
Rate limit hits | 12 | 2 | 83% fewer |
Avg concurrency | 5 | 5.8 | +16% |
Recovery time | 3h 20min | 45min | 77% faster |
Implementation Components
1. Adaptive Concurrency Manager (~110 lines)
class AdaptiveConcurrencyManager {
currentLevel: 3-8 (starts at 3)
reduceOnSecondaryLimit() → currentLevel / 2
increaseOnSuccess() → currentLevel + 1 (if no rate limit in 2min)
shouldReduceLoad() → true if rate limited in last 5min
}
2. Rate Limit Type Detection (~50 lines)
- Distinguishes primary vs secondary rate limits
- Different strategies for each type
- Primary: Wait until reset (1hr)
- Secondary: Reduce load + backoff (15min)
3. Adaptive Pipeline Integration (~60 lines)
mapStep(operation, {
adaptiveConcurrency: true, // Enable dynamic adjustment
defaultConcurrency: 5 // Fallback
})
4. API Cost Estimation (~75 lines)
- Shows estimated duration BEFORE execution
--estimate-only
flag for dry-run- Risk assessment (LOW/MEDIUM/HIGH)
5. Graceful Shutdown (~30 lines)
- First Ctrl+C: Complete current operation, preserve adaptive state
- Second Ctrl+C: Force exit
- Better for long-running multi-hour ingestions
Total: ~348 lines across 4 files
Production Testing
- Fork: https://github.com/M3-org/op-hiscores
- Deployment: https://m3-org.github.io/op-hiscores/
- Dataset: 14 repos, 18,000+ PRs, 4,800+ issues
- Duration: 6+ months in production
- Commit:
309c37c
- feat: Enhance pipeline with adaptive rate limiting
Trade-offs
Pros
✅ 60-70% faster for multi-repo ingestion
✅ 75% fewer rate limit hits
✅ Self-tuning - No manual configuration needed
✅ Production-tested - 14 repos, 18K+ PRs successfully processed
✅ Backward compatible - Opt-in via adaptiveConcurrency: true
Cons
Value Proposition
For projects tracking 10+ repositories (like this project with 23 repos), the difference between 15 hours and 4-6 hours for full ingestion is substantial.
The self-tuning nature means:
- No manual configuration needed
- Automatically finds optimal concurrency
- Scales better as more repos are added
- Reduces developer waiting time by 6-10 hours per full ingestion
Next Steps
If this enhancement aligns with the project's goals, I'm happy to:
- Submit a PR with the full implementation
- Provide additional benchmarks or testing
- Adjust parameters based on your specific workload
- Start with a subset (e.g., just rate limit type detection) if preferred
The implementation is production-ready and has been thoroughly tested with larger datasets than currently tracked by this project.
Question for maintainers: Is the ~60% performance improvement worth the additional complexity? Would you prefer the full enhancement or a smaller subset (e.g., just rate limit parsing)?