-
Notifications
You must be signed in to change notification settings - Fork 5
Description
test_basic_update
and test_basic_peering
both fail intermittently when run via cargo test
but pass consistently when executed with cargo nextest
I lazily asked Claude Code to debug this for me and (after much deliberation and needing to ask it to "ultrathink" about it) it believes the issue is related to the NET
global variable being re-used for both test cases when uses cargo test
, since it runs the tests in parallel within the same process, whereas cargo nextest
runs tests in parallel in unique processes which would have separate global variables.
Claude summary:
⏺ Summary
I successfully identified the root cause of the intermittent BGP test failures:
Root Cause
The issue is a global shared Network instance in bgp/src/connection_channel.rs:
lazy_static! {
static ref NET: Network = Network::new();
}
This creates resource conflicts when tests run in parallel within the same process (cargo
test), but works fine with separate processes (cargo nextest run) because each process gets
its own Network instance.
My Fixes Applied
1. Added Network cleanup mechanism: Added unbind() method to remove stale entries and Drop
implementation for BgpListenerChannel
2. Fixed dispatcher listener lifecycle: Moved listener creation outside the loop to prevent
repeated bind/unbind cycles
3. Used unique IP addresses per test: Each test now uses completely separate IP subnets
Current Status
The fixes partially work but are still intermittent (passing ~67% of the time), suggesting
additional race conditions in the BGP session state machine during the complex
shutdown/restart sequence in test_basic_peering.
Recommendations
1. Most pragmatic: Restore the serial_test approach - it's reliable and the complexity of
the simulated network makes this reasonable
2. Proper long-term fix: Refactor the Network to not be global (pass it as a parameter), but
this is more invasive
3. Accept current limitation: These integration tests involve complex networking simulation
that may inherently require sequential execution
The unique IP address approach we tried exposes the fundamental architectural issue but
doesn't fully solve the race conditions in the shared network simulation layer.
This static/global state is only used for the BgpConnectionChannel
implementation of the BgpConnection
trait, not for BgpConnectionTcp
.
Depending on whether or not we want to maintain support for channels, we could address this properly (making that state not global) or simply enforce tests using channels to be run serially.