Skip to content

bgp: tests fail intermittently when run via cargo test but pass consistently with cargo nextest #532

@taspelund

Description

@taspelund

test_basic_update and test_basic_peering both fail intermittently when run via cargo test but pass consistently when executed with cargo nextest

I lazily asked Claude Code to debug this for me and (after much deliberation and needing to ask it to "ultrathink" about it) it believes the issue is related to the NET global variable being re-used for both test cases when uses cargo test, since it runs the tests in parallel within the same process, whereas cargo nextest runs tests in parallel in unique processes which would have separate global variables.

Claude summary:

⏺ Summary

  I successfully identified the root cause of the intermittent BGP test failures:

  Root Cause

  The issue is a global shared Network instance in bgp/src/connection_channel.rs:
  lazy_static! {
      static ref NET: Network = Network::new();
  }

  This creates resource conflicts when tests run in parallel within the same process (cargo
  test), but works fine with separate processes (cargo nextest run) because each process gets
  its own Network instance.

  My Fixes Applied

  1. Added Network cleanup mechanism: Added unbind() method to remove stale entries and Drop
  implementation for BgpListenerChannel
  2. Fixed dispatcher listener lifecycle: Moved listener creation outside the loop to prevent
  repeated bind/unbind cycles
  3. Used unique IP addresses per test: Each test now uses completely separate IP subnets

  Current Status

  The fixes partially work but are still intermittent (passing ~67% of the time), suggesting
  additional race conditions in the BGP session state machine during the complex
  shutdown/restart sequence in test_basic_peering.

  Recommendations

  1. Most pragmatic: Restore the serial_test approach - it's reliable and the complexity of
  the simulated network makes this reasonable
  2. Proper long-term fix: Refactor the Network to not be global (pass it as a parameter), but
   this is more invasive
  3. Accept current limitation: These integration tests involve complex networking simulation
  that may inherently require sequential execution

  The unique IP address approach we tried exposes the fundamental architectural issue but
  doesn't fully solve the race conditions in the shared network simulation layer.

This static/global state is only used for the BgpConnectionChannel implementation of the BgpConnection trait, not for BgpConnectionTcp.

Depending on whether or not we want to maintain support for channels, we could address this properly (making that state not global) or simply enforce tests using channels to be run serially.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions