Skip to content

Conversation

zeeshanlakhani
Copy link
Contributor

@zeeshanlakhani zeeshanlakhani commented Aug 29, 2025

This change strengthens the multicast implementation with always-allocated group IDs, better API validation, and comprehensive test improvements for Omicron integration.

This update no longer generates multicast group IDs optionally. They are always allocated during group creation, following how multicast groups are configured in the Omicron CP

In Omicron, multicast groups are created first, without members, and then members are added as instances are configured for a multicast group.

Replication configuration is only written to tables when members are added, but IDs are always generated for the 1:1 mapping between underlay and external (overlay) associated groups.

Includes:

  • Core ID Management Changes:

    • Remove Option - IDs are always allocated during
      group creation
    • Establish 1:1 mapping between underlay and external (overlay) groups
    • External groups now use IDs from corresponding NAT target (Omicron
      keeps the true relational mapping)
  • API Changes and Validation:

    • Remove sources field from internal group
      APIs (MulticastGroupCreateEntry, MulticastGroupUpdateEntry)
    • New response types for External/Underlay: MulticastUnderlayGroupResponse and MulticastExternalGroupResponse, and unified for lists MulticastGroupResponse
    • Internal groups cannot have sources or NAT targets - cleaner
      separation of concerns
    • External groups retain sources for proper SSM (Source-Specific
      Multicast) validation
    • Now fail outright on reset if cleanup is not used properly, which
      helps on the Omicron side
    • Renaming API boundary structs to be consistent.
    • Integrate all the new types with the API trait that went in upstream
    • A new AdminScopedIpv6 Type to make the calls into dpd properly typed for internal underlay groups
  • Rollback & Error Handling:

    • The addition of a rollback module (and trait) for a more
      functional approach to rollback on creation or updates involving
      tables, ports, etc
    • Improved error propagation in test cleanup to catch resource leaks early
    • Better validation of group ID relationships to match tables and
      allocation states
  • Test Infrastructure Improvements:

    • Enhanced cleanup_test_group() to fail explicitly on deletion errors
      (prevents test pollution), and ensures proper 1:1 deletion mapping
    • New tests for rollback, empty members upon multicast group creation/update
  • Replication Management:

    • Configure replication only when groups have members (change made
      expecting empty groups in Omicron CP initially)
    • Reconfigure replication tables when transitioning between empty/populated groups

Key aspects this commit covers:

  1. ID Management to match expectations in Omicron's multicast impl
  2. Validation: Enhanced API validation, group ID relationship checks,
    SSM validation
  3. Rollback: Reset operations now fail explicitly, better error propagation
  4. Testing: Comprehensive test improvements, better error handling,
    standardized cleanup

@zeeshanlakhani zeeshanlakhani force-pushed the zl/omicron-mcast-fallout branch 5 times, most recently from 15f31df to 076815a Compare September 3, 2025 03:02
@zeeshanlakhani zeeshanlakhani changed the title [mcast] updates for omicron changes [mcast] Lifecycle + API changes for Omicron impl Sep 3, 2025
@zeeshanlakhani zeeshanlakhani force-pushed the zl/omicron-mcast-fallout branch 2 times, most recently from 0fc844d to 55b4d32 Compare September 8, 2025 17:31
@zeeshanlakhani zeeshanlakhani marked this pull request as ready for review September 8, 2025 17:59
Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Zeeshan. Full disclosure that I haven't yet looked at the integration tests. I think the new rollback machinery is pretty neat -- obviously it's geared toward just multicast, but I think the model of maintaining a snapshot of the old target state and moving back to it is pretty useful.

I think overall I'm a bit confused by the mention of External forwarding groups being used with instances/guests, but most things here are nits.

zeeshanlakhani and others added 2 commits September 9, 2025 20:49
This change strengthens the multicast implementation with
always-allocated group IDs, better API validation, and comprehensive
test improvements for Omicron integration.

This update no longer generates multicast group IDs optionally.
They are always allocated during group creation, following how multicast
groups are configured in the Omicron CP

In Omicron, multicast groups are created first, without members, and then
members are added as instances are configured for a multicast group.

Replication configuration is only written to tables when members are
added, but IDs are always generated for the 1:1 mapping between underlay
and external (overlay) associated groups.

Includes:
  * **Core ID Management Changes:**
    - Remove Option<MulticastGroupId> - IDs are always allocated during
      group creation
    - Establish 1:1 mapping between underlay and external (overlay) groups
    - External groups now use IDs from corresponding NAT target (Omicron
      keeps the true relational mapping)

  * **API Changes and Validation:**
    - Remove sources field from internal group
      APIs (MulticastGroupCreateEntry, MulticastGroupUpdateEntry)
    - Internal groups cannot have sources or NAT targets - cleaner
      separation of concerns
    - External groups retain sources for proper SSM (Source-Specific
      Multicast) validation
    - Now fail outright on reset if cleanup is not used properly, which
      helps on the Omicron side.

  * **Rollback & Error Handling:**
    - The addition of a rollback module (and trait) for a more
      functional approach to rollback on creation or updates involving
      tables, ports, etc
    - Improved error propagation in test cleanup to catch resource leaks early
    - Better validation of group ID relationships to match tables and
      allocation states

  * **Test Infrastructure Improvements:**
    - Enhanced cleanup_test_group() to fail explicitly on deletion errors
      (prevents test pollution), and ensures proper 1:1 deletion mapping
    - New tests for rollback, empty members upon multicast group creation/update

  * **Replication Management:**
    - Configure replication only when groups have members (change made
      expecting empty groups in Omicron CP initially)
    - Reconfigure replication tables when transitioning between empty/populated groups

Key aspects this commit covers:

1. ID Management to match expectations in Omicron's multicast impl
2. Validation: Enhanced API validation, group ID relationship checks,
   SSM validation
3. Rollback: Reset operations now fail explicitly, better error propagation
4. Testing: Comprehensive test improvements, better error handling,
   standardized cleanup
…nsistentcy

This includes `MulticastUnderlayGroupResponse` and `MulticastExternalGroupResponse`, and a
unified response type for lists, mixed result calls `MulticastGroupResponse`. We also added
an AdminScoped type for underlay and consistent naming throughout. We
also rename structs for consistency, and handle rollback at the
boundary calls to internal fns.

This PR has been updated to accomodate the new API trait,
oxidecomputer/omicron#8922, so it adjusts
a lot from the previous code and commit.
@zeeshanlakhani
Copy link
Contributor Author

@FelixMcFelix Sorry for the additional changes, as 2daa552 went in after the review. With it changing all the type handling, I went ahead and just made the API more consistent (and properly restrictive) across the board.

Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating here. Mainly a pile of raw-string-shaped nits, with one or two genuine questions in integration_tests and rollback.

Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the changes!

@zeeshanlakhani zeeshanlakhani merged commit 40f9237 into main Sep 17, 2025
6 checks passed
@zeeshanlakhani zeeshanlakhani deleted the zl/omicron-mcast-fallout branch September 17, 2025 11:33
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:
     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming
  2. RPW reconciliation:
     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:
     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming
  2. RPW reconciliation:
     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 24, 2025
Introduce end-to-end multicast support across control plane and sled-agent, and integrate IP pool model extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management; pool_type/mvlan/switch_port_uplinks
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables; IP pool enhancements (pool_type, mvlan, switch_port_uplinks)
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 25, 2025
This work introduces multicast IP pool capabilities to support external
multicast traffic routing through the rack's switching infrastructure.

Includes:
  - Add IpPoolType enum (unicast/multicast) with unicast as default
  - Add multicast pool fields: switch_port_uplinks (UUID[]), mvlan (VLAN ID)
  - Add database migration (multicast-support/up01.sql) with new columns and indexes
  - Add ASM/SSM range validation for multicast pools to prevent mixing
  - Add pool type-aware resolution for IP allocation
  - Add custom deserializer for switch port uplinks with deduplication
  - Update external API params/views for multicast pool configuration
  - Add SSM constants (IPV4_SSM_SUBNET, IPV6_SSM_FLAG_FIELD) for validation

Database schema updates:
  - ip_pool table: pool_type, switch_port_uplinks, mvlan columns
  - Index on pool_type for efficient filtering
  - Migration preserves existing pools as unicast type by default

This provides the foundation for multicast group functionality while
maintaining full backward compatibility with existing unicast pools.

References (for review):
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 25, 2025
This work introduces multicast IP pool capabilities to support external
multicast traffic routing through the rack's switching infrastructure.

Includes:
  - Add IpPoolType enum (unicast/multicast) with unicast as default
  - Add multicast pool fields: switch_port_uplinks (UUID[]), mvlan (VLAN ID)
  - Add database migration (multicast-support/up01.sql) with new columns and indexes
  - Add ASM/SSM range validation for multicast pools to prevent mixing
  - Add pool type-aware resolution for IP allocation
  - Add custom deserializer for switch port uplinks with deduplication
  - Update external API params/views for multicast pool configuration
  - Add SSM constants (IPV4_SSM_SUBNET, IPV6_SSM_FLAG_FIELD) for validation

Database schema updates:
  - ip_pool table: pool_type, switch_port_uplinks, mvlan columns
  - Index on pool_type for efficient filtering
  - Migration preserves existing pools as unicast type by default

This provides the foundation for multicast group functionality while
maintaining full backward compatibility with existing unicast pools.

References (for review):
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 25, 2025
Introduce end-to-end multicast group support across control plane and sled-agent, integrated with IP pool extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 25, 2025
Introduces end-to-end multicast group support across control plane and sled-agent, integrated with IP pool extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - IP Pool extensions: #9084
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 25, 2025
Introduces end-to-end multicast group support across control plane and sled-agent, integrated with IP pool extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-group-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Contains a version change (to v5) as InstanceEnsureBody has been modified to
    include multicast_groups associated with an instance in the
    underlying sled config
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - IP Pool extensions: #9084
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request Sep 26, 2025
Introduces end-to-end multicast group support across control plane and sled-agent, integrated with IP pool extensions required
for supporting multicast workflows. This work enables project-scoped multicast groups with lifecycle-driven dataplane programming
and exposes an API for operating multicast groups over instances.

Highlights:
  - DB: new multicast_group tables; member lifecycle management
  - API: multicast group/member CRUD; source IP validation; VPC/project hierarchy integration with default VNI fallback
  - Control plane: RPW reconcilers for groups/members; sagas for dataplane updates atomically at the group level; instance lifecycle hooks and piggybacking
  - Dataplane: Dendrite DPD switch programming via trait abstraction; DPD client used in tests
  - Sled agent: multicast-aware instance management; network interface configuration for multicast traffic; cross-version testing; OPTE stubs present
  - Tests: comprehensive integration suites under nexus/tests/integration_tests/multicast/

Components:
  - Database schema: external and underlay multicast groups; member/instance association tables
  - Control plane modules: multicast group management, member lifecycle, dataplane abstraction; RPW reconcilers to ensure convergence
  - API layer: endpoints and validation; default-VNI semantics when VPC not provided
  - Sled agent: OPTE stubs and compatibility shims for older agents

Workflows Implemented:
  1. Instance lifecycle integration:

     - "Create" -> resolve VPC/VNI (or default), validate source IPs, create memberships, enqueue group ensure RPW
     - "Start" -> program dataplane via ensure/update sagas; activate member flows after switch ack
     - "Stop" -> deactivate dataplane membership; retain DB membership for fast restart
     - "Delete" -> remove instance memberships; group deletion is explicit
     - "Migrate" -> deactivate on source sled; activate on target; idempotent with ordering guarantees
     - Restart/recovery -> RPWs reconcile desired state; compensations clean up partial programming

  2. RPW reconciliation:

     - ensure dataplane switches match database state
     - handle sled migrations and state transitions
     - Eventual consistency with retry logic

Migrations:
  - Apply schema changes in schema/crdb/multicast-group-support/up01.sql (and update dbinit.sql)
  - Bump schema versions accordingly

API/Compatibility:
  - OpenAPI updated: openapi/nexus.json, openapi/sled-agent/sled-agent-5.0.0-89f1f7.json
  - Contains a version change (to v5) as InstanceEnsureBody has been modified to
    include multicast_groups associated with an instance in the
    underlying sled config
  - Regenerate clients where applicable

References:
  - RFD 488: https://rfd.shared.oxide.computer/rfd/488
  - IP Pool extensions: #9084
  - Dendrite PRs (based on recency):
    * oxidecomputer/dendrite#132
    * oxidecomputer/dendrite#109
    * oxidecomputer/dendrite#14

Follow-ups include:
  - OPTE integration
  - commtest extension
  - omdb commands are tracked in issues
  - pool and group stats
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants