Skip to content

Conversation

FelixMcFelix
Copy link
Contributor

Instances could come up with inconsistently tagged Internet Gateway mappings between their nat and router OPTE port layers. This manifested in two ways.

  1. All ports were deliberately created without IGW tags in the router
    layer iff. an existing router was found. The expectation was that the
    control plane would send a router update with the correct values and
    update them if the port belonged to an instance. However it never did
    so -- the version would not have increased, as the baseline router
    was already complete and accurate.
  2. There are cases where a stopped and restarted instance may have an
    existing EIP<->IGW mapping that is not removed in time. Thus, no change
    is detected when Nexus informs sled-agent of its new NIC's state. We
    were not filling this information during creation under the
    assumption that Nexus's insert would always be accepted.

In both cases, it should be stressed that we cannot just insert either state every minute via task -- this constitutes a port lock on OPTE (holding up all packets and possibly invalidating UFT entries).

This PR closes both those holes by amending the first oversight and filling nat mpping information at create-time.

Fixes #7165.

Instances could come up with inconsistently tagged Internet Gateway
mappings between their `nat` and `router` OPTE port layers. This
manifested in two ways.

1) All ports were deliberately created without IGW tags in the router
   layer iff. an existing router was found. The expectation was that the
   control plane would send a router update with the correct values and
   update them if the port belonged to an instance. However it never did
   so -- the version would not have increased, as the baseline router
   was already complete and accurate.
2) There are cases where a stopped and restarted instance may have an
   existing EIP<->IGW mapping that is not removed in time. Thus, no change
   is detected when Nexus informs sled-agent of its new NIC's state. We
   were not filling this information during creation under the
   assumption that Nexus's insert would always be accepted.

In both cases, it should be stressed that we cannot just insert either
state every minute via task -- this constitutes a port lock on OPTE (holding
up all packets and possibly invalidating UFT entries).

This PR closes both those holes by amending the first oversight and
filling `nat` mpping information at create-time.

Fixes #7165.
@FelixMcFelix
Copy link
Contributor Author

I've tested this fix using a local standalone-omicron on my Helios workstation. This makes for one sled-agent, which has made it very easy to trigger the co-residency requirement of #7165.

Broadly, testing has revolevd around:

  • Creating a project with two instances, started during creation.
  • Verifying that NAT bindings are correctly installed (pfexec opteadm dump-layer -p opte9 nat), and that Internet Gateway router entries all have the correct tag (pfexec opteadm dump-layer -p opte9 router).
kyle@farme:~/gits/omicron$ pfexec opteadm dump-layer -p opte9 nat
Port opte9 - Layer nat
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO  SRC IP  SPORT  DST IP  DPORT  HITS  ACTION

Outbound Flows
----------------------------------------------------------------------
PROTO  SRC IP  SPORT  DST IP  DPORT  HITS  ACTION

Inbound Rules
----------------------------------------------------------------------
ID   PRI  HITS  PREDICATES                ACTION
2    10   0     inner.ip.dst=10.1.222.13  "Stateful: 172.30.0.6 <=> (external)"
DEF  --   0     --                        "allow"

Outbound Rules
----------------------------------------------------------------------
ID   PRI  HITS  PREDICATES                                                   ACTION
6    10   0     inner.ether.ether_type=IPv4                                  "Stateful: 172.30.0.6 <=> 10.1.222.13"
                meta: router-target=ig=edb88f36-1e3d-4e96-b9e3-7787516de959

7    100  0     inner.ether.ether_type=IPv4                                  "Stateful: 10.1.222.11:16384-32767"
                meta: router-target=ig=edb88f36-1e3d-4e96-b9e3-7787516de959

8    255  0     meta: router-target-class=ig                                 "Deny"
DEF  --   0     --                                                           "allow"

kyle@farme:~/gits/omicron$ pfexec opteadm dump-layer -p opte9 router
Port opte9 - Layer router
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO  SRC IP  SPORT  DST IP  DPORT  HITS  ACTION

Outbound Flows
----------------------------------------------------------------------
PROTO  SRC IP  SPORT  DST IP  DPORT  HITS  ACTION

Inbound Rules
----------------------------------------------------------------------
ID   PRI  HITS  PREDICATES  ACTION
DEF  --   0     --          "allow"

Outbound Rules
----------------------------------------------------------------------
ID   PRI  HITS  PREDICATES                         ACTION
0    31   0     inner.ip.dst=172.30.0.0/22         "Meta: Target = Subnet: 172.30.0.0/22"
1    75   0     inner.ip.dst=0.0.0.0/0             "Meta: Target = IG(Some(edb88f36-1e3d-4e96-b9e3-7787516de959))"
3    139  0     inner.ip6.dst=fdc3:ffbe:6e23::/64  "Meta: Target = Subnet: fdc3:ffbe:6e23::/64"
2    267  0     inner.ip6.dst=::/0                 "Meta: Target = IG(Some(edb88f36-1e3d-4e96-b9e3-7787516de959))"
DEF  --   0     --                                 "deny"

Prior to the fix, all IG routing entries would install a metadata of IG(None) -- by simulating a single sled, we're guaranteed to have up-to-date routes for use during instance creation (triggering the first bug). The cause is as described above -- the tags were being stripped for all router entries on a valid match, rather than only on service/probe NICs.

The second bug resulted in entries instead missing from the NAT table, long after the instance had started:

kyle@farme:~/gits/omicron$ pfexec opteadm dump-layer -p opte9 nat
Port opte9 - Layer nat
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO  SRC IP  SPORT  DST IP  DPORT  HITS  ACTION

Outbound Flows
----------------------------------------------------------------------
PROTO  SRC IP  SPORT  DST IP  DPORT  HITS  ACTION

Inbound Rules
----------------------------------------------------------------------
ID   PRI  HITS  PREDICATES                ACTION
1    10   0     inner.ip.dst=10.1.222.13  "Stateful: 172.30.0.6 <=> (external)"
DEF  --   0     --                        "allow"

Outbound Rules
----------------------------------------------------------------------
ID   PRI  HITS  PREDICATES                    ACTION
3    10   0     inner.ether.ether_type=IPv4   "Stateful: 172.30.0.6 <=> 10.1.222.13"
                meta: router-target=ig

4    100  0     inner.ether.ether_type=IPv4   "Stateful: 10.1.222.11:16384-32767"
                meta: router-target=ig

5    255  0     meta: router-target-class=ig  "Deny"
DEF  --   0     --                            "allow"

We expect meta: router-target=ig=edb88..., but instead have an untagged match. The cause in this case is, again, that sled-agent already has these mappings but was not installing them at create time at all. This led to later updates being ignored for this NIC -- no delta was reported! The trigger in this case was simply to start and stop an instance with high frequency -- we can now see via opteadm that existing mappings are now correctly used straight after an instance is stopped and restarted on the same sled.

@FelixMcFelix FelixMcFelix added this to the 12 milestone Nov 27, 2024
@FelixMcFelix FelixMcFelix self-assigned this Nov 27, 2024
@rcgoodfellow
Copy link
Contributor

In both cases, it should be stressed that we cannot just insert either state every minute via task -- this constitutes a port lock on OPTE (holding up all packets and possibly invalidating UFT entries).

Is this also true for reading OPTE state? If we can read without a data-plane lock, we could do a ground-truth reconciliation in nexus at regular intervals.

Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kyle. Changes LGTM

@FelixMcFelix
Copy link
Contributor Author

FelixMcFelix commented Nov 27, 2024

Thanks for the review Ry.

In both cases, it should be stressed that we cannot just insert either state every minute via task -- this constitutes a port lock on OPTE (holding up all packets and possibly invalidating UFT entries).

Is this also true for reading OPTE state? If we can read without a data-plane lock, we could do a ground-truth reconciliation in nexus at regular intervals.

Most port-specific reads currently require taking the port lock today (e.g., reading installed router rules). There are a few things we keep copied outside of the port that we can read back locklessly via OpteCmd::ListPorts (ditto for anything V2P/V2B-related -- those live in an Arc) -- the EIP<->IGW map isn't included here but it likely could be!

@FelixMcFelix FelixMcFelix merged commit 2a9f164 into main Nov 27, 2024
16 checks passed
@FelixMcFelix FelixMcFelix deleted the felixmcfelix/missing-igw-tags branch November 27, 2024 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instances can start with inconsistent routing/ExtIP state
2 participants