-
Notifications
You must be signed in to change notification settings - Fork 58
[sled-agent] Correctly install nat/router IGW tags during create. #7172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Instances could come up with inconsistently tagged Internet Gateway mappings between their `nat` and `router` OPTE port layers. This manifested in two ways. 1) All ports were deliberately created without IGW tags in the router layer iff. an existing router was found. The expectation was that the control plane would send a router update with the correct values and update them if the port belonged to an instance. However it never did so -- the version would not have increased, as the baseline router was already complete and accurate. 2) There are cases where a stopped and restarted instance may have an existing EIP<->IGW mapping that is not removed in time. Thus, no change is detected when Nexus informs sled-agent of its new NIC's state. We were not filling this information during creation under the assumption that Nexus's insert would always be accepted. In both cases, it should be stressed that we cannot just insert either state every minute via task -- this constitutes a port lock on OPTE (holding up all packets and possibly invalidating UFT entries). This PR closes both those holes by amending the first oversight and filling `nat` mpping information at create-time. Fixes #7165.
I've tested this fix using a local standalone-omicron on my Helios workstation. This makes for one sled-agent, which has made it very easy to trigger the co-residency requirement of #7165. Broadly, testing has revolevd around:
kyle@farme:~/gits/omicron$ pfexec opteadm dump-layer -p opte9 nat
Port opte9 - Layer nat
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Outbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Inbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
2 10 0 inner.ip.dst=10.1.222.13 "Stateful: 172.30.0.6 <=> (external)"
DEF -- 0 -- "allow"
Outbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
6 10 0 inner.ether.ether_type=IPv4 "Stateful: 172.30.0.6 <=> 10.1.222.13"
meta: router-target=ig=edb88f36-1e3d-4e96-b9e3-7787516de959
7 100 0 inner.ether.ether_type=IPv4 "Stateful: 10.1.222.11:16384-32767"
meta: router-target=ig=edb88f36-1e3d-4e96-b9e3-7787516de959
8 255 0 meta: router-target-class=ig "Deny"
DEF -- 0 -- "allow"
kyle@farme:~/gits/omicron$ pfexec opteadm dump-layer -p opte9 router
Port opte9 - Layer router
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Outbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Inbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
DEF -- 0 -- "allow"
Outbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
0 31 0 inner.ip.dst=172.30.0.0/22 "Meta: Target = Subnet: 172.30.0.0/22"
1 75 0 inner.ip.dst=0.0.0.0/0 "Meta: Target = IG(Some(edb88f36-1e3d-4e96-b9e3-7787516de959))"
3 139 0 inner.ip6.dst=fdc3:ffbe:6e23::/64 "Meta: Target = Subnet: fdc3:ffbe:6e23::/64"
2 267 0 inner.ip6.dst=::/0 "Meta: Target = IG(Some(edb88f36-1e3d-4e96-b9e3-7787516de959))"
DEF -- 0 -- "deny" Prior to the fix, all IG routing entries would install a metadata of The second bug resulted in entries instead missing from the NAT table, long after the instance had started: kyle@farme:~/gits/omicron$ pfexec opteadm dump-layer -p opte9 nat
Port opte9 - Layer nat
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Outbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Inbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
1 10 0 inner.ip.dst=10.1.222.13 "Stateful: 172.30.0.6 <=> (external)"
DEF -- 0 -- "allow"
Outbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
3 10 0 inner.ether.ether_type=IPv4 "Stateful: 172.30.0.6 <=> 10.1.222.13"
meta: router-target=ig
4 100 0 inner.ether.ether_type=IPv4 "Stateful: 10.1.222.11:16384-32767"
meta: router-target=ig
5 255 0 meta: router-target-class=ig "Deny"
DEF -- 0 -- "allow" We expect |
Is this also true for reading OPTE state? If we can read without a data-plane lock, we could do a ground-truth reconciliation in nexus at regular intervals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Kyle. Changes LGTM
Thanks for the review Ry.
Most port-specific reads currently require taking the port lock today (e.g., reading installed router rules). There are a few things we keep copied outside of the port that we can read back locklessly via |
Instances could come up with inconsistently tagged Internet Gateway mappings between their
nat
androuter
OPTE port layers. This manifested in two ways.layer iff. an existing router was found. The expectation was that the
control plane would send a router update with the correct values and
update them if the port belonged to an instance. However it never did
so -- the version would not have increased, as the baseline router
was already complete and accurate.
existing EIP<->IGW mapping that is not removed in time. Thus, no change
is detected when Nexus informs sled-agent of its new NIC's state. We
were not filling this information during creation under the
assumption that Nexus's insert would always be accepted.
In both cases, it should be stressed that we cannot just insert either state every minute via task -- this constitutes a port lock on OPTE (holding up all packets and possibly invalidating UFT entries).
This PR closes both those holes by amending the first oversight and filling
nat
mpping information at create-time.Fixes #7165.