0% found this document useful (0 votes)
67 views

Google 1014 Poutievski Introducing - Fail Open - Mode PDF

This document proposes a "fail open" approach to handling partial failures of the control plane network (CPN) in SDNs. It assumes the data plane is not affected by control plane failures. When the controller loses connectivity to a switch, it keeps the switch in its last programmed state rather than treating it as down. This avoids massive network disruption from rerouting traffic each time the CPN connection is lost or regained. The controller uses signals from neighboring switches and end-to-end data to distinguish disconnected from down nodes. The goal is to minimize network disruption during and after CPN failures.

Uploaded by

王星
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Google 1014 Poutievski Introducing - Fail Open - Mode PDF

This document proposes a "fail open" approach to handling partial failures of the control plane network (CPN) in SDNs. It assumes the data plane is not affected by control plane failures. When the controller loses connectivity to a switch, it keeps the switch in its last programmed state rather than treating it as down. This avoids massive network disruption from rerouting traffic each time the CPN connection is lost or regained. The controller uses signals from neighboring switches and end-to-end data to distinguish disconnected from down nodes. The goal is to minimize network disruption during and after CPN failures.

Uploaded by

王星
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Fail Open:

A Way to Handle CPN Failures


Leon Poutievski

10/16/2014
Introduction
With SDN, controller runs remotely from the
switches. Many benefits:
● Complex routing moves to the powerful
server
● Simpler switch
● Simpler upgrade
● Easier to introduce new features

Out-of-band CPN A separate out-of-band control plane


network (CPN) makes bootstrapping and
troubleshooting simpler.
Problem: Partial CPN Failure
Controller loses connectivity of a switch.

If controller treats CPN disconnect as


switch down:
● Controller reroute traffic from the node
● Massive churn on CPN disconnect
● Massive churn on CPN re-connect
resulting in congestion and traffic loss.
CPN
Controller Signals
How can the controller distinguish
disconnected node from a down node?
● Peer switches know about adjacent links
○ If neighboring switch reports ports to
the disconnected node are down,
then the node is likely down
● Controller can send inject probes to
determine if paths through the
CPN disconnected node are still up
● End-to-end data might be available, e.g.
host may see drops
Assumptions
Frequent switch updates are not required
● E.g. proactive approach: switch is
updated based on topology changes,
routes advertisement, traffic changes
● Example: WAN

Switches can automatically react on local


failures. E.g. prune down ports.
CPN
Fail-Open

Goal: Minimize the network disruption during and after CPN failures

Fail-Open - Optimistic Policy

Assume the data plane is not affected by control plane failures.

Controller knows last programmed state, assumes that node is


“frozen” at that state.
Fail-Open Reaction: At Transit

Encap
Keep using the confirmed flows on
failed-open nodes.

New routing solutions can use the


confirmed flows on failed-opened nodes.
Encap Transit Decap

CPN
Fail-Open Reaction: Destination

Encap Transit
Keep using the confirmed flows on
failed-open nodes.

New routing solutions can use the


confirmed flows on failed-opened nodes.
Encap Transit Decap

CPN
Fail-Open Reaction: At Source

Encap
Keep using the confirmed flows on
failed-open nodes.

Encap Transit Decap

CPN
Fail-Open Reaction: At Source

Encap
Keep using the confirmed flows on
failed-open nodes.

Since the source cannot be updated,


controller needs to maintain the existing
Encap Transit Decap
tunnels.

CPN
Detection and Recovery
Up → Fail-Open
● No control plane connection:
○ OpenFlow connection lost
○ No response to commands
Fail
● Peers report that data plane links to the
Up
Open node are still up
Fail-Open → Up
● Control connectivity has been restored
Fail-Open → Down
Down
● Fail-Open for a long period of time
● Negative data plane signals from peers
Massive CPN Failures
● Quick state transitions can be harmful
● Example:
○ 25% nodes considered down, then
○ 50%, then
○ 75%, then
○ 100%
○ The network will be left at 25% capacity

● Coalescing helps
Thank You!

You might also like