The Oct. 4 outage that took Facebook, WhatsApp and Instagram down for six hours is indicative of the issues we are facing in modern network architecture. While there has been a lot of chatter about misconfiguration and DNS failures, the reality could have more to do with Facebook’s software-defined networks and the propensity for errors when the control plane is centralized and segregated from the network devices.
Facebook was an early adopter of SDN and pioneered innovative methods to automate provisioning of their massive networks. Najam Admad, director of technical operations who architected SDN for Facebook, said in 2014, “We want to deploy, manage, monitor and fix networks using software.” One of their widely publicized use cases is their software-defined backbone routing. Facebook uses SDN to augment BGP route selection with advanced congestion and capacity analytics in order to overcome the shortcomings of BGP. The SDN controllers produce routes that then get sent to the edge routers which advertise to the outside world.
Based on public speculation, the belief is that this SDN-controlled BGP route selection encountered a glitch and basically made the Facebook edge routers unreachable. This is made worse by the fact that DNS is hosted internally and went into hibernation mode in response to lost connections. The blog post by Facebook’s Santosh Janardhan further supports this hypothesis: “During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.”
This type of control-plane and data-plane disconnect is also occurring in other SDN environments. Similar malfunctions are happening in SD-WAN use cases. There are episodes in SD-WAN deployments where operators find that while the SD-WAN controller issues a reroute command to the edge device, the edge device would ignore the command. There have also been cases where the SD-WAN controllers issue reroutes unnecessarily. It is becoming vital that the network operations center monitors the state and the action of the controllers to assure integral software defined transactions.
The important lesson here is that as we evolve networks to be more intelligent through centralizing the control plane or to be more efficient through automation, we need to make sure that suitable network observability is installed. For SDN and automation, new netops capabilities have to be deployed such as new operational workflow to validate commands using network intelligence such as discovered topologies and historical performance and flow data. One can also apply the preemptive capability to initiate active tests simulating all the traffic to/from the edge router. SDN has provided many benefits and promises more to come, but it has also introduced risks. Therefore, SDN-enabled network observability is needed for checks and balances. Modern network observability monitors the networks as well as the brain of the networks.