A cascade of mistakes made during maintenance on Facebook’s network caused the outage that took its services offline Monday, the company said in a blog post published Tuesday.
Facebook’s family of apps, which includes Instagram, WhatsApp and Messenger, were offline for more than five hours as employees scrambled to repair the damage. More than 3.5 billion people around the world use Facebook’s services.
The initial problem occurred in a network Facebook calls its “backbone,” which connects its data centers around the world, Santosh Janardhan, a vice president of infrastructure at Facebook, wrote in the blog post.
During maintenance of the network, a command was issued to assess how much capacity was available. But the command backfired, disconnecting the network and blocking Facebook’s data centers from communicating, Janardhan said. An audit tool designed to catch mistaken commands failed to detect the error, he added.
But it was just the beginning of the problems. “This change caused a complete disconnection of our server connections between our data centers and the internet,” Janardhan wrote. “And that total loss of connection caused a second issue that made things worse.”
With Facebook’s data centers offline, the company’s servers that manage its internet addresses were also unavailable. “This made it impossible for the rest of the internet to find our servers,” Janardhan said.
As the scope of the outage became clear, Facebook engineers struggled to restore access because its data centers are heavily protected and the employees could not gain immediate entry, the company said.
“We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity but an error of our own making,” Janardhan wrote.
Once the engineers were inside Facebook’s data centers and began to work, they were able to restore the network. But they needed to be gradual when bringing servers online so as not to overwhelm the system, Janardhan said.
The company planned to study how the outage occurred and to create drills that would allow employees to practice fixing Facebook’s systems more quickly, he added.