Worldwide Microsoft cloud-service failure traced to quick BGP router updates

Interruptions that made Microsoft Azure and multiple Microsoft cloud services commonly not available for 90 minutes on Jan. 25 can be traced to the cascading effects of duplicated, rapid readvertising of BGP router prefixes, according to a ThousandEyes analysis of the incident.The Cisco-owned network intelligence company traced the Microsoft interruption to an external BGP modification by Microsoft that impacted service providers.(Find out more about network and infrastructure blackouts in our leading 10 failures of 2022 wrap-up.)Several Microsoft BGP prefixes were withdrawn completely and then almost immediately readvertised, ThousandEyes said. Border gateway procedure (BGP)

tells Internet traffic what route to take, and the BGP best-path choice algorithm figures out the optimal paths to utilize for traffic forwarding.The withdrawal of BGP routes prior to the outage appeared mainly to impact direct peers, ThousandEyes said. With a direct course unavailable during the withdrawal durations, the next best available path would have been through a transit service provider. As soon as direct courses were readvertised, the BGP best-path selection algorithm would have selected the fastest course, leading to a reversion to the initial path. These re-advertisements repeated a number of times, triggering considerable route-table instability.”This was rapidly changing, triggering a lot of churn in the worldwide web routing tables, “said Kemal Sanjta, principal internet analyst at ThousandEyes, in a webcast analysis of the Microsoft interruption.”As an outcome, we can see that a lot of routers were performing best path choice algorithm, which is not really an inexpensive operation from a power-consumption perspective.”More importantly, the routing modifications triggered significant package loss, leaving customers unable to reach Microsoft Teams, Outlook, SharePoint, and other applications.”Microsoft was volatilely switching in between transit service providers prior to installing best path, and then it was repeating the same thing once again, which’s never good for the client experience, “Sanjta stated. In addition to the rapid modifications in traffic paths, there was a massive shift of traffic through transit company networks that was tough for the provider to soak up and discusses the levels of packet loss that ThousandEyes documented.”Provided the popularity of Microsoft services such as SharePoint, Teams and other services that were impacted as part of this occasion, they were most likely getting pretty big quantities of traffic when the traffic was diverted to them,” Sanjta stated.

Depending on the routing innovation these ISPs were utilizing– for example, software-defined networking or MPLS traffic engineering made it possible for by the network-control protocol RSVP–” all of these services required a long time to react to an influx of a large amount of traffic. And if they don’t have sufficient time to react to the influx of big quantities of traffic, undoubtedly, what you’re going to see is overutilization of specific user interfaces, ultimately leading to drops. “The resulting heavy packet loss”is something that would definitely be observed by the customers, and it would show itself in a truly bad experience.” As for the reason for the connectivity disruptions, ThousandEyes said the scope and rapidity of modifications suggest an administrative change, most likely including automation technology, that caused a destabilization of worldwide paths to Microsoft’s prefixes.” Given the rapidity of these changes in the routing table, we think that some of this was brought on by automatic action on the Microsoft side,”Sanjta said. “Essentially, we believe that there was specific automation that kicked in, that did something that was unforeseen from a traffic-engineering viewpoint, and it … Source