4 networking best practices learned from the Atlassian network outage


  • 在脸书上分享
  • Share on Twitter
  • Share on LinkedIn
  • 分享Reddit
  • 通过电子邮件分享
  • Print resource
2020欧洲杯预赛数据中心 /企业网络
timofeev vladimir / shutterstock

上个月,软件工具供应商Atlassian遭受了主要的网络停电,持续了两个星期,并影响了超过200,000多个客户中的400多个。停电销毁了他们的几种产品,包括Jira,Confluence,Atlassian Access,Opsgenie和Statuspage。


The outage was the result of a series of unfortunate internal errors by Atlassian’s own staff, and not the result of a cyberattack or malware. In the end, no customer lost more than a few minutes’ worth of data transactions, and the vast majority of customers didn’t see any downtime whatsoever.

What is interesting about the entire Atlassian outage situation is how badly they managed their initial communication of the incident to their customers, and then how they eventually发表了一篇冗长的博客文章,详细介绍了about the circumstances.



The document is frank, factual, and full of important revelations and should be required reading for any engineering and network manager. It should be used as a template for any business that depends on software to locate and fix similar mistakes that you might have made, and also serve as a discussion framework to honestly assess your own disaster recovery playbooks.

Lessons learned from the incident

The trouble began when the company decided to delete a legacy app that was being made redundant by the purchase of a functionally similar piece of software. However, they made the mistake of assigning two different teams with separate but related responsibilities. One team requested the redundant app be deleted, but another was charged with figuring out how to actually do the task. That should have raised some red flags immediately.

The two teams didn’t use the same language and parameters, and as a result had immediate communication problems. For example, one team used the app ID to identify the software to be deleted, but the other team thought they were talking about the ID for the entire cloud instance where the apps were located.


Teams that request network changes and the team that actually implements them should be one and the same. If not, then you need to put in place solid communication tools to ensure that they are in sync, using the same language, and have precision on procedures. Because of the miscommunication, Atlassian engineers didn’t realize the extent of their mistake for several days.

But cross-team communication was only one part of the problem. When Atlassian analyzed its communications between various managers and its customers, they discovered that they posted details about the outage within a day on their own monitoring systems, but they weren’t able to directly reach some of their customers because contact information was lost when the legacy sites were deleted, and other information was woefully outdated.

Plus, the deleted data contained information that was necessary for customers to fill out a valid support request ticket. Getting around this problem required a group of developers to build and deploy a new support ticketing process. The company also admits they should have reached out earlier in the outage timeline and not waited until they had a full picture of the scope of the recovery processes.

This would have allowed customers to better plan around the incident, even without specific time frames. “We should have acknowledged our uncertainty in providing a site restoration date sooner and made ourselves available earlier for in-person discussions so that our customers could make plans accordingly. We should have been transparent about what we did know about the outage and what we didn’t know.”


Treat your customer data with care, ensure that it is current and accurate and backed up in multiple, separate places. Make sure your customer data can survive a disaster and include specific checks in any playbook.


“Our site-level deletions that happened in April did not have runbooks that could be quickly automated for the scale of this event,” they wrote in their analysis. “We had the ability to recover a single site, but we had not built capabilities and processes for recovering a large batch of sites.”

In the blog confessional, they chart their previous large-scale incident management process – you can see that it has a lot of moving parts and wasn’t up to the task to “handle the depth, expansiveness and duration of the April incident.








版权所有©2022 IDG Com足球竞彩网下载munications,Inc。
