4 networking best practices learned from the Atlassian network outage

软件供应商发布详细的分析,分析出了什么问题来帮助他人避免相同的命运

  • 在脸书上分享
  • Share on Twitter
  • Share on LinkedIn
  • 分享Reddit
  • 通过电子邮件分享
  • Print resource
2020欧洲杯预赛数据中心 /企业网络
timofeev vladimir / shutterstock

上个月,软件工具供应商Atlassian遭受了主要的网络停电,持续了两个星期,并影响了超过200,000多个客户中的400多个。停电销毁了他们的几种产品,包括Jira,Confluence,Atlassian Access,Opsgenie和Statuspage。

虽然只有少数客户在整整两个星期内受到影响,但就公司工程师发现的问题的深度以及他们不得不寻找并解决问题所需的时间而言,中断意义重大。

The outage was the result of a series of unfortunate internal errors by Atlassian’s own staff, and not the result of a cyberattack or malware. In the end, no customer lost more than a few minutes’ worth of data transactions, and the vast majority of customers didn’t see any downtime whatsoever.

What is interesting about the entire Atlassian outage situation is how badly they managed their initial communication of the incident to their customers, and then how they eventually发表了一篇冗长的博客文章,详细介绍了about the circumstances.

很少有一个大规模和公共中断的供应商需要努力将发生的事情和原因拼凑在一起,并提供了其他人也可以从中学到的路线图。

在帖子中,他们仔细地描述了他们现有的IT基础架构,指出其灾难恢复计划中的缺陷,如何解决其缺点以防止未来的中断,并描述时间表,工作流程以及他们打算改善其流程的方式。

The document is frank, factual, and full of important revelations and should be required reading for any engineering and network manager. It should be used as a template for any business that depends on software to locate and fix similar mistakes that you might have made, and also serve as a discussion framework to honestly assess your own disaster recovery playbooks.

Lessons learned from the incident

The trouble began when the company decided to delete a legacy app that was being made redundant by the purchase of a functionally similar piece of software. However, they made the mistake of assigning two different teams with separate but related responsibilities. One team requested the redundant app be deleted, but another was charged with figuring out how to actually do the task. That should have raised some red flags immediately.

The two teams didn’t use the same language and parameters, and as a result had immediate communication problems. For example, one team used the app ID to identify the software to be deleted, but the other team thought they were talking about the ID for the entire cloud instance where the apps were located.

第1课:改善内部和外部交流

Teams that request network changes and the team that actually implements them should be one and the same. If not, then you need to put in place solid communication tools to ensure that they are in sync, using the same language, and have precision on procedures. Because of the miscommunication, Atlassian engineers didn’t realize the extent of their mistake for several days.

But cross-team communication was only one part of the problem. When Atlassian analyzed its communications between various managers and its customers, they discovered that they posted details about the outage within a day on their own monitoring systems, but they weren’t able to directly reach some of their customers because contact information was lost when the legacy sites were deleted, and other information was woefully outdated.

Plus, the deleted data contained information that was necessary for customers to fill out a valid support request ticket. Getting around this problem required a group of developers to build and deploy a new support ticketing process. The company also admits they should have reached out earlier in the outage timeline and not waited until they had a full picture of the scope of the recovery processes.

This would have allowed customers to better plan around the incident, even without specific time frames. “We should have acknowledged our uncertainty in providing a site restoration date sooner and made ourselves available earlier for in-person discussions so that our customers could make plans accordingly. We should have been transparent about what we did know about the outage and what we didn’t know.”

第2课:保护客户数据

Treat your customer data with care, ensure that it is current and accurate and backed up in multiple, separate places. Make sure your customer data can survive a disaster and include specific checks in any playbook.

这提出了关于灾难恢复的另一点。在4月的停电期间,Atlassian错过了其恢复时间目标(显然,考虑到恢复系统所花费的几周),但设法实现了其恢复点目标,因为他们能够在实际停电前几分钟恢复数据。他们也无法选择一组客户站点并以任何自动化方式将所有相互关联的产品从备份恢复到前一刻。

“Our site-level deletions that happened in April did not have runbooks that could be quickly automated for the scale of this event,” they wrote in their analysis. “We had the ability to recover a single site, but we had not built capabilities and processes for recovering a large batch of sites.”

In the blog confessional, they chart their previous large-scale incident management process – you can see that it has a lot of moving parts and wasn’t up to the task to “handle the depth, expansiveness and duration of the April incident.

第3课:测试复杂灾难恢复方案

检查并重新检查您的灾难恢复程序,剧本和程序,以确保它们符合各种目标。确保测试各种客户基础架构的场景。这意味着要专门解决和预期更大规模的事件响应,并了解使用多个产品或依赖您应用程序的互锁系列和序列的客户的各种复杂关系。

如果您使用的是自动化,请确保您的API正常运行,并在没有时发送适当的警告信号。这是Atlassian在停电几天拖延几天时必须进行调试的问题之一。

第4课:保护配置数据

最后,存在有关如何删除数据的问题,这开始了整个中断。他们现在意识到不允许删除数据,尤其是整个网站的数据。Atlassian正在转向他们所谓的“软删除”,直到使用定义的系统回滚并通过许多保障措施,该数据才立即处理数据。

Atlassian正在建立所有系统中的“通用软删除”策略,并创建一系列标准和内部评论。软删除选项不仅仅是一个选项。在整个基础架构中对其进行测试之前,不要删除任何配置数据。

Related:

版权所有©2022 IDG Com足球竞彩网下载munications,Inc。

企业网络中的10家最强大的公司2022