It’s not always possible to predict massive traffic spikes, but some organizations are fortunate enough to have some idea of when their traffic will be increased. Those that do have that foresight can take steps to ensure that when the time comes, they are ready.
For example, ahead of the 2018 midterm elections, the New York Times predicted that they would see a surge in users leading up to and on Election Day.
According to Shesh Patel, engineering manager at the New York Times, midterm elections aren’t typically as popular as presidential elections, but the paper knew that 2018 would be different.
In order to ensure that they could handle the increased traffic, the engineering team performed stress tests on the system, he explained in a talk at Datadog Dash last month.
During their tests, they found some issues in the system that they were able to fix. When Election Day came, they had already rehearsed what to do when things went awry and they were able to quickly fix the issue. And as an added bonus, the team now feels more prepared for the 2020 elections, Patel explained.
According to Tomer Levy, CEO and founder of monitoring company Logz.io, the first step in preparing is sitting down and planning. He recommends they talk about the consequences and outcomes of what could happen if an outage occurs, though he does admit that planning does have limitations because it’s hard to predict how web traffic will behave.
The second part of the process is testing. “You want to make sure you try to simulate that and that’s a lot of work,” said Levy. “It’s easier said than done. But if you can simulate high traffic loads and then see how the downstream parts of your applications … behave under these conditions, you can find the weak spot where you need to invest more.”
Third, and perhaps most important, is gaining visibility. According to Levy, most companies don’t have the proper visibility to understand when they’re going through a massive spike in traffic and from where.
“So before you can do anything, you need to know that you have issues and where the issues are,” said Levy. “So once you put up good visibility and then you start planning your tests and eventually the last part is you have to build your application in a way which can scale.”
According to Levy, the current trends in software development lend themselves well to doing this kind of planning and testing. With microservices and containers, it’s easy to drill down into the different layers of an application.
Once you look at the different layers, including the support layers, it’s easy to scale it up. For example, a team could add more resources to a part of the application that is currently acting as a bottleneck. “You have to know about it and launch automatically more of those instances to serve more billing requests [or whatever the bottleneck is] when we go beyond this level of web spike. With strong monitoring, you can take proactive measures to add resources and add cycles and add workers to support these spikes,” Levy said.
Designing your application to do this allocation automatically will give you relief when the day finally comes where you are being hit with higher traffic than normal, he explained.
In addition to technical preparation, organizations should also ensure that their employees are ready to take this on when the time comes. Levy recommends 24/7 shifts before, during, and after a high-traffic incident.
“Usually good engineering teams will have that briefing and will have real-time postmortem on issues they had,” Levy said. The goal is to ensure that the team actually operates as a team during a real-time critical mission.
It’s also important to constantly be revisiting plans as technology changes or gets updated.
“Internally, we run at a very high scale with a lot of big web spikes,” said Levy. “Our customers experience big events and we experience similar traffic. We have a monthly meeting [where we discuss] where are the risks, what are the new bottlenecks, what has changed, what new risks have emerged, and how do we address them?”
By constantly refining your plan, your organization will be able to incorporate lessons they’ve learned from past events and ensure you’re ready for whatever comes next.