Problem solve Get help with specific problems with your technologies, process and projects.

QCon New York Sessions - Incident Response with Etsy

“Incident response – what makes it so terribly difficult?” – John Allspaw at QCon New York

“Anomaly response does not happen the way we might imagine it does,” John Allspaw, CTO at Etsy, said in his opening keynote presentation at QCon New York, “Incident Response: Trade-offs Under Pressure.”

Can we trust tools?

One of the first notes that Allspaw made is that organizations cannot simply rely on tools to make it easier to understand how and why incidents are occurring. Instead, teams need to rely on processes and reasoning in order to truly respond to anomalies. And they cannot, he said, treat these outages as a mystery that is constantly developing over time.

John Allspaw on tools at QCon New York

Allspaw believes that tools designed for incident response may never actually simplify the process.

“An outage is not a detective story,” Allspaw said. “It’s static, and it’s there.”

A model of reasoning

In order to properly deal with outage-causing anomalies, Allspaw recommended that organizations implement a “model of reasoning” that does not “distinguish between diagnosis and therapy.”

John Allspaw presents "model of reasoning" at QCon New York

Allspaw presents this model as an ideal strategy for anomaly response.

Avoiding “cognitive fixation”

Listeners were also warned not to fall into the traps of “thematic vagabonding” and “cognitive fixation” – meaning that those debugging the code can become so wrapped up in simply fixing bugs and symptoms that they fail to delve further into discover the actually root cause of the issue.

“As one thread of diagnosis comes in, you start running to more,” Allspaw said. He said that avoiding this requires developers and testers to communicate about what they are seeing and not get stuck alone on a path of just fixing bug after bug.

In fact, he provided a list of “prompts” that teams can use to frame particular question, dividing the questions into four “stages” of incident response: observations, hypotheses, coordination and suggesting actions. By asking these questions, team members may be able to avoid “cognitive fixation” and get to the root of the problem.

Allspaw's "questions to ask" at QCon New York

Allspaw provided a list of question ideas and prompts that can help move anomaly response forward.

Final notes

Allspaw also talked about the importance of linking anomalies to any known, recent changes in the code or application and, more so, of having peers review your hypotheses.

“Validate the hypothesis that most easily comes to mind,” he said, while also adding that anyone who begins to build confidence about discovering a certain cause of an outage should always check that confidence with a peer review.

The "punch line" of John Allspaw's talk at QCon New York

Allspaw sums up his presentation at QCon New York by saying teams need to rethink how they approach incident and anomaly response.

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.