Problem solve Get help with specific problems with your technologies, process and projects.

QCon New York Sessions – Fault injection with Microsoft

When it comes to testing software, many of today’s organizations rely heavily on comprehensive testing, especially unit testing, to minimize the risk of outages. But in this session, Michalis Zervos of Microsoft talked to audience members about what some consider the “next generation” of creating software resiliency: actually taking those anticipated faults and forcing them to occur to your software.

Fault injection,” as Zervos refers to it, can be performed on everything from virtual machines, to custom applications to hardware. And this is a practice Zervos’ team at Microsoft actively uses and promotes in order to see not just how particular services and such are affected by certain unwanted events, but also how the dependent services and software are affected as well.

Fault injection benefits

Zervos explains some of the reasons to adopt fault injection alongside testing.

“We create ‘storms in the cloud’ to see how it performs under pressure and failure and use that to create resiliency,” he said. And according to Zervos, fault injection can be used for more than just testing resiliency. It can also be used for things like testing new features, training and verifying staged deployments.

Zervos covered the numerous faults that teams could consider injecting, including creating a kernal panic, “hooking” and disrupting critical service code, crashing critical processes and even pulling the power plug on your data center. He also suggested a few publically available tools that development teams can use to make the process easier, such as Consume.exe, Sysinternals tools and “managed code fault injection” through TestApi, a library of test and utility APIs.

Zervos did warn audience members that fault injection cannot be performed without certain precautions and considerations in order to achieve accurate results and avoid creating more problems. He cautioned that teams need to still follow fundamental security principles such as the least-privilege principle, make extensive use of code signing, create a “safety net” for the automatic removal of faults should they get out of a tester’s control and have a “kill switch” available, which he said can save developers and testers “a lot of grief.”

Zervos also stressed this importance of extensive verification and reporting when it comes to fault injection. He also instructed audience members that it is useful to manage fault injection from a centralized location.

“If you are not able to verify what happened, you don’t get the most out of your system,” he said.

System architecture for fault injection

Zervos presents his own system architecture in relation to a centralized fault management service.

One of Zervos’ final points was that it is not enough to simply perform fault injection every now and again. He stressed that teams need to integrate fault injection as a continuous part of the production cycle and find creative ways to encourage teams to adopt its practice. One suggestion he made was the idea of “recovery games,” in which one team member simulates an attack on a particular system and another team member, often a trainee, must record what occurs and take the proper steps to mitigate the risks of an outage. By implementing these types of programs, Zervos said his organization was able to increase adoption of fault-injection and also garner helpful insights about the behaviors of team members, such as that some spent too much time debugging and not enough time actually mitigating the problem.

“It needs to be part of the engineering process and part of the culture of the company,” Zervos said.

Fault injection recovery game goals

Zervos provides examples of the goals that can be achieved through adoption and training programs such as “recovery games.”

John Billings, technical lead on one of the infrastructure teams at Yelp and attendee of Zervos’ talk, said he thoroughly enjoyed the session and believes that fault injection is “the next step in actually testing resiliency of production systems,” he said.

Billings, who also held a talk at QCon on the “human side of microservices,” said he particularly liked the fact that Zervos spent his time discussing the general principles of fault injection rather than talking about specific technologies. And while his company does already make use of fault injection techniques, he is hoping to push the adoption of this strategy even further within his company and hopes that others will as well.

“Tests can only cover so much that you’ve thought about beforehand,” he said. “If you actually have fault injection happening all the time in production, you get that additional level of reliability that otherwise would be very difficult to achieve.”

Billings also said he liked the idea of introducing “fault injection games” as an approach to encouraging the adoption of this strategy, but believes that these adoption strategies must be align with a company’s individual culture. For instance, he noted hearing about the idea of a “badge-based system” that awards teams particular badges for completing and adopting certain testing and production techniques.

“You have to experiment and just see what works for your particular culture and your company,” he said.