alotofpeople -

Get started Bring yourself up to speed with our introductory content.

5 proven patterns for resilient software architecture design

Maintaining a resilient software architecture is a constant battle. Luckily, there are a few design methods that can help teams instill the reliability they desperately need.

There is an ironic dichotomy in today's software stacks: While many things are programmed to be operationally deterministic and predictable, the inevitably unstable and unpredictable nature of deployment and production environments means that failures can come at any time. On top of that, there seems to be an exponential rise in the effort and cost related to eliminating failures, especially as architectures transform into increasingly distributed collections of dynamic and complex processes.

Luckily, there are a set of coding and design patterns that ease the path to failure mitigation and can help put the right recovery mechanisms in place to solidify the residence of a software architecture and the applications housed within. In this article, we'll explore five resilience patterns that can help get in front of failures, prevent them from running rampantly through a distributed software system, and -- when needed -- gradually decommission problematic components without disturbing the whole operation.


The first design we'll examine is the bulkhead pattern. Here, the term bulkhead is a reference to the compartmentalized partitions often built into the hull of a ship. These bulkheads have a few responsibilities, one of them being to help the ship stay afloat in the case of a water breach in the hull. If water manages to enter one part of the ship, the leak will likely remain confined to the bulkhead where the hole appeared rather than flood the entire vessel. Some of the more modern bulkhead designs are even capable of halting fires and mitigating the potential damage caused by an electromagnetic pulse. 

When multiple subsystems run as part of a single, overarching process, one component fault can easily cause cascading failures. Inspired by the bulkhead concept applied to the hulls of ships, this resilience pattern enables developers to design a system with multiple, independent subsystems and services running in their own private machines or containers. This limits the effect of a failure on neighboring processes, allows teams to examine those failures in isolation, and is arguably a trademarked benefit of an architecture style like microservices that adamantly demands loose coupling between software components.

The backpressure pattern

Backpressure is a resilience approach that configures individual application systems and services to autonomously push back incoming workloads that exceed its current throughput capacity. For instance, if slow DB queries and a congested network traffic are causing long delays to remote service calls, a service can push back those workloads to retain its performance. This also helps inform the calling system why responses are delayed, which can prevent it from stalling in the case it is waiting for the response or from making continuously failed calls.

Ideally, backpressure mechanisms ought to cascade across multiple nodes -- if a component in a chain of calls is unable to deliver the required throughput, it should be able to push that workload back through as many upstream producers and steps in the process as necessary. As the system or service in question recovers and catches up, it can then gradually allow "backpressured" calls to reach it again. This resilience pattern can often help manage throughput naturally, without the need to pile an unfair or unregulated number of requests on any single component.

The circuit breaker pattern

While backpressure is an effective way to regulate load, it often isn't a strong enough method to address performance problems on its own. For instance, companies operating complex software environments need systems that can recognize the onset of performance problems and disconnect completely from producers to allow the systems to catch up. This is where the circuit breaker pattern comes into play.

Mimicking the concept behind electrical circuit breakers that break connections in the event of power surges and overloads, this pattern sets a predetermined workload stress limit where the receiving entity can set its incoming connection to an "open" state. This means it will reject any new workload requests and put a halt to the pending message queue. When workload stress levels and throughput drop back down to an acceptable level, the circuit closes and starts accepting requests again.

By outfitting these isolated entities to open and close connections, this resilience pattern can stop temporary outages from becoming cascading failures that run rampantly across large swaths of the software stack. As such, it's a particularly important pattern for those designing large, distributed systems to familiarize themselves with.

Fast failure, or slow success?

Speaking generally, a fast failure response is arguably better than a successful network request left in limbo. The reason is that slow success responses have the potential to block resources on the caller until the response is received. Under high enough throughput, this can lead to a rapid exhaustion of resources during wait times.

An effective way to deal with this is to enforce explicit timeouts in service-level agreements (SLAs). If a particular database, service or other component cannot complete a business transaction within a timeframe defined by the SLA, it should consider the transaction a failure and terminate processing. This approach ensures that resource bottlenecks are kept to a minimum, although keep in mind that all participating services should be idempotent to avoid duplicating transactions unnecessarily.


While resilience patterns like backpressure and circuit breakers are equipped to deal with sudden surges, it is also important to directly address one the most common causes behind these surges -- batch processing of records. Batch processes load notoriously large number of records into a queue and pump them through a processing pipeline on hasty schedules. Unsurprisingly, this builds up abrupt, performance-dampening spikes of stress on services, databases and all other related components.

This is the problem: Since batch processes usually run directly on a server, and therefore aren't subject to the existing load balancers on those servers, there is no mechanism in place to manage the throughput. However, it's possible to modify batch jobs so that they push data as regular OLTP transactions from standard network entry points, forcing it to submit to load balancers and trigger the appropriate remediating mechanisms when throughput exceeds acceptable rates.

This type of workload throttling technique can safeguard a consistent rate of push, ensure that the load balancer distributes jobs appropriately, and help the previously mentioned backpressure and circuit breakers patterns to kick in when needed without issue. Keep in mind that since this technique requires teams to set acceptable limits on workload throughput, you'll need to consider whether a particular service or application system specifically requires a higher rate -- possibly one that exceeds the predetermined limit.

Graceful degradation

When all else fails, and a component or service fails completely, it helps to maintain continuity using a fallback mechanism that allows alternative components to automatically pick up the reins. For instance, if the service that supports a website's "product search" function goes down, perhaps an API that handles product listings can use simple database queries to pull up listings and push results to the user. Similarly, developers could configure an alternate payment gateway to step in if it recognizes that the circuit breaker is open for the website's main payment service. Either way, this will at least keep the user interface functional, despite the service outage.

Teams that design an architecture using graceful degradation patterns like this can offer monumentally better continuity for users. For a systems architect, this means it's critically important to recognize that failures will inevitably occur. Instead of prevention, the key to a resilient software architecture rests in strengthening its ability to automatically contain and mitigate failures right as they occur -- and maybe even begin the recovery process on its own.

Dig Deeper on Distributed application architecture