Software Engineering Notebook

Essays and research on topics in software engineering.

Fault Tolerance Patterns

This section consists of essays and examples pertaining to fault tolerance patterns.

Circuit Breaker

A fault-tolerance primitive that sheds load from struggling subsystems.

A circuit breaker is a tool to shed load from struggling subsystems, decrease latency of requests by failing fast, introduce subsystem monitoring, and give operations teams granular control over system interactions [1][2].

A circuit breaker is essentially a protective sleeve around a function call. Clients make requests to the breaker instead of to the protected function, and then the breaker selectively chooses to forward those requests on to the protected function. If the function becomes unresponsive or erroneous, the breaker may "trip" and refuse to forward subsequent requests.

In this way a breaker can reduce load on strained subsystems accessed by its protected call. This not only gives the strained subsystem an opportunity to recover [1][2], it also prevents the client from potentially dedicating resources to a slow subsystem [1][2], allowing those resources to be used for productive work. The client of a tripped breaker may furthermore fall back on secondary measures (e.g cached data or default values); while not ideal, this ensures end-users receive usable data within an acceptable window of time [2].

A circuit breaker may generate logs, alerts, or other diagnostics when it is tripped, informing operators of possible system degradation, and may even expose an interface to operators which allows them to manually toggle the breaker between states for troubleshooting [1].

Examples

The following examples demonstrate operation of a circuit breaker.

Rate Limiter

A fault-tolerance primitive that prevents clients of a system from exceeding a sustainable request rate.

Rate-limiters exist to protect services from unsustainable request volumes, such as during spikes in user traffic or denial-of-service attacks (including friendly-fire from misbehaving collaborator services) [3][4]. While services can utilize rate-limiting to protect themselves from high request volume, they can also use rate-limiting to control the volume of their own requests to unprotected legacy back-ends that are otherwise vulnerable [3]. In either case, rate-limiters deny requests above a certain throughput, shedding load off vulnerable systems, and may even prioritize certain types of requests over others to ensure less-significant systems degrade before more-critical systems [4].

Clients of rate-limited systems should respond to rate-limiting intelligently or else risk cascading failures as errors from one service propagate uncontrolled throughout a network of collaborating services, possibly with self-reinforcing consequences [3]. In order for clients to respond to rate-limiting, services must signal clients when rate-limiting is active, such as by returning error code 429: Too Many Requests [5] during periods of high load. Clients will typically want to retry. As an alternative to rejecting excessive requests, services may be able to enqueue them for asynchronous fulfillment. In this scenario a service would immediately return a job ID to the client in response to their request. The client would know to poll the service or subscribe to an appropriate messaging channel for the response [6][3].

Rate-limiting strategies which discard traffic in exceeds of an allowed rate (e.g. with 429 http responses) are referred to as traffic policing strategies, and strategies which instead defer requests until they conform to a tolerable request rate (e.g. by enqueuing requests for asynchronous fulfillment) are referred to as traffic shaping strategies [7][8].

There are multiple algorithms for rate-limiting (see [3][9]), some of which we will explore below.

Token Bucket (Leaky Bucket) Algorithm

Note	There are two implementations of the leaky bucket: one used to police traffic, and one used to shape it. The traffic-shaping version is conceptually the same as token bucket, described below. These amount to different mental models for the same algorithm. [10]

The token bucket algorithm is used to police or shape traffic to conform to a desired level of "burstiness" and average request rate [11][10]. This strategy limits each request by the resources (e.g. database connections, CPU time) it requires such that overall traffic is limited on the basis of available system capacity rather than simple volume of requests [3].

The algorithm is based on the analogy of a bucket filled with tokens. Requesters must retrieve some number of tokens from the bucket based on the capacity they wish to utilize. If the bucket has enough tokens, the requester removes the tokens and computes a response. If the bucket has too few tokens, the request is rejected or enqueued for later computation depending on implementation. At a fixed rate, tokens are added back to the bucket, but any excess tokens beyond the bucket’s capacity are simply discarded [3][12][10]. The capacity of the bucket therefore determines the maximum instantaneous capacity that may be utilized by requests, and the rate at which tokens are added governs the long-term average utilization [12][10][11].

In telecommunications contexts, token buckets are often used to limit the average rate at which bits flow into a network with limited bandwidth [12][10][11]. The literature describes such a token bucket algorithm with the equation CIR = CBS / T [11], where:

CIR, the committed information rate or mean rate, is the average rate at which bits are forwarded to the network;
CBS, the committed burst size, is the maximum volume of bits submitted simultaneously during the time period T; and
T is the maximum duration of a burst during which the network is fully utilized.

The capacity of a token bucket is represented by CBS; clients of the bucket can never request more than CBS tokens at a single time. Tokens are returned to the bucket at the rate CIR, which by definition means all CBS tokens are replenished every interval T. Therefore in the worst case, clients can request at most CBS tokens every interval T, which over the long-term conforms to the desired average request rate CIR.

Examples

See the following examples:

Retry

A Retry is a fault tolerance mechanism to help clients make successful requests despite interference from transient faults.

Retries help clients carry out requests against remote services even when short-lived, "transient" faults might otherwise prevent them from doing so [13]. These faults, like "the momentary loss of network connectivity to components and services, the temporary unavailability of a service, or timeouts that occur when a service is busy," frequently go away on their own [14]. If clients re-attempt requests that fail due to these faults, they are therefore likely to eventually succeed [14]. As a consequence, retry mechanisms improve a client’s odds of successfully communicating with a service at the cost of issuing additional requests, which comes with a number of special considerations, outlined below.

Design Considerations

Increased and Synchronized Traffic

Because retry mechanisms send more traffic to services in response to faults, they can place additional strain on already struggling services and hamper recovery [13][14].

Retry traffic load is managed using backoff, a delay between client retry attempts. These delays help to distribute client retry traffic over time, lessening the risk of overloading a service [14]. However, retry-with-backoff can generate thundering herds when many clients synchronize their retry schedule. The resulting waves of traffic can exacerbate service degradation and cause additional faults [13][15].

These problematic spikes of traffic are smoothed out into more uniform request patterns by further introducing jitter, or randomization, to the retry logic [13][21]. As demonstrated in [21], randomization prevents clients from synchronizing, producing a uniform rate of request instead.

Even when backoff and jitter are designed into a retry mechanism, the resulting retry traffic may nonetheless exacerbate service providers when they enter a degraded state. Consider instrumenting at-risk services with circuit breakers [13][16] or rate limiters [13] to manage this risk.

Note

(TODO) Because backoff distributes request load over time into a near-uniform rate of request, it plays a very similar role to the rate limiter, which enforces a constant request rate from clients by shedding excess requests. At time of writing, I do not have references which indicate how the two differ.

It may be that rate limiters in the literature are assumed to enforce static rates. Suppose a service with a static rate limiter becomes degraded for some unforseen reason (e.g. database queries are suddenly slow), and that causes the rate at which it can sustainably service requests to drop below the enforced rate. This causes the service to attempt to service more traffic than it can; essentially, the static rate limit no longer reflects the actual capacity of the system, allowing it to further degrade. Retry-with-backoff is dynamic in that the client varies its request rate over time in response to actual service capacity. Therefore retry-with-backoff may serve as a second layer of defence when static rate limiters fail.

Put another way, rate limiters shed load to prevent service degradation, whereas retries-with-backoff slow down in response to observed service load so that services can recover.

See Netflix’s discussion of adaptive concurrency limits.

Kinds of Backoff

Retries can be made immediately or after a possibly growing delay. While [21] considers exponential backoff to be a good standard, it and many other sources do not identify the properties of different backoff protocols or when best to use them. The choice is ostensibly significant, however, since backoff clearly impacts service load, perceived availability, and perceived responsiveness. For example, the Twitter API instructs developers to use no backoff, linear backoff, and exponential backoff on a case-by-case basis [20]:

Once an established connection drops, attempt to reconnect immediately. If the reconnect fails, slow down your reconnect attempts according to the type of error experienced:

Back off linearly for TCP/IP level network errors. These problems are generally temporary and tend to clear quickly. Increase the delay in reconnects by 250ms each attempt, up to 16 seconds.

Back off exponentially for HTTP errors for which reconnecting would be appropriate. Start with a 5 second wait, doubling each attempt, up to 320 seconds.

Back off exponentially for HTTP 429 errors Rate limit exceeded. Start with a 1 minute wait and double each attempt. Note that every HTTP 429 received increases the time you must wait until rate limiting will no longer be in effect for your account.

Client Waiting

Backoff protocols such as exponential backoff grow rapidly and cause clients to wait a long time between requests and overall [13]. Both [16] and [13] recommend limiting the maximum delay between requests and the maximum number of retry attempts to avoid undesirable waiting.

Idempotence

It may not be safe to retry failed requests against an API that causes side effects, such as database state changes. A service might successfully create side effects even when a request fails overall; a subsequent attempt could trigger those effects again [13][14]. If repeated side effects are not desirable, design APIs that are idempotent instead, and only issue retries to idempotent APIs [13][14].

Example

Suppose a client issues a bank API request to increment the value of an account by $100. The service successfully increments the value of the account, but the request fails overall due to an unrelated error (e.g. a network error). The client inappropriately retries the request, and as a result, increments the account by $200 cumulatively, not $100 as intended. Because the bank API is not idempotent, it is not safe to retry failed requests against it.

Compounding Retries

When multiple services within a request path each use retry, a request to a service early in the path can multiply into a substantial number of requests that place additional strain on the service near the end [13][14]. For example, if the first service makes 3 retries to the second, which makes 3 retires to the third, the result may be as many as 9 requests to the third service. Limit retry logic to only one point in the stack to avoid the issue [13], or only introduce retry when the consequences of failure are well-understood [14].

Transient and Intransient Faults

Because a Retry only helps a system recover from transient faults, retries must have selective triggers. Retries should generally not trigger in response to client errors (e.g. HTTP 4xx-series responses) since those are unlikely to ever succeed, but generally should trigger in response to server errors (e.g. HTTP 5xx-series responses) because those may succeed on a subsequent attempt [13]. This is contextual, however; Microsoft notes that internal server errors (e.g. HTTP 500 responses) may be caused by business-logic defects, and these types of failures would not succeed on retried attempts. Retry trigger logic is further complicated in an eventually-consistent environment where otherwise chronic errors may resolve as state propagates throughout a system [13].

Timeout

A timeout limits the amount of time one process spends waiting on another [17][18]. When, for example, a web service fails, it may do so without responding to its clients in any way [17]. A client could wait indefinitely for a response, but by using a timeout, it can instead assume the service has failed [17] and respond to the error with a fallback (TODO) behavior [18]. By preempting latent requests and using fallback behavior, a client can stay responsive to its own clients and reallocate resources that would otherwise be dedicated to an unresponsive request [19].

Clients that utilize timeouts must be aware timed-out requests may still succeed in the backing service: An operation can both succeed and time-out [18]. For example, suppose a web service experiences unusually high load and becomes latent. Though it still successfully handles all requests, it may fail to do so within the configured timeout of its clients. The clients will receive timeouts even though the service completes requests normally.

References

[1] M. Fowler. "CircuitBreaker." martinFowler.com. https://www.martinfowler.com/bliki/CircuitBreaker.html (accessed June 29, 2020).
[2] B. Christensen. "Fault tolerance in a high volume, distributed system." The Netflix Tech Blog. https://netflixtechblog.com/fault-tolerance-in-a-high-volume-distributed-system-91ab4faae74a (accessed June 29, 2020).
[3] "Rate-limiting strategies and techniques." Google Cloud. https://cloud.google.com/solutions/rate-limiting-strategies-techniques (accessed Aug. 8, 2020).
[4] P. Tarjan. "Scaling your API with rate limiters." Stripe. https://stripe.com/blog/rate-limiters (accessed Aug. 8, 2020).
[5] D. Bryant. "Part 2: rate limiting for API gateways." Ambassador. https://blog.getambassador.io/rate-limiting-for-api-gateways-892310a2da02 (accessed Aug. 8, 2020).
[6] "Ratelimiter." Resilience4j. https://resilience4j.readme.io/docs/ratelimiter (accessed Aug. 8, 2020).
[7] S. Blake, D. Black, M. Carlson, E. Davies, Z.Wang, W. Weisss. "RFC 2475 - An architecture for differentiated services." IETF Tools. https://tools.ietf.org/html/rfc2475#section-1.2 (accessed Sep. 7, 2020).
[8] "Traffic policing (communications)." Wikipedia. https://en.wikipedia.org/wiki/Traffic_policing_(communications) (accessed Sep. 7, 2020).
[9] "Rate limiting." Wikipedia. https://en.wikipedia.org/wiki/Rate_limiting (accessed Aug. 8, 2020).
[10] "Token bucket." Wikipedia. https://en.wikipedia.org/wiki/Token_bucket (accessed Sep. 7, 2020)
[11] D. Medhi, K. Ramasamy, Network Routing, 2nd ed. Elsevier Inc, 2018, ch. 18.4.2, pp. 626-644. [Online]. Available: https://www.sciencedirect.com/topics/computer-science/token-bucket. Accessed: Sep. 7, 2020.
[12] "Policing and shaping overview." Cisco Networking Software. https://www.cisco.com/c/en/us/td/docs/ios/qos/configuration/guide/12_2sr/qos_12_2sr_book/polcing_shping_oview.html (accessed Sep 7, 2020).
[13] M. Brooker. "Timeouts, retries, and backoff with jitter." Amazon Builder’s Library. https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/. (Accessed Nov 11, 2020).
[14] C. Bennage et. al. "Retry pattern." Azure Documentation. https://docs.microsoft.com/en-us/azure/architecture/patterns/retry. (Accessed Nov 11, 2020).
[15] "Rate-limiting strategies and techniques." Google Cloud Solutions. https://cloud.google.com/solutions/rate-limiting-strategies-techniques. (Accessed Nov 11, 2020).
[16] C. Bennate et. al. "Transient fault handling." Azure Documentation. https://docs.microsoft.com/en-us/azure/architecture/best-practices/transient-faults. (Accessed Nov 11, 2020).
[17] L. Lamport, "Using Time Instead of Timeout for Fault-Tolerant Distributed Systems," ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, pp. 254-280 April 1984. [Online]. https://www.microsoft.com/en-us/research/uploads/prod/2016/12/using-time-Copy.pdf. Accessed: Nov 21, 2020.
[18] "Fault tolerance." IBM Cloud Docs. https://cloud.ibm.com/docs/java?topic=cloud-native-fault-tolerance (Accessed Nov 21, 2020).
[19] V. Heorhiadi, S. Rajagopalan, H. Jamjoom, M. Reiter, V. Sekar, "Gremlin: Systematic Resilience Testing of Microservices." [Online]. http://www.cs.unc.edu/~victor/papers/gremlin-icdcs.pdf. Accessed: Nov 21, 2020.
[20] "Sampled stream." Twitter developer documentation. https://developer.twitter.com/en/docs/twitter-api/tweets/sampled-stream/integrate/handling-disconnections. (Accessed Jan 12, 2021).
[21] M. Brooker. "Exponential Backoff and Jitter." AWS Architecture Blog. https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/. (Accessed Jan 12, 2021)

Useful Libraries

Resilience4J implements a number of useful fault-tolerance mechanisms like circuit breakers, bulkheads, and retry mechanisms. Resilience4j is inspired by Netflix’s Hystrix library, which is no longer actively maintained but curates similar configuration-based mechanisms. As stated in the Hystrix README, Netflix is moving away from configuration-heavy strategies employed by Resilience4J and Hystix towards research on adaptive concurrency limits.