When Traffic Server is used as a delivery server for a video live streaming event, it's possible that there are a huge number of concurrent requests for the same object. Depending on the type of the object being requested, the cache lookup for those objects can result in either a stale copy of the object (e.g manifest files) or a complete cache miss (e.g segment files). ATS currently supports different types of connection collapse (e.g. read-while-writer functionality, stale_while_revalidate plugin etc), but, in order for the read-while-writer to kick-in, ATS requires the complete response headers for the object be received and validated. Until this happens, any number of incoming requests for the same object that result in a cache miss or a cache stale would be forwarded to the origin. For a scenario such as a live event, this leaves a sufficiently significant window, where there could be 100's of requests being forwarded to the origin for the same object. This scenario is what is commonly known as a Thundering Herd. It has been observed during production that this results in significant increase in latency for the objects waiting in read-while-write state.
Note that, there are also a couple of related settings open-read-retry-timeout and max-open-read-retries that can alleviate the thundering herd to some extent, by re-trying to acquire the read lock for the object. With these settings configured, ATS would retry to get the read lock for the configured amount of attempts each with the configured duration and if the read lock is still not available (due to the write lock being held by the first request that was forwarded to the origin - for e.g. due to the fact that the response headers have not been received yet), then all the waiting requests would simply be forwarded to the origin (by disabling cache for each of them).
It is almost impossible to get the above settings accurate to help in all possible situations (traffic, concurrent connections, network conditions etc). Due to this reason, a configurable workaround is proposed below that avoids the thundering herd completely. Justin Laue and Phil Sorber developed the initial proof-of-concept patch for this solution and it is now being discussed to make a generic stale-while-revalidate solution within the core and possibly deprecate the existing plugin solution which doesn't work well due to the write lock issue.
The solution being discussed/developed is as follows (being tracked in TS-3549 - Configurable option to avoid thundering herd due to concurrent requests for the same object Closed ) :
On failing to obtain a write lock for an object (which means, there's likely another ongoing parallel request for the same object that was forwarded to the origin),
- If it's a regular cache miss, based on a configured setting, a 502 error is returned to let the client (e.g. player) to reattempt. The 502 error also includes a special internal ATS header named @Ats-Internal with the appropriate value to allow for custom logging or for plugins to take any appropriate actions (e.g. prevent a fail-over if there's such a plugin that does fail-over on a regular 502 error).
- If it's a cache refresh miss, one of the below actions is taken depending on the configuration:
*) Stale copy of the object is served if it's a valid cacheable response and is still within the max-stale-age.
*) If the stale copy can not be returned, either due to the age being greater than max_stale_age or the object otherwise being non-cacheable etc
a) a 502 error is returned with a @Ats-Internal header explaining the reason
b) request forwarded to the origin
*) Honor the stale-while-revalidate extensions of the Cache-Control header.
*) Return the stale copy even for the first client request that requires a cache refresh.
Currently, this solution does not also support the Stale-If-Error functionality.