级联恢复与循环:系统间交互引发的潜在风险

#Tech

本文探讨了系统和组件之间的交互,以及这些交互如何导致“级联恢复”问题。

当一个系统恢复时,其恢复行为可能对依赖系统造成意想不到的负面影响,甚至可能形成恶性循环。

文章通过一个生产者-消费者系统(Producer-Consumer System)的案例,详细阐述了这种现象,包括数据缺失、重复计算和消息总线负载增加等问题。

为了解决此类问题,建议采用增量恢复(Additive Recovery)模式,并通过预算恢复(Budget Recovery)和流整形(Stream Shaping)来避免资源峰值负载,从而防止系统不稳定。

查看原文开头(英文 · 仅前 3 段)

My last metastable blog post discussed the interactions between systems and components and how they can lead to metastable failures. Specifically, I looked at interactions between systems/components and how signals can be misinterpreted by different systems due to ambiguity — a timeout may mean a transient fault that can be fixed by retrying, but it may also mean a catastrophic overload where retrying only makes things worse. This leads to an important realization: some common action systems take in response to signals can be good under some assumptions and bad under others.

Last month, Todd Porter from Meta, Dave Maier from Portland State, and I gave a talk at SREConn Americas on Metastability in Recovery. This talk discusses how cross-system interactions and ambiguous assumptions can make recovery of system of systems impossible due to cascading amplifications and feedback loops.

It is a common fact that failures can spread. A lot of engineering effort is often put into siloing/compartmentalizing systems to reduce the blast radius. But if failures can spread, so do the recoveries from failures. Failures (often) spread visibly — as one system starts to affect another, alarms begin to sound on the newly affected system. The recoveries spread differently and often more silently. Even if we successfully isolate failures and prevent them from propagating directly to other components, it does not mean that recoveries from these failures will not spread. Consider two systems: A and B; system A experiences a problem, but the problem remains isolated to A. However, system B usually receives its work from A, and while A is failing, it sends less work to B. The recovery of A, however, can be rather unexpected to B — it now starts receiving all the work, and maybe more, as A’s recovery tries to process and ship the backlog to B.

※ 出于版权考虑,仅引用前 3 段。完整内容请阅读原文。

阅读原文 ↗