Experiencing failure at scale is as the popular Marvel character Thanos would say “Inevitable”. Memory leaks, software or hardware or network I/O failures are just a few. It’s a problem of simple mathematics, the probability of failing rises as the total number of operations performed increases. With each component used to scale the application, the failure quotient increases.

So how do you tackle this so-called “Inevitable” problem that comes with scaling? Well a common sense approach is figuring out what can go wrong. For teams that build on the cloud and do not have their own datacenter, will commonly encounter failures that are caused by a single point failure, i.e. the network being unreliable, slow processes and unexpected load.

Let’s dive into the single point of failure. Single point of failure on a distributed system is equal to the snap by Thanos in the popular Marvel movie, Infinity War, it’s that one component that brings down the entire system with it. In reality, distributed systems can have multiple single points of failure.

A service which reads from a non-replicated database to boot up is an example of a single point of failure, i.e. if the database is not reachable the service won’t start.

Another example of the same is a service that exposes a HTTP API on top of TLS and uses a certificate that needs to be manually renewed. For whatever reason if the certificate is not renewed by the time it expires, then most clients would not be able to access the service.

Role of HPC in Failures in Distributed Systems

There are several ways in which HPC systems can contribute to the failure in distributed systems. One possibility is that the HPC system itself may fail, either due to hardware or software problems. This can disrupt the operation of the distributed system, as the other computers in the system may rely on the HPC system to perform certain tasks.

Another possibility is that the communication channels between the computers in the distributed system may fail, disrupting the flow of information between them. This can occur due to problems with the network infrastructure, such as cable breaks or hardware failures.

Finally, it is also possible for the distributed system to fail due to problems with the software that is being used to coordinate the activities of the computers. This could include bugs in the software, or problems with the way that the software is configured.

In general, HPC systems can play a critical role in the operation of distributed systems, and their failure can have significant consequences. It is important to carefully design and maintain HPC systems to ensure that they are reliable and able to support the operation of the distributed systems that rely on them.

So what do you do to prevent it?

Identify the single points of failure when the system is architected before they can cause any harm. Examine all the components of the system to see what would happen if any of the components failed. This will allow you to fix some single point failures by introducing redundancy and other ways. But despite this some can’t be fixed, in that case just reduce the blast radius when things go wrong.

The Nebula or Unreliable Network

Similar to Nebula who betrays her father’s trust in the Marvel End Game, an unreliable network can fail you as well without being spotted. Let’s say a client makes a remote network call, it sends a request to a server and expects a response from it a while later. Ideally the client should receive a response shortly after sending the request. But what if the response is super delayed?

Now what are the client’s options? Wait for a response that may never arrive or fail the request with an error or an exception.

Just like Nebula, slow network calls are silent killers of distributed systems. Because the client is unaware which side will the response flip, it can spend a long time trusting for it to show up or eventually give up, if that is even an option. The wait can in turn cause degradations that are extremely hard to debug.

Slow Processes

Slow network calls are the silent killers of distributed systems. Because the client doesn’t know whether the response is on its way or not, it can spend a long time waiting before giving up, if it gives up at all. The wait can in turn cause degradations that are extremely hard to debug.

The many scenes in Marvel where you thought they almost had the bad guy and won i.e. Slow Processes

What’s the difference between a slow process and a broken process? Well, from an observer’s point of view, none! As both are not really performing any useful work.

The prime cause of slow processes is resource leaks. Arguably memory leaks are the most common source of leaks. How does it happen? A memory leak glares itself with a steady increase in memory consumption over time. Run-times with trash collection does not help either i.e. if a reference to an object that isn’t longer needed is kept somewhere, the object won’t be deleted by the garbage collector.

So, a memory leak basically keeps consuming memory till it exhausts it, at which point the OS starts swapping memory pages to the disk constantly, all the while the garbage collector tries its best to release any shred of memory frequently. The process slows down more with constant paging and the garbage collector eating up CPU cycles. When there is no more physical memory and there is no more space in the swap file, the process is not able to allocate more memory leading to failure of most operations.

While being the most well-known, it is just one of the many resources that can leak. For eg: Whilst using a thread pool, a thread can be lost when it blocks on a synchronous call that never returns. If a thread makes synchronous, and blocking, HTTP call without setting a timeout, and the call never returns, the thread will not be returned to the pool. Given its fixed size, continuously losing threads will eventually leave the pool with no threads.

You might think that making asynchronous calls, rather than synchronous ones, would mitigate the problem in the previous case. But, modern HTTP clients use socket pools to avoid recreating TCP connections and pay a hefty performance fee. If a request is made without a timeout, the connection is never returned to the pool. As the pool has a limited size, eventually there won’t be any connections left to communicate with the host.

On top of all that, the code you write isn’t the only one accessing memory, threads and sockets. The libraries your application depends on access the same resources, and they can do all kinds of shady things. Without digging into their implementation, assuming it’s open in the first place, you can’t be sure whether they can wreak havoc or not.

Unexpected Load

Every system has a limit to how much load it can withstand without scaling. Depending on how the load increases, you are bound to hit that brick wall sooner or later. But one thing is an organic increase in load, which gives you the time to scale your service out accordingly, and another is a sudden and unexpected spike.

For example, consider the number of requests received by a service in a period of time. The rate and the type of incoming requests can change over time, and sometimes suddenly, for a variety of reasons:

The requests might have a seasonality - depending on the hour of the day the service is going to get hit by users in different countries. Some requests are much more expensive than others and abuse the system in ways you didn’t really anticipate for, like scrapers slurping in data from your site at super human speed.

Some requests are malicious - think of DDoS attacks which try to saturate your service’s bandwidth, denying access to the service to legitimate users.

Cascading Failures

You would think that if your system has hundreds of processes, it shouldn’t make much of a difference if a small percentage are slow or unreachable. The thing about faults is that they tend to spread like cancer, propagating from one process to the other until the whole system crumbles to its knees. This effect is also referred to as a cascading failure, which occurs when a portion of an overall system fails, increasing the probability that other portions fail.

For example, suppose there are multiple clients querying two database replicas A and B, which are behind a load balancer. Each replica is handling about 50 transactions per second.

Suddenly, replica B becomes unavailable because of a network fault. The load balancer detects that B is unavailable and removes it from its pool. Because of that, replica A has to pick up the slack for replica B, doubling the load it was previously under.

As replica A starts to struggle to keep up with the incoming requests, the clients experience more failures and timeouts. In turn, they retry the same failing requests several times, adding insult to injury.

💡
Check our quick guide to SLAs, SLOs, and SLIs

Eventually, replica A is under so much load that it can no longer serve requests promptly, and becomes for all intent and purposes unavailable, causing replica A to be removed from the load balancer’s pool. In the meantime, replica B becomes available again and the load balancer puts it back in the pool, at which point it’s flooded with requests that kill the replica instantaneously. This feedback loop of doom can repeat several times.

Cascading failures are very hard to get under control once they have started. The best way to mitigate one is to not have it in the first place by stopping the cracks in your services to propagate to others.

Redundant MLOps in Slow Process and Cascading Failures

While redundant MLOps can help to ensure that machine learning processes are able to continue running smoothly, they can also have an impact on the speed of those processes.

One way in which redundant MLOps can affect the speed of a process is by adding extra overhead. For example, if multiple systems are performing the same task, they may need to communicate with each other to coordinate their activities, which can take time and slow down the overall process. In addition, having multiple systems performing the same task can lead to duplication of effort, which can also contribute to slower overall performance.

Another way in which redundant MLOps can impact the speed of a process is by introducing bottlenecks. If one of the redundant systems is slower than the others, it may act as a bottleneck, causing the overall process to slow down.

If redundant MLOps are not properly designed and managed, they can also lead to cascading failures. This can occur if one of the redundant systems fails, and the other systems are unable to compensate for the loss of capacity. This can cause the overall process to fail, leading to disruption and downtime.

Overall, while redundant MLOps can help to improve the reliability and availability of machine learning processes, they can also have an impact on their speed and can introduce additional complexity. It is important to carefully consider the trade-offs involved in using redundant MLOps and to carefully design and manage these systems to ensure that they are effective and efficient.