No Free Lunch Theorem for Public Clouds

I recently learned that the common saying, “There is no free lunch” is not just an expression that people use but also a theorem in some fields of mathematics. For example, the “No Free Lunch Theorem For Optimizations”[1,2] states, in simple terms, that if an algorithm performs well on a certain class of problems, then it pays for that with degraded performance on all the remaining problems.

It got me thinking about what an equivalent result for cloud computing, especially public clouds, might be. It is well known that public clouds provide a pay-as-you-go model of computing with ease of use and elasticity. One doesn’t have to worry about making all the architectural decisions, deploying the infrastructure and running layers of cloud software to create a cloud.

 Two of the most prevalent concerns with public clouds are:

  1.  Meeting the security and compliance requirements of an enterprise
  2.  Lack of integration with local data and other systems

In addition, there are other fundamental trade-offs that one should be aware of. Public clouds are, by definition, designed for large scale – a typical deployment within a datacenter consists of tens of thousands of compute servers spanning multiple racks, pods and rows. Similarly, a scalable storage layer is deployed to provide block and object storage functionality. All these clusters with specific use case are stitched together using a fat-tree or spine-leaf based datacenter network topology.

Such large scale comes with several costs: 

High I/O Latency: Imagine an I/O-intensive application that depends on high IOPS (I/Os per second) and low latency for performance. In a typical datacenter, most arrays or distributed storage systems with SSD tiers can easily provide sub-millisecond latency and more than 25,000 IOPS in the presence of concurrent I/Os from the application (Tip: Apply Little’s Law to see how your application is faring. Simply divide the number of concurrent I/Os by the per-I/O latency to generate the number of IOPS.  For example, an application doing 16 concurrent IOs with a per IO latency of 0.5ms, will get 32000 IOPS). In contrast, if you consider the Provisioned IOPS (PIOPS) offering from AWS, the observed latency, as reported by end users [3], is in the range of hundreds of milliseconds, which also indicates a very high variance. Much of the higher latency in public clouds can be attributed to their scale – the software layers and the application data are all distributed several hops away on the network. 

Loss of Network Proximity: Several optimizations in computer systems rely on caching — applied at the processor, operating system and storage layers — to improve performance. This quest for data locality, however, becomes nullified in large scale public clouds. Imagine an application consisting of 50 VMs running micro services that talk to each other to provide a specific functionality. Ideally, one should be able to control which VMs can share the same rack, hosts or availability zones. The performance difference between VMs sharing the same host and rack is very high. For example, our internal tests show that two VMs running on the same host are able to get more than 25 Gbps of throughput between them; two VMs within the same rack achieve 2.5 Gbps network bandwidth with sub-millisecond latency. Now consider VMs running across different rows or pods within a large datacenter. The bandwidth numbers drop significantly and latency increases with each hop in the network. In fact, studies [4] have shown that network performance is tied to the size of instance itself. While this makes sense for public clouds, from a customer point of view even smaller instances deserve better network I/O performance and lower latency.

 Lack of control on placement: In a large-scale environment, we often look for quick and good enough solutions. I remember at VMware, we would do VM placement by looking at multiple resources like CPU, memory, I/O latency, and affinity rules across all hosts in a cluster as part of DRS and SDRS solutions. This was possible at the scale of 32 hosts and three to four thousand VMs. At the scale of a public cloud, one cannot provide such detailed selection. Also, providing controls like affinity and anti-affinity rules is very difficult. In fact, it is hard to have the complete view of the system at any time given the rate of churn. As a result, users have much more limited control of the placement of workloads on the underlying infrastructure.

These limitations are not specific to any one provider; they exist for all public clouds. So, based on these, here is my “No Free Lunch” theorem for public clouds:

Applications pay for the scale and elasticity of a public cloud by incurring higher I/O latency, lack of network proximity and control between various components of an application. So while pubic clouds are suitable for a certain set of applications, they pay for that with lower and less predictable performance for all other class of applications. 

Now, of course, different trade-offs are true for private clouds as well. I will elaborate on the “No Free Lunch” theorem for them in the next post. 

[1] Wolpert, D.H., Macready, W.G. (1997), “No Free Lunch Theorems for Optimization,” IEEE Transactions on Evolutionary Computation 1, 67.

[2] Wolpert, D.H., Macready, W.G. (1995), No Free Lunch Theorems for Search, Technical Report SFI-TR-95-02-010 (Santa Fe Institute).

[3] http://blog.parse.com/learn/engineering/parse-databases-upgraded-to-amazon-provisioned-iops/

[4] https://gigaom.com/2014/04/12/need-for-speed-testing-the-networking-performance-of-the-top-4-cloud-providers/

Leave a Reply