Wednesday, September 7, 2011

Warehouse-Scale Computing: Entering the Teenage Decade

Motivation:
The keynote at ISCA 2011 by Luiz Andre Barroso describes the challenges that warehouse scale computing faces in the next ten years. Tracing the history of warehouse scale computing from early 2000s, the presentation presents a window into the changes that have taken place in the last decade.


Main Ideas:
- Importance of low latency: The biggest idea that I found in the talk was the stress on the importance for low latency computation and I/O. For the computation, this means that using wimpy cores is not always good enough and having brawny cores helps in easily exploiting request-level parallelism.
- In terms of I/O, the advent of flash and how it is integrated into the storage hierarchy is an important problem. While flash has very good random I/O latency when compared to disks, the tail latency is high due to slow erase cycles (sometimes worse than disk).
- Power management: Over the last decade, power management has mostly looked at how to make the datacenter more efficient. The gains from this are lower now and we need to look at how to make individual machines more efficient. Further, there are other related problems like not using potable water for cooling and being able to have servers which can work better with load peaks.
- Disaggregation: The main idea behind dis-aggregation is that resources are utilized better when they can be shared across the datacenter and are not in small silos. Disk resources can be said to be dis-aggregated as network speeds allow access to remote disks to be almost as fast as local disk accesses.  There is a lot of work happening in the area of full bisection bandwidth in datacenters and faster networks could hopefully lead to more resources being dis-aggregated.

Trade-offs/influence:
- The changing nature of the workloads (e.g., Google Instant, Twitter) has led to the need for warehouse scale computers to support low latency operations along with other constraints such as energy efficiency.
- The example showing how much slower TCP is compared to RDMA highlights that many parts of the stack need to be restructured to build a low latency warehouse scale computer. I think this idea could be a great influence in terms of the research and development over the next decade.