Wednesday, September 7, 2011

The Datacenter as a Computer - Chapters 3,4,7

Motivation:
The chapters 3,4 and 7 of the book introduce the hardware building blocks and provide details on the power infrastructure used in a datacenter. Further they discuss the different failure modes of a warehouse scale computer and motivate the need for building fault-tolerant software

Main ideas:
- The server hardware used in warehouse scale computing is a low-end server for cost-efficiency reasons. Using TPC-C price/performance benchmark one can see that lower end servers are up to 20x more efficient than a high-end server.
- However, there are limits to how low-end components one can choose and using hardware from embedded systems may not give much benefits. This is because there are parts of a computation that are difficult to parallelize and the latency would suffer if lower-power processors are used.
- The power supply to a warehouse scale computer has many redundancy levels and the architecture of the datacenter is optimized for more efficient cooling.
- Even though low-end hardware components are used, configuration and software failures are more frequent. Further, while hardware failures are statistically independent, many of the software or configuration related failures are correlated and could lead to a loss of availability.
- A survey of the machine restart events in a Google datacenter showed that 95% of the machines restart less often than once a month, but this distribution has a long tail. The average machine availability was found to be 99.84%. But for a service which runs on 2000 machines, there will be a failure once every 2.5 hrs and software design needs to take into account such frequent failures.

Tradeoffs, Influence:
- The most important trade-off in the design of a warehouse scale computer is the choice of hardware components used for it. This work shows that there are many factors which influence such a decision and designers need to find a sweet spot over a large number of choices.
- Software components like file systems have been redesigned to handle failures (GFS) and this leads to the trade-off between consistency, availability and tolerance to network partition (CAP theorem). Storage systems like Bigtable and Yahoo's Pnuts have shown the importance of availability with eventual consistency.