Wednesday, September 21, 2011

Paxos Made Simple

Motivation:
This paper presents the Paxos algorithm in plain English. The Paxos algorithm solves the problem of reaching consensus in a distributed system where agents may fail and messages may be dropped or delayed.

Main points:
- The Paxos algorithm has three main roles for agents: Proposers, Acceptors and Learners. These agents behave according to a set of rules prescribed by the algorithm.
- Every proposal that is sent has a proposal identifier (id) and a value. The acceptors only accept a proposal if it has a proposal id greater than the proposals they have already accepted and if the values are the same.
- The algorithm has two phases: Phase 1: A proposer chooses a proposal number and sends a prepare request to a majority of acceptors. Every acceptor replies if it accepts the proposal and sets the value that should be used based on the previous condition
- Phase 2: If the proposer receives a response from a majority of acceptors, then it sends an accept message to  with the proposal number and a value that is the highest among values it received.
- Acceptors send notifications to the learners to disseminate the value that has been chosen. To reduce the number of messages, we could elect a distinguished learner who propagates messages to other learners.
- The election could fail if there are two proposers who keep increment the proposal numbers one after the other, thereby hindering progress. However the safety of the algorithm is still not affected.

Trade-offs/Influence:
- The Paxos algorithm has been influential in the design of distributed systems built for datacenters. Used in systems like Chubby and ZooKeeper, Paxos has a wide range of applications like leader election, distributed locks etc.

Monday, September 19, 2011

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Motivation:

This paper describes a proof of the CAP theorem which argues that consistent, available and partition-tolerant services are not feasible.

Main points:
- The proof of the CAP theorem is divided into two parts, first for asynchronous networks and the other for partially synchronous networks
- For asynchronous networks (where processes can be arbitrarily slow), it is shown that it is impossible to guarantee availability and atomic consistency for all fair executions. This is proved by considering two processes and showing that if no messages are exchanged, a write made in the first process and a read from the second process cannot present consistent results.
- It is possible in an asynchronous network to relax one of the constraints from CAP. A centralized algorithm where all data is written to a central server and other nodes forward requests to this node satisfies Consistency, Partition tolerance (CP). It is also an example for consistent and available system (CA). Web-cache like systems where stale data could be returned is an example of a an available, partition tolerant (CA) system
- For partially synchronous networks, we can introduce the idea of timeouts and local clocks, but the CAP theorem still holds good.
- A notion of Delayed-t consistency is introduced, which means that if no messages are lost for a time period t, then the partial ordering among operations is maintained correctly. This is an example of a consistency model where consistency is sacrificed but there is some bound on the stale data that can be returned (weak consistency)

Trade-offs/Influence:
- The CAP theorem is the basis on which eventually consistent storage systems like BigTable, PNUTS etc.
I think it has been very influential over the last ten years and has also led to the growth of key-value stores and other NoSQL systems.
- More fundamentally the CAP theorem has led to consistency being sacrificed over availability (it is not possible to prevent all network partitions). This is interesting as a lot of the traditional storage systems and databases prioritized consistency. However with Internet services where availability is more important and the cost of presenting an inconsistent view is not very high (delayed tweets or friend count in Facebook), companies like Google and Amazon have found it to be more beneficial to have an eventually consistent system. 

Cluster-based Scalable Network Services


Motivation:
This paper looks at the problem of how to design a cluster for Internet services and how to provide software services which can be shared across all the services in a cluster.

Main Points
- The three main qualities that a network service should have are scalability, availability and cost-effectiveness.
- The paper describes the use of commodity building blocks to design a cluster with the insight that commodity machines give the best price-performance ratio.
 - This leads to a set of challenges in the software-layer which include load balancing, scheduling  and maintaining shared state across a large number of machines.
- The cluster architecture consists of a pool of worker nodes, shared cache servers and frontends. There is a centralized manager used for load-balancing and a common user-profile database.
- This paper also proposes the use of BASE semantics for data stored in the cache and workers. This increases the availability of data and makes it easier to scale.

Trade-offs/Influence:
- The architecture described here has been very influential and is in fact pretty close to the Google serving infrastructure (as of 2003). In fact even the Mesos design has similar goals of provisioning a cluster for multiple applications.
- The idea of using commodity building blocks and providing transparent fault tolerance is the basis for the design of MapReduce-like systems
- BASE semantics have also been very influential and this led in turn to the CAP theorem which has been useful to build storage systems like Bigtable, PNUTS etc.

Wednesday, September 14, 2011

The Datacenter Needs an Operating System

Motivation:
With a large number of Internet, enterprise and scientific services deployed in datacenters, this paper motivates the need for an operating system in the datacenter to provide ease of development and efficient sharing.

Main ideas:
- There is a growing diversity of applications deployed in the datacenter and similar to the personal computer, this leads to the need for an operating system
- There are four main roles envisioned for a datacenter OS
   a. Resource sharing: Efficiently share cluster resources among all the jobs and provide optimal scheduling and fairness.
   b. Data sharing: Provide common data interfaces (analogous to pipes) to allow different applications to be chained together.
   c. Programming abstractions: Simplify development of new programming frameworks by providing standard APIs and services.
   d. Debugging and monitoring: Provide support for end-to-end debugging and this could potentially involve a clean-slate approach having some restrictions
- Finally the paper also identifies how the academic research community can contribute to this goal, given that the industry

Trade-offs/influence:
The main trade-off here I see is between the challenge in coming up with standard interfaces against the benefits we would gain from it. The paper mentions that there many different software stacks today that are performing some of the functions mentioned above e.g., Google GFS-MapReduce etc., Hadoop's software stack etc. However as datacenters are owned by a single organization and when most of the software is developed in-house at places like Google and Amazon, there is less of a need to come up with standard interfaces.
Further, the PC operating systems are driven by a commodity business model where the user can assemble a computer bought from different vendors and run a large number of applications. This encourages have a data and resource sharing model that application developers can use. Similarly the development of a standard sharing model will grow once users and application developers have the need to run multiple programs at the same time.
Finally, I think the it is important to have an operating system for the datacenter as it will prevent vendors from locking in customers to their proprietary solutions. For example, if one wants to use both Amazon AWS and Microsoft Azure (to provide say primary backup failure handling), users cannot using the same application across both frameworks.  Having a more standard model for describing data and resources could help the user switch easily between cluster computing providers. 

Wednesday, September 7, 2011

The Datacenter as a Computer - Chapters 3,4,7

Motivation:
The chapters 3,4 and 7 of the book introduce the hardware building blocks and provide details on the power infrastructure used in a datacenter. Further they discuss the different failure modes of a warehouse scale computer and motivate the need for building fault-tolerant software

Main ideas:
- The server hardware used in warehouse scale computing is a low-end server for cost-efficiency reasons. Using TPC-C price/performance benchmark one can see that lower end servers are up to 20x more efficient than a high-end server.
- However, there are limits to how low-end components one can choose and using hardware from embedded systems may not give much benefits. This is because there are parts of a computation that are difficult to parallelize and the latency would suffer if lower-power processors are used.
- The power supply to a warehouse scale computer has many redundancy levels and the architecture of the datacenter is optimized for more efficient cooling.
- Even though low-end hardware components are used, configuration and software failures are more frequent. Further, while hardware failures are statistically independent, many of the software or configuration related failures are correlated and could lead to a loss of availability.
- A survey of the machine restart events in a Google datacenter showed that 95% of the machines restart less often than once a month, but this distribution has a long tail. The average machine availability was found to be 99.84%. But for a service which runs on 2000 machines, there will be a failure once every 2.5 hrs and software design needs to take into account such frequent failures.

Tradeoffs, Influence:
- The most important trade-off in the design of a warehouse scale computer is the choice of hardware components used for it. This work shows that there are many factors which influence such a decision and designers need to find a sweet spot over a large number of choices.
- Software components like file systems have been redesigned to handle failures (GFS) and this leads to the trade-off between consistency, availability and tolerance to network partition (CAP theorem). Storage systems like Bigtable and Yahoo's Pnuts have shown the importance of availability with eventual consistency.

Warehouse-Scale Computing: Entering the Teenage Decade

Motivation:
The keynote at ISCA 2011 by Luiz Andre Barroso describes the challenges that warehouse scale computing faces in the next ten years. Tracing the history of warehouse scale computing from early 2000s, the presentation presents a window into the changes that have taken place in the last decade.


Main Ideas:
- Importance of low latency: The biggest idea that I found in the talk was the stress on the importance for low latency computation and I/O. For the computation, this means that using wimpy cores is not always good enough and having brawny cores helps in easily exploiting request-level parallelism.
- In terms of I/O, the advent of flash and how it is integrated into the storage hierarchy is an important problem. While flash has very good random I/O latency when compared to disks, the tail latency is high due to slow erase cycles (sometimes worse than disk).
- Power management: Over the last decade, power management has mostly looked at how to make the datacenter more efficient. The gains from this are lower now and we need to look at how to make individual machines more efficient. Further, there are other related problems like not using potable water for cooling and being able to have servers which can work better with load peaks.
- Disaggregation: The main idea behind dis-aggregation is that resources are utilized better when they can be shared across the datacenter and are not in small silos. Disk resources can be said to be dis-aggregated as network speeds allow access to remote disks to be almost as fast as local disk accesses.  There is a lot of work happening in the area of full bisection bandwidth in datacenters and faster networks could hopefully lead to more resources being dis-aggregated.

Trade-offs/influence:
- The changing nature of the workloads (e.g., Google Instant, Twitter) has led to the need for warehouse scale computers to support low latency operations along with other constraints such as energy efficiency.
- The example showing how much slower TCP is compared to RDMA highlights that many parts of the stack need to be restructured to build a low latency warehouse scale computer. I think this idea could be a great influence in terms of the research and development over the next decade.