Monday, November 7, 2011

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric


Motivation:
Datacenter networks have a need to support VM migration without losing
any open connections. Layer 2 networks can support such functionality
but cannot scale to many thousands of nodes. While layer 3 networks are
more scalable, they also have a configuration overhead. This paper
proposes a new scalable, layer 2 network fabric for datacenters.

Main points
- The paper assumes a datacenter topology that is a multi-rooted tree.
  This is applicable for fat-trees and some of the other topologies
  developed in recent years.
- The main idea in the paper is the introduction of a Psuedo MAC (PMAC)
  address which allows end hosts to be named hierarchically at level 2.
  The edge-switches perform a translation of PMAC to MAC addresses and
  vice-versa.
- The networking fabric is co-ordinated by a centralized fabric manager.
  The fabric manager helps avoid broadcasts of ARP requests and helps in
  performing fault-tolerant routing.
- The paper also describes a local discovery protocol which helps switches
  bootstap automatically and discover their role in the multi-rooted
  tree.

Trade-offs
- This paper tries to get the benefits of Layer 3 (hierarchical
  namespace, better routing etc.) and the benefits of Layer 2 (migrate
  VMs without losing connections).
- The major insight in this paper is that since datacenter network
  topology is hierarchical and well-known, a new indirection (PMAC)
  can be used to get the benefits of Layer 2,3.
- The centralized manager simplifies the design but is a scalability
  bottleneck. The authors propose the a small cluster could be used, but
  its not clear if this would affect the other properties of the system.

Wednesday, October 26, 2011

Piccolo - Building Fast, Distributed Programs with Partitioned Tables

Motivation:
This paper looks at the problem of how to build an easy-to-use programming abstraction for performing large-scale computation on in-memory data

Main points
- With growing size of DRAM in machines, the working set for many applications fits in-memory of a large cluster. This leads to the need for a programming model that is efficient and makes it easier for users to perform large scale computation

- Data-flow based systems like MapReduce don't expose shared global state across workers. While that makes it easier to program, this paper argues that it limits the applications and performance. Traditional frameworks like MPI use message passing to manage shared-state but are hard to use.

- This paper proposes partitioned tables distributed across the cluster as a method to read and modify shared state. The computation is structured around one master, used as a control thread and many worker nodes which execute 'kernels'. The workers can read and modify data in both local and remote partitions.

- Since many workers can update the same key, this can lead to a write-write conflict. Piccolo uses user-defined accumulation functions to resolve such conflicts.

- Other features of the system: User-specified locality for kernels, Load-Balancing using work-stealing and checkpoint recovery using a consistent snapshot (chandy-lamport algorithm)

Trade-offs
- The main trade-off this paper is looking to address is a performance vs. ease of use. As they acknowledge in their evaluation, using MPI should give similar or better performance but it is harder to use. While a key-value interface is definitely easier to use, the release consistency semantics could lead to non-deterministic behavior.

- Having global shared state leads to expensive checkpoint and recovery methods . The main insight here is that MapReduce has a simple fault tolerance model as it restricts the amount of global state that can be updated by a task. It remains to be seen if the benefits of exposing more shared state are worth the complex snapshot and recovery

- Overall, I think the bigger question is what are the applications that we wish to target for in-memory parallel programs. If the applications are iterative or sequential then Spark or MapReduce would be good enough. However if there are applications which require such fine-grained updates to shared state, Piccolo would be a good programming model

Wednesday, September 21, 2011

Paxos Made Simple

Motivation:
This paper presents the Paxos algorithm in plain English. The Paxos algorithm solves the problem of reaching consensus in a distributed system where agents may fail and messages may be dropped or delayed.

Main points:
- The Paxos algorithm has three main roles for agents: Proposers, Acceptors and Learners. These agents behave according to a set of rules prescribed by the algorithm.
- Every proposal that is sent has a proposal identifier (id) and a value. The acceptors only accept a proposal if it has a proposal id greater than the proposals they have already accepted and if the values are the same.
- The algorithm has two phases: Phase 1: A proposer chooses a proposal number and sends a prepare request to a majority of acceptors. Every acceptor replies if it accepts the proposal and sets the value that should be used based on the previous condition
- Phase 2: If the proposer receives a response from a majority of acceptors, then it sends an accept message to  with the proposal number and a value that is the highest among values it received.
- Acceptors send notifications to the learners to disseminate the value that has been chosen. To reduce the number of messages, we could elect a distinguished learner who propagates messages to other learners.
- The election could fail if there are two proposers who keep increment the proposal numbers one after the other, thereby hindering progress. However the safety of the algorithm is still not affected.

Trade-offs/Influence:
- The Paxos algorithm has been influential in the design of distributed systems built for datacenters. Used in systems like Chubby and ZooKeeper, Paxos has a wide range of applications like leader election, distributed locks etc.

Monday, September 19, 2011

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Motivation:

This paper describes a proof of the CAP theorem which argues that consistent, available and partition-tolerant services are not feasible.

Main points:
- The proof of the CAP theorem is divided into two parts, first for asynchronous networks and the other for partially synchronous networks
- For asynchronous networks (where processes can be arbitrarily slow), it is shown that it is impossible to guarantee availability and atomic consistency for all fair executions. This is proved by considering two processes and showing that if no messages are exchanged, a write made in the first process and a read from the second process cannot present consistent results.
- It is possible in an asynchronous network to relax one of the constraints from CAP. A centralized algorithm where all data is written to a central server and other nodes forward requests to this node satisfies Consistency, Partition tolerance (CP). It is also an example for consistent and available system (CA). Web-cache like systems where stale data could be returned is an example of a an available, partition tolerant (CA) system
- For partially synchronous networks, we can introduce the idea of timeouts and local clocks, but the CAP theorem still holds good.
- A notion of Delayed-t consistency is introduced, which means that if no messages are lost for a time period t, then the partial ordering among operations is maintained correctly. This is an example of a consistency model where consistency is sacrificed but there is some bound on the stale data that can be returned (weak consistency)

Trade-offs/Influence:
- The CAP theorem is the basis on which eventually consistent storage systems like BigTable, PNUTS etc.
I think it has been very influential over the last ten years and has also led to the growth of key-value stores and other NoSQL systems.
- More fundamentally the CAP theorem has led to consistency being sacrificed over availability (it is not possible to prevent all network partitions). This is interesting as a lot of the traditional storage systems and databases prioritized consistency. However with Internet services where availability is more important and the cost of presenting an inconsistent view is not very high (delayed tweets or friend count in Facebook), companies like Google and Amazon have found it to be more beneficial to have an eventually consistent system. 

Cluster-based Scalable Network Services


Motivation:
This paper looks at the problem of how to design a cluster for Internet services and how to provide software services which can be shared across all the services in a cluster.

Main Points
- The three main qualities that a network service should have are scalability, availability and cost-effectiveness.
- The paper describes the use of commodity building blocks to design a cluster with the insight that commodity machines give the best price-performance ratio.
 - This leads to a set of challenges in the software-layer which include load balancing, scheduling  and maintaining shared state across a large number of machines.
- The cluster architecture consists of a pool of worker nodes, shared cache servers and frontends. There is a centralized manager used for load-balancing and a common user-profile database.
- This paper also proposes the use of BASE semantics for data stored in the cache and workers. This increases the availability of data and makes it easier to scale.

Trade-offs/Influence:
- The architecture described here has been very influential and is in fact pretty close to the Google serving infrastructure (as of 2003). In fact even the Mesos design has similar goals of provisioning a cluster for multiple applications.
- The idea of using commodity building blocks and providing transparent fault tolerance is the basis for the design of MapReduce-like systems
- BASE semantics have also been very influential and this led in turn to the CAP theorem which has been useful to build storage systems like Bigtable, PNUTS etc.

Wednesday, September 14, 2011

The Datacenter Needs an Operating System

Motivation:
With a large number of Internet, enterprise and scientific services deployed in datacenters, this paper motivates the need for an operating system in the datacenter to provide ease of development and efficient sharing.

Main ideas:
- There is a growing diversity of applications deployed in the datacenter and similar to the personal computer, this leads to the need for an operating system
- There are four main roles envisioned for a datacenter OS
   a. Resource sharing: Efficiently share cluster resources among all the jobs and provide optimal scheduling and fairness.
   b. Data sharing: Provide common data interfaces (analogous to pipes) to allow different applications to be chained together.
   c. Programming abstractions: Simplify development of new programming frameworks by providing standard APIs and services.
   d. Debugging and monitoring: Provide support for end-to-end debugging and this could potentially involve a clean-slate approach having some restrictions
- Finally the paper also identifies how the academic research community can contribute to this goal, given that the industry

Trade-offs/influence:
The main trade-off here I see is between the challenge in coming up with standard interfaces against the benefits we would gain from it. The paper mentions that there many different software stacks today that are performing some of the functions mentioned above e.g., Google GFS-MapReduce etc., Hadoop's software stack etc. However as datacenters are owned by a single organization and when most of the software is developed in-house at places like Google and Amazon, there is less of a need to come up with standard interfaces.
Further, the PC operating systems are driven by a commodity business model where the user can assemble a computer bought from different vendors and run a large number of applications. This encourages have a data and resource sharing model that application developers can use. Similarly the development of a standard sharing model will grow once users and application developers have the need to run multiple programs at the same time.
Finally, I think the it is important to have an operating system for the datacenter as it will prevent vendors from locking in customers to their proprietary solutions. For example, if one wants to use both Amazon AWS and Microsoft Azure (to provide say primary backup failure handling), users cannot using the same application across both frameworks.  Having a more standard model for describing data and resources could help the user switch easily between cluster computing providers. 

Wednesday, September 7, 2011

The Datacenter as a Computer - Chapters 3,4,7

Motivation:
The chapters 3,4 and 7 of the book introduce the hardware building blocks and provide details on the power infrastructure used in a datacenter. Further they discuss the different failure modes of a warehouse scale computer and motivate the need for building fault-tolerant software

Main ideas:
- The server hardware used in warehouse scale computing is a low-end server for cost-efficiency reasons. Using TPC-C price/performance benchmark one can see that lower end servers are up to 20x more efficient than a high-end server.
- However, there are limits to how low-end components one can choose and using hardware from embedded systems may not give much benefits. This is because there are parts of a computation that are difficult to parallelize and the latency would suffer if lower-power processors are used.
- The power supply to a warehouse scale computer has many redundancy levels and the architecture of the datacenter is optimized for more efficient cooling.
- Even though low-end hardware components are used, configuration and software failures are more frequent. Further, while hardware failures are statistically independent, many of the software or configuration related failures are correlated and could lead to a loss of availability.
- A survey of the machine restart events in a Google datacenter showed that 95% of the machines restart less often than once a month, but this distribution has a long tail. The average machine availability was found to be 99.84%. But for a service which runs on 2000 machines, there will be a failure once every 2.5 hrs and software design needs to take into account such frequent failures.

Tradeoffs, Influence:
- The most important trade-off in the design of a warehouse scale computer is the choice of hardware components used for it. This work shows that there are many factors which influence such a decision and designers need to find a sweet spot over a large number of choices.
- Software components like file systems have been redesigned to handle failures (GFS) and this leads to the trade-off between consistency, availability and tolerance to network partition (CAP theorem). Storage systems like Bigtable and Yahoo's Pnuts have shown the importance of availability with eventual consistency.