Monday, November 7, 2011

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric


Motivation:
Datacenter networks have a need to support VM migration without losing
any open connections. Layer 2 networks can support such functionality
but cannot scale to many thousands of nodes. While layer 3 networks are
more scalable, they also have a configuration overhead. This paper
proposes a new scalable, layer 2 network fabric for datacenters.

Main points
- The paper assumes a datacenter topology that is a multi-rooted tree.
  This is applicable for fat-trees and some of the other topologies
  developed in recent years.
- The main idea in the paper is the introduction of a Psuedo MAC (PMAC)
  address which allows end hosts to be named hierarchically at level 2.
  The edge-switches perform a translation of PMAC to MAC addresses and
  vice-versa.
- The networking fabric is co-ordinated by a centralized fabric manager.
  The fabric manager helps avoid broadcasts of ARP requests and helps in
  performing fault-tolerant routing.
- The paper also describes a local discovery protocol which helps switches
  bootstap automatically and discover their role in the multi-rooted
  tree.

Trade-offs
- This paper tries to get the benefits of Layer 3 (hierarchical
  namespace, better routing etc.) and the benefits of Layer 2 (migrate
  VMs without losing connections).
- The major insight in this paper is that since datacenter network
  topology is hierarchical and well-known, a new indirection (PMAC)
  can be used to get the benefits of Layer 2,3.
- The centralized manager simplifies the design but is a scalability
  bottleneck. The authors propose the a small cluster could be used, but
  its not clear if this would affect the other properties of the system.

Wednesday, October 26, 2011

Piccolo - Building Fast, Distributed Programs with Partitioned Tables

Motivation:
This paper looks at the problem of how to build an easy-to-use programming abstraction for performing large-scale computation on in-memory data

Main points
- With growing size of DRAM in machines, the working set for many applications fits in-memory of a large cluster. This leads to the need for a programming model that is efficient and makes it easier for users to perform large scale computation

- Data-flow based systems like MapReduce don't expose shared global state across workers. While that makes it easier to program, this paper argues that it limits the applications and performance. Traditional frameworks like MPI use message passing to manage shared-state but are hard to use.

- This paper proposes partitioned tables distributed across the cluster as a method to read and modify shared state. The computation is structured around one master, used as a control thread and many worker nodes which execute 'kernels'. The workers can read and modify data in both local and remote partitions.

- Since many workers can update the same key, this can lead to a write-write conflict. Piccolo uses user-defined accumulation functions to resolve such conflicts.

- Other features of the system: User-specified locality for kernels, Load-Balancing using work-stealing and checkpoint recovery using a consistent snapshot (chandy-lamport algorithm)

Trade-offs
- The main trade-off this paper is looking to address is a performance vs. ease of use. As they acknowledge in their evaluation, using MPI should give similar or better performance but it is harder to use. While a key-value interface is definitely easier to use, the release consistency semantics could lead to non-deterministic behavior.

- Having global shared state leads to expensive checkpoint and recovery methods . The main insight here is that MapReduce has a simple fault tolerance model as it restricts the amount of global state that can be updated by a task. It remains to be seen if the benefits of exposing more shared state are worth the complex snapshot and recovery

- Overall, I think the bigger question is what are the applications that we wish to target for in-memory parallel programs. If the applications are iterative or sequential then Spark or MapReduce would be good enough. However if there are applications which require such fine-grained updates to shared state, Piccolo would be a good programming model

Wednesday, September 21, 2011

Paxos Made Simple

Motivation:
This paper presents the Paxos algorithm in plain English. The Paxos algorithm solves the problem of reaching consensus in a distributed system where agents may fail and messages may be dropped or delayed.

Main points:
- The Paxos algorithm has three main roles for agents: Proposers, Acceptors and Learners. These agents behave according to a set of rules prescribed by the algorithm.
- Every proposal that is sent has a proposal identifier (id) and a value. The acceptors only accept a proposal if it has a proposal id greater than the proposals they have already accepted and if the values are the same.
- The algorithm has two phases: Phase 1: A proposer chooses a proposal number and sends a prepare request to a majority of acceptors. Every acceptor replies if it accepts the proposal and sets the value that should be used based on the previous condition
- Phase 2: If the proposer receives a response from a majority of acceptors, then it sends an accept message to  with the proposal number and a value that is the highest among values it received.
- Acceptors send notifications to the learners to disseminate the value that has been chosen. To reduce the number of messages, we could elect a distinguished learner who propagates messages to other learners.
- The election could fail if there are two proposers who keep increment the proposal numbers one after the other, thereby hindering progress. However the safety of the algorithm is still not affected.

Trade-offs/Influence:
- The Paxos algorithm has been influential in the design of distributed systems built for datacenters. Used in systems like Chubby and ZooKeeper, Paxos has a wide range of applications like leader election, distributed locks etc.

Monday, September 19, 2011

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Motivation:

This paper describes a proof of the CAP theorem which argues that consistent, available and partition-tolerant services are not feasible.

Main points:
- The proof of the CAP theorem is divided into two parts, first for asynchronous networks and the other for partially synchronous networks
- For asynchronous networks (where processes can be arbitrarily slow), it is shown that it is impossible to guarantee availability and atomic consistency for all fair executions. This is proved by considering two processes and showing that if no messages are exchanged, a write made in the first process and a read from the second process cannot present consistent results.
- It is possible in an asynchronous network to relax one of the constraints from CAP. A centralized algorithm where all data is written to a central server and other nodes forward requests to this node satisfies Consistency, Partition tolerance (CP). It is also an example for consistent and available system (CA). Web-cache like systems where stale data could be returned is an example of a an available, partition tolerant (CA) system
- For partially synchronous networks, we can introduce the idea of timeouts and local clocks, but the CAP theorem still holds good.
- A notion of Delayed-t consistency is introduced, which means that if no messages are lost for a time period t, then the partial ordering among operations is maintained correctly. This is an example of a consistency model where consistency is sacrificed but there is some bound on the stale data that can be returned (weak consistency)

Trade-offs/Influence:
- The CAP theorem is the basis on which eventually consistent storage systems like BigTable, PNUTS etc.
I think it has been very influential over the last ten years and has also led to the growth of key-value stores and other NoSQL systems.
- More fundamentally the CAP theorem has led to consistency being sacrificed over availability (it is not possible to prevent all network partitions). This is interesting as a lot of the traditional storage systems and databases prioritized consistency. However with Internet services where availability is more important and the cost of presenting an inconsistent view is not very high (delayed tweets or friend count in Facebook), companies like Google and Amazon have found it to be more beneficial to have an eventually consistent system. 

Cluster-based Scalable Network Services


Motivation:
This paper looks at the problem of how to design a cluster for Internet services and how to provide software services which can be shared across all the services in a cluster.

Main Points
- The three main qualities that a network service should have are scalability, availability and cost-effectiveness.
- The paper describes the use of commodity building blocks to design a cluster with the insight that commodity machines give the best price-performance ratio.
 - This leads to a set of challenges in the software-layer which include load balancing, scheduling  and maintaining shared state across a large number of machines.
- The cluster architecture consists of a pool of worker nodes, shared cache servers and frontends. There is a centralized manager used for load-balancing and a common user-profile database.
- This paper also proposes the use of BASE semantics for data stored in the cache and workers. This increases the availability of data and makes it easier to scale.

Trade-offs/Influence:
- The architecture described here has been very influential and is in fact pretty close to the Google serving infrastructure (as of 2003). In fact even the Mesos design has similar goals of provisioning a cluster for multiple applications.
- The idea of using commodity building blocks and providing transparent fault tolerance is the basis for the design of MapReduce-like systems
- BASE semantics have also been very influential and this led in turn to the CAP theorem which has been useful to build storage systems like Bigtable, PNUTS etc.

Wednesday, September 14, 2011

The Datacenter Needs an Operating System

Motivation:
With a large number of Internet, enterprise and scientific services deployed in datacenters, this paper motivates the need for an operating system in the datacenter to provide ease of development and efficient sharing.

Main ideas:
- There is a growing diversity of applications deployed in the datacenter and similar to the personal computer, this leads to the need for an operating system
- There are four main roles envisioned for a datacenter OS
   a. Resource sharing: Efficiently share cluster resources among all the jobs and provide optimal scheduling and fairness.
   b. Data sharing: Provide common data interfaces (analogous to pipes) to allow different applications to be chained together.
   c. Programming abstractions: Simplify development of new programming frameworks by providing standard APIs and services.
   d. Debugging and monitoring: Provide support for end-to-end debugging and this could potentially involve a clean-slate approach having some restrictions
- Finally the paper also identifies how the academic research community can contribute to this goal, given that the industry

Trade-offs/influence:
The main trade-off here I see is between the challenge in coming up with standard interfaces against the benefits we would gain from it. The paper mentions that there many different software stacks today that are performing some of the functions mentioned above e.g., Google GFS-MapReduce etc., Hadoop's software stack etc. However as datacenters are owned by a single organization and when most of the software is developed in-house at places like Google and Amazon, there is less of a need to come up with standard interfaces.
Further, the PC operating systems are driven by a commodity business model where the user can assemble a computer bought from different vendors and run a large number of applications. This encourages have a data and resource sharing model that application developers can use. Similarly the development of a standard sharing model will grow once users and application developers have the need to run multiple programs at the same time.
Finally, I think the it is important to have an operating system for the datacenter as it will prevent vendors from locking in customers to their proprietary solutions. For example, if one wants to use both Amazon AWS and Microsoft Azure (to provide say primary backup failure handling), users cannot using the same application across both frameworks.  Having a more standard model for describing data and resources could help the user switch easily between cluster computing providers. 

Wednesday, September 7, 2011

The Datacenter as a Computer - Chapters 3,4,7

Motivation:
The chapters 3,4 and 7 of the book introduce the hardware building blocks and provide details on the power infrastructure used in a datacenter. Further they discuss the different failure modes of a warehouse scale computer and motivate the need for building fault-tolerant software

Main ideas:
- The server hardware used in warehouse scale computing is a low-end server for cost-efficiency reasons. Using TPC-C price/performance benchmark one can see that lower end servers are up to 20x more efficient than a high-end server.
- However, there are limits to how low-end components one can choose and using hardware from embedded systems may not give much benefits. This is because there are parts of a computation that are difficult to parallelize and the latency would suffer if lower-power processors are used.
- The power supply to a warehouse scale computer has many redundancy levels and the architecture of the datacenter is optimized for more efficient cooling.
- Even though low-end hardware components are used, configuration and software failures are more frequent. Further, while hardware failures are statistically independent, many of the software or configuration related failures are correlated and could lead to a loss of availability.
- A survey of the machine restart events in a Google datacenter showed that 95% of the machines restart less often than once a month, but this distribution has a long tail. The average machine availability was found to be 99.84%. But for a service which runs on 2000 machines, there will be a failure once every 2.5 hrs and software design needs to take into account such frequent failures.

Tradeoffs, Influence:
- The most important trade-off in the design of a warehouse scale computer is the choice of hardware components used for it. This work shows that there are many factors which influence such a decision and designers need to find a sweet spot over a large number of choices.
- Software components like file systems have been redesigned to handle failures (GFS) and this leads to the trade-off between consistency, availability and tolerance to network partition (CAP theorem). Storage systems like Bigtable and Yahoo's Pnuts have shown the importance of availability with eventual consistency.

Warehouse-Scale Computing: Entering the Teenage Decade

Motivation:
The keynote at ISCA 2011 by Luiz Andre Barroso describes the challenges that warehouse scale computing faces in the next ten years. Tracing the history of warehouse scale computing from early 2000s, the presentation presents a window into the changes that have taken place in the last decade.


Main Ideas:
- Importance of low latency: The biggest idea that I found in the talk was the stress on the importance for low latency computation and I/O. For the computation, this means that using wimpy cores is not always good enough and having brawny cores helps in easily exploiting request-level parallelism.
- In terms of I/O, the advent of flash and how it is integrated into the storage hierarchy is an important problem. While flash has very good random I/O latency when compared to disks, the tail latency is high due to slow erase cycles (sometimes worse than disk).
- Power management: Over the last decade, power management has mostly looked at how to make the datacenter more efficient. The gains from this are lower now and we need to look at how to make individual machines more efficient. Further, there are other related problems like not using potable water for cooling and being able to have servers which can work better with load peaks.
- Disaggregation: The main idea behind dis-aggregation is that resources are utilized better when they can be shared across the datacenter and are not in small silos. Disk resources can be said to be dis-aggregated as network speeds allow access to remote disks to be almost as fast as local disk accesses.  There is a lot of work happening in the area of full bisection bandwidth in datacenters and faster networks could hopefully lead to more resources being dis-aggregated.

Trade-offs/influence:
- The changing nature of the workloads (e.g., Google Instant, Twitter) has led to the need for warehouse scale computers to support low latency operations along with other constraints such as energy efficiency.
- The example showing how much slower TCP is compared to RDMA highlights that many parts of the stack need to be restructured to build a low latency warehouse scale computer. I think this idea could be a great influence in terms of the research and development over the next decade.

Wednesday, August 31, 2011

The Datacenter as a Computer (Chapters 1, 2)

Motivation:
Introduce the concept of warehouse scale computers and outline the
differences with traditional datacenters. The first two chapters
introduce the architecture and software components such clusters.

Main Ideas:
- There is a broad shift to server-side computation due to the ease of
deployment and the growing influence of large scale workloads like
websearch.
- Warehouse-scale computers are defined as clusters consisting of
thousands of machines and services which are deployed across them.
With such large scale, failures are commonplace and are a part of the
design.
- The architecture of the warehouse scale computer consists of a set of
servers arranged in racks and connected by a rack-level switch. The
racks are then connected by a cluster-level switch which could be
over-subscribed by a factor of 5 or 10.
- The bandwidth and latency varies a lot for memory / disk accesses
on a local machine vs. machines in the same rack vs. machines in other
racks.
- There are many levels of software components in a warehouse scale
computer. The resource manager, storage servers and monitoring
infrastructure are examples of lower-level software components
- Applications in warehouse scale clusters could be latency senstive
like websearch or batch processing jobs like indexing and article
similarity.

Trade-offs, influence
- This work analyzes many trade-offs involved in constructing a
warehouse scale computer. These include
- Using NAS storage solutions vs. distributed file system like GFS.
GFS provides better availability and performance benefits for locality
aware applications.
- Choice between buying and building software components for warehouse
scale computing. Building components in-house gives more flexibility
and ease of maintenance.
- Software components can also trade-off between advanced features and
performance. For example, Bigtable does not support multi-row
transactions and GFS does not have POSIX semantics. These are not
frequently used in the workloads observed at Google, Microsoft etc.
- The advent of warehouse scale computers has been a key driving factor
for the rise of cloud computing. With increasing operating costs for power
consumption, warehouse scale computers can help in building more
efficient systems to support newer applications.
However, there will be a client-side component for most of the
applications (the authors acknowledge this) in say a mobile device or
a web-browser. With changing hardware costs, there could be a shift in
the most efficient way to split the computation between the server and clients.

Tuesday, August 30, 2011

Above the Clouds: A Berkeley View of Cloud Computing

Problem Motivation:
- To define cloud computing and to analyze the trade-offs for cloud
providers and users

Main Ideas:
- Cloud computing refers to both applications delivered over the
Internet and hardware/software systems in datacenters.
- Computing available as a utility is an important shift in IT
infrastructure and similar to semiconductor foundries, it can enable
many more users to build scalable web-services.
- Cloud providers benefit from statistical multiplexing and bulk
purchasing. They can make money as it is cheaper to build larger datacenters and
cost is amortized. It also helps fulfill the needs of their existing customers
e.g., Microsoft, IBM etc.
- Advantage for users is that cloud computing allows them to scale their
applications based on traffic patterns. This includes scaling up and
scaling down and based on the interface, this could be automatic or
specified by the programmer.
- Economic factors have influenced the growth of cloud computing. Jim Gray
observed in 2003 that cost of wide area networking has fallen
more slowly compared to CPU and storage costs. Hence it is better to put
applications closer to the data. This trend continues (as seen by 2008
numbers) and when opeartional costs are included, it is cheaper to
rent than to buy clusters.
- There are ten obstacles identified for the adoption and growth of
cloud computing. Some of these are policy issues like licensing, while
others are research opportunities like a scalable storage solution.

Trade-offs, Influence
- The paper analyzes the economic trade-offs for an organization to use
cloud computing against building their own datacenter. This trade-off
is represented as an equation and based on the cluster utilization,
companies can decide when it would be profitable for them to switch to
the cloud.
- The cloud computing interface also represents a trade-off between
flexibility and ease of use. While Amazon EC2 instances, close to
bare-metal hardware, provide flexibility, it is not easy to get
automatic scale up / scale down.
On the other hand Google AppEngine is designed for a restricted set of
applications, but provides automatic scaling.
- With the growing adoption of cloud computing, the ideas presented in
this paper are influential today and I think they will also be influential
in the future.