Wednesday, October 26, 2011

Piccolo - Building Fast, Distributed Programs with Partitioned Tables

Motivation:
This paper looks at the problem of how to build an easy-to-use programming abstraction for performing large-scale computation on in-memory data

Main points
- With growing size of DRAM in machines, the working set for many applications fits in-memory of a large cluster. This leads to the need for a programming model that is efficient and makes it easier for users to perform large scale computation

- Data-flow based systems like MapReduce don't expose shared global state across workers. While that makes it easier to program, this paper argues that it limits the applications and performance. Traditional frameworks like MPI use message passing to manage shared-state but are hard to use.

- This paper proposes partitioned tables distributed across the cluster as a method to read and modify shared state. The computation is structured around one master, used as a control thread and many worker nodes which execute 'kernels'. The workers can read and modify data in both local and remote partitions.

- Since many workers can update the same key, this can lead to a write-write conflict. Piccolo uses user-defined accumulation functions to resolve such conflicts.

- Other features of the system: User-specified locality for kernels, Load-Balancing using work-stealing and checkpoint recovery using a consistent snapshot (chandy-lamport algorithm)

Trade-offs
- The main trade-off this paper is looking to address is a performance vs. ease of use. As they acknowledge in their evaluation, using MPI should give similar or better performance but it is harder to use. While a key-value interface is definitely easier to use, the release consistency semantics could lead to non-deterministic behavior.

- Having global shared state leads to expensive checkpoint and recovery methods . The main insight here is that MapReduce has a simple fault tolerance model as it restricts the amount of global state that can be updated by a task. It remains to be seen if the benefits of exposing more shared state are worth the complex snapshot and recovery

- Overall, I think the bigger question is what are the applications that we wish to target for in-memory parallel programs. If the applications are iterative or sequential then Spark or MapReduce would be good enough. However if there are applications which require such fine-grained updates to shared state, Piccolo would be a good programming model