CSE 704: Top Critiques for Week 4

Top Critiques for Week 4: Pig and Dremel

(in no particular order)

Pig Latin is a language which enables the programmer to write database queries in form of procedural program units so that they can easily implemented using map-reduce. Its like deconstruction of SQL into procedural program units. Pig is the system (based on map-reduce architecture) developed using Pig Latin which can be used for ad-hoc data processing. Pig Latin's data model consists of atom (single field), tuple (sequence of fields), bag (collection of tuples) and map (keys mapped to bags/tuples/atoms) and several language constructs like LOAD, FOREACH, GROUP etc for data processing. An exciting feature that comes Pig is its interactive debugger which is called Pig Pen which can choose a small example data subset and provide output at each iteration so that debugging will be faster and convenient.

Clear advantages of Pig are its procedural style data processing which allows nested operations, convenient debugger, built-in support for user defined functions and it can directly operate directly on input data without any schema.

However, the paper assumes that data analysts are comfortable with procedural sytle programming rather than SQL like queries for data processing, only with anecdotal evidence. It also talks about overhead created by Pig when compiling Pig Latin to map-reduce jobs but doesn't provide any data regarding how that overhead can be negated because of performance gains by Pig. Finally, the user- defined functions can only be written in java for now which constricts the developer.

(credit: Ravi Theja M)

Abstract

Dremel is an interactive ad-hoc data analysis system for read-only, nested data. It provides SQL-like language for users to write ad-hoc queries, but it does not translate the query into MapReduce programs. To enable an interactive analysis over "trillion- row tables", Dremel combines two techniques: (1) multi-level execution trees and (2) columnar storage for nested data.

For its implementation, at first, the paper addresses how to use columnar storage for nested data sets.

Then, the paper describes a multi-level serving tree to execute queries. A client send a query to a root server. A root server receives the query, reads metadata from the tables, and distribute the queries to the next level in the serving tree. Thus each level in the tree rewrites the query it receives and distributes to the next level. Finally, the leaf servers communicate with the actual storage layer. This execution model is especially good for aggregation queries because those return relatively small sized result, and in many cases, operations are commutative (e.g., count()).

Lastly, the paper describes Dremel's query dispatcher briefly. Scheduling of a query execution is based on slots, a processing unit corresponds to an execution thread on a leaf server. If a query on a tablet is too slow, it reschedules the query on another server.

Critiques

The fundamental assumption in building a system like Dremel is that web-scale data analysis usually processes "temporary" data. Thus, it would be better to process data without expensive data importing (which might be considered as a kind of "pre- processing" to improve query execution on that data). There are many systems built on top of the assumption, such as Hive and Pig-latin. (But Dremel does not use MapReduce as its platform because Dremel focuses on "interactive" analysis.) I think the assumption is quite true, but if their workload can become an incremental one, it would be better to spend some more time at first (preprocessing, indexing, and etc). But again, it's a tradeoff, and up to the system's usage and/or intended uses.

(credit: Anonymous)

The authors seek to address the challenge of performing interactive data analysis at a large scale. They present Dremel, which is an interactive query system for analysis of read-only nested large scale data in place at very fast execution times by using a columnar storage format for nested data. Because data used in web and scientific computing is often non-relational, a more flexible data model than SQL was essential in the design of Dremel and they proposed the columnar storage format for nested data as a design choice providing details of it's operation and experiments of its efficiency.

Another design choice they use is to base their query language on SQL while being able to efficiently query columnar nested storage. SQL is well-known and has many high-level useful querying features such as joins, aggregations, etc.

While the authors leave multiple-pass aggregations for future work, I found it surprising that any experimental data was lacking for this in this paper, especially given that the authors claim "most" Dremel queries are one-pass aggregations, implying that multiple-pass aggregations are in use as well. I also thought the "natural question" of where the cross over is of benefits from using Dremel based on the number of fields was left largely unanswered as the authors provided a mere "It depends" response. Combined with the MR/Dremel execution on columnar vs. record-oriented data where the authors intentionally access only a single field to maximize the perceived performance gained of columnar nested storage, I found myself not completely convinced of the performance gains based on the experiments provided in this paper.

(credit: Mike Over)