CSE 704: Top Critiques for Week 5

Top Critiques for Week 5: MonetDB and DataCyclotron

(in no particular order)

DataCyclotron is a novel approach to distributed query processing system in the wake of improved network communication speeds. It is a self-organizing db architecture aimed at data systems with volatile workloads and to maximize resource utilization with little global supervision. It should be noted that it is still research level system in which many aspects are yet to be optimized.

DataCyclotron's ring topology is built around powerful processing nodes (with sufficient RAM), RDMA (ability to access a node's main memory remotely; requires special NIC) and an enhanced database underneath all this. The system is managed by a control layer between network topology and database called DataCyclotron layer (DaCy). DaCy continuously moves data (hot data) in the ring network and assigns tasks to nodes depending upon the data chucks level of interest (LOI, which is measured by multiple metrics).

The paper discusses about various type of possible scenarios in DaCy implementation like varying number of nodes in ring, different query types, non-uniform workload and how it affects the performance of DaCy. Finally, it measures the DaCy performance on related metrics such as data access latency, hot set management, ring extension and throughput.

All results in the paper are based DaCy implementation on MonetDb but how will be the performance if it is implemented on traditional and more popular row based databases? The motivation behind choosing ring topology vs other types of network topology is not discussed in full detail. Bus topology might suit better for sequential tasks? The implementation results are based on network connected via Infiniband but how well the DaCy perform are highly available Ethernet?

(credit: Ravi Theja M.)

The paper presents a dynamic way of processing distributed queries. It avoids the need to predict your workload requirements statically, and instead can dynamically adapt, self-organizing dynamically. It exploits newer technologies such as Remote Memory Access, and adapts mindsets, such as reducing network traffic at all costs, to more current situations. Along the way, it ditches roles assigned to machines, allowing for supurb load balancing among machines, as a query is free to be executed on any available machine, avoiding potential bottlenecks. It seems to favor a more holistic approach towards handling data, and avoids optimizing for specific queries, instead opting to slowly adapt its structure to fit its given workload. The approach taken by the paper seems novel and practical, but seems to gloss over the fact that it's based on a homogeneous cluster. It'd be interesting to see if that could be weakened, and how it would perform if changed. I think if your cluster is significantly large, the idea of having to upgrade it all at once may be a significant drawback. Similarly, if your cluster is say 70% of the way through its expected lifetime, and you need to expand the number of nodes, you may have to purchase old hardware to handle the data for a relatively short lifecycle.

(credit: Jon Logan)

Summary

The paper answers the need for architecture conscious database to tackle the imbalance concept of memory wall in applications involving analysis of large datasets like OLAP tools.
In contrast to database systems that concentrate on providing faster execution speeds, MonetDB tends to optimize the query execution process by efficient utilization of the memory arrangement of the system itself.
MonetDB makes use of vertical storage, bulk query algebra, cache-conscious algebra and memory access cost modeling to overcome memory wall.

Review & Evaluation

The paper is concentrated on describing the architecture of MonetDB and not the system performance.
Although it uses relational algebra, it supports SQL, XQuery and SPARQL at the front end.
Not meant for general purpose applications.
The size of database depends on the underlying hardware platform.
Probably, decomposed storage model will consume more space to store same amount of data than in other storage architecture.
Concurrent execution of queries is a limitation.
A decomposed storage scheme using memory mapped files with automated index selection and maintenance.
The relational operators materialize their results and are self-optimizing.
Analytical details about failure tolerance is desirable.

(credit: Anonymous)