DataCyclotron is a novel approach to distributed query processing system in the wake of improved network communication speeds. It is a self-organizing db architecture aimed at data systems with volatile workloads and to maximize resource utilization with little global supervision. It should be noted that it is still research level system in which many aspects are yet to be optimized.
DataCyclotron's ring topology is built around powerful processing nodes (with sufficient RAM), RDMA (ability to access a node's main memory remotely; requires special NIC) and an enhanced database underneath all this. The system is managed by a control layer between network topology and database called DataCyclotron layer (DaCy). DaCy continuously moves data (hot data) in the ring network and assigns tasks to nodes depending upon the data chucks level of interest (LOI, which is measured by multiple metrics).
The paper discusses about various type of possible scenarios in DaCy implementation like varying number of nodes in ring, different query types, non-uniform workload and how it affects the performance of DaCy. Finally, it measures the DaCy performance on related metrics such as data access latency, hot set management, ring extension and throughput.
All results in the paper are based DaCy implementation on MonetDb but how will be the performance if it is implemented on traditional and more popular row based databases? The motivation behind choosing ring topology vs other types of network topology is not discussed in full detail. Bus topology might suit better for sequential tasks? The implementation results are based on network connected via Infiniband but how well the DaCy perform are highly available Ethernet?
(credit: Ravi Theja M.)
The paper presents a dynamic way of processing distributed queries. It avoids the need to predict your workload requirements statically, and instead can dynamically adapt, self-organizing dynamically. It exploits newer technologies such as Remote Memory Access, and adapts mindsets, such as reducing network traffic at all costs, to more current situations. Along the way, it ditches roles assigned to machines, allowing for supurb load balancing among machines, as a query is free to be executed on any available machine, avoiding potential bottlenecks. It seems to favor a more holistic approach towards handling data, and avoids optimizing for specific queries, instead opting to slowly adapt its structure to fit its given workload. The approach taken by the paper seems novel and practical, but seems to gloss over the fact that it's based on a homogeneous cluster. It'd be interesting to see if that could be weakened, and how it would perform if changed. I think if your cluster is significantly large, the idea of having to upgrade it all at once may be a significant drawback. Similarly, if your cluster is say 70% of the way through its expected lifetime, and you need to expand the number of nodes, you may have to purchase old hardware to handle the data for a relatively short lifecycle.
(credit: Jon Logan)
(credit: Anonymous)