Hive is an extension of the Hadoop framework, allowing developers to define their map-reduce jobs in terms of SQL queries. Hive integrates the SQL semantics into Hadoop as follows: data-tables are stored (and partitioned) as plain HDFS files, SQL queries are compiled into a MapReduce plan (map: selection, projection; reduce: join, aggregation), results are stored as HDFS files. The Hive system provides several benefits:
HadoopDB is a new data analysis platform that incorporates two different approaches currently available, parallel DBMS and MapReduce. MapReduce is scalable, flexible, and low-cost. But parallel DBMS shows better performance and efficiency. The authors presented HadoopDB design as running MapReduce over relational DBMS instances, specifically PostgreSQL. HadoopDB extends Hive's HiveQL to MapReduce transformer, by pushing most of the query processing into each database instance (as certain query processing, for example, join, is more efficient in DBMS).
The main idea behind HadoopDB is very simple and straightforward, considering the debate between parallel database research community and MapReduce community at the time.This paper showed the feasibility of a hybrid system that takes the good features from both technologies, parallel DB and MapReduce. Even though HadoopDB's performance wasn't comparable to Vertica in many cases, there's still room for improvement as Hive and Hadoop are improving and PostgreSQL is not column-store DBMS. Regarding fault tolerance, would it be possible to implement parallel DBMS to recover from a node failure without restarting the entire query processing? As Vertica (a parallel DBMS) shows a better performance than HadoopDB, I am wondering if it is inherently impossible for parallel DBMS to scale. One of the advantages using Hadoop (MapReduce) is scalability and flexibility (handling unstructured data). But if we are dealing with structured data like HadoopDB, I think MapReduce loses one of its edges because the MapReduce style flexibility is not required for structured data.
In this paper, authors present an open source datawarehousing solution calle Hive. Hive runs on Hadoop and makes the processing of large data sets on Hadoop a lot easier. Before Hive was introduced, such large data sets were processed only by MapReduce programming model, which is kind of very low level. Hive offers a data warehousing solution for this model with the halpe of metadata. This model is developed by FAcebook as their data processing became very complicated on Hadoop infrastructure. They introduce a new query language called HiveQL which is similar to SQL query language with limited options. The paper explains about the Data Model and Query Language with the help of few HiveQL query examples. We can also include the MapReduce scripts as part of the query. This section is followed by the explanation of how data is stored and retreived using SerDe(Serilization and Desrialization) functions followed by the presentation od System architecture. Final section explains how Hive is used in Facebook.
(credit: Sakthi Sundar Alagusundaram Ganesan)