Thursday, October 25, 2012


Now here is something that's really gonna change the way people thing about Hadoop. Hadoop was always criticized by the BI world as it does not integrate well with traditional business intelligence processes, as they say. The BI world has always felt that Hadoop lacks the capability of delivering real time. So, here is the answer for that. The biggest player of the Hadoop world (just a personal view), Cloudera, has recently launched a Real-Time Query Engine for Hadoop, and they call it as Impla. And the best part is that Cloudera has decided to distribute Impala under Apache's licence, which means another treat for open source lovers. Although, it is just the beta release, I think it's worth giving Impala a try. Cloudera is using the Strata + Hadoop World event in New York City to unveil Impala. As Cloudera claims Impala can process queries 10 to 30 times faster than Hive/MapReduce. (Sounds quite impressive)

Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries. Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs.

Cloudera Impala Diagram

The Impala solution is composed of the following components :
1. Impala State Store - The state store coordinates information about all instances of impalad running in your environment. This information is used to find data so the distributed resources can be used to respond to queries.
2.impalad - This process runs on datanodes and responds to queries from the Impala shell. impalad receives requests from the database connector layer and schedules the tasks for optimal execution. Intermittently, the impalad updates the Impala State Store of its name and address.

More about Impala can be found out at the Cloudera Imapala page.

No comments:

Post a Comment

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich ...