Monday, July 2, 2012


Hive is a wonderful tool for those who like to perform batch operations to process their large amounts of data residing on a Hadoop cluster and who are comparatively new to the NOSQL world. Not only it provides us warehousing capabilities on top of a Hadoop cluster, but also a superb SQL like interface which makes it very easy to use and makes our task execution more familiar. But, one thing which newbies like me always wanted to have is the support of BETWEEN operator in Hive.

Since the release of version 0.9.0 earlier this year, Hive provides us some new and very useful features. BETWEEN operator is one among those.

Apart from this some new UDFs have also beed added viz. printf(), sort_array(), and java_method().

Not only this, the new Hive also has something special for our Hbase users. Hive now has the ability to access primitive types stored in our Hbase tables as in Binary format, and support for filter-pushdown for keys.

Now here is a quick example on how to use BETWEEN, very basic though -

First create a simple demo table :
hive> CREATE TABLE employee(uid int, name string) 

Now we'll load some data into this table :
hive> 'home/mohammad/file.txt' OVERWRITE INTO 
hive> TABLE employee;

Now, we'll have a look over the contents of the table -
hive> SELECT * FROM employee;

1 abc
2 xyz
3 pqr
4 abc1
Time taken: 0.153 seconds

Now, suppose we want to fetch the details of those employees whose uids are between 2 and 4. To do that -
hive> SELECT * FROM employee
hive>  WHERE uid BETWEEN 2 AND 4;

Once this query is executed successfully we'll get the following output on our shell -
2 xyz
3 pqr
4 abc1
Time taken: 21.344 seconds

Well, it was a very primitive example on how to use BETWEEN operator provided by Hive. I know you guys are much better than me in writing SQL queries, so go ahead and KEEP HADOOPING.

NOTE : For further details you can visit the official home page.

No comments:

Post a Comment

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich ...