Saturday, June 30, 2012


Apart from various useful features, Hbase provides another advanced and useful feature called COUNTERS.
Hbase provides us a mechanism to treat columns as counters. Counters allow us to increment a column value with least possible overhead.

Advantage of using Counters
While trying to increment a value stored in a table, we would have to lock the row, read the value, increment it, write it back to the table, and finally remove the look from the row, so that it can be used by other clients. This could cause a row to be locked for a long period and may possibly cause a clash between the clients tying to access the same row. Counters help us to overcome this problem as Increments are done under a single row lock, so write operations to a row are synchronized.

**Older versions of Hbase supported calls which involved one RPC per counter update. But the newer versions allow us to bundle multiple counters in a single RPC call

Counters are limited to a single row, though we can update multiple counters simultaneously. This means that we operate on one row at a time when working with counters.

Hbase API provide the Increment class to perform Increment operations.

**To increment columns of a row, instantiate an Increment object with the row to increment. At least one column to increment must be specified using the addColumn(byte[], byte[], long) method.
Alternatively we can use the the incrementColumnValue(row, family, qualifier, amount) method through an instance of Htable class. 

Thursday, June 28, 2012

FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Quite often the very first blow which people, starting their Hive journey, face is this exception :

FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

And the worst part is that even after Googling for hours and trying out different solutions provided by different peoples this error just keeps on haunting them. The solution is simple. You just have to change the owner ship and permissions of the Hve directories which you have crated for warehousing your data. Use the following commands t do that.

$HADOOP_HOME/bin/hadoop fs  -chmod g+w  /tmp
$HADOOP_HOME/bin/hadoop fs  -chmod 777  /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs  -chmod 777  /user/hive/warehouse
You are good to go now. Keep Hadooping.
NOTE : Please do not forget to tell me whether it worked for you or not.

Thursday, June 21, 2012


If you have successfully configured Hadoop on a single machine in pseudo-distributed mode and looking for some help to use Hbase on top of that then you may find this writeup useful. Please let me know if you face any issue.

Since you are able to use Hadoop, I am assuming you have all the pieces in place . So we'll directly start with Habse configuration. Please follow the steps shown below to do that:

1 - Download the Hbase release from one of the mirrors using the link shown below. Then unzip it at some convenient location (I'll call this location as HBASE_HOME now on) -

2 - Go to the /conf directory inside the unzipped HBASE_HOME and do these changes :

     - In the file modify these line as shown :
       export JAVA_HOME=/usr/lib/jvm/java-6-sun
       export HBASE_MANAGES_ZK=true

Wednesday, June 20, 2012


Sometimes you may try to aggregate data from different sources and dump it into a common location, say your HDFS. In such a scenario it will be useful to create a directory inside the HDFS corresponding to each host machine. To do this FLEME-NG provide a suitable escape sequence,  the %{host}. Unfortunately it was not working with early releases of FLUME-NG. In such case the only solution was to create a custom interceptor that adds a host header key to each event, along with the corresponding hostname as the header value.

But, luckily guys at Clouders did a great job and contributed an Interceptor to provide this feature out of the box. Now we just have to add few lines in our configuration file and we are good to go. For example, suppose we are collecting Apache web server logs from different hosts into a directory called flume inside the HDFS. It would be quite fussy to figure out which log is coming from which host. So we''ll use %{host} in our agent configuration files for the agents running on each machine. This will create a separate directory for each host inside the flume directory and store the logs from that host there itself. A simple configuration file may look like this :

Friday, June 15, 2012


The first Hbase sink was commited to the Flume 1.2.x trunk few days ago. In this post we'll see how we can use this sink to collect data from a file stored in the local filesystem and dump this data into an Hbase tableWe should have Flume built from the trunk in order to achieve that. If you haven't built it yet and looking for some help, you can visit my other post that shows how to build and use Flume-NG at this link :

First of all we have to write the configuration file for our agent. This agent will collect data from the file and dump it into the Hbase table. A simple configuration file might look like this :


In this post we'll see how to build flume-ng from trunk and use it for data aggregation.

Prerequisites :

In order to to do a hassle free build we should have following two things pre-installed on our box :
1- Thrift
2- Apache Maven-3.x.x

Build the project :

Once we are done with this we have to build flume-ng from the trunk. Use following commands to do this :

$ svn co flume

This will create a directory flume inside our /home/username/ directory. Now go inside this directory and start the build process.

$ cd flume
$ mvn3 install -DskipTests

NOTE : If everything was fine then you will receive a BUILD SUCCESS message after this. But sometimes you may get an error somewhat like this :

Thursday, June 14, 2012


The default MapReduce output format, TextOutputFormat, writes records as lines of text. Its keys and values may be of any type, since TextOutputFormat turns them to strings by calling toString() on them.

 Each key-value pair is separated by a tab character. We can change this separator to some character of our choice using the mapreduce.output.textoutputformat.separator (In the older MapReduce API this was mapred.textoutputformat.separator).

To do this you have to add this line in your driver function -
Configuration.set("mapreduce.output.key.field.separator", ",");

Error while executing MapReduce WordCount program (Type mismatch in key from map: expected, received

Quite often I see questions from people who are comparatively new to the Hadoop world or just starting their Hadoop journey that they are getting below specified error while executing the traditional WordCount program :

Type mismatch in key from map: expected, received

If you are also getting this error then you have to set your MapOutputKeyClass explicitly like this :

- If you are using the older MapReduce API then do this :

 - And if you are using the new MapReduce API then do this :

REASON : The reason for this is that your MapReduce application might be using TextInputFormat as the InputFormat class and this class generates keys of type LongWritable and values of type Text by default. But your application might be expecting keys of type Text. That's why you get this error.

NOTE : For detailed information you can visit the official MapReduce page.

Saturday, June 2, 2012

How to install maven3 on ubuntu 11.10

If you are trying to install maven2 that comes shipped with your ubutnu 11.10, and it is not working as intended you can try following steps to install maven3 :

1 - First of all add the repository for maven3. Use following command for this -
     $ sudo add-apt-repository ppa:natecarlson/maven3

2 - Now update the repositories - 
     $ sudo apt-get update

3 - Finally install maven3 - 
      $ sudo apt-get install maven3

NOTE : To check whether installation was done properly or not, issue the following 
              command -
              $ mvn --version

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich ...