Thursday, May 24, 2012

Tips for Hadoop newbies (Part I).

Few moths ago, after completing my graduation I thought of doing something new. In quest of that I started learning and working on Apache's platform for distributed computing, the Apache Hadoop. Like a good student I started with reading the documentation. Trust me there are many good posts and documentations available for learning Hadoop and setting up a Hadoop cluster. But even after following everything properly, at times I ran into few problems and I could not find solutions for them. I posted questions on the mailing lists, searched over the internet, asked the experts and finally got my issues resolved. But it took a lot of precious time and efforts. Hence I decided to write down those things, so that if anyone who is just starting off doesn't have to face all those things.

Please provide me with your valuable comments and suggestions if you have any. That will help me a lot in refining things further, and to add on to my knowledge, as I am still a learner.

1 - If there is some problem with the Namenode, first of all check your hosts file. Proper DNS resolution is very important for Hadoop cluster to work properly.Then see whether ssh is working fine or not. For a pseudo-distributed mode configuration your hosts file should look this - localhost ubuntu.ubuntu-domain ubuntu
     For fully-distributed mode configuration add the IP addresses and the hostnames accordingly.

2 - Once you format the filesystem successfully issue bin/ command to start Namenode and Datanode, and then issue bin/ to start Jobtracker and Tasktracker. Usually people use bin/ command to start these processes, but this command has been depricated now.

3 - If you are unable to start the Naemnode after the system restart, add the following property in your core-site.xml file -
     - hadoop.tmp.dir : This is the base for temporary Hadoop directories. By default its location is the   
       tmp directory. So sometimes it could create a problem, as tmp directory is cleared at every reboot.
4 - If you are loosing your data every time you restart your machine, add the following properties in your hdfs-site.xml file - 
     Create two directories at some location of your choice and assign the paths of these directories as the  
     values of above specified properties. This is required because the default location for these properties 
     is defined by the hadoop.tmp.dir property. So it is always better to move these directories out of it. 

5 - If Datanode is not getting started, change the permissions of the directory which you have assigned as the value for property to 755.

6 - If you are trying to access the data stored in the Hdfs through Hdfs API, but you are getting a message like this directory does not exist, then probably your code is looking for that file inside your local filesystem instead of Hdfs. In such a case add the following two resources to your Hadoop configuration object - 
     - conf.addResource(new Path("path_to_the_core-site.xml_file"))
     - conf.addResource(new Path("path_to_the_hdfs-site.xml_file"))

In case you were able to deploy your Hadoop cluster properly, but you are facing problems while trying to use other Hadoop ecosystem projects like Hbase etc then you can give a try to the following -

1 - If you are facing HMaster related issues again make sure that you don't have any DNS related problem. Also properly set the required properties in hbase-site.xml file. For a pseudo-distributed mode HBase configuration the content hbase-site.xml file could be something like this -

                <description>the value should correspond to the value of property
                                      in the hadoop core-site.xml file.
                 <description>Property from ZooKeeper's config zoo.cfg.
                                     The port at which the clients will connect.
                 <description>Property from ZooKeeper's config zoo.cfg.
                                      The directory where the snapshot is stored.

Also modify the following lines in the file -

      export HBASE_REGIONSERVERS=/path_to_the_hbase_directory/conf/regionservers

      export HBASE_MANAGES_ZK=true

2 - If HMaster is still not responding as expected, or if you are getting some exception like -

2011-12-06 13:59:29,979 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. Call to localhost/ failed on local exception:
2011-12-06 13:59:30,577 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/
2011-12-06 13:59:30,577 WARN org.apache.zookeeper.ClientCnxn: Session 0x134127deaaf0002 for server null, unexpected error, closing socket connection and attempting reconnect Connection refused
at Method)
at org.apache.zookeeper.ClientCnxn$

This exception is thrown if there is some incompatibility issue between Hadoop and Hbase. To overcome this problem copy the hadoop-core-*.jar from the hadoop directory to the hbase/lib directory. Then kill the hbase processes and start it again. Now, if you get the exception shown below -
2011-12-06 14:51:05,778 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration.
This happens if proper commons configuration jar is not present in the hbase/lib. To overcome this problem add commons-configuration-1.6.jar from the hadoop/lib directory to the hbase/lib directory. Then kill the hbase processes again and do a restart.
I hope this post proves to be useful for those who are starting their Hadoop journey. In the next post of this series I will cover the issues that we normally face while configuring other Hadoop sub project like Hive, Pig etc.

No comments:

Post a Comment

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich ...