Monday, July 23, 2012

HOW TO CONFIGURE HADOOP

You can find countless posts on the same topic over the internet. And most of them are really good. But quite often, newbies face some issues even after doing everything as specified. I was no exception. In fact, many a times, my friends who are just starting their Hadoop journey, call me up and tell me that they are facing some issues even after doing everything in order. So, I thought of writing down the things which worked for me. I am not going in detail as there are many better post that outline everything pretty well. I'll just show how to configure Hadoop on a single Linux box in pseudo distributed mode.

Prerequisites :

1- Sun(Oracle) java must be installed on the machine.
2- ssh must be installed and keypair must be already generated.

NOTE : Ubuntu comes with its own java compiler (i.e OpenJDK), but Sun(Oracle) java is the preferable choice for Hadoop. You can visit this link if you need some help on how to install it.

NOTE : You can visit this link if you want to see how to setup and configure ssh on your Ubuntu box.

Versions used :

1- Linux (Ubuntu 12.04)
2- Java (Oracle java-7)
3- Hadoop (Apache hadoop-1.0.3)
4- OpenSSH_5.9p1 Debian-5ubuntu1, OpenSSL 1.0.1 14

If you have everything in place, start following the steps shown below to configure Hadoop on your machine :

1- Download the stable release of Hadoop (hadoop-1.0.3 at the time of this writing) from the repository and copy it to some convenient location. Say your home directory.

2- Now, right click the compressed file which you have downloaded just now and choose extract here. This will create the hadoop-1.0.3 folder inside your home directory. We'll call this location as  HADOOP_HOME hereafter. So, your HADOOP_HOME=/home/your_username/hadoop-1.0.3

3- Edit the /HADOOP_HOME/conf/hadoop-env.sh file to set the JAVA_HOME variable to point to appropriate jvm.

    export JAVA_HOME=/usr/lib/jvm/java-7-oracle

NOTE : Before moving further, create a directory, hdfs for instance, with sub directories viz. name, data and tmp. We'll use these directories as the values of properties in the configuration files.

NOTE : Change the permissions of the directories created in the previous step to 755. Too open or too close permissions may result in abnormal behavior. Use the following command to do that :

cluster@ubuntu:~$ sudo chmod -R 755 /home/cluster/hdfs/


4- Now, we'll start with the actual configuration process. Hadoop is configured using a set of configuration files present inside the HADOOP_HOME/conf directory. These are xml files having a set of properties in form of key-value pairs. We'll modify the following 3 files for our setup :

    I- HADOOP_HOME/conf/core-site.xml : Add the following lines between the <configuration></configuration> tag -

    <property>
            <name>fs.default.name</name>
            <value>hdfs://localhost:9000</value>
     </property>
     <property>
             <name>hadoop.tmp.dir</name>
             <value>/home/your_username/hdfs/tmp</value>
     </property>

     fs.default.name : This is the URI (protocol specifier, hostname, and port) that describes the NameNode for  the cluster. Each node in the system on which Hadoop is expected to operate needs to know the address of the NameNode.

    hadoop.tmp.dir : A base for temporary directories. Value of this property defaults to the /tmp directory. So, it is always better to set this property to some other location to prevent irregularities.

     II- HADOOP_HOME/conf/hdfs-site.xml : Add the following lines between the <configuration></configuration> tag -

     <property>
             <name>dfs.name.dir</name>
             <value>/home/your_username/hdfs/name</value>
      </property>
      <property>
             <name>dfs.data.dir</name>                  
             <value>/home/your_username/hdfs/data</value>
      </property>
      <property>
             <name>dfs.replication</name>
             <value>1</value>
      </property>

      dfs.name.dir : This is the path on the local file system of the NameNode instance where the NameNode metadata is stored. Defaults to the /tmp directory, if not specified explicitly.

    dfs.data.dir : This is the path on the local file system in which the DataNode instance should store its data. It also defaults to the /tmp directory, if not specified explicitly.

  II- HADOOP_HOME/conf/mapred-site.xml : Add the following lines between the <configuration></configuration> tag -

     <property>
              <name>mapred.job.tracker</name>
              <value>localhost:9001</value>
      </property>

      mapred.job.tracker : host and port at which JobTracker will run.

NOTE : Although there are many properties that can be used and play an important role while working with a large, fully distributed cluster, above shown properties are sufficient enough to set up a pseudo distributed Hadoop cluster on a single machine.

5- The configuration part is over now. And in order to proceed further, we have to format our Hdfs first (like any other file system). Use the following command to do that :

    cluster@ubuntu:~/hadoop-1.0.3$ bin/hadoop namenode -format

If everything was ok, you'll see something like this on your terminal :

12/07/23 05:43:22 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012
************************************************************/
12/07/23 05:43:22 INFO util.GSet: VM type       = 64-bit
12/07/23 05:43:22 INFO util.GSet: 2% max memory = 17.77875 MB
12/07/23 05:43:22 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/07/23 05:43:22 INFO util.GSet: recommended=2097152, actual=2097152
12/07/23 05:43:22 INFO namenode.FSNamesystem: fsOwner=cluster
12/07/23 05:43:22 INFO namenode.FSNamesystem: supergroup=supergroup
12/07/23 05:43:22 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/07/23 05:43:22 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/07/23 05:43:22 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/07/23 05:43:22 INFO namenode.NameNode: Caching file names occuring more than 10 times 
12/07/23 05:43:22 INFO common.Storage: Image file of size 113 saved in 0 seconds.
12/07/23 05:43:23 INFO common.Storage: Storage directory /home/cluster/hdfs/name has been successfully formatted.
12/07/23 05:43:23 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.0.1
************************************************************/
cluster@ubuntu:~/hadoop-1.0.3$

6- Once the formatting is done, start the NameNode, Secondary NameNode and DataNode daemons using the command shown below :   

     cluster@ubuntu:~/hadoop-1.0.3$ bin/start-dfs.sh


This will emit the following lines on the terminal :


starting namenode, logging to /home/cluster/hdfs/logs/hadoop-cluster-namenode-ubuntu.out

localhost: starting datanode, logging to home/cluster/hdfs/logs/hadoop-cluster-datanode-ubuntu.outlocalhost: starting secondarynamenode, logging to /home/cluster/hdfs/logs/hadoop-cluster-secondarynamenode-ubuntu.out
cluster@ubuntu:~/hadoop-1.0.3$

7- To start the JobTracker and Tasktracker daemons use :

      cluster@ubuntu:~/hadoop-1.0.3$ bin/start-mapred.sh

This will emit the following lines on the terminal :

starting jobtracker, logging to /home/cluster/hdfs/logs/hadoop-cluster-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to
/home/cluster/hdfs/logs/hadoop-cluster-tasktracker-ubuntu.out
cluster@ubuntu:~/hadoop-1.0.3$

NOTE : To check if everything is working fine or not, we'll use JPS command (OpenJDK must be installed for this) :
cluster@ubuntu:~/hadoop-1.0.3$ jps
12537 Jps
12042 SecondaryNameNode
12173 JobTracker
11783 DataNode
11487 NameNode
12421 TaskTracker

NOTE : Hadoop also provides a web interface using which we can monitor our cluster. Point your web browser to http://localhost:50070 to see the NameNode status and to http://localhost:50030 to see the MapReduce status.



You can find all the information about your Hdfs from this page. You can even browse the file system and download files from here.


This page shows the status of the JobTracker and includes information about all the MapReduce jobs which ran on the cluster.

Please do not forget to provide me your valuable comments and suggestions.


8 comments:

  1. Thanks for providing such a wonderful tutorial.

    Now, I am able to run Hadoop on my Ubuntu 12.04(32 bit) Laptop.

    Please, share some good tutorial, where I can learn the Hadoop.

    ReplyDelete
    Replies
    1. You are always welcome.I normally try to write a post on the areas where folks usually get stuck like this one.It's very difficult to cover each n everything as Hadoop ecosystem is quite vast.Please let me know if you need something in particular.I'll definitely try to cover that.As far as learning about Hadoop is concerned, you can visit the official homepage at hadoop.apache.org. Apart from this you can give "Hadoop-The Definitive Guide".It's a great book.You can also visit other posts of mine.I have covered few other things as well.HTH
      Thank you.

      Delete
  2. Hi, while formating name i am getting the bellow msg like:

    orugantp@ubuntu:~/hadoop-1.0.3$ bin/hadoop namenode -format
    13/02/28 09:35:43 INFO namenode.NameNode: STARTUP_MSG:
    /************************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG: host = ubuntu/127.0.1.1
    STARTUP_MSG: args = [-format]
    STARTUP_MSG: version = 1.0.3
    STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May 8 20:31:25 UTC 2012
    ************************************************************/
    Re-format filesystem in /home/orugantp/hdfs/name ? (Y or N) y
    Format aborted in /home/orugantp/hdfs/name
    13/02/28 09:35:57 INFO namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
    ************************************************************/

    ReplyDelete
  3. Hi every one.!
    Thanks for your valuable posts. Can i Know hoe to install Hive on Linux for Apache Hadoop. I am getting some problem while downloading it. So can u guys please help me out in step-by step process clearly..

    Thanks

    ReplyDelete
  4. Download hive from this link : http://apache.techartifact.com/mirror/hive/stable/hive-0.9.0.tar.gz

    Unzip it.

    Set HADOOP_HOME
    Set HIVE_HOME
    Set HIVE in your Path

    Start hive using : bin/hive

    ReplyDelete
  5. Hi Mohammad, thanks for your tutorial :)! It is a very good one :)
    I'm new with hadoop, I'm learning it now.
    I have a question: Do we have to format hdfs ($ bin/hadoop namenode -format) always before we go further with hadoop, or it's only once and why?

    ReplyDelete
  6. You are welcome Elma. And a big 'NO' regarding reformatting HDFS. It's a one time activity and doing it after storing the data into the HDFS will cause a complete data loss. It is actually a storage related thing and is applicable to all kinds of storages, and not only HDFS. Have you ever notices that before starting to use any hard drive or flash drive etc, you need to first format it?This is required to create a proper layout(creation of blocks, tracks, sectors etc) which will be used for data storage. It's ,like buying a piece of land. You can obviously go there but in order to actually live there you have to build a house on that land. Same is the case here.

    ReplyDelete

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich ...