CloudFront: 2012

Wednesday, December 19, 2012

India to Open Up Its Data

Here comes a moment to be proud of. India has joined a select group of over 20 countries whose governments have launched open data portals. With a view to improving government transparency and efficiency, www.data.gov.in, (which is in beta right now) will provide access to a valuable repository of datasets, from government departments, ministries, and agencies, and autonomous bodies.

Data Portal India is a platform for supporting Open Data initiative of Government of India. The portal is intended to be used by Ministries/Department/Organizations of Government of India to publish datasets, and applications for public use. It intends to increase transparency in the functioning of Government and also opens avenues for many more innovative uses of Government Data to give different perspective.

The entire product is available for download at the Open Source Code Sharing Platform GitHub.

Open data will be made up of “non-personally identifiable data” collected, compiled, or produced during the normal course of governing. It will be released under an unrestricted license -- meaning it is freely available for everyone to use, reuse, or distribute, but citations will be required.

For a detailed info and all the terms and conditions, you can visit the official web site.

Exception in thread "main" java.lang.NoSuchFieldError: type at org.apache.hadoop.hive.ql.parse.HiveLexer.mKW_CREATE(HiveLexer.java:1602)

Now you have successfully configured Hadoop and everything is running perfectly fine. So, you decided to give Hive a try. But, oops...as soon as you try to create the very first table you find yourself into something like this :

Exception in thread "main" java.lang.NoSuchFieldError: type
at org.apache.hadoop.hive.ql.parse.HiveLexer.mKW_CREATE(HiveLexer.java:1602)
at org.apache.hadoop.hive.ql.parse.HiveLexer.mTokens(HiveLexer.java:6380)
at org.antlr.runtime.Lexer.nextToken(Lexer.java:89)
at org.antlr.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:133)
at org.antlr.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:127)
at org.antlr.runtime.CommonTokenStream.setup(CommonTokenStream.java:132)
at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:91)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:547)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:438)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Need not worry. It's something related to antlr-*.jar which is present inside you HIVE_HOME/lib directory. Just make sure you don't have any other antlr-*.jar in your classpath. If it still doesn't work, download the latest version from the ANTLR website and put it inside your HIVE_HOME/lib. Restart Hive and you are good to go...

NOTE: If you want to see how to configure Hadoop you can go here

Friday, November 30, 2012

How to add minimize and maximize buttons in Gnome 3

In a recent post I had shown the steps to add the Power off option in the menu, which is absent by default, in Gnome 3. In this post we'll see another feature which most of us would love to have in our Gnome environment, especially guys coming from PC world. The minimize and maximize buttons.

In order to do that follow the specified below :

1. Open run box by pressing ALT+F2. Type dconf-editor in it and hit enter(Alternatively you can open your terminal and type there).
* If you don't have dconf install on your machine, you can install it using this command -
apache@hadoop:~$ sudo apt-get install dconf-tools

2. This will open the dconf-editor and you will see something like this on your screen -

Now, go to org->gnome->shell->overrides and click there. It will open the editor in the right pane. Here click on the value against button-layout. By default it'll show only close there.

Add minimize and maximize option here and you are done.

Let me know if you face any issue. Thank you.

How to add shutdown option in the Gnome 3 menu

While I am just in love with Gnome 3, one thing which I don't like about it is the absence of shutdown or power off option. But, the good news it that it can be added very easily. All you have to do is follow these three simple steps :

1. Add the repository -
  apache@hadoop:~$ sudo add-apt-repository ppa:ferramroberto/gnome3

2. Update it -
  apache@hadoop:~$ sudo apt-get update

3. Install the Gnome extensions -
  apache@hadoop:~$ sudo apt-get install gnome-shell-extensions

Now restart your machine and you are good to go.

NOTE : This is procedure if you are working on Ubuntu. You might have to use a different procedure to add the repository and do the installation depending on your OS.

Thank you.

Thursday, October 25, 2012

HOW TO INSTALL AND USE MICROSOFT HDINSIGHT (HADOOP ON WINDOWS)

HDInsight is Microsoft’s 100% Apache compatible Hadoop distribution, supported by Microsoft. HDInsight, available both on Windows Server or as an Windows Azure service, empowers organizations with new insights on previously untouched unstructured data, while connecting to the most widely used Business Intelligence (BI) tools on the planet. In this post we'll directory jump into the hands-on. But, if you want more on HDInsight, you can visit my another post here.

NOTE : OS used - Windows 7

So let's get started.

First of all go to the Microsoft Big Data page, and click on the Download HDInsight Server link (shown in the blue eclipse). You will see something like this :

Once you click the link it will guide you to the Download Center. Now, go to the Instructions heading and click on Microsoft Web Platform Installer.

This will automatically download and install all the required thing.

Once the installation is over open the Microsoft Web Platform Installer and go to the Top Right corner of the Microsoft Web Platform Installer UI where you will find a Search Box. Type Hadoop in there. This will show you Microsoft HDInsight for Windows Server Community Technology Preview bar. Select it and click on install. And you are done.

NOTE : It may take some time to install all the necessary components depending upon your connection speed.

On successful completion of HDInsight you can find the Hadoop Command Line icon on your desktop. Also you will find a brand new directory named Hadoop inside your C drive. This indicates that everything was OK and you are good to go.

TESTING TIME

It's time now to test HDInsight.

Step1. Go to the C:\Hadoop\hadoop-1.1.0-SNAPSHOT\bin directory :
c:\>cd Hadoop\hadoop-1.1.0-SNAPSHOT\bin

Step2. Now, start the daemons using start_daemons.cmd :
c:\Hadoop\hadoop-1.1.0-SNAPSHOT\bin>start_daemons.cmd

It will show you something like this on your terminal :

This means that your Hadoop processes have been started successfully and you are all set.

Let us use few of the Hadoop Commands to get ourselves familiar with Hadoop.

1. List all the directories, sub-directories and file present in Hdfs. And we do it using fs -ls :

2. Create a new directory inside Hdfs.We use fs -mkdir to do that :

You would have become familiar with the Hadoop shell by now. But I would suggest to go to the official Hadoop Page and try more in order to get a good grip. HAPPY HADOOPING..!!

HDInsight, Hadoop For Windows

Now, here is a treat for those who want to go Hadoop way but don't love Linux that much. Microsoft is rolling out the first preview editions of its Apache Hadoop integration for Windows Server and Azure in a marriage of open source and commercial code, after a year of beta testing. And they call it HDInsight. HDInsight is available both on Windows Server ( or Windows 7) or as an Windows Azure service. HDInsight will empower organizations with new insights on previously untouched unstructured data, while connecting to the most widely used Business Intelligence (BI) tools.

Microsoft collaborated with Hortonworks to make this happen. Last year Microsoft had announced that it will integrate Hadoop into its forthcoming SQL Server 2012 release and Azure platforms, and had committed to full compatibility with the Apache code base. The first previews have been shown off at the Hadoop World show in New York and are open for download. HDInsight delivers Apache Hadoop compatibility for the enterprise and simplify deployment of Hadoop-based solutions. In addition, delivering these capabilities on the Windows Server and Azure platforms enables customers to use the familiar tools of Excel, PowerPivot for Excel and Power View to easily extract actionable insights from the data.

Microsoft also announced that it is going to expand partnership with Hortonworks, to give customers access to an enterprise-ready distribution of Hadoop with the newly released solutions. Having said that, I hope this Microsoft+Hortonwork relationship gets growing so that we keep on getting great things like HDInsight.

You can find more about HDInsight here. And if you are planning to give HDInsight a shot you can visit my post on this which shows how to install and start using Hadoop on windows using HDInsight.

CLOUDERA IMPALA

Now here is something that's really gonna change the way people thing about Hadoop. Hadoop was always criticized by the BI world as it does not integrate well with traditional business intelligence processes, as they say. The BI world has always felt that Hadoop lacks the capability of delivering real time. So, here is the answer for that. The biggest player of the Hadoop world (just a personal view), Cloudera, has recently launched a Real-Time Query Engine for Hadoop, and they call it as Impla. And the best part is that Cloudera has decided to distribute Impala under Apache's licence, which means another treat for open source lovers. Although, it is just the beta release, I think it's worth giving Impala a try. Cloudera is using the Strata + Hadoop World event in New York City to unveil Impala. As Cloudera claims Impala can process queries 10 to 30 times faster than Hive/MapReduce. (Sounds quite impressive)

Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries. Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs.

Cloudera Impala Diagram

The Impala solution is composed of the following components :
1. Impala State Store - The state store coordinates information about all instances of impalad running in your environment. This information is used to find data so the distributed resources can be used to respond to queries.
2.impalad - This process runs on datanodes and responds to queries from the Impala shell. impalad receives requests from the database connector layer and schedules the tasks for optimal execution. Intermittently, the impalad updates the Impala State Store of its name and address.

More about Impala can be found out at the Cloudera Imapala page.

Tuesday, October 23, 2012

UBUNTU FOR ANDROID

Now, here is something that would really catch each open-source lover's attention. Team Ubuntu has come up with the idea of running Ubuntu on your Android device. Ubuntu for Android, as they call it, is an upcoming free and open source variant of Ubuntu designed to run on Android phones. It is expected to come pre-loaded on several phones. Android was shown at Mobile World Congress 2012. The best thing about Ubuntu for Android is that you don't have to reboot the device as both the operating systems use the same kernel.

Other salient features include :

1. Both Ubuntu and Android run at the same time on the device, without emulation or virtualization.

2. When the device is connected to a desktop monitor, it features a standard Ubuntu Desktop interface.

3. When the device is connected to a TV, the interface featured is the Ubuntu TV experience.

4. Ability to run standard Ubuntu Desktop applications, like Firefox, Thunderbird, VLC, etc.

5. Ability to run Android applications on the Ubuntu Desktop.

6. Make and receive calls and SMSs directly from the Desktop

In order to use Ubuntu for Android, the device should have :

1. Dual-core 1 GHz CPU

2. Video acceleration: shared kernel driver with associated X driver; OpenGL, ES/EGL

3. Storage: 2 GB for OS disk image

4. HDMI: video-out with secondary framebuffer device

5. USB host mode

6. 512 MB RAM

Here is an image showing Ubuntu running on an Android device, docked to a desktop monitor.

For more info you can visit the official page.

Tuesday, August 28, 2012

HOW TO INSTALL NAGIOS ON UBUNTU 12.04

Monitoring plays an important role in running our systems smoothly. It is always better to diagnose the problems and take some measures as early as possible, rather than waiting for things to go worse.

Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. For a detailed information on Nagios you can visit the official documentation page here. I'll just cover the steps to install and get Nagios working on your Ubuntu box.

First of all install Nagios on your Ubuntu box using the following command :
$ sudo apt-get install -y nagios3

It will go through, and ask you about what mail server you want to use. You'll see something like this on your screen.

enter image description here

Pick one as per your requirements.

enter image description here

It will then ask you about the domain name you want to have email sent from. Again, fill that out based upon your needs.

It will ask you what password you want to use - put in a secure password. This is for the admin account nagiosadmin.

NOTE : Keep the account name and password name in mind as you'll need it to log in to Nagios.

enter image description here

It'll ask you to verify your password.

Once the install is all done, point your web browser to localhost/nagios3 or whatever the IP address/domain name of the server you installed Nagios on is.

At this point you'll be asked to enter your password. Enter your password and hit enter.

enter image description here

Once you've done that, you're in and you'll see something like this on your screen.

enter image description here

Nagios automatically adds in the 'localhost' to the config, and does load, current users, disk space, http and ssh checks.

enter image description here

Before leaving we have to do one more thing to make Nagios all set. We need to have it accept external commands so that we can acknowledge problems, add comments, etc. To do that, we need to edit a few files.

Start by opening /etc/nagios3/nagios.cfg with the following command.
$ sudo nano /etc/nagios3/nagios.cfg

Now, search for check_external_commands, and turn the check_external_commands=0 into check_external_commands=1.

Now, restart apache by running.

$ sudo service apache2 restart

Not done yet! We need to edit /etc/group. There should be a line like this in there.

nagios:x:114

Change it to
nagios:x:114:www-data

Save and close this file.

Now, we need to edit the /var/lib/nagios3/rw files permission with.
$ sudo chmod g+x /var/lib/nagios3/rw

And then (because of how permissions work) we need to edit the permissions of the directory above that with.
$ sudo chmod g+x /var/lib/nagios3

Now, restart nagios with.
$ sudo service nagios3 restart

And you should be good to go! Happy monitoring!

NOTE : Do not forget to let me know whether it worked for you or not.

Wednesday, August 1, 2012

How To Find A File In Linux(Through The Shell)

In *nix family of operating systems, we can find a file easily with the help of find command.

SYNTAX : $ find {directory-name} -name {filename}

To find a file in root directory :

Sometimes, it happens that we don't have any clue about the location of the file we are trying to search. In such a case we can search the entire system via the root directory (/). For example, if we want to search for a file named demo.txt, but we don't know where it could probably be present, then we would do something like this :

$ sudo find / -name 'demo.txt'

NOTE : Sometimes we may need special privileges to search for a particular file, so we'll use 'sudo' for that.

To find a file in a specific directory :

If we know the probable location of the file, but not sure about it, we can do this :

$ sudo find /path/of/the/directory -name 'demo.txt'

Wednesday, July 25, 2012

HOW TO RUN MAPREDUCE PROGRAMS USING ECLIPSE

Hadoop provides us a plugin for Eclipse that helps us to connect our Hadoop cluster to Eclipse. We can then run MapReduce jobs and browse Hdfs, through the Eclipse itself. But it requires a few things to be done in order to achieve that. Normally, it is said that we just have to copy hadoop-eclipse-plugin-*.jar to the eclipse/plugins directory in order to get things going. But unfortunately it did not work for me. When I tried to connect eclipse to my Hadoop cluster it threw this error :

An internal error occurred during: "Map/Reduce location status updater".
org/codehaus/jackson/map/JsonMappingException

You may face some different error, but it would be somewhat similar to this. This is because of the fact that some required jars are missing from the plugin that comes with Hadoop. Then, I tried a few things and it turned out to be positive.

So, I thought of sharing it, so that if anybody else is facing the same issue, can try it out. Just try the steps outlined below and let me know if it works for you.

First of all setup a Hadoop cluster properly on your machine. If you need some help on that just go here. Then download eclipse compatible with your environment from eclipse home. Also set your HADOOP_HOME to point to your hadoop folder.

Now, follow these steps :

1- Go to your HADOOP_HOME/contrib folder. Copy the hadoop-eclipse-plugin-*.jar somewhere and extract it. This will give a folder named hadoop-eclipse-plugin-*

2- Now, add following 5 jars from your HADOOP_HOME/lib folder to the hadoop-eclipse-plugin-*/lib folder, you have got just now after extracting the plugin :
commons-configuration-1.6.jar
commons-httpclient-3.0.1.jar
commons-lang-2.4.jar
jackson-core-asl-1.0.1.jar
jackson-mapper-asl-1.0.1.jar

3- Now, modify the hadoop-eclipse-plugin-*/META-INF/MANIFEST.MF file and change the Bundle-ClassPath to :

Bundle-ClassPath: classes /, lib / hadoop-core.jar, lib/commons-cli-1.2.jar, lib/commons-httpclient-3.0.1.jar, lib/jackson-core-asl-1.0.1.jar , lib/jackson-mapper-asl-1.0.1.jar, lib/commons-configuration-1.6.jar, lib/commons-lang-2.4.jar

4- Now, re 'jar' the package and place this new jar inside eclipse/plugin directory and restart the eclipse.

You are good to go now. Do let me know it it doesn't work for you.

NOTE : For details you can visit the official home page.

Thank you.

EDIT : If you are not able to see the job status at the JobTracker port(50070) you might find this post of mine useful.

Monday, July 23, 2012

HOW TO SETUP AND CONFIGURE 'ssh' ON LINUX (UBUNTU)

SSH (Secure Shell) is a network protocol secure data communication, remote shell services or command execution and other secure network services between two networked computers that it connects via a secure channel over an insecure network. The ssh server runs on a machine (server) and ssh client runs on another machine (client).

ssh has 2 main components :
1- ssh : The command we use to connect to remote machines - the client.
2- sshd : The daemon that is running on the server and allows clients to connect to the server.
ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first. Use this command to do that :

$ sudo apt-get install ssh
This will install ssh on your machine. In order to check if ssh is setup properly do this :

$ which ssh
It will throw this line on your terminal
/usr/bin/ssh

$ which sshd
It will throw this line on your terminal
/usr/bin/sshd

SSH uses public-key cryptography to authenticate the remote computer and allow it to authenticate the user, if necessary. Well, there are numerous post and links that explain about ssh in much detail. You can just google ssh if you want to learn about it. I'll now show the steps required to configure ssh.

1- First of all create a ssh-keypair using this command :
$ ssh-keygen -t rsa -P ""
Once you issue this command it will ask you for the name of directory where you want to store the key. Simple hit enter without giving any name, and your key will be created and saved to the default location i.e /.ssh directory inside your home directory. (Files and directories having names tarting with a dot (.) are hidden files and directories in Linux. To see these files and directories just go to your home folder and press Ctrl+h).

cluster@ubuntu:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/home/cluster/.ssh/id_rsa):

Hit enter, and you will see something like this :

Your identification has been saved in /home/cluster/.ssh/id_rsa.
Your public key has been saved in /home/cluster/.ssh/id_rsa.pub.
The key fingerprint is:
66:4f:72:26:2b:18:57:43:64:4f:3e:5a:58:d1:2c:30 cluster@ubuntu
The key's randomart image is:
+--[ RSA 2048]-----+
| .E.++ |
| o B. o |
| + =. |
| . + . |
| . . S + |
| + o O |
| . . . . |
| . |
| |
+---------------------+
cluster@ubuntu:~$

Your keypair has been created now. The keypair contains the keys in 2 different files present under the .ssh directory. These are id_rsa (the private key) and id_rsa.pub (the public key).

To connect to a remote machine, just give ssh command along with the hostname of that machine.

NOTE : Hostname of the machine to which you want to ssh must be present in your /etc/hosts file along with its IP address.

For example, if you want to connect to machine called 'client', do this :

cluster@ubuntu:~$ ssh localhost
cluster@localhost's password:

Enter the password of the client machine to login. Now, you are at the terminal of the client machine. Just use a few commands like ls or cat to cross check. Once you give the password and hit enter you will see something like this on your terminal :

cluster@ubuntu:~$ ssh client
cluster@client's password:
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-26-generic x86_64)

* Documentation: https://help.ubuntu.com/

90 packages can be updated.
10 updates are security updates.

Last login: Fri Jul 20 01:08:28 2012 from client
cluster@client:~$

NOTE : In some cases you may want to use passwordless ssh (for example while working with Apache Hadoop). To do that you just have to copy the public key, i'e the content of your id_rsa.pub file to the authorized_keys in the .ssh directory of the client machine. Use the following command to do that :

cluster@ubuntu:~$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now, if you do ssh client, you won't be asked for any password.

cluster@ubuntu:~$ ssh client
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-26-generic x86_64)

* Documentation: https://help.ubuntu.com/

90 packages can be updated.
10 updates are security updates.

Last login: Fri Jul 20 01:08:28 2012 from client
cluster@client:~$

HOW TO CONFIGURE HADOOP

You can find countless posts on the same topic over the internet. And most of them are really good. But quite often, newbies face some issues even after doing everything as specified. I was no exception. In fact, many a times, my friends who are just starting their Hadoop journey, call me up and tell me that they are facing some issues even after doing everything in order. So, I thought of writing down the things which worked for me. I am not going in detail as there are many better post that outline everything pretty well. I'll just show how to configure Hadoop on a single Linux box in pseudo distributed mode.

Prerequisites :

1- Sun(Oracle) java must be installed on the machine.
2- ssh must be installed and keypair must be already generated.

NOTE : Ubuntu comes with its own java compiler (i.e OpenJDK), but Sun(Oracle) java is the preferable choice for Hadoop. You can visit this link if you need some help on how to install it.

NOTE : You can visit this link if you want to see how to setup and configure ssh on your Ubuntu box.

Versions used :

1- Linux (Ubuntu 12.04)

2- Java (Oracle java-7)

3- Hadoop (Apache hadoop-1.0.3)
4- OpenSSH_5.9p1 Debian-5ubuntu1, OpenSSL 1.0.1 14

If you have everything in place, start following the steps shown below to configure Hadoop on your machine :

1- Download the stable release of Hadoop (hadoop-1.0.3 at the time of this writing) from the repository and copy it to some convenient location. Say your home directory.

2- Now, right click the compressed file which you have downloaded just now and choose extract here. This will create the hadoop-1.0.3 folder inside your home directory. We'll call this location as HADOOP_HOME hereafter. So, your HADOOP_HOME=/home/your_username/hadoop-1.0.3

3- Edit the /HADOOP_HOME/conf/hadoop-env.sh file to set the JAVA_HOME variable to point to appropriate jvm.

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

NOTE : Before moving further, create a directory, hdfs for instance, with sub directories viz. name, data and tmp. We'll use these directories as the values of properties in the configuration files.

NOTE : Change the permissions of the directories created in the previous step to 755. Too open or too close permissions may result in abnormal behavior. Use the following command to do that :

cluster@ubuntu:~$ sudo chmod -R 755 /home/cluster/hdfs/

HOW TO INSTALL SUN(ORACLE) JAVA ON UBUNTU 12.04 IN 3 EASY STEPS

If you have upgraded to Ubuntu 12.04 or just made a fresh Ubuntu installation you might want to install sun(oracle) java on it. Although Ubuntu has its own jdk, the OpenJdk, but there certain things that demand for sun(oracle) java. You can follow the steps shown below to do that -

1 - Add the “WEBUPD8″ PPA :

hadoop@master:~$ sudo add-apt-repository ppa:webupd8team/java

2 - Update the repositories :

hadoop@master:~$ sudo apt-get update

3 - Begin the installation :

hadoop@master:~$ sudo apt-get install oracle-java7-installer

Now, to test if the installation was ok or not do this :

hadoop@master:~$ java -version

If everything was ok you must be able to see something like this on your terminal :

hadoop@master:~$ java -version

java version "1.7.0_05"

Java(TM) SE Runtime Environment (build 1.7.0_05-b05)

Java HotSpot(TM) 64-Bit Server VM (build 23.1-b03, mixed mode)

hadoop@master:~$

Monday, July 2, 2012

BETWEEN OPERATOR IN HIVE

Hive is a wonderful tool for those who like to perform batch operations to process their large amounts of data residing on a Hadoop cluster and who are comparatively new to the NOSQL world. Not only it provides us warehousing capabilities on top of a Hadoop cluster, but also a superb SQL like interface which makes it very easy to use and makes our task execution more familiar. But, one thing which newbies like me always wanted to have is the support of BETWEEN operator in Hive.

Since the release of version 0.9.0 earlier this year, Hive provides us some new and very useful features. BETWEEN operator is one among those.

HBASE COUNTERS (PART I)

Apart from various useful features, Hbase provides another advanced and useful feature called COUNTERS.
Hbase provides us a mechanism to treat columns as counters. Counters allow us to increment a column value with least possible overhead.

Advantage of using Counters
While trying to increment a value stored in a table, we would have to lock the row, read the value, increment it, write it back to the table, and finally remove the look from the row, so that it can be used by other clients. This could cause a row to be locked for a long period and may possibly cause a clash between the clients tying to access the same row. Counters help us to overcome this problem as Increments are done under a single row lock, so write operations to a row are synchronized.

**Older versions of Hbase supported calls which involved one RPC per counter update. But the newer versions allow us to bundle multiple counters in a single RPC call.

Counters are limited to a single row, though we can update multiple counters simultaneously. This means that we operate on one row at a time when working with counters.

Hbase API provide the Increment class to perform Increment operations.

**To increment columns of a row, instantiate an Increment object with the row to increment. At least one column to increment must be specified using the addColumn(byte[], byte[], long) method.
Alternatively we can use the the incrementColumnValue(row, family, qualifier, amount) method through an instance of Htable class.

FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Quite often the very first blow which people, starting their Hive journey, face is this exception :

FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

And the worst part is that even after Googling for hours and trying out different solutions provided by different peoples this error just keeps on haunting them. The solution is simple. You just have to change the owner ship and permissions of the Hve directories which you have crated for warehousing your data. Use the following commands t do that.

$HADOOP_HOME/bin/hadoop fs  -chmod g+w  /tmp

$HADOOP_HOME/bin/hadoop fs  -chmod 777  /tmp

$HADOOP_HOME/bin/hadoop fs -chmod g+w  /user/hive/warehouse

$HADOOP_HOME/bin/hadoop fs  -chmod 777  /user/hive/warehouse

You are good to go now. Keep Hadooping.

NOTE : Please do not forget to tell me whether it worked for you or not.

Thursday, June 21, 2012

HOW TO CONFIGURE HBASE IN PSEUDO DISTRIBUTED MODE ON A SINGLE LINUX BOX

If you have successfully configured Hadoop on a single machine in pseudo-distributed mode and looking for some help to use Hbase on top of that then you may find this writeup useful. Please let me know if you face any issue.

Since you are able to use Hadoop, I am assuming you have all the pieces in place . So we'll directly start with Habse configuration. Please follow the steps shown below to do that:

1 - Download the Hbase release from one of the mirrors using the link shown below. Then unzip it at some convenient location (I'll call this location as HBASE_HOME now on) -
http://apache.techartifact.com/mirror/hbase/

2 - Go to the /conf directory inside the unzipped HBASE_HOME and do these changes :

- In the hbase-env.sh file modify these line as shown :
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HBASE_REGIONSERVERS
=/PATH_TO_YOUR_HBASE_FOLDER/conf/regionservers
export HBASE_MANAGES_ZK=true

HOW TO USE %{host} ESCAPE SEQUENCE IN FLUME-NG

Sometimes you may try to aggregate data from different sources and dump it into a common location, say your HDFS. In such a scenario it will be useful to create a directory inside the HDFS corresponding to each host machine. To do this FLEME-NG provide a suitable escape sequence, the %{host}. Unfortunately it was not working with early releases of FLUME-NG. In such case the only solution was to create a custom interceptor that adds a host header key to each event, along with the corresponding hostname as the header value.

But, luckily guys at Clouders did a great job and contributed an Interceptor to provide this feature out of the box. Now we just have to add few lines in our configuration file and we are good to go. For example, suppose we are collecting Apache web server logs from different hosts into a directory called flume inside the HDFS. It would be quite fussy to figure out which log is coming from which host. So we''ll use %{host} in our agent configuration files for the agents running on each machine. This will create a separate directory for each host inside the flume directory and store the logs from that host there itself. A simple configuration file may look like this :

HOW TO MOVE DATA INTO AN HBASE TABLE USING FLUME-NG

The first Hbase sink was commited to the Flume 1.2.x trunk few days ago. In this post we'll see how we can use this sink to collect data from a file stored in the local filesystem and dump this data into an Hbase table. We should have Flume built from the trunk in order to achieve that. If you haven't built it yet and looking for some help, you can visit my other post that shows how to build and use Flume-NG at this link :
http://cloudfront.blogspot.in/2012/06/how-to-build-and-use-flume-ng.html

First of all we have to write the configuration file for our agent. This agent will collect data from the file and dump it into the Hbase table. A simple configuration file might look like this :

HOW TO BUILD AND USE FLUME-NG

In this post we'll see how to build flume-ng from trunk and use it for data aggregation.

Prerequisites :

In order to to do a hassle free build we should have following two things pre-installed on our box :
1- Thrift
2- Apache Maven-3.x.x

Build the project :

Once we are done with this we have to build flume-ng from the trunk. Use following commands to do this :

$ svn co https://svn.apache.org/repos/asf/incubator/flume/trunk flume

This will create a directory flume inside our /home/username/ directory. Now go inside this directory and start the build process.

$ cd flume
$ mvn3 install -DskipTests

NOTE : If everything was fine then you will receive a BUILD SUCCESS message after this. But sometimes you may get an error somewhat like this :

HOW TO CHANGE THE DEFAULT KEY-VALUE SEPARATOR OF A MAPREDUCE JOB

The default MapReduce output format, TextOutputFormat, writes records as lines of text. Its keys and values may be of any type, since TextOutputFormat turns them to strings by calling toString() on them.

Each key-value pair is separated by a tab character. We can change this separator to some character of our choice using the mapreduce.output.textoutputformat.separator (In the older MapReduce API this was mapred.textoutputformat.separator).

To do this you have to add this line in your driver function -
Configuration.set("mapreduce.output.key.field.separator", ",");

Error while executing MapReduce WordCount program (Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable)

Quite often I see questions from people who are comparatively new to the Hadoop world or just starting their Hadoop journey that they are getting below specified error while executing the traditional WordCount program :

Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable

If you are also getting this error then you have to set your MapOutputKeyClass explicitly like this :

- If you are using the older MapReduce API then do this :
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);

- And if you are using the new MapReduce API then do this :
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);

REASON : The reason for this is that your MapReduce application might be using TextInputFormat as the InputFormat class and this class generates keys of type LongWritable and values of type Text by default. But your application might be expecting keys of type Text. That's why you get this error.

NOTE : For detailed information you can visit the official MapReduce page.

Saturday, June 2, 2012

How to install maven3 on ubuntu 11.10

If you are trying to install maven2 that comes shipped with your ubutnu 11.10, and it is not working as intended you can try following steps to install maven3 :

1 - First of all add the repository for maven3. Use following command for this -
$ sudo add-apt-repository ppa:natecarlson/maven3

2 - Now update the repositories -
$ sudo apt-get update

3 - Finally install maven3 -
$ sudo apt-get install maven3

NOTE : To check whether installation was done properly or not, issue the following
command -
$ mvn --version

Thursday, May 24, 2012

Tips for Hadoop newbies (Part I).

Few moths ago, after completing my graduation I thought of doing something new. In quest of that I started learning and working on Apache's platform for distributed computing, the Apache Hadoop. Like a good student I started with reading the documentation. Trust me there are many good posts and documentations available for learning Hadoop and setting up a Hadoop cluster. But even after following everything properly, at times I ran into few problems and I could not find solutions for them. I posted questions on the mailing lists, searched over the internet, asked the experts and finally got my issues resolved. But it took a lot of precious time and efforts. Hence I decided to write down those things, so that if anyone who is just starting off doesn't have to face all those things.

Please provide me with your valuable comments and suggestions if you have any. That will help me a lot in refining things further, and to add on to my knowledge, as I am still a learner.

1 - If there is some problem with the Namenode, first of all check your hosts file. Proper DNS resolution is very important for Hadoop cluster to work properly.Then see whether ssh is working fine or not. For a pseudo-distributed mode configuration your hosts file should look this -
127.0.0.1 localhost
127.0.0.1 ubuntu.ubuntu-domain ubuntu
For fully-distributed mode configuration add the IP addresses and the hostnames accordingly.

Wednesday, December 19, 2012

Friday, November 30, 2012

Thursday, October 25, 2012

Tuesday, October 23, 2012

Tuesday, August 28, 2012

Wednesday, August 1, 2012

Wednesday, July 25, 2012

Monday, July 23, 2012

Thursday, July 19, 2012

Monday, July 2, 2012

Saturday, June 30, 2012

Thursday, June 28, 2012

FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Thursday, June 21, 2012

Wednesday, June 20, 2012

Friday, June 15, 2012

Thursday, June 14, 2012

Saturday, June 2, 2012

Thursday, May 24, 2012