Posts

Showing posts from 2012

India to Open Up Its Data

Here comes a moment to be proud of. India has joined a select group of over 20 countries whose governments have launched open data portals. With a view to improving government transparency and efficiency, www.data.gov.in, (which is in beta right now) will provide access to a valuable repository of datasets, from government departments, ministries, and agencies, and autonomous bodies.


Data Portal India is a platform for supporting Open Data initiative of Government of India. The portal is intended to be used by Ministries/Department/Organizations of Government of India to publish datasets, and applications for public use. It intends to increase transparency in the functioning of Government and also opens avenues for many more innovative uses of Government Data to give different perspective.

The entire product is available for download at the Open Source Code Sharing Platform GitHub.

Open data will be made up of “non-personally identifiable data” collected, compiled, or produced during …

Exception in thread "main" java.lang.NoSuchFieldError: type at org.apache.hadoop.hive.ql.parse.HiveLexer.mKW_CREATE(HiveLexer.java:1602)

Now you have successfully configured Hadoop and everything is running perfectly fine. So, you decided to give Hive a try. But, oops...as soon as you try to create the very first table you find yourself into something like this :


Exception in thread "main" java.lang.NoSuchFieldError: type
        at org.apache.hadoop.hive.ql.parse.HiveLexer.mKW_CREATE(HiveLexer.java:1602)
        at org.apache.hadoop.hive.ql.parse.HiveLexer.mTokens(HiveLexer.java:6380)
        at org.antlr.runtime.Lexer.nextToken(Lexer.java:89)
        at org.antlr.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:133)
        at org.antlr.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:127)
        at org.antlr.runtime.CommonTokenStream.setup(CommonTokenStream.java:132)
        at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:91)
        at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:547)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDrive…

How to add minimize and maximize buttons in Gnome 3

Image
In a recent post I had shown the steps to add the Power off option in the menu, which is absent by default, in Gnome 3. In this post we'll see another feature which most of us would love to have in our Gnome environment, especially guys coming from PC world. The minimize and maximize buttons.

In order to do that follow the specified below :

1. Openrun box by pressing ALT+F2. Type dconf-editor in it and hit enter(Alternatively you can open   your terminal and type there).
* If you don't have dconf install on your machine, you can install it using this command -
apache@hadoop:~$ sudo apt-get install dconf-tools

2. This will open the dconf-editor and you will see something like this on your screen -

Now, go to org->gnome->shell->overrides and click there. It will open the editor in the right pane. Here click on the value against button-layout. By default it'll show only close there.


Add minimize and maximize option here and you are done.

Let me know if you face any issue…

How to add shutdown option in the Gnome 3 menu

While I am just in love with Gnome 3, one thing which I don't like about it is the absence of shutdown or power off option. But, the good news it that it can be added very easily. All you have to do is follow these three simple steps :

1. Add the repository -
apache@hadoop:~$ sudo add-apt-repository ppa:ferramroberto/gnome3

2. Update it -
apache@hadoop:~$ sudo apt-get update

3. Install the Gnome extensions -
apache@hadoop:~$ sudo apt-get install gnome-shell-extensions

Now restart your machine and you are good to go.

NOTE : This is procedure if you are working on Ubuntu. You might have to use a different procedure to add the repository and do the installation depending on your OS.

Thank you.

HOW TO INSTALL AND USE MICROSOFT HDINSIGHT (HADOOP ON WINDOWS)

Image
HDInsightis Microsoft’s 100% Apache compatible Hadoop distribution, supported by Microsoft. HDInsight, available both on Windows Server or as an Windows Azure service, empowers organizations with new insights on previously untouched unstructured data, while connecting to the most widely used Business Intelligence (BI) tools on the planet. In this post we'll directory jump into the hands-on. But, if you want more on HDInsight, you can visit my another post here.

NOTE : OS used - Windows 7

So let's get started.

First of all go to the Microsoft Big Data page, and click on the Download HDInsight Server link (shown in the blue eclipse). You will see something like this :



Once you click the link it will guide you to the Download Center. Now, go to the Instructions heading and click on Microsoft Web Platform Installer.



This will automatically download and install all the required thing.

Once the installation is over open the  Microsoft Web Platform Installer and go to the Top Right co…

HDInsight, Hadoop For Windows

Now, here is a treat for those who want to go Hadoop way but don't love Linux that much. Microsoft is rolling out the first preview editions of its Apache Hadoop integration for Windows Server and Azure in a marriage of open source and commercial code, after a year of beta testing. And they call it HDInsight. HDInsight is available both on Windows Server ( or Windows 7) or as an Windows Azure service. HDInsight will empower organizations with new insights on previously untouched unstructured data, while connecting to the most widely used Business Intelligence (BI) tools.

Microsoft collaborated with Hortonworks to make this happen. Last year Microsoft had announced that it will integrate Hadoop into its forthcoming SQL Server 2012 release and Azure platforms, and had committed to full compatibility with the Apache code base. The first previews have been shown off at the Hadoop World show in New York and are open for download. HDInsight delivers Apache Hadoop compatibility for the e…

CLOUDERA IMPALA

Image
Now here is something that's really gonna change the way people thing about Hadoop. Hadoop was always criticized by the BI world as it does not integrate well with traditional business intelligence processes, as they say. The BI world has always felt that Hadoop lacks the capability of delivering real time. So, here is the answer for that. The biggest player of the Hadoop world (just a personal view), Cloudera, has recently launched a Real-Time Query Engine for Hadoop, and they call it as Impla. And the best part is that Cloudera has decided to distribute Impala under Apache's licence, which means another treat for open source lovers. Although, it is just the beta release, I think it's worth giving Impala a try. Cloudera is using the Strata + Hadoop World event in New York City to unveil Impala. As Cloudera claims Impala can process queries 10 to 30 times faster than Hive/MapReduce. (Sounds quite impressive)


Cloudera Impala provides fast, interactive SQL queries directly o…

UBUNTU FOR ANDROID

Image
Now, here is something that would really catch each open-source lover's attention. Team Ubuntu has come up with the idea of running Ubuntu on your Android device. Ubuntu for Android, as they call it, is an upcoming free and open source variant of Ubuntu designed to run on Android phones. It is expected to come pre-loaded on several phones. Android was shown at Mobile World Congress2012. The best thing about Ubuntu for Android is that you don't have to reboot the device as both the operating systems use the same kernel.


Other salient features include :

1. Both Ubuntu and Android run at the same time on the device, without emulation or virtualization. 2. When the device is connected to a desktop monitor, it features a standard Ubuntu Desktop interface. 3. When the device is connected to a TV, the interface featured is the Ubuntu TV experience. 4. Ability to run standard Ubuntu Desktop applications, like Firefox, Thunderbird, VLC, etc. 5. Ability to run Android applications on the Ubu…

HOW TO INSTALL NAGIOS ON UBUNTU 12.04

Image
Monitoring plays an important role in running our systems smoothly. It is always better to diagnose the problems and take some measures as early as possible, rather than waiting for things to go worse.

Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. For a detailed information on Nagios you can visit the official documentation page here. I'll just cover the steps to install and get Nagios working on your Ubuntu box.

First of all install Nagios on your Ubuntu box using the following command :
$ sudo apt-get install -y nagios3

It will go through, and ask you about what mail server you want to use. You'll see something like this on your screen.



Pick one as per your requirements.



It will then ask you about the domain name you want to have email sent from. Again, fill that out based upon your needs.

It will ask you what password you want to use - put in a secure passw…

How To Find A File In Linux(Through The Shell)

In *nix family of operating systems, we can find a file easily with the help of find command.

SYNTAX : $ find {directory-name} -name {filename}

To find a file in root directory :

Sometimes, it happens that we don't have any clue about the location of the file we are trying to search. In such a case we can search the entire system via the root directory (/). For example, if we want to search for a file named demo.txt, but we don't know where it could probably be present, then we would do something like this :

$ sudo find / -name 'demo.txt'

NOTE : Sometimes we may need special privileges to search for a particular file, so we'll use 'sudo' for that.

To find a file in a specific directory :

If we know the probable location of the file, but not sure about it, we can do this :

$ sudo find /path/of/the/directory -name 'demo.txt'


HOW TO RUN MAPREDUCE PROGRAMS USING ECLIPSE

Hadoop provides us a plugin for Eclipse that helps us to connect our Hadoop cluster to Eclipse. We can then run MapReduce jobs and browse Hdfs, through the Eclipse itself. But it requires a few things to be done in order to achieve that. Normally, it is said that we just have to copy hadoop-eclipse-plugin-*.jar to the eclipse/plugins directory in order to get things going. But unfortunately it did not work for me. When I tried to connect eclipse to my Hadoop cluster it threw this error :


An internal error occurred during: "Map/Reduce location status updater".
org/codehaus/jackson/map/JsonMappingException

You may face some different error, but it would be somewhat similar to this. This is because of the fact that some required jars are missing from the plugin that comes with Hadoop. Then, I tried a few things and it turned out to be positive.



So, I thought of sharing it, so that if anybody else is facing the same issue, can try it out. Just try the steps outlined below and let me…

HOW TO SETUP AND CONFIGURE 'ssh' ON LINUX (UBUNTU)

SSH (Secure Shell) is a network protocol secure data communication, remote shell services or command execution and other secure network services between two networked computers that it connects via a secure channel over an insecure network. The ssh server runs on a machine (server) and ssh client runs on another machine (client).

ssh has 2 main components :
1- ssh : The command we use to connect to remote machines - the client. 
2- sshd : The daemon that is running on the server and allows clients to connect to the server.
ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first. Use this command to do that :

$ sudo apt-get install ssh
This will install ssh on your machine. In order to check if ssh is setup properly do this :

$ which ssh
It will throw this line on your terminal
/usr/bin/ssh

$ which sshd
It will throw this line on your terminal
/usr/bin/sshd

SSH uses public-key cryptography to authenticate the remote computer and allow it to authenticate the user…

HOW TO CONFIGURE HADOOP

Image
You can find countless posts on the same topic over the internet. And most of them are really good. But quite often, newbies face some issues even after doing everything as specified. I was no exception. In fact, many a times, my friends who are just starting their Hadoop journey, call me up and tell me that they are facing some issues even after doing everything in order. So, I thought of writing down the things which worked for me. I am not going in detail as there are many better post that outline everything pretty well. I'll just show how to configure Hadoop on a single Linux box in pseudo distributed mode.
Prerequisites :

1- Sun(Oracle) java must be installed on the machine.
2- ssh must be installed and keypair must be already generated.

NOTE : Ubuntu comes with its own java compiler (i.e OpenJDK), but Sun(Oracle) java is the preferable choice for Hadoop. You can visit this linkif you need some help on how to install it.

NOTE : You can visit this linkif you want to see how to setu…

HOW TO INSTALL SUN(ORACLE) JAVA ON UBUNTU 12.04 IN 3 EASY STEPS

If you have upgraded to Ubuntu 12.04 or just made a fresh Ubuntu installation you might want to install sun(oracle) java on it. Although Ubuntu has its own jdk, the OpenJdk, but there certain things that demand for sun(oracle) java. You can follow the steps shown below to do that -

1 - Add the “WEBUPD8″ PPA : hadoop@master:~$ sudo add-apt-repository ppa:webupd8team/java
2 - Update the repositories : hadoop@master:~$ sudo apt-get update
3 - Begin the installation : hadoop@master:~$ sudo apt-get install oracle-java7-installer
Now, to test if the installation was ok or not do this : hadoop@master:~$ java -version
If everything was ok you must be able to see something like this on your terminal :
hadoop@master:~$ java -version java version "1.7.0_05" Java(TM) SE Runtime Environment (build 1.7.0_05-b05) Java HotSpot(TM) 64-Bit Server VM (build 23.1-b03, mixed mode) hadoop@master:~$ 

BETWEEN OPERATOR IN HIVE

Hive is a wonderful tool for those who like to perform batch operations to process their large amounts of data residing on a Hadoop cluster and who are comparatively new to the NOSQL world. Not only it provides us warehousing capabilities on top of a Hadoop cluster, but also a superb SQL like interface which makes it very easy to use and makes our task execution more familiar. But, one thing which newbies like me always wanted to have is the support of BETWEEN operator in Hive.

Since the release of version 0.9.0 earlier this year, Hive provides us some new and very useful features. BETWEEN operator is one among those.

HBASE COUNTERS (PART I)

Apart from various useful features, Hbase provides another advanced and useful feature called COUNTERS.
Hbase provides us a mechanism to treat columns as counters. Counters allow us to increment a column value with least possible overhead.

Advantage of using Counters
While trying to increment a value stored in a table, we would have to lock the row, read the value, increment it, write it back to the table, and finally remove the look from the row, so that it can be used by other clients. This could cause a row to be locked for a long period and may possibly cause a clash between the clients tying to access the same row. Counters help us to overcome this problem as Increments are done under a single row lock, so write operations to a row are synchronized.

**Older versions of Hbase supported calls which involved one RPC per counter update. But the newer versions allow us to bundle multiple counters in a single RPC call

Counters are limited to a single row, though we can update multiple cou…

FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Quite often the very first blow which people, starting their Hive journey, face is this exception :

FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
And the worst part is that even after Googling for hours and trying out different solutions provided by different peoples this error just keeps on haunting them. The solution is simple. You just have to change the owner ship and permissions of the Hve directories which you have crated for warehousing your data. Use the following commands t do that.
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp$HADOOP_HOME/bin/hadoop fs -chmod 777 /tmp$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse$HADOOP_HOME/bin/hadoop fs -chmod 777 /user/hive/warehouseYou are good to go now. Keep Hadooping.NOTE : Please do not forget to tell me whether it worked for you or…

HOW TO CONFIGURE HBASE IN PSEUDO DISTRIBUTED MODE ON A SINGLE LINUX BOX

If you have successfully configured Hadoop on a single machine in pseudo-distributed mode and looking for some help to use Hbase on top of that then you may find this writeup useful. Please let me know if you face any issue.

Since you are able to use Hadoop, I am assuming you have all the pieces in place . So we'll directly start with Habse configuration. Please follow the steps shown below to do that:

1 - Download the Hbase release from one of the mirrors using the link shown below. Then unzip it at some convenient location (I'll call this location as HBASE_HOME now on) -  
http://apache.techartifact.com/mirror/hbase/

2 - Go to the /conf directory inside the unzipped HBASE_HOME and do these changes :

     - In the hbase-env.sh file modify these line as shown :
export JAVA_HOME=/usr/lib/jvm/java-6-sun
       export HBASE_REGIONSERVERS
       =/PATH_TO_YOUR_HBASE_FOLDER/conf/regionservers
       export HBASE_MANAGES_ZK=true


HOW TO USE %{host} ESCAPE SEQUENCE IN FLUME-NG

Sometimes you may try to aggregate data from different sources and dump it into a common location, say your HDFS. In such a scenario it will be useful to create a directory inside the HDFS corresponding to each host machine. To do this FLEME-NG provide a suitable escape sequence,  the %{host}. Unfortunately it was not working with early releases of FLUME-NG. In such case the only solution was to create a custom interceptor that adds a host header key to each event, along with the corresponding hostname as the header value.

But, luckily guys at Clouders did a great job and contributed an Interceptor to provide this feature out of the box. Now we just have to add few lines in our configuration file and we are good to go. For example, suppose we are collecting Apache web server logs from different hosts into a directory called flume inside the HDFS. It would be quite fussy to figure out which log is coming from which host. So we''ll use %{host} in our agent configuration files fo…

HOW TO MOVE DATA INTO AN HBASE TABLE USING FLUME-NG

The first Hbase sink was commited to the Flume 1.2.x trunk few days ago. In this post we'll see how we can use this sink to collect data from a file stored in the local filesystem and dump this data into an Hbase table. We should have Flume built from the trunk in order to achieve that. If you haven't built it yet and looking for some help, you can visit my other post that shows how to build and use Flume-NG at this link :
http://cloudfront.blogspot.in/2012/06/how-to-build-and-use-flume-ng.html

First of all we have to write the configuration file for our agent. This agent will collect data from the file and dump it into the Hbase table. A simple configuration file might look like this :

HOW TO BUILD AND USE FLUME-NG

In this post we'll see how to build flume-ng from trunk and use it for data aggregation.

Prerequisites :

In order to to do a hassle free build we should have following two things pre-installed on our box :
1- Thrift
2- Apache Maven-3.x.x

Build the project :

Once we are done with this we have to build flume-ng from the trunk. Use following commands to do this :

$ svn co https://svn.apache.org/repos/asf/incubator/flume/trunk flume

This will create a directory flume inside our /home/username/ directory. Now go inside this directory and start the build process.

$ cd flume
$ mvn3 install -DskipTests

NOTE : If everything was fine then you will receive a BUILD SUCCESS message after this. But sometimes you may get an error somewhat like this :

HOW TO CHANGE THE DEFAULT KEY-VALUE SEPARATOR OF A MAPREDUCE JOB

The default MapReduce output format, TextOutputFormat, writes records as lines of text. Its keys and values may be of any type, since TextOutputFormat turns them to strings by calling toString() on them.

 Each key-value pair is separated by a tab character. We can change this separator to some character of our choice using the mapreduce.output.textoutputformat.separator (In the older MapReduce API this was mapred.textoutputformat.separator).

To do this you have to add this line in your driver function -
Configuration.set("mapreduce.output.key.field.separator", ",");

Error while executing MapReduce WordCount program (Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable)

Quite often I see questions from people who are comparatively new to the Hadoop world or just starting their Hadoop journey that they are getting below specified error while executing the traditional WordCount program :

Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable

If you are also getting this error then you have to set your MapOutputKeyClass explicitly like this :

- If you are using the older MapReduce API then do this :
conf.setMapOutputKeyClass(Text.class); 
  conf.setMapOutputValueClass(IntWritable.class); 

 - And if you are using the new MapReduce API then do this :
job.setMapOutputKeyClass(Text.class);     
   job.setMapOutputValueClass(IntWritable.class);

REASON : The reason for this is that your MapReduce application might be using TextInputFormat as the InputFormat class and this class generates keys of type LongWritable and values of type Text by default. But your application might be expecting keys of type…

How to install maven3 on ubuntu 11.10

If you are trying to install maven2 that comes shipped with your ubutnu 11.10, and it is not working as intended you can try following steps to install maven3 :

1 - First of all add the repository for maven3. Use following command for this -
     $ sudo add-apt-repository ppa:natecarlson/maven3

2 - Now update the repositories - 
     $ sudo apt-get update

3 - Finally install maven3 - 
      $ sudo apt-get install maven3

NOTE : To check whether installation was done properly or not, issue the following 
              command -
              $ mvn --version

Tips for Hadoop newbies (Part I).

Few moths ago, after completing my graduation I thought of doing something new. In quest of that I started learning and working on Apache's platform for distributed computing, the Apache Hadoop. Like a good student I started with reading the documentation. Trust me there are many good posts and documentations available for learning Hadoop and setting up a Hadoop cluster. But even after following everything properly, at times I ran into few problems and I could not find solutions for them. I posted questions on the mailing lists, searched over the internet, asked the experts and finally got my issues resolved. But it took a lot of precious time and efforts. Hence I decided to write down those things, so that if anyone who is just starting off doesn't have to face all those things.
Please provide me with your valuable comments and suggestions if you have any. That will help me a lot in refining things further, and to add on to my knowledge, as I am still a learner.
1 - If there…