Thursday, October 25, 2012


HDInsight is Microsoft’s 100% Apache compatible Hadoop distribution, supported by Microsoft. HDInsight, available both on Windows Server or as an Windows Azure service, empowers organizations with new insights on previously untouched unstructured data, while connecting to the most widely used Business Intelligence (BI) tools on the planet. In this post we'll directory jump into the hands-on. But, if you want more on HDInsight, you can visit my another post here.

NOTE : OS used - Windows 7

So let's get started.

First of all go to the Microsoft Big Data page, and click on the Download HDInsight Server link (shown in the blue eclipse). You will see something like this :

Once you click the link it will guide you to the Download Center. Now, go to the Instructions heading and click on Microsoft Web Platform Installer.

This will automatically download and install all the required thing.

Once the installation is over open the  Microsoft Web Platform Installer and go to the Top Right corner of the Microsoft Web Platform Installer UI where you will find a Search Box. Type Hadoop in there. This will show you Microsoft HDInsight for Windows Server Community Technology Preview bar. Select it and click on install. And you are done.

NOTE : It may take some time to install all the necessary components depending upon your connection speed.

On successful completion of HDInsight you can find the Hadoop Command Line icon on your desktop. Also you will find a brand new directory named Hadoop inside your C drive. This indicates that everything was OK and you are good to go.


It's time now to test HDInsight.

Step1. Go to the C:\Hadoop\hadoop-1.1.0-SNAPSHOT\bin directory :
c:\>cd Hadoop\hadoop-1.1.0-SNAPSHOT\bin

Step2. Now, start the daemons using start_daemons.cmd :

It will show you something like this on your terminal :

This means that your Hadoop processes have been started successfully and you are all set.

Let us use few of the Hadoop Commands to get ourselves familiar with Hadoop.

1. List all the directories, sub-directories and file present in Hdfs. And we do it using fs -ls :

2. Create a new directory inside Hdfs.We use fs -mkdir to do that :

You would have become familiar with the Hadoop shell by now. But I would suggest to go to the official Hadoop Page and try more in order to get a good grip. HAPPY HADOOPING..!!

HDInsight, Hadoop For Windows

Now, here is a treat for those who want to go Hadoop way but don't love Linux that much. Microsoft is rolling out the first preview editions of its Apache Hadoop integration for Windows Server and Azure in a marriage of open source and commercial code, after a year of beta testing. And they call it HDInsight. HDInsight is available both on Windows Server ( or Windows 7) or as an Windows Azure service. HDInsight will empower organizations with new insights on previously untouched unstructured data, while connecting to the most widely used Business Intelligence (BI) tools.

Microsoft collaborated with Hortonworks to make this happen. Last year Microsoft had announced that it will integrate Hadoop into its forthcoming SQL Server 2012 release and Azure platforms, and had committed to full compatibility with the Apache code base. The first previews have been shown off at the Hadoop World show in New York and are open for download. HDInsight delivers Apache Hadoop compatibility for the enterprise and simplify deployment of Hadoop-based solutions. In addition, delivering these capabilities on the Windows Server and Azure platforms enables customers to use the familiar tools of Excel, PowerPivot for Excel and Power View to easily extract actionable insights from the data.

Microsoft also announced that it is going to expand partnership with Hortonworks, to give customers access to an enterprise-ready distribution of Hadoop with the newly released solutions. Having said that, I hope this Microsoft+Hortonwork relationship gets growing so that we keep on getting great things like HDInsight.

You can find more about HDInsight here. And if you are planning to give HDInsight a shot you can visit my post on this which shows how to install and start using Hadoop on windows using HDInsight.


Now here is something that's really gonna change the way people thing about Hadoop. Hadoop was always criticized by the BI world as it does not integrate well with traditional business intelligence processes, as they say. The BI world has always felt that Hadoop lacks the capability of delivering real time. So, here is the answer for that. The biggest player of the Hadoop world (just a personal view), Cloudera, has recently launched a Real-Time Query Engine for Hadoop, and they call it as Impla. And the best part is that Cloudera has decided to distribute Impala under Apache's licence, which means another treat for open source lovers. Although, it is just the beta release, I think it's worth giving Impala a try. Cloudera is using the Strata + Hadoop World event in New York City to unveil Impala. As Cloudera claims Impala can process queries 10 to 30 times faster than Hive/MapReduce. (Sounds quite impressive)

Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries. Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs.

Cloudera Impala Diagram

The Impala solution is composed of the following components :
1. Impala State Store - The state store coordinates information about all instances of impalad running in your environment. This information is used to find data so the distributed resources can be used to respond to queries.
2.impalad - This process runs on datanodes and responds to queries from the Impala shell. impalad receives requests from the database connector layer and schedules the tasks for optimal execution. Intermittently, the impalad updates the Impala State Store of its name and address.

More about Impala can be found out at the Cloudera Imapala page.

Tuesday, October 23, 2012


Now, here is something that would really catch each open-source lover's attention. Team Ubuntu has come up with the idea of running Ubuntu on your Android device. Ubuntu for Android, as they call it, is an upcoming free and open source variant of Ubuntu designed to run on Android phones. It is expected to come pre-loaded on several phones. Android was shown at Mobile World Congress 2012. The best thing about Ubuntu for Android is that you don't have to reboot the device as both the operating systems use the same kernel.

Other salient features include :

1. Both Ubuntu and Android run at the same time on the device, without emulation or virtualization.
2. When the device is connected to a desktop monitor, it features a standard Ubuntu Desktop interface.
3. When the device is connected to a TV, the interface featured is the Ubuntu TV experience.
4. Ability to run standard Ubuntu Desktop applications, like Firefox, Thunderbird, VLC, etc.
5. Ability to run Android applications on the Ubuntu Desktop.
6. Make and receive calls and SMSs directly from the Desktop

In order to use Ubuntu for Android, the device should have :

1. Dual-core 1 GHz CPU
2. Video acceleration: shared kernel driver with associated X driver; OpenGL, ES/EGL
3. Storage: 2 GB for OS disk image
4. HDMI: video-out with secondary framebuffer device
5. USB host mode
6. 512 MB RAM

Here is an image showing Ubuntu running on an Android device, docked to a desktop monitor.

For more info you can visit the official page.

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich ...