Posts

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich set of higher-level Apis which include  Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming  for high-throughput, fault-tolerant stream processing of live data streams. Through this post we'll explore the Spark SQL API and see how to use it with Avro data. As stated earlier, Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. A DataFrame is a distributed collection of data organised into named columns. We can think of a DataFrame as a table in a relational database or as a dataframe in R or Python. Apache Avro is a very popular data serialization system, specially in BigData world. We'll use the spark-avro library for this. spark-avro is a beaut

Fun with HBase shell

Image
HBase shell is great, specially while getting yourself familiar with HBase. It provides lots of useful shell commands using which you can perform trivial tasks like creating tables, putting some test data into it, scanning the whole table, fetching data from a specific row etc etc. Executing  help  on HBase shell will give you the list of all the HBase shell commands. If you need help on a specific command, type help "command" . For example, help "get" will give you a detailed explanation of the get command. But this post is not about the above said stuff. We will try to do something fun here. Something which is available, but less known. So, get ready, start your HBase daemons, open HBase shell and get your hands dirty. For those of us who are unaware, HBase shell is based on  JRuby , the Java Virtual Machine-based implementation of Ruby. More specifically, it uses the  Interactive Ruby Shell (IRB) , which is used to enter Ruby commands and get an immedia

List of my top 10 most voted SO answers

Here is a list of my top 10 most voted answers on Stackoverflow . All these questions are related to cloud computing including discussions on distributed storage and computing tools like Hadoop, HBase etc. I hope you find it useful as others did. What is SAAS , PAAS and IAAS ? With examples When to use HBase and when to use Hive Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill) Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version) Difference between HBase and Hadoop How does impala provide faster query response compared to hive How can I develop an ASP.NET web application using Hadoop as Database? How (in Hadoop),is the data put into map and reduce functions in correct types? Why do we need Hadoop passwordless ssh? PIG VS HIVE VS Native Map Reduce

Analyzing your data on the fly with Pig through Mortar Watchtower

Let me start by thanking Mortar for developing such an amazing tool.  Isn't it really cool to have the ability to make your Pig development faster without having to write a complete script, run it and then wait for for local or remote Pig to finish the execution and finally give you the final data? Quite often, when writing a Pig script, I find it very time consuming to debug what each line of my script is doing. Moreover, the fact that Pig is a dataflow language makes it even more important to have a clear idea of what exactly your data looks like at each step of the flow. This obviously helps in writing compact and efficient scripts. Trust me, you don't want to write inefficient code while dealing with Petabytes of data. It's a bitter truth that Hadoop development iterations are slow. Traditional programmers have always had the benefit of re-compiling their app, running it, and seeing the results within seconds. They have near instant validation that what they’re build

How to run Hive queries through Hive Web Interface.

Image
One of the good things about Hadoop, and related projects, which I really like is the WebUI provided to us. It makes our life a lot easier. Just point your web browser to the appropriate URL and quickly perform the desired action. Be it browsing through HDFS files or glancing over HBase tables. Otherwise you need to go the shell and issue the associated commands one by one for each action [I know i'm a bit lazy ;)]. Hive is no exception and provides us a WebUI, called as Hive Web Interface , or HWI in short. But, somehow I feel it is less documented and talked about as compared to HDFS and HBase WebUI. But that doesn't make it any less useful. In fact I personally find it quite helpful. With its help you can do various operations like browsing your DB schema , see your sessions , query your tables etc. You can also see the System and User variables like Java Runtime, your OS architecture, your PATH etc etc. OK, enough brand building. Let's get started and see how to

Visualizing Pig Queries Through Lipstick

Quite often while working with Pig you would have reached a situation wherein you found that your Pig scripts have reached such a level of complexity that the flow of execution, and it’s relation to the MapReduce jobs being executed, has become difficult to visualize. And this eventually ends up with the need of additional efforts required to develop, maintain, debug, and monitor the execution of scripts. But not anymore. Thankfully Netflix has developed a tool that enables developers to visualize and monitor the execution of their data flows at a logical level, and they call it Lipstick . As an plementation of PigProgressNotificationListener , Lipstick piggybacks on top of all Pig scripts executed in our environment notifying a Lipstick server of job executions and periodically reporting progress as the script executes. Lipstick has got some really cool features. For instance once you are at the Lipstick main page you can see all the Pig jobs that are currently running or have

How to install MapR M3 on Ubuntu through Ubuntu Partner Archive.

In a recent post of mine I had mentioned about the partnership between MapR and Canonical towards an initiative to make Hadoop available with Ubuntu natively through  Ubuntu Partner Archive . Since, the package has been released now, I thought of showing how to get it done. Trust me it's really cool to install Hadoop by just one apt-get install :) First things first. Open your  sources.list  file and add the MapR repositories into it. deb http://package.mapr.com/releases/v2.1.2/ubuntu/ mapr optional deb http://package.mapr.com/releases/ecosystem/ubuntu binary/ Now, update your repository. sudo apt-get update Note : If it throws any error regarding MapR repositories, just uncomment the lines which allow us to add software from Canonical's partner repository . ## Uncomment the following two lines to add software from #Canonical's ## 'partner' repository. ## This software is not part of Ubuntu, but is offered by #Canonical and the ## respective vendors