Friday, January 10, 2014

List of my top 10 most voted SO answers

Here is a list of my top 10 most voted answers on Stackoverflow. All these questions are related to cloud computing including discussions on distributed storage and computing tools like Hadoop, HBase etc. I hope you find it useful as others did.

Analyzing your data on the fly with Pig through Mortar Watchtower

Let me start by thanking Mortar for developing such an amazing tool.  Isn't it really cool to have the ability to make your Pig development faster without having to write a complete script, run it and then wait for for local or remote Pig to finish the execution and finally give you the final data? Quite often, when writing a Pig script, I find it very time consuming to debug what each line of my script is doing. Moreover, the fact that Pig is a dataflow language makes it even more important to have a clear idea of what exactly your data looks like at each step of the flow. This obviously helps in writing compact and efficient scripts. Trust me, you don't want to write inefficient code while dealing with Petabytes of data.

It's a bitter truth that Hadoop development iterations are slow. Traditional programmers have always had the benefit of re-compiling their app, running it, and seeing the results within seconds. They have near instant validation that what they’re building is actually working. When you’re working with Hadoop, dealing with Petabytes of data, your development iteration time is more like hours(even days sometimes). With watchtower folks at Mortar have made an awesome effort to bring back that almost instant iteration cycle developers are used to. Not only that, Watchtower also helps surface the semantics of your Pig scripts, to give you insight into how your scripts are working, not just that they are working.

What is Watchtower??

Watchtower is basically a daemon which continuously watches your data and script running over it in real time. It stores the state of your data at each step and shows how it changes at each step as your script goes. It actually shows the exact flow of your data, directly inline with your script. Not only this Watchtower helps us in finding out the errors in our script as we proceed. So, you don't have to wait until the completion and execution of your script. Since Watchtower is constantly sending data through your entire script, errors are surfaced and displayed instantly.

This is what Watchtower provides you with (courtesy Watchtower homepage) :

  • Instant Sampling of Your Data: Watchtower samples your data in the background while writing your script. This means that when you start writing code, Watchtower is able to provide instant and accurate examples of your data flowing through your script.
  • Complete File Watching: Watchtower watches all files in your Mortar Project for changes. If watchtower detects a change in any of your scripts, UDFs, or even your data, it will recalculate the samples instantly and show you what changed.
  • Instant Schema Evaluation: Watchtower re-evaluates your schema on file save, not only verifying that you referred to the implied schema correctly, but also to show how Pig builds up the schema and generates field names. This is incredibly powerful for the novice (or experienced!) Pig developer who doesn't full understand how Pig uses features like the disambiguate operator.
  • Instant Error Catching: Since Watchtower is running data through your entire script, errors in your script and UDFs are surfaced immediately. Allowing you to debug and fix the errors before you ship your job to an Hadoop cluster.
To get started with Watchtower visit its installation page. It contains all the info you need to get started with Watchtower and use it.

This page contains a detailed description of how Watchtower works, along with a short introductory video.

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich ...