Skip to main content



How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich set of higher-level Apis which include Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for high-throughput, fault-tolerant stream processing of live data streams.

Through this post we'll explore the Spark SQL API and see how to use it with Avro data. As stated earlier, Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. A DataFrame is a distributed collection of data organised into named columns. We can think of a DataFrame as a table in a relational database or as a dataframe in R or Python.

Apache Avro is a very popular data serialization system, specially in BigData world. We'll use the spark-avro library for this. spark-avro is a beautiful library…

Latest Posts

Fun with HBase shell

List of my top 10 most voted SO answers

Analyzing your data on the fly with Pig through Mortar Watchtower

How to run Hive queries through Hive Web Interface.

Visualizing Pig Queries Through Lipstick

How to install MapR M3 on Ubuntu through Ubuntu Partner Archive.

Hadoop Herd : When to use What...

Hadoop+Ubuntu : The Big Fat Wedding.

Is your data really Big(Data)??

Happy Birthday Hadoop