Posts

Showing posts from 2015

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich set of higher-level Apis which include  Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming  for high-throughput, fault-tolerant stream processing of live data streams. Through this post we'll explore the Spark SQL API and see how to use it with Avro data. As stated earlier, Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. A DataFrame is a distributed collection of data organised into named columns. We can think of a DataFrame as a table in a relational database or as a dataframe in R or Python. Apache Avro is a very popular data serialization system, specially in BigData world. We'll use the spark-avro library for this. spark-avro is a beaut