How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich set of higher-level Apis which include Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for high-throughput, fault-tolerant stream processing of live data streams.

Through this post we'll explore the Spark SQL API and see how to use it with Avro data. As stated earlier, Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. A DataFrame is a distributed collection of data organised into named columns. We can think of a DataFrame as a table in a relational database or as a dataframe in R or Python.

Apache Avro is a very popular data serialization system, specially in BigData world. We'll use the spark-avro library for this. spark-avro is a beautiful library provided by Databricks, which helps us in reading and writing Avro data.

We begin with creating an instance of JavaSparkContext using SparkConf.

SparkConf conf = new SparkConf().setAppName("SQLApiDemo").setMaster(
"local");

JavaSparkContext javaSparkContext = new JavaSparkContext(conf);

Once created, this JavaSparkContext instance is used to create an instance of SQLContext, which will be used to read, operate on and write the Avro data.

SQLContext sqlContext = new SQLContext(sc);

Next, we use the load() API provided by SQLContext to read some Avro data in from a given source. The load() API returns a DataFrame created out of the data read from a specified source. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs. Another good thing about DataFrame is that the DataFrame API is available in Scala, Java, Python, and R.

Here is a complete example showing how to read and work with Avro data using DataFrame :

import org.apache.spark.SparkConf;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.api.java.JavaSparkContext;

public class SQLApiDemo {

    public static void main(String[] args) throws IOException {

        SparkConf conf = new SparkConf().setAppName("SQLApiDemo").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);
        DataFrame df = sqlContext.load("/Users/tariq/avro_data/browser.avro/", "com.databricks.spark.avro");
        df.schema();
        df.show();
    }
}

Not just this, DataFrame API also allows us to perform various structured data processing operations. For example :
df.select("name").show();
// name
// Michael
// Andy
// Justin

// Select everybody, but increment the age by 1
df.select("name", df.col("age").plus(1)).show();
// name    (age + 1)
// Michael null
// Andy    31
// Justin  20

// Select people older than 21
df.filter(df("name") > 21).show();
// age name
// 30  Andy

// Count people by age
df.groupBy("age").count().show();
// age  count
// null 1
// 19   1
// 30   1


Comments

  1. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating big data online training

    ReplyDelete
  2. I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Spark Certification

    ReplyDelete
  3. Capsule theory is an excellent concept to talk about, but you can't ignore the relation of capsule theories with Google cloud big data services.

    ReplyDelete
  4. very nice blo ,keep updating more posts.

    We are offering some courses intrested candidates visit now,

    big data online training

    hadoop admin course

    ios online training

    android online training
    And many more trainings available here.

    ReplyDelete
  5. Hadoop, an open-source software framework for storing and processing data services on clusters of commodity hardware, is rapidly gaining steam in the enterprise. According to an InfoWorld analysis, Hadoop is now used by more than half of the Global 500. Hopefully that number will continue to grow as more and more companies realize the value that Hadoop can bring to their data analysis and storage needs.

    ReplyDelete
  6. I think its old 2015 video its too old in latest version by default avro available no need any dependency. its working in spark 2.4.8 version or spark 3.x version.. ok finally thx to share ur knowledge.
    Regard
    Venu spark training in Hyderabad

    bigdata training in Hyderabad

    ReplyDelete
  7. Excellent article and this helps to enhance your knowledge regarding new things. Waiting for more updates.
    Angular 11 New Features
    Angular Latest Stable Version

    ReplyDelete
  8. Hi.
    Thanks for sharing this information, This blog was good and it’s gives good information general-purpose cluster computing and it's impartance

    Here is sharing some IBM DataStage information may be its helpful to you.
    IBM DataStage Training

    ReplyDelete

Post a Comment

Popular posts from this blog

Fun with HBase shell

BETWEEN OPERATOR IN HIVE

HOW TO SETUP AND CONFIGURE 'ssh' ON LINUX (UBUNTU)