CloudFront: How to work with Avro data using Apache Spark(Spark SQL API)

Thursday, September 24, 2015

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich set of higher-level Apis which include Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for high-throughput, fault-tolerant stream processing of live data streams.

Through this post we'll explore the Spark SQL API and see how to use it with Avro data. As stated earlier, Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. A DataFrame is a distributed collection of data organised into named columns. We can think of a DataFrame as a table in a relational database or as a dataframe in R or Python.

Apache Avro is a very popular data serialization system, specially in BigData world. We'll use the spark-avro library for this. spark-avro is a beautiful library provided by Databricks, which helps us in reading and writing Avro data.

We begin with creating an instance of JavaSparkContext using SparkConf.

SparkConf conf = new SparkConf().setAppName("SQLApiDemo").setMaster(

"local");

JavaSparkContext javaSparkContext = new JavaSparkContext(conf);

Once created, this JavaSparkContext instance is used to create an instance of SQLContext, which will be used to read, operate on and write the Avro data.

SQLContext sqlContext = new SQLContext(sc);

Next, we use the load() API provided by SQLContext to read some Avro data in from a given source. The load() API returns a DataFrame created out of the data read from a specified source. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs. Another good thing about DataFrame is that the DataFrame API is available in Scala, Java, Python, and R.

Here is a complete example showing how to read and work with Avro data using DataFrame :

import org.apache.spark.SparkConf;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.api.java.JavaSparkContext;

public class SQLApiDemo {

    public static void main(String[] args) throws IOException {

        SparkConf conf = new SparkConf().setAppName("SQLApiDemo").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);
        DataFrame df = sqlContext.load("/Users/tariq/avro_data/browser.avro/", "com.databricks.spark.avro");
        df.schema();
        df.show();
    }
}

Not just this, DataFrame API also allows us to perform various structured data processing operations. For example :

df.select("name").show();
// name
// Michael
// Andy
// Justin

// Select everybody, but increment the age by 1
df.select("name", df.col("age").plus(1)).show();
// name    (age + 1)
// Michael null
// Andy    31
// Justin  20

// Select people older than 21
df.filter(df("name") > 21).show();
// age name
// 30  Andy

// Count people by age
df.groupBy("age").count().show();
// age  count
// null 1
// 19   1
// 30   1

11 comments:

UnknownSeptember 16, 2019 at 5:20 PM
Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating big data online training
ReplyDelete
Replies
siva sreedharSeptember 25, 2019 at 12:40 PM
I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Spark Certification
ReplyDelete
Replies
Alfred AvinaJanuary 29, 2020 at 6:21 PM
Capsule theory is an excellent concept to talk about, but you can't ignore the relation of capsule theories with Google cloud big data services.
ReplyDelete
Replies
veera cynixitJuly 24, 2020 at 2:30 PM
very nice blo ,keep updating more posts.

We are offering some courses intrested candidates visit now,

big data online training

hadoop admin course

ios online training

android online training
And many more trainings available here.
ReplyDelete
Replies
Lafay Tech PlazaApril 17, 2021 at 6:26 PM
Hadoop, an open-source software framework for storing and processing data services on clusters of commodity hardware, is rapidly gaining steam in the enterprise. According to an InfoWorld analysis, Hadoop is now used by more than half of the Global 500. Hopefully that number will continue to grow as more and more companies realize the value that Hadoop can bring to their data analysis and storage needs.
ReplyDelete
Replies
Sreyobhilashi InstituteSeptember 25, 2021 at 12:38 PM
I think its old 2015 video its too old in latest version by default avro available no need any dependency. its working in spark 2.4.8 version or spark 3.x version.. ok finally thx to share ur knowledge.
Regard
Venu spark training in Hyderabad

bigdata training in Hyderabad
ReplyDelete
Replies
KamaleshDecember 1, 2021 at 5:05 PM
Excellent article and this helps to enhance your knowledge regarding new things. Waiting for more updates.
Angular 11 New Features
Angular Latest Stable Version
ReplyDelete
Replies
UnknownDecember 31, 2021 at 12:29 PM
Thank you ever so for you article. Really Cool.
core java online training hyderabad
java online training india
ReplyDelete
Replies
nareshAugust 8, 2023 at 5:13 PM
Hi.
Thanks for sharing this information, This blog was good and it’s gives good information general-purpose cluster computing and it's impartance

Here is sharing some IBM DataStage information may be its helpful to you.
IBM DataStage Training
ReplyDelete
Replies
TotalCloudAIJuly 19, 2024 at 12:45 PM
This comment has been removed by the author.
ReplyDelete
Replies
TotalCloudAIAugust 13, 2024 at 9:41 AM
Improve your data abilities with TotalCloudAI's Databricks training courses, which include data engineering, machine learning, and advanced analytics. For more visit us!
ReplyDelete
Replies

Add comment