Through this post we'll explore the Spark SQL API and see how to use it with Avro data. As stated earlier, Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. A DataFrame is a distributed collection of data organised into named columns. We can think of a DataFrame as a table in a relational database or as a dataframe in R or Python.
Apache Avro is a very popular data serialization system, specially in BigData world. We'll use the spark-avro library for this. spark-avro is a beautiful library provided by Databricks, which helps us in reading and writing Avro data.
We begin with creating an instance of JavaSparkContext using SparkConf.
SparkConf conf = new SparkConf().setAppName("SQLApiDemo").setMaster(
"local");
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
Once created, this JavaSparkContext instance is used to create an instance of SQLContext, which will be used to read, operate on and write the Avro data.
SQLContext sqlContext = new SQLContext(sc);
Next, we use the load() API provided by SQLContext to read some Avro data in from a given source. The load() API returns a DataFrame created out of the data read from a specified source. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs. Another good thing about DataFrame is that the DataFrame API is available in Scala, Java, Python, and R.
Here is a complete example showing how to read and work with Avro data using DataFrame :
import org.apache.spark.SparkConf;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.api.java.JavaSparkContext;
public class SQLApiDemo {
public static void main(String[] args) throws IOException {
SparkConf conf = new SparkConf().setAppName("SQLApiDemo").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.load("/Users/tariq/avro_data/browser.avro/", "com.databricks.spark.avro");
df.schema();
df.show();
}
}
Not just this, DataFrame API also allows us to perform various structured data processing operations. For example :
df.select("name").show();
// name
// Michael
// Andy
// Justin
// Select everybody, but increment the age by 1
df.select("name", df.col("age").plus(1)).show();
// name (age + 1)
// Michael null
// Andy 31
// Justin 20
// Select people older than 21
df.filter(df("name") > 21).show();
// age name
// 30 Andy
// Count people by age
df.groupBy("age").count().show();
// age count
// null 1
// 19 1
// 30 1
Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating big data online training
ReplyDeleteI like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Spark Certification
ReplyDeleteCapsule theory is an excellent concept to talk about, but you can't ignore the relation of capsule theories with Google cloud big data services.
ReplyDeletevery nice blo ,keep updating more posts.
ReplyDeleteWe are offering some courses intrested candidates visit now,
big data online training
hadoop admin course
ios online training
android online training
And many more trainings available here.
Hadoop, an open-source software framework for storing and processing data services on clusters of commodity hardware, is rapidly gaining steam in the enterprise. According to an InfoWorld analysis, Hadoop is now used by more than half of the Global 500. Hopefully that number will continue to grow as more and more companies realize the value that Hadoop can bring to their data analysis and storage needs.
ReplyDeleteI think its old 2015 video its too old in latest version by default avro available no need any dependency. its working in spark 2.4.8 version or spark 3.x version.. ok finally thx to share ur knowledge.
ReplyDeleteRegard
Venu spark training in Hyderabad
bigdata training in Hyderabad
Excellent article and this helps to enhance your knowledge regarding new things. Waiting for more updates.
ReplyDeleteAngular 11 New Features
Angular Latest Stable Version
Thank you ever so for you article. Really Cool.
ReplyDeletecore java online training hyderabad
java online training india
Hi.
ReplyDeleteThanks for sharing this information, This blog was good and it’s gives good information general-purpose cluster computing and it's impartance
Here is sharing some IBM DataStage information may be its helpful to you.
IBM DataStage Training
This comment has been removed by the author.
ReplyDeleteImprove your data abilities with TotalCloudAI's Databricks training courses, which include data engineering, machine learning, and advanced analytics. For more visit us!
ReplyDelete