In continuation of our blog to Learn Hadoop after the Introduction to Big Data Hadoop Online Training Video on YouTube And Learn HADOOP EcoSystem Basics through our Blog post.
This is Apache Spark Tutorial for Beginners, Hadoop Online Training. Let’s begin….
After this training you would be able to answer following question: What are the features of Apache Spark? Why use Apache Spark? Uses of Apache Spark? Use case which can be considered for Apache spark?
As we know Big Data is a huge amount of unstructured or structured data is generated across internet or offline from various other resources where like Facebook (Texts / Audios / Videos / Pictures), Twitter and Medical domain in a short amount of time. This data can have problems like missing data or broken links. Every 2 years the data is getting double this is what Big Data is, there will be so much data to predict future for marketing, to help calamities, to help medicine, to help human lives.
Why Apache Spark?
At a higher level, Apache spark is a really quick high performance distributed cluster computing software. Apache Spark is much faster in terms of computation and also in utilizes resources like memory to perform a lot of iterative computations. Spark is both suitable for batch (Hadoop and map reduce) as well as real time (Apache storm) processing. Apache Spark is built up on top of Scala (programming language) it runs on JVM. Java, Scala and Python can be used for programming in Spark. Comparing Apache Spark and Hadoop MapReduce, Spark coding can be processed very fast in comparison with MapReduce. Hadoop implements Batch processing on Big Data, thus it cannot deliver to our Real time use cases need. Spark provides faster processing and easy to use in comparison to map reduce programming. In short, Apache Spark Process data in real-time, Handles inputs from various resources, Very easy to use (programming is very easy to understand and write) and is very fast in processing the data. Examples where spark is used in real time processing today, Stock market, twitter analytics, banking fraud detection, medicine.
So, what is Apache Spark? Apache Spark is an open source framework for real time or batch processing developed using Apache software foundation. Spark is easier to use as it gives an interface to develop an entire cluster. Spark is very reliable and built on top of MapReduce. It can be used stand-alone without Hadoop File System. Spark is a simple binary download and it does not need any installation.
Let’s see the uses of Apache Spark? Spark is giving about 100x times faster than other large scale data processing systems. Can be programmed using Java, Scala, R and Python. It provides simple programming layer and also has a very powerful cashing capability. Can deploy your application using mesos, Yarn and as a standalone cluster.
Spark Ecosystem consists of major Engine called Spark Core Engine, above it we have all the libraries like Spark SQL (SQL programming computation in memory – used for only structured data), Spark Streaming (to perform Real time processing can be unstructured or structured data), MLLib (Machine learning library consisting of, supervised algorithm (partial output known) and unsupervised algorithm (no output known), GraphX (Showing data in Graphs, Graphical computations), SparkR (R on Spark in beta testing still).
For the Case Study we can consider a real time calamity such as a Tsunami, Earthquake, heavy rains, failure of Railway signals, where prediction, just a few minute/seconds before can save lacks of lives. Apache spark can handle all the Requirements for the above case study such as Process data in Real time (just when the calamity is about to happen or has happened), Handle inputs from multiple resources (there would be different sources sending the data), final data must be easy to use (representation of result in graphical forms for faster understanding), Bulk transmission of alerts (Messaging within seconds of receiving results so that lives can be saved)