In this Hadoop tutorial we will Learn Hadoop Ecosystem basics. Let’s begin….
Big Data Hadoop Online Training
What is a Hadoop Ecosystem?
Hadoop Ecosystem is not a single tool, it’s not a programming language or a single framework. It is a group of tools which are there which are used together by various companies in various domains for different tasks.
Hadoop alone cannot provide all the facilities or services that are required to process the big data. Hadoop can store big data, Hadoop can also process big data to a certain limits, however there are other requirements that are there for e.g. We would like to create a recommendation engines over big data or we would like to run clustering algorithms using over big data or we would like to get real time insight using big data itself, because Hadoop is a batch processing framework. So if you want a real-time insight you will need another tool that can work over HDFS or can utilize and leverage HDFS. So what we understand here is that one single tool like Hadoop will not solve all your problems, you will have to use various other tools over Hadoop or allow with Hadoop to get rid or find a solution to problems what you have. These Tools along with Hadoop is known as Hadoop Ecosystem.
Understand important tools in the Hadoop Ecosystems that can be used with Hadoop and functions they can perform in their own domains.
HDFS – Hadoop Distributed File System is a storage unit of Hadoop. HDFS is entirely a Hadoop cluster which is formed by data nodes (cheap hardware’s) clustered together using Hadoop framework. Using HDFS you can use any kind of data be it structured / unstructured or semi structured. Once you store the data in HDFS you can view the entire data as a single unit. We can store data across various nodes and also maintains log files about the stored data (which is called metadata). What data is stored at which position, so basically HDFS has got two components, NameNode (manages the entire cluster and keeps metadata of the data that is stored in the data nodes) and DataNode (commodity hardware’s which actually stores the data)
YARN – Was introduced in Hadoop 2.0 which enabled various ecosystems tools to connect with HDFS and leverage big data. YARN is also called a Resource Negotiator. Main purpose of YARN is to allocate resources to perform tasks over the Hadoop cluster. ResourceManager (Master demon) and NodeManager (Slave Demon). As soon a client submits a job, these resources are nothing but the containers in which the jobs are executed. Nodemanager finally executes the jobs in this containers and manages entire work over these nodes. YARN was next layer over HDFS and MapReduce now connected with YARN to allocate resources for executing the MapReduce tasks.
MapReduce – Core component in Hadoop Ecosystem Used for Data Processing using Programming. Basically comprises of two functions MAP Function and Reduce function. Map function is used to perform functions like Filtering, grouping and Sorting and result of the map function is aggregated on the Reduce base and entire summarize result is come on the HDFS itself. Written in Java, but not confined to any languages. It can be written in python, Perl or any other language.
Apache PIG – Processing Service Using Query. PIG is a Data processing tool that runs/sits over Hadoop. Apache Pig has its own language which is called PIG Latin which is a data flow language or an instructional language. For example to load a data, you have a command like Load Disk Data. Tool that was developed to make life of the programmer very easy, 1 line of Pig Latin = 100 lines of Map-perform required functions like grouping, filtering, sorting etc. and finally dumps data on the screen or stores in HDFS.
Apache HIVE – was developed at FaceBook. The time when HIVE was developed RDBMS were flourishing. Even Facebook website was working on MySQL and this was a problem because their workforce was trained in MYSQL, However to write a MapReduce program you had to know some other programming language. So Facebook came up with a tool called HIVE, using which you can write SQL like queries called HIVE Query Language (HQL) and execute the same task over the Hadoop cluster and leverage Big Data. Just Like PIG, using HIVE write simple SQL like queries and the task you were executing using Map Reduce now can be executed using HIVE without getting into the complexities of MapReduce. Using HIVE you can connect from client applications live JAVA as well (if you have the requirement). So HIVE is one important tool who do not want to get into writing MapReduce program. Also the HQL syntax is similar to SQL. Note, Using HIVE we cannot perform all the SQL queries, however most common functions like group, filter can be performed.
Apache SPARK – SPARK is the leading tool in the Hadoop Ecosystem. MapReduce and Hadoop can only be used for batch processing that means we cannot get the results in Real-Time. With the Data is growing so fast, there is a requirement for the real time analytics as well. Real time Analytics cannot be done using MapReduce and Hadoop in that case SPARK comes into picture. SPARK can work Standalone as well as over Hadoop cluster and provide real-time data, also Apache SPARKs is almost 100 times faster than Apache MapReduce.
Apache HBase – A NoSQL Database that runs over Hadoop. Apache HBase is an open source, non-relational distributed database. Can store any kind of data and capable of handling any kind of data. Modeled after Google BigTable and can be used to store any Big Data that is there in Hadoop Filesystem. You can use Apache Hbase as a back-end for a website or web application to query in real time, which cannot be done with PIG/HIVE/MapReduce etc. you can also write java applications and connect to HBase using REST, Thrive, and Avro APIs.
Apache Drill – Open source application works well with NoSQL database or flat file system or a simple file itself at the same time. For example, Imagine if we have to fetch data stored in different data stores like in HDFS, Hbase and Mongo databases. Each of these data stores have its own syntax to get the data, so we would have to write different syntax to fetch data from each data store. Using Apache Drill we can connect to all the databases at a single time, execute one query and extract data from all 3 databases and use the data in your application. Because it follows the ANSI SQL which enables you to write a query that can be understood by all the 3 databases.
Apache Oozie – Job Scheduler in Hadoop Ecosystem. For example suppose we have a task that needs to be performed every 30 minutes, we can define a workflow in Oozie and schedule this task which can execute on its own every 30 minutes rather than someone triggering it manually. We are doing two things, Defining a workflow that can be one or more than one task that are executed by various tools like MapReduce, Hive, Pig etc. in a sequence, as well as you are defining a frequency in which the workflow needs to executed. Oozie coordinator is a component which helps to execute both the above tasks and that too only when the data is available i.e. event based job scheduler
Flume – Data Ingesting service – data ingestion into HDFS. Using Flume you can ingest any kind of data to HDFS and perform tasks on the data after storing. Flume gives you a flexibility of extracting data from Social Medias like Facebook, Twitter, network traffic etc. or service where logs are getting generated on regular intervals and move the data into HDFS.
Apache Sqoop – Data Ingesting service – Mostly handles structured data. Used between Relational database and HDFS. You can move data in and out from HDFS to another RDBMS.
Zookeeper – Coordinator – Ensures coordination between various tools that is there in Hadoop Ecosystem to perform a particular tasks without any interruption. Also manages all the services that are running into Hadoop cluster.
Apache Ambari – cluster manager – manages the Hadoop cluster. Provision, manage and monitors Apache Hadoop cluster. Monitors health of Hadoop Cluster. Developed by Horton works given to Apache later.
There are many other tools with Hadoop Ecosystem, but the one listed above are most important tools.
Start your Big Data Hadoop Training Now. Join our next Batch.