What is HIVE – Learn Hadoop Online

In continuation of our blog to learn Hadoop after the few blogs on Introduction to Big Data Hadoop Online Training  and  Learn Hadoop Ecosystem Basics  and Learn  Spark a Hadoop Ecosystem component through our Blog post.

This Hadoop Online training tutorial is the Apache HIVE Tutorial for Beginners

Big Data Hadoop Tutorial - What is HIVE

After this training you would be able to answer following question: What are the features of Apache HIVE? Why use Apache HIVE? Uses of Apache HIVE? Use case which can be considered for Apache HIVE?

Brief History about Apache HIVE

Facebook in 2007 started making APACHE HIVE component, when their Data started growing from a few GB to few TB of data per day which included text, Images, Videos and may other formats. Traditional database were not able to handle this volume of data. They were able to process huge datasets using Hadoop Storage and MapReduce processes in parallel. However it faced the challenge to convert basic SQL data into MapReduce for processing, hence FB analyzed this problem and came up with HIVE which provided SQL like interface which will get converted into MapReduce Job. Facebook handed HIVE component to Apache as an open source

Note – Yahoo developed PIG to solve the same issue of data growing exponentially

Using HIVE Query Language we can create tables, Database, Read data and Partitions and buckets to restructure the database. It has a Schema Flexibility, JDBC/ODBC drivers are available in HIVE to read the data. Table creation in HIVE is quite easy. Storing and Processing data is very easy with Hive Query Language (HQL).

Advantages of HIVE.

Dataware package built on top of Hadoop. All the dataware house functions to create database, tables, views etc. can be performed in HIVE. Mainly used for Data analysis, for Business analyst by using SQL Expertise. Targeted towards SQL Experts. Can be used without knowing Java or Hadoop APIs.

Limitations of HIVE

ONLY used for managing and querying structured data. Not designed for online transaction processing (doesn’t provide selective insert / update). Does not offer real time queries and row level updates. Latency for Hive Query is very high.

Advantages of HQL

Filter data using Where clause. Partitioning supported to speed up process of reading and consuming data (create, drop, alter). Ability to store results of one query into another table. Ability to store results of one query into HDFS Directory

Difference between HIVE and PIG (you may ask if PIG is there , Why use HIVE?)

HIVE used by Analysts generating daily reports PIG preferred by Programmers and Researchers
SQL Query like Language PIG Latin procedural Language
Supports Partitioning for better processing of data No Partitioning Support
support  Limited JDBC/ODBC No support for JDBC/ODBC
Web Interface Supported Web Interface NOT Supported
Shells / Streaming / Java Supported Shells / Streaming / Java Supported

Click to Join Hadoop Online Training Now.  Learn Hadoop Online from the best training consultant.



SPARK * Hadoop EcoSystem * Hadoop Training Online

In continuation of our blog to Learn Hadoop after the Introduction to Big Data Hadoop Online Training Video on YouTube And Learn HADOOP EcoSystem Basics through our Blog post.

This is Apache Spark Tutorial for Beginners, Hadoop Online Training. Let’s begin….

Learn Apache Spark - A Hadoop Ecosystem Component

After this training you would be able to answer following question: What are the features of Apache Spark? Why use Apache Spark? Uses of Apache Spark? Use case which can be considered for Apache spark?

As we know Big Data is a huge amount of unstructured or structured data is generated across internet or offline from various other resources where like Facebook (Texts / Audios / Videos / Pictures), Twitter and Medical domain in a short amount of time. This data can have problems like missing data or broken links. Every 2 years the data is getting double this is what Big Data is, there will be so much data to predict future for marketing, to help calamities, to help medicine, to help human lives.

Why Apache Spark?

At a higher level, Apache spark is a really quick high performance distributed cluster computing software. Apache Spark is much faster in terms of computation and also in utilizes resources like memory to perform a lot of iterative computations. Spark is both suitable for batch (Hadoop and map reduce) as well as real time (Apache storm) processing. Apache Spark is built up on top of Scala (programming language) it runs on JVM. Java, Scala and Python can be used for programming in Spark. Comparing Apache Spark and Hadoop MapReduce, Spark coding can be processed very fast in comparison with MapReduce. Hadoop implements Batch processing on Big Data, thus it cannot deliver to our Real time use cases need. Spark provides faster processing and easy to use in comparison to map reduce programming. In short, Apache Spark Process data in real-time, Handles inputs from various resources, Very easy to use (programming is very easy to understand and write) and is very fast in processing the data. Examples where spark is used in real time processing today, Stock market, twitter analytics, banking fraud detection, medicine.

So, what is Apache Spark? Apache Spark is an open source framework for real time or batch processing developed using Apache software foundation. Spark is easier to use as it gives an interface to develop an entire cluster. Spark is very reliable and built on top of MapReduce. It can be used stand-alone without Hadoop File System. Spark is a simple binary download and it does not need any installation.

Let’s see the uses of Apache Spark? Spark is giving about 100x times faster than other large scale data processing systems. Can be programmed using Java, Scala, R and Python. It provides simple programming layer and also has a very powerful cashing capability. Can deploy your application using mesos, Yarn and as a standalone cluster.

Spark Ecosystem consists of major Engine called Spark Core Engine, above it we have all the libraries like Spark SQL (SQL programming computation in memory – used for only structured data), Spark Streaming (to perform Real time processing can be unstructured or structured data), MLLib (Machine learning library consisting of, supervised algorithm (partial output known) and unsupervised algorithm (no output known), GraphX (Showing data in Graphs, Graphical computations), SparkR (R on Spark in beta testing still).

For the Case Study we can consider a real time calamity such as a Tsunami, Earthquake, heavy rains, failure of Railway signals, where prediction, just a few minute/seconds before can save lacks of lives. Apache spark can handle all the Requirements for the above case study such as Process data in Real time (just when the calamity is about to happen or has happened), Handle inputs from multiple resources (there would be different sources sending the data), final data must be easy to use (representation of result in graphical forms for faster understanding), Bulk transmission of alerts (Messaging within seconds of receiving results so that lives can be saved)

Learn in depth all about Apache spark in Online Hadoop Training by ITJobZone.biz

Learn Hadoop Online – Basic Terminologies and their Applications

This is our Second blog of Hadoop Online Training

  • Introduction to Big Data Hadoop Online Training Video on YouTube and Our First blog on Hadoop Tutorials on Hadoop Ecosystem basicsIn this Hadoop tutorial we will Learn Hadoop Basics Terminologies and their applications through Question and Answers.
  • Let’s begin….Hadoop Online Training

    Learn Hadoop from our Big Data Hadoop Online Trainings to get the understanding of these terminologies with practical implementations in the Hadoop EcoSystem through live practice examples


  • Q 1. What is Big data?Big data is not just lots of amount of data (Terabytes or Petabytes) But is so large and complex that it is unable to process by traditional data processing systems such as RDBMS. More than the volume of data, it is the nature of data which defines that it’s considered to be Big data. Big data is mainly characterized by its Volume, Velocity and Variety. Volume defines the huge size of data to be stored, Velocity defines the rate of data that is being transferred at a point of time and the Variety defines the type of data. The variety of data is classified in Structured, Semi-structured and unstructured data.Structured data – The data which is stored in table format and has relations is considered structured data. For ex. CRM, RDBMSSemi-structured data – The data which has some structure but has no relations is considered as semi-structured. For ex. JSON, XMLUnstructured data – The data which has no structure at all is unstructured data. Ex. Audio, video, images

    The semi-structured and unstructured data is not suitable to store in RDBMS systems. And the 90% of world’s data is unstructured which make it a problem for analysis.

    Q2. What is the difference between Hadoop and Traditional database management systems?

    Hadoop is built to store large amount of semi-structured and unstructured data whereas Traditional RDBMS stores only the structured data efficiently.

    Hadoop stores the data directly as files whereas RDBMS stores data in tabular format with some drawback.

    Hadoop handles Null value and do not waste space to store data which is missing or NULL Stores whereas Traditional RDBMS also stores the NULL values.

    Hadoop is most suitable for Analysis of data (OLAP) whereas Traditional RDBMS is most suitable for transaction processing (OLTP).

    Hadoop allows you to store data without providing any schema and you can give schema when you read/process data but Traditional RDBMS need schema to be defined when you write data.

    Hadoop provides fast writes but RDBMS systems provide fast reads of data.

    Q 3. What is the difference between Vertical and Horizontal Scalability?

    The Vertical scalability says if you need more storage/processing power then you need to add more hardware to the existing system. For example if you have a system with 8GB RAM and 1TB hard drive, you need to add more Hard drive and RAM to the same system (limited by various parameters such as data transfer rate, hardware compatibility etc., you cannot scale beyond a range only some terabytes) for which you have to shut down your server. This pattern of scalability is used by Traditional RDBMS.

    The Horizontal Scalability defines if you need more storage/processing power then you need not to worry about downtime and other factors, you can simply add another system to the network of desired or equal storage/processing capacity as the other machine. This makes it very easy to scale and has very efficiently store/process huge amount of data, petabytes and even more. This pattern is used in Hadoop and NoSQL databases.

    Q 4. How does Hadoop store/process Big data?

    Hadoop creates a cluster of several machines connected by some network topology which works as a single unit of storage and processing. For storage, Hadoop stores its data on a distributed filesystem called HDFS (Hadoop Distributed File System). So when a large file is stored on HDFS, it is broken into pieces which are spread on several machines by the file system. This also decrease the time required to read the file by huge factor as several systems can simultaneously read data with their own I/O channels.

    For processing of stored data in HDFS Hadoop uses a simple programming Framework called MapReduce. Using a simple program written in Java, you can process the stored data on the same nodes on which it is stored, this is called Data Locality. So if you have a large file and you want to perform a Word Count on it. The MapReduce framework will automatically count the words stored on each node and all nodes will send intermediate count to some other node for final sum.

    Q 5 what is Hadoop Streaming?

    Hadoop distribution has a generic programming interface to write MapReduce programs in any desired language such as Java, Python, Perl, Ruby etc. which is referred as Hadoop Streaming.

    Q 6. What are Hadoop Daemons?

    HDFS and MapReduce clusters uses some background processes which are referred as Hadoop Daemons. This Daemons need to always run in background for properly proper functioning of a Hadoop cluster.

    Namenode – This is the central node of the cluster which stores all the metadata of the system. There is only one namenode allowed in Hadoop1.x which makes it single point of failure as if Namenode is down and no metadata is available, you cannot access the data stored in data nodes. But Hadoop 2.x comes with a solution and allows two Namenodes to be configured in sync which is called High Availability so if one namenode is down other will take over but only one Namenode is active at a time and other is on standby. In the upcoming version of Hadoop3.x allows more than 2 Namenodes for better availability…

    Datanode – There can be many (hundreds or thousands) nodes in a single Hadoop cluster. Datanode stored actual data of files and which is distributed in the form of blocks on the different nodes.

    Secondary Namenode – This works as a backup node for the Namenode and stores the metadata periodically from Namenode. But this does not work as an active standby which means if Namenode is down this won’t take over Namenodes role immediately.

    Resource Manager – RM is the Master in YARN cluster which manages and allocates all the available resources to Node managers in the cluster.

    Node Manager – NM take instructions from Resource Manager and manage resources on a single node.

    Q 7. What are the components of Hadoop Ecosystem?

    Hadoop has basically only three components HDFS, Hadoop MapReduce and YARN which can store and process data efficiently. But as per the application needs you will have to add some more components for data access, transfer, serialization, visualization etc. few of the main components along with Hadoop makes Hadoop Ecosystem

  • HDFS
  • Hadoop MapReduce
  • YARN
  • Hadoop Common
  • Data Access components – Pig, Hive
  • Real-Time storage components – HBase, Accumulo
  • Data Integration components – Sqoop, Flume
  • Data Serialization Components – Thrift, Avro
  • Data Management, monitoring and coordination – Ambari, Oozie, zookeeper, Ganglia
  • Data Intelligence Components – Mahout, Drill
  • Want to Learn More about Big Data Hadoop? Follow us….You want to learn Hadoop online? Get big data hadoop training online by expert trainers at ITJobZone.biz – Start Hadoop Training Online Now.


What is Hadoop Ecosystem * Hadoop Training Online

In this Hadoop tutorial we will Learn Hadoop Ecosystem basics. Let’s begin….

Hadoop Online Training - Hadoop EcoSystem

Big Data Hadoop Online Training

What is a Hadoop Ecosystem?

Hadoop Ecosystem is not a single tool, it’s not a programming language or a single framework. It is a group of tools which are there which are used together by various companies in various domains for different tasks.

Hadoop alone cannot provide all the facilities or services that are required to process the big data. Hadoop can store big data, Hadoop can also process big data to a certain limits, however there are other requirements that are there for e.g. We would like to create a recommendation engines over big data or we would like to run clustering algorithms using over big data or we would like to get real time insight using big data itself, because Hadoop is a batch processing framework. So if you want a real-time insight you will need another tool that can work over HDFS or can utilize and leverage HDFS. So what we understand here is that one single tool like Hadoop will not solve all your problems, you will have to use various other tools over Hadoop or allow with Hadoop to get rid or find a solution to problems what you have. These Tools along with Hadoop is known as Hadoop Ecosystem.

Understand important tools in the Hadoop Ecosystems that can be used with Hadoop and functions they can perform in their own domains.

HDFS – Hadoop Distributed File System is a storage unit of Hadoop. HDFS is entirely a Hadoop cluster which is formed by data nodes (cheap hardware’s) clustered together using Hadoop framework. Using HDFS you can use any kind of data be it structured / unstructured or semi structured. Once you store the data in HDFS you can view the entire data as a single unit. We can store data across various nodes and also maintains log files about the stored data (which is called metadata).  What data is stored at which position, so basically HDFS has got two components, NameNode (manages the entire cluster and keeps metadata of the data that is stored in the data nodes) and DataNode (commodity hardware’s which actually stores the data)

YARN – Was introduced in Hadoop 2.0 which enabled various ecosystems tools to connect with HDFS and leverage big data. YARN is also called a Resource Negotiator. Main purpose of YARN is to allocate resources to perform tasks over the Hadoop cluster. ResourceManager (Master demon) and NodeManager (Slave Demon). As soon a client submits a job, these resources are nothing but the containers in which the jobs are executed. Nodemanager finally executes the jobs in this containers and manages entire work over these nodes. YARN was next layer over HDFS and MapReduce now connected with YARN to allocate resources for executing the MapReduce tasks.

MapReduce – Core component in Hadoop Ecosystem Used for Data Processing using Programming. Basically comprises of two functions MAP Function and Reduce function. Map function is used to perform functions like Filtering, grouping and Sorting and result of the map function is aggregated on the Reduce base and entire summarize result is come on the HDFS itself. Written in Java, but not confined to any languages. It can be written in python, Perl or any other language.

Apache PIG – Processing Service Using Query. PIG is a Data processing tool that runs/sits over Hadoop. Apache Pig has its own language which is called PIG Latin which is a data flow language or an instructional language. For example to load a data, you have a command like Load Disk Data. Tool that was developed to make life of the programmer very easy, 1 line of Pig Latin = 100 lines of Map-perform required functions like grouping, filtering, sorting etc. and finally dumps data on the screen or stores in HDFS.

Apache HIVE – was developed at FaceBook. The time when HIVE was developed RDBMS were flourishing. Even Facebook website was working on MySQL and this was a problem because their workforce was trained in MYSQL, However to write a MapReduce program you had to know some other programming language. So Facebook came up with a tool called HIVE, using which you can write SQL like queries called HIVE Query Language (HQL) and execute the same task over the Hadoop cluster and leverage Big Data. Just Like PIG, using HIVE write simple SQL like queries and the task you were executing using Map Reduce now can be executed using HIVE without getting into the complexities of MapReduce. Using HIVE you can connect from client applications live JAVA as well (if you have the requirement). So HIVE is one important tool who do not want to get into writing MapReduce program. Also the HQL syntax is similar to SQL. Note, Using HIVE we cannot perform all the SQL queries, however most common functions like group, filter can be performed.

Apache SPARK – SPARK is the leading tool in the Hadoop Ecosystem. MapReduce and Hadoop can only be used for batch processing that means we cannot get the results in Real-Time. With the Data is growing so fast, there is a requirement for the real time analytics as well. Real time Analytics cannot be done using MapReduce and Hadoop in that case SPARK comes into picture. SPARK can work Standalone as well as over Hadoop cluster and provide real-time data, also Apache SPARKs is almost 100 times faster than Apache MapReduce.

Apache HBase – A NoSQL Database that runs over Hadoop. Apache HBase is an open source, non-relational distributed database. Can store any kind of data and capable of handling any kind of data. Modeled after Google BigTable and can be used to store any Big Data that is there in Hadoop Filesystem. You can use Apache Hbase as a back-end for a website or web application to query in real time, which cannot be done with PIG/HIVE/MapReduce etc. you can also write java applications and connect to HBase using REST, Thrive, and Avro APIs.

Apache Drill Open source application works well with NoSQL database or flat file system or a simple file itself at the same time. For example, Imagine if we have to fetch data stored in different data stores like in HDFS, Hbase and Mongo databases. Each of these data stores have its own syntax to get the data, so we would have to write different syntax to fetch data from each data store. Using Apache Drill we can connect to all the databases at a single time, execute one query and extract data from all 3 databases and use the data in your application. Because it follows the ANSI SQL which enables you to write a query that can be understood by all the 3 databases.

Apache Oozie – Job Scheduler in Hadoop Ecosystem. For example suppose we have a task that needs to be performed every 30 minutes, we can define a workflow in Oozie and schedule this task which can execute on its own every 30 minutes rather than someone triggering it manually. We are doing two things, Defining a workflow that can be one or more than one task that are executed by various tools like MapReduce, Hive, Pig etc. in a sequence, as well as you are defining a frequency in which the workflow needs to executed. Oozie coordinator is a component which helps to execute both the above tasks and that too only when the data is available i.e. event based job scheduler

Flume – Data Ingesting service – data ingestion into HDFS. Using Flume you can ingest any kind of data to HDFS and perform tasks on the data after storing. Flume gives you a flexibility of extracting data from Social Medias like Facebook, Twitter, network traffic etc. or service where logs are getting generated on regular intervals and move the data into HDFS.

Apache Sqoop Data Ingesting service – Mostly handles structured data. Used between Relational database and HDFS. You can move data in and out from HDFS to another RDBMS.

ZookeeperCoordinator – Ensures coordination between various tools that is there in Hadoop Ecosystem to perform a particular tasks without any interruption. Also manages all the services that are running into Hadoop cluster.

Apache Ambaricluster manager – manages the Hadoop cluster. Provision, manage and monitors Apache Hadoop cluster. Monitors health of Hadoop Cluster. Developed by Horton works given to Apache later.

There are many other tools with Hadoop Ecosystem, but the one listed above are most important tools.

Start your Big Data Hadoop Training Now. Join our next Batch.