Learn Hadoop Online – Basic Terminologies and their Applications

This is our Second blog of Hadoop Online Training

  • Introduction to Big Data Hadoop Online Training Video on YouTube and Our First blog on Hadoop Tutorials on Hadoop Ecosystem basicsIn this Hadoop tutorial we will Learn Hadoop Basics Terminologies and their applications through Question and Answers.
  • Let’s begin….Hadoop Online Training

    Learn Hadoop from our Big Data Hadoop Online Trainings to get the understanding of these terminologies with practical implementations in the Hadoop EcoSystem through live practice examples

 

  • Q 1. What is Big data?Big data is not just lots of amount of data (Terabytes or Petabytes) But is so large and complex that it is unable to process by traditional data processing systems such as RDBMS. More than the volume of data, it is the nature of data which defines that it’s considered to be Big data. Big data is mainly characterized by its Volume, Velocity and Variety. Volume defines the huge size of data to be stored, Velocity defines the rate of data that is being transferred at a point of time and the Variety defines the type of data. The variety of data is classified in Structured, Semi-structured and unstructured data.Structured data – The data which is stored in table format and has relations is considered structured data. For ex. CRM, RDBMSSemi-structured data – The data which has some structure but has no relations is considered as semi-structured. For ex. JSON, XMLUnstructured data – The data which has no structure at all is unstructured data. Ex. Audio, video, images

    The semi-structured and unstructured data is not suitable to store in RDBMS systems. And the 90% of world’s data is unstructured which make it a problem for analysis.

    Q2. What is the difference between Hadoop and Traditional database management systems?

    Hadoop is built to store large amount of semi-structured and unstructured data whereas Traditional RDBMS stores only the structured data efficiently.

    Hadoop stores the data directly as files whereas RDBMS stores data in tabular format with some drawback.

    Hadoop handles Null value and do not waste space to store data which is missing or NULL Stores whereas Traditional RDBMS also stores the NULL values.

    Hadoop is most suitable for Analysis of data (OLAP) whereas Traditional RDBMS is most suitable for transaction processing (OLTP).

    Hadoop allows you to store data without providing any schema and you can give schema when you read/process data but Traditional RDBMS need schema to be defined when you write data.

    Hadoop provides fast writes but RDBMS systems provide fast reads of data.

    Q 3. What is the difference between Vertical and Horizontal Scalability?

    The Vertical scalability says if you need more storage/processing power then you need to add more hardware to the existing system. For example if you have a system with 8GB RAM and 1TB hard drive, you need to add more Hard drive and RAM to the same system (limited by various parameters such as data transfer rate, hardware compatibility etc., you cannot scale beyond a range only some terabytes) for which you have to shut down your server. This pattern of scalability is used by Traditional RDBMS.

    The Horizontal Scalability defines if you need more storage/processing power then you need not to worry about downtime and other factors, you can simply add another system to the network of desired or equal storage/processing capacity as the other machine. This makes it very easy to scale and has very efficiently store/process huge amount of data, petabytes and even more. This pattern is used in Hadoop and NoSQL databases.

    Q 4. How does Hadoop store/process Big data?

    Hadoop creates a cluster of several machines connected by some network topology which works as a single unit of storage and processing. For storage, Hadoop stores its data on a distributed filesystem called HDFS (Hadoop Distributed File System). So when a large file is stored on HDFS, it is broken into pieces which are spread on several machines by the file system. This also decrease the time required to read the file by huge factor as several systems can simultaneously read data with their own I/O channels.

    For processing of stored data in HDFS Hadoop uses a simple programming Framework called MapReduce. Using a simple program written in Java, you can process the stored data on the same nodes on which it is stored, this is called Data Locality. So if you have a large file and you want to perform a Word Count on it. The MapReduce framework will automatically count the words stored on each node and all nodes will send intermediate count to some other node for final sum.

    Q 5 what is Hadoop Streaming?

    Hadoop distribution has a generic programming interface to write MapReduce programs in any desired language such as Java, Python, Perl, Ruby etc. which is referred as Hadoop Streaming.

    Q 6. What are Hadoop Daemons?

    HDFS and MapReduce clusters uses some background processes which are referred as Hadoop Daemons. This Daemons need to always run in background for properly proper functioning of a Hadoop cluster.

    Namenode – This is the central node of the cluster which stores all the metadata of the system. There is only one namenode allowed in Hadoop1.x which makes it single point of failure as if Namenode is down and no metadata is available, you cannot access the data stored in data nodes. But Hadoop 2.x comes with a solution and allows two Namenodes to be configured in sync which is called High Availability so if one namenode is down other will take over but only one Namenode is active at a time and other is on standby. In the upcoming version of Hadoop3.x allows more than 2 Namenodes for better availability…

    Datanode – There can be many (hundreds or thousands) nodes in a single Hadoop cluster. Datanode stored actual data of files and which is distributed in the form of blocks on the different nodes.

    Secondary Namenode – This works as a backup node for the Namenode and stores the metadata periodically from Namenode. But this does not work as an active standby which means if Namenode is down this won’t take over Namenodes role immediately.

    Resource Manager – RM is the Master in YARN cluster which manages and allocates all the available resources to Node managers in the cluster.

    Node Manager – NM take instructions from Resource Manager and manage resources on a single node.

    Q 7. What are the components of Hadoop Ecosystem?

    Hadoop has basically only three components HDFS, Hadoop MapReduce and YARN which can store and process data efficiently. But as per the application needs you will have to add some more components for data access, transfer, serialization, visualization etc. few of the main components along with Hadoop makes Hadoop Ecosystem

  • HDFS
  • Hadoop MapReduce
  • YARN
  • SPARK
  • Hadoop Common
  • Data Access components – Pig, Hive
  • Real-Time storage components – HBase, Accumulo
  • Data Integration components – Sqoop, Flume
  • Data Serialization Components – Thrift, Avro
  • Data Management, monitoring and coordination – Ambari, Oozie, zookeeper, Ganglia
  • Data Intelligence Components – Mahout, Drill
  • Want to Learn More about Big Data Hadoop? Follow us….You want to learn Hadoop online? Get big data hadoop training online by expert trainers at ITJobZone.biz – Start Hadoop Training Online Now.

 

One thought on “Learn Hadoop Online – Basic Terminologies and their Applications

Leave a comment