Hadoop Interview Questions

Hadoop Interview Questions

What is Big Data?

Any data that cannot be stored into traditional RDBMS is termed as Big Data. As we know most of the data that we use today has been generated in the past 20 years. And this data is mostly unstructured or semi structured in nature. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not.

What do the four V’s of Big Data denote?

IBM has a nice, simple explanation for the four critical features of big data:
a) Volume –Scale of data
b) Velocity –Different forms of data
c) Variety –Analysis of streaming data
d) Veracity –Uncertainty of data

IBM has a nice, simple explanation for the four critical features of big data:
a) Volume –Scale of data
b) Velocity –Different forms of data
c) Variety –Analysis of streaming data
d) Veracity –Uncertainty of data

What is Hadoop and its components.

When “Big Data” emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.

♣ Tip: Now, while explaining Hadoop, you should also explain the main components of Hadoop, i.e.:

  • Storage unit– HDFS (NameNode, DataNode)
  • Processing framework– YARN (ResourceManager, NodeManager)

What are the basic differences between relational database and HDFS?

  • fs –put
  • fs –copyToLocal
  • fs –copyFromLocal

Name some companies that use Hadoop.

Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)

On what concept the Hadoop framework works?

Hadoop Framework works on the following two core components-

1)HDFS – Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture.

2)Hadoop MapReduce-This is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters. MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples. The reduce job is always performed after the map job is executed.

Explain about the different schedulers available in Hadoop.

  • FIFO Scheduler – This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.
  • COSHH- This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.
  • Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains a number of map and reduce slots on a resource. Each user can use their own pool to execute the jobs.

What is jps command used for?

jps command is used to verify whether the daemons that run the Hadoop cluster are working or not. The output of jps command shows the status of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.

What are the main components of a Hadoop Application?

Hadoop applications have wide range of technologies that provide great advantage in solving complex business problems.

Core components of a Hadoop application are-

1) Hadoop Common


3) Hadoop MapReduce


Data Access Components are – Pig and Hive

Data Storage Component is – HBase

Data Integration Components are – Apache Flume, Sqoop, Chukwa

Data Management and Monitoring Components are – Ambari, Oozie and Zookeeper.

Data Serialization Components are – Thrift and Avro

Data Intelligence Components are – Apache Mahout and Drill.

What are the main components of a Hadoop Application?

Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers.

What are the most commonly defined input formats in Hadoop?

The most common Input Formats defined in Hadoop are:

  • Text Input Format- This is the default input format defined in Hadoop.
  • Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
  • Sequence File Input Format- This input format is used for reading files in sequence.

List some use cases of the Hadoop Ecosystem

Text Mining, Graph Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.

Which is the best operating system to run Hadoop?

Ubuntu or Linux is the most preferred operating system to run Hadoop. Though Windows OS can also be used to run Hadoop but it will lead to several problems and is not recommended.

What are the network requirements to run Hadoop?

  • SSH is required to run – to launch server processes on the slave nodes.
  • A password less SSH connection is required between the master, secondary machines and all the slaves.

What is the best practice to deploy a secondary NameNode?

It is always better to deploy a secondary NameNode on a separate standalone machine. When the secondary NameNode is deployed on a separate machine it does not interfere with the operations of the primary node.

What happens when the NameNode on the Hadoop cluster goes down?

The file system goes offline whenever the NameNode is down.

How will you decide whether you need to use the Capacity Scheduler or the Fair Scheduler?

Fair Scheduling is the process in which resources are assigned to jobs such that all jobs get to share equal number of resources over time. Fair Scheduler can be used under the following circumstances –

  1. i) If you wants the jobs to make equal progress instead of following the FIFO order then you must use Fair Scheduling.
  2. ii) If you have slow connectivity and data locality plays a vital role and makes a significant difference to the job runtime then you must use Fair Scheduling.

iii) Use fair scheduling if there is lot of variability in the utilization between pools.

Capacity Scheduler allows runs the hadoop mapreduce cluster as a shared, multi-tenant cluster to maximize the utilization of the hadoop cluster and throughput.Capacity Scheduler can be used under the following circumstances –

  1. i) If the jobs require scheduler detrminism then Capacity Scheduler can be useful.
  2. ii) CS’s memory based scheduling method is useful if the jobs have varying memory requirements.

iii) If you want to enforce resource allocation  because you know very well about the cluster utilization and workload then use Capacity Scheduler.

I want to see all the jobs running in a Hadoop cluster. How can you do this?

Using the command – Hadoop job –list, gives the list of jobs running in a Hadoop cluster.

How often should the NameNode be reformatted?

The NameNode should never be reformatted. Doing so will result in complete data loss. NameNode is formatted only once at the beginning after which it creates the directory structure for file system metadata and namespace ID for the entire file system.

If Hadoop spawns 100 tasks for a job and one of the job fails. What does Hadoop do?

The task will be started again on a new TaskTracker and if it fails more than 4 times which is the default setting (the default value can be changed), the job will be killed.

How can you add and remove nodes from the Hadoop cluster?

  • To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then DataNode and TaskTracker should be started on the new node.
  • To remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refreshNodes should be executed.

You increase the replication level but notice that the data is under replicated. What could have gone wrong?

Nothing could have actually wrong, if there is huge volume of data because data replication usually takes times based on data size as the cluster has to copy the data and it might take a few hours.

Explain about the different configuration files and where are they located.

The configuration files are located in “conf” sub directory. Hadoop has 3 different Configuration files- hdfs-site.xml, core-site.xml and mapred-site.xml

Which operating system(s) are supported for production Hadoop deployment?

Which operating system(s) are supported for production Hadoop deployment? | Hadoop admin questions

What is the role of the namenode?

The namenode is the “brain” of the Hadoop cluster and responsible for managing the distribution blocks on the system based on the replication policy. The namenode also supplies the specific addresses for the data based on the client requests.

What happen on the namenode when a client tries to read a data file? | Hadoop admin questions

The namenode will look up the information about file in the edit file and then retrieve the remaining information from filesystem memory snapshot<br>

Since the namenode needs to support a large number of the clients, the primary namenode will only send information back for the data location. The datanode itselt is responsible for the retrieval.

What are the hardware requirements for a Hadoop cluster (primary and secondary namenodes and datanodes)?

There are no requirements for datanodes. However, the namenodes require a specified amount of RAM to store filesystem image in memory Based on the design of the primary namenode and secondary namenode, entire filesystem information will be stored in memory. Therefore, both namenodes need to have enough memory to contain the entire filesystem image.

What mode(s) can Hadoop code be run in? | Hadoop admin questions

Hadoop can be deployed in stand alone mode, pseudo-distributed mode or fully-distributed mode.

Hadoop was specifically designed to be deployed on multi-node cluster. However, it also can be deployed on single machine and as a single process for testing purposes

How would an Hadoop administrator deploy various components of Hadoop in production?

Deploy namenode and jobtracker on the master node, and deploy datanodes and taskstrackers on multiple slave nodes

There is a need for only one namenode and jobtracker on the system. The number of datanodes depends on the available hardware

Is there a standard procedure to deploy Hadoop?

No, there are some differences between various distributions. However, they all require that Hadoop jars be installed on the machine<br>

There are some common requirements for all Hadoop distributions but the specific procedures will be different for different vendors since they all have some degree of proprietary software.

What is the role of the secondary namenode?

Secondary namenode performs CPU intensive operation of combining edit logs and current filesystem snapshots.

The secondary namenode was separated out as a process due to having CPU intensive operations and additional requirements for metadata back-up.

What are the side effects of not running a secondary name node?

The cluster performance will degrade over time since edit log will grow bigger and bigger<br>

If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current filesystem checkpoint image.

What happen if one of the datanodes has much slower CPU?

The task execution will be as fast as the slowest worker. However, if speculative execution is enabled, the slowest worker will not have such big impact

Hadoop was specifically designed to work with commodity hardware. The speculative execution helps to offset the slow workers. The multiple instances of the same task will be created and job tracker will take the first result into consideration and the second instance of the task will be killed.

What is speculative execution?

If speculative execution is enabled, the job tracker will issue multiple instances of the same task on multiple nodes and it will take the result of the task that finished first. The other instances of the task will be killed.

The speculative execution is used to offset the impact of the slow workers in the cluster. The jobtracker creates multiple instances of the same task and takes the result of the first successful task. The rest of the tasks will be discarded.

What are real-time industry applications of Hadoop?

Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and distributed computing of large volumes of data. It provides rapid, high performance and cost-effective analysis of structured and unstructured data generated on digital platforms and within the enterprise. It is used in almost all departments and sectors today.Some of the instances where Hadoop is used:

  • Managing traffic on streets.
  • Streaming processing.
  • Content Management and Archiving Emails.
  • Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster.
  • Fraud detection and Prevention.
  • Advertisements Targeting Platforms are using Hadoop to capture and analyze click stream, transaction, video and social media data.
  • Managing content, posts, images and videos on social media platforms.
  • Analyzing customer data in real-time for improving business performance.
  • Public sector fields such as intelligence, defense, cyber security and scientific research.
  • Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns, identify rogue traders, more precisely target their marketing campaigns based on customer segmentation, and improve customer satisfaction.
  • Getting access to unstructured data like output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data.

Compare Hadoop & Spark?

Criteria                                        Hadoop                                      Spark

Dedicated storage                           HDFS                                            None

Speed of processing                        average                                          excellent

Libraries                                          Separate tools available                 Spark Core, SQL, Streaming, MLlib, Graphx

How is Hadoop different from other parallel computing systems?

Hadoop is a distributed file system, which lets you store and handle massive amount of data on a cloud of machines, handling data redundancy. The primary benefit is that since data is stored in several nodes, it is better to process it in distributed manner. Each node can process the data stored on it instead of spending time in moving it over the network.

On the contrary, in Relational database computing system, you can query data in real-time, but it is not efficient to store data in tables, records and columns when the data is huge.
Hadoop also provides a scheme to build a Column Database with Hadoop HBase, for runtime queries on rows.

What all modes Hadoop can be run in?

Hadoop can run in three modes:

  • Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
  • Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
  • Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Sepa rate nodes are allotted as Master and Slave.

Explain the major difference between HDFS block and InputSplit?

In simple terms, block is the physical representation of data while split is the logical representation of data present in the block. Split acts a s an intermediary between block and mapper.
Suppose we have two blocks:
Block 1: myTectra
Block 2: my Tect

Now, considering the map, it will read first block from my till ll, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block.

It then forms key-value pair using inputformat and records reader and sends map for further processing With inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.

However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.

What is distributed cache and what are its benefits?

Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when needed. Once a file is cached for a specific job, hadoop will make it available on each data node both in system and in memory, where map and reduce tasks are executing.Later, you can easily access and read the cache file and populate any collection (like array, hashmap) in your code.

Benefits of using distributed cache are:

  • It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.
  • Distributed cache tracks the modification timestamps of cache files, which notifies that the files should not be modified until a job is executing currently.

Explain the difference between NameNode, Checkpoint NameNode and BackupNode?

  • NameNode is the core of HDFS that manages the metadata – the information of what file maps to what block locations and what blocks are stored on what datanode. In simple terms, it’s the data about the data being stored. NameNode supports a directory tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It uses following files for namespace:
    fsimage file- It keeps track of the latest checkpoint of the namespace.
    edits file-It is a log of changes that have been made to the namespace since checkpoint.
  • Checkpoint NameNode has the same directory structure as NameNode, and creates checkpoints for namespace at regular intervals by downloading the fsimage and edits file and margining them within the local directory. The new image after merging is then uploaded to NameNode.
    There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not support the ‘upload to NameNode’ functionality.
  • Backup Node provides similar functionality as Checkpoint, enforcing synchronization with NameNode. It maintains an up-to-date in-memory copy of file system namespace and doesn’t require getting hold of changes after regular intervals. The backup node needs to save the current state in-memory to an image file to create a new checkpoint.

What are the most common Input Formats in Hadoop?

There are three most common input formats in Hadoop:

  • Text Input Format: Default input format in Hadoop.
  • Key Value Input Format: used for plain text files where the files are broken into lines
  • Sequence File Input Format: used for reading files in sequence

Define DataNode and how does NameNode tackle DataNode failures?

DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive a message from datanode for 10 minutes, it considers it to be dead or out of place, and starts replication of blocks that were hosted on that data node such that they are hosted on some other data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to replicate what were stored in dead DataNode.

The NameNode manages the replication of data blocksfrom one DataNode to other. In this process, the replication data transfers directly between DataNode such that the data never passes the NameNode.

What are the core methods of a Reducer?

The three core methods of a Reducer are: setup(): this method is used for configuring various parameters like input data size, distributed cache. public void setup (context) reduce(): heart of the reducer always called once per key with the associated reduced task public void reduce(Key, Value, context) cleanup(): this method is called to clean temporary files, only once at the end of the task public void cleanup (context)

What is SequenceFile in Hadoop?

Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are: Uncompressed key/value records. Record compressed key/value records – only ‘values’ are compressed here. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.

What is Job Tracker role in Hadoop?

Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance). It is a process that runs on a separate node, not on a DataNode often. Job Tracker communicates with the NameNode to identify data location. Finds the best Task Tracker Nodes to execute tasks on given nodes. Monitors individual Task Trackers and submits the overall job back to the client. It tracks the execution of MapReduce workloads local to the slave node.

What is the use of RecordReader in Hadoop?

Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single record. For instance, if our input data is split like: Row1: Welcome to Row2: Intellipaat It will be read as “Welcome to Intellipaat” using RecordReader.

What is Speculative Execution in Hadoop?

One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that few slow nodes limit the rest of the program. Tehre are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution. It creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output. Speculative execution is by default true in Hadoop. To disable, set mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false.

What happens if you try to run a Hadoop job with an output directory that is already present?

It will throw an exception saying that the output file directory already exists.
To run the MapReduce job, you need to ensure that the output directory does not exist before in the HDFS.
To delete the directory before running the job, you can use shell:Hadoop fs –rmr /path/to/your/output/Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);

How can you debug Hadoop code?

First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.

  1. Run: “ps –ef | grep –I ResourceManager”
    and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.
  2. On the basis of RM logs, identify the worker node that was involved in execution of the task.
  3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
  4. Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.

How to configure Replication Factor in HDFS?

hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
You can also modify the replication factor on a per-file basis using the

Hadoop FS Shell:[[email protected] ~]$ hadoopfs –setrep –w 3 /my/fileConversely,

you can also change the replication factor of all the files under a directory.

[[email protected] ~]$ hadoopfs –setrep –w 3 -R /my/dir

Go through Hadoop Training to learn about Replication Factor In HDFS now!

How to compress mapper output but not the reducer output?

To achieve this compression, you should set:

conf.set(“mapreduce.map.output.compress”, true)
conf.set(“mapreduce.output.fileoutputformat.compress”, false)

What is the difference between Map Side join and Reduce Side Join?

Map side Join at map side is performed data reaches the map. You need a strict structure for defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than map side join since the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.

How can you transfer data from Hive to HDFS?

By writing the query:

hive> insert overwrite directory ‘/’ select * from emp;

You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.

What companies use Hadoop, any idea?

Yahoo! (the biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop, Facebook – Developed Hive for analysis,Amazon,Netflix,Adobe,eBay,Spotify,Twitter,Adobe.

In Hadoop what is InputSplit?

It splits input files into chunks and assign each split to a mapper for processing.

Mention Hadoop core components?

Hadoop core components include,

  • HDFS
  • MapReduce

What is NameNode in Hadoop?

NameNode in Hadoop is where Hadoop stores all the file location information in HDFS. It is the master node on which job tracker runs and consists of metadata.

Mention what are the data components used by Hadoop?

Data components used by Hadoop are

  • Pig
  • Hive

What does ‘jps’ command do?

It gives the status of the deamons which run Hadoop cluster. It gives the output mentioning the status of namenode, datanode , secondary namenode, Jobtracker and Task tracker.

How to restart Namenode?

Step-1. Click on stop-all.sh and then click on start-all.sh OR

Step-2. Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and then /etc/init.d/hadoop-0.20-namenode start (press enter).

Which are the three modes in which Hadoop can be run?

The three modes in which Hadoop can be run are −

  1. standalone (local) mode
  2. Pseudo-distributed mode
  3. Fully distributed mode

What does /etc /init.d do?

/etc /init.d specifies where daemons (services) are placed or to see the status of these daemons. It is very LINUX specific, and nothing to do with Hadoop.

What if a Namenode has no data?

It cannot be part of the Hadoop cluster.

What happens to job tracker when Namenode is down?

When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.

What are the four characteristics of Big Data?

the three characteristics of Big Data are −


Volume − Facebook generating 500+ terabytes of data per day.

Velocity − Analyzing 2 million records each day to identify the reason for losses.

Variety − images, audio, video, sensor data, log files, etc.  Veracity: biases, noise and abnormality in data

How is analysis of Big Data useful for organizations?

Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus on and which areas are less important. Big data analysis provides some early key indicators that can prevent the company from a huge loss or help in grasping a great opportunity with open hands! A precise analysis of Big Data helps in decision making! For instance, nowadays people rely so much on Facebook and Twitter before buying any product or service. All thanks to the Big Data explosion.

Why do we need Hadoop?

Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.

What is the basic difference between traditional RDBMS and Hadoop?

Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. RDBMS will be useful when you want to seek one record from Big data, whereas, Hadoop will be useful when you want Big data in one shot and perform analysis on that later

What is Fault Tolerance?

Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.

Replication causes data redundancy, then why is it pursued in HDFS?

HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at least 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.

Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?

No, calculations will be done only on the original data. The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.

Is Namenode also a commodity hardware?

No. Namenode can never be commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Namenode has to be a high-availability machine.

What is a Datanode?

Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

Why do we use HDFS for applications having large data sets and not when there are lot of small files?

HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.

What is a job tracker?

Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

What is a task tracker?

Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

What is a ‘block’ in HDFS?

A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

What are the benefits of block transfer?

A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three).

How indexing is done in HDFS?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be.

Are job tracker and task trackers present in separate machines?

Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

What is the communication channel between client and namenode/datanode?

The mode of communication is SSH.

What is a rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

What is a Secondary Namenode? Is it a substitute to the Namenode?

The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

Explain how do ‘map’ and ‘reduce’ works?

Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects this key value pairs of all the datanodes and combines them and generates the final output.

Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?

Through mapreduce program the file can be read by splitting its blocks when reading. But while writing as the incoming values are not yet known to the system mapreduce cannot be applied and no parallel writing is possible.

Copy a directory from one node in the cluster to another?

Use ‘-distcp’ command to copy,

What is rack awareness?

Rack awareness is the way in which the namenode decides how to place blocks based on the rack definitions Hadoop will try to minimize the network traffic between datanodes within the same rack and will only contact remote racks if it has to. The namenode is able to control this due to rack awareness.

Which file does the Hadoop-core configuration?


Is there a hdfs command to see available free space in hdfs?

hadoop dfsadmin -report

How do you gracefully stop a running job?

Hadoop job –kill jobid

Does the name-node stay in safe mode till all under-replicated files are fully replicated?

No. During safe mode replication of blocks is prohibited. The name-node awaits when all or majority of data-nodes report their blocks.

How to make a large cluster smaller by taking out some of the nodes?

Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude.

The decommission process can be terminated at any time by editing the configuration or the exclude files and repeating the -refreshNodes command

What is a Combiner?

The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

What is Twitter Bootstrap?

Bootstrap is a sleek, intuitive, and powerful mobile first front-end framework for faster and easier web development. It uses HTML, CSS and Javascript.

What does Bootstrap package includes?

Bootstrap package includes −

  • Scaffolding − Bootstrap provides a basic structure with Grid System, link styles, background. This is is covered in detail in the section Bootstrap Basic Structure
  • CSS − Bootstrap comes with feature of global CSS settings, fundamental HTML elements styled and enhanced with extensible classes, and an advanced grid system. This is covered in detail in the section Bootstrap with CSS.
  • Components − Bootstrap contains over a dozen reusable components built to provide iconography, dropdowns, navigation, alerts, popovers, and much more. This is covered in detail in the section Layout Components.
  • JavaScript Plugins − Bootstrap contains over a dozen custom jQuery plugins. You can easily include them all, or one by one. This is covered in details in the section Bootstrap Plugins.
  • Customize − You can customize Bootstrap’s components, LESS variables, and jQuery plugins to get your very own version.

What is Contextual classes of table in Bootstrap?

The Contextual classes allow you to change the background color of your table rows or individual cells.

Class Description
.active Applies the hover color to a particular row or cell
.success Indicates a successful or positive action
.warning Indicates a warning that might need attention
.danger Indicates a dangerous or potentially negative action

What is Bootstrap Grid System?

Bootstrap includes a responsive, mobile first fluid grid system that appropriately scales up to 12 columns as the device or viewport size increases. It includes predefined classes for easy layout options, as well as powerful mixins for generating more semantic layouts.

What are Bootstrap media queries?

Media Queries in Bootstrap allow you to move, show and hide content based on viewport size.

Show a basic grid structure in Bootstrap?

Following is basic structure of Bootstrap grid −

<div class="container">
   <div class="row">
      <div class="col-*-*"></div>
      <div class="col-*-*"></div>      
   <div class="row">...</div>
<div class="container">....

What are Offset columns?

Offsets are a useful feature for more specialized layouts. They can be used to push columns over for more spacing, for example. The .col-xs=* classes don’t support offsets, but they are easily replicated by using an empty cell.

How can you order columns in Bootstrap?

You can easily change the order of built-in grid columns with .col-md-push-* and .col-md-pull-* modifier classes where * range from 1 to 11.

Review Date
Reviewed Item
Thanks for posting Hadoop Interview Questions and Answers Very Helpful!!
Author Rating