Hadoop Interview Questions and Answers For Experienced

1.What is Hadoop?

Hadoop is a distributed computing platform. It is written in Java. It consist of the features like Google File System and MapReduce.

2.What platform and Java version is required to run Hadoop?

Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.

3.What are the most common Input Formats in Hadoop?

  • Text Input Format: Default input format in Hadoop.
  • Key Value Input Format: used for plain text files where the files are broken into lines.
  • Sequence File Input Format: used for reading files in sequence.

4.What is SSH?

Secure Shell also called as Secure Socket Shell

5.What is Hadoop Map Reduce?

For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used.  Data analysis uses a two-step map and reduce process.

6.What kind of Hardware is best for Hadoop?

Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.

7.What is Sequence File in Hadoop?

Extensively used in Map Reduce I/O formats, Sequence File is a flat file containing binary key/value pairs. The map outputs are stored as Sequence File internally. It provides Reader, Writer and Sorter classes. The three Sequence File formats are:

  1. Uncompressed key/value records.
  2. Record compressed key/value records – only ‘values’ are compressed here.
  3. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable

8.What is the Use of SSH in Hadoop?

We should use SSH in Hadoop because SSH is a built-in username and password schema that can be used for secure access to a remote host; it is a more secure alternative to rlogin and telnet.

9.What is Name Node in Hadoop?

Name Node in Hadoop is where Hadoop stores all the file location information in HDFS. It is the master node on which job tracker runs and consists of metadata.

10.How Hadoop Map Reduce works?

In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework

11. What is Input Block in Hadoop? Explain.

When a Hadoop job runs, it blocks input files into chunks and assign each split to a mapper for processing. It is called Input block.

12.How will format the HDFS?

$hadoop namenode –format

13.Mention what are the main configuration parameters that user need to specify to run Map reduce Job?

  • The user of Map reduce framework needs to specify
  • Job’s input locations in the distributed file system
  • Job’s output location in the distributed file system
  • Input format
  • Output format
  • Class containing the map function
  • Class containing the reduce function
  • JAR file containing the mapper, reducer and driver classes

14.How many Input blocks is made by a Hadoop Framework?

The default block size is 64MB, according to which, Hadoop will make 5 Block as following:

  • One Block for 64K files
  • Two Block for 65MB files, and
  • Two Block for 127MB files

The block size is configurable.

15.How can you debug Hadoop code?

First, check the list of Map Reduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.

  1. Run: “ps –ef | grep –I Resource Manager”
    and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.
  2. On the basis of RM logs, identify the worker node that was involved in execution of the task.
  3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
  4. Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.

16.Mention what is the Hadoop MapReduce APIs contract for a key and value class?

For a key and value class, there are two Hadoop MapReduce APIs contract

  • The value must be defining the org.apache.hadoop.io.Writable interface
  • The key must be defining the org.apache.hadoop.io.WritableComparable interface

17.What is the use of RecordReader in Hadoop?

Input Block is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.

18.How to compress mapper output but not the reducer output?

To achieve this compression, you should set:

conf.set(“mapreduce.map.output.compress”, true)

conf.set(“mapreduce.output.fileoutputformat.compress”, false)

19.What is Hive?

Hive is a data warehouse software which is used for facilitates querying and managing large data sets residing in distributed storage.

20.List out Hadoop’s three configuration files?

The three configuration files are

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml

21.What is JobTracker in Hadoop?

JobTracer is a service within Monitors and assigns Map tasks and Reduce tasks to corresponding task tracker on the data nodes

22.What are real-time industry applications of Hadoop?

Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and distributed computing of large volumes of data. It provides rapid, high performance and cost-effective analysis of structured and unstructured data generated on digital platforms and within the enterprise. It is used in almost all departments and sectors today.Some of the instances where Hadoop is used:

  • Managing traffic on streets.
  • Streaming processing.
  • Content Management and Archiving Emails.
  • Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster.
  • Fraud detection and Prevention.
  • Advertisements Targeting Platforms are using Hadoop to capture and analyze click stream, transaction, video and social media data.
  • Managing content, posts, images and videos on social media platforms.
  • Analyzing customer data in real-time for improving business performance.
  • Public sector fields such as intelligence, defense, cyber security and scientific research.
  • Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns, identify rogue traders, more precisely target their marketing campaigns based on customer segmentation, and improve customer satisfaction.
  • Getting access to unstructured data like output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data.

23.What is Hive Metastore ?

Hive Meta store is a database that stores metadata of your hive tables like table name,column name,data types,table location,number of buckets in the table etc.

24.Mention what is the Hadoop MapReduce APIs contract for a key and value class?

For a key and value class, there are two Hadoop MapReduce APIs contract

  • The value must be defining the org.apache.hadoop.io.Writable interface
  • The key must be defining the org.apache.hadoop.io.WritableComparable interface

25.What are the functionalities of JobTracker?

These are the main tasks of JobTracker:

  • To accept jobs from client.
  • To communicate with the NameNode to determine the location of the data.
  • To locate TaskTracker Nodes with available slots.
  • To submit the work to the chosen TaskTracker node and monitors progress of each tasks

26.What is Hive Present Version ?


27.What are the core methods of a Reducer?

The three core methods of a Reducer are:

  1. setup(): this method is used for configuring various parameters like input data size, distributed cache.
    public void setup (context)
  2. reduce(): heart of the reducer always called once per key with the associated reduced task
    public void reduce(Key, Value, context)
  3. cleanup(): this method is called to clean temporary files, only once at the end of the task
    public void cleanup (context)

28.For using Hadoop list the network requirements?

For using Hadoop the list of network requirements are:

  • Password-less SSH connection
  • Secure Shell (SSH) for launching server processes

29.What is Hadoop Streaming?

Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.

 30.Can i access Hive Without Hadoop ?

Yes,We can access Hive without hadoop with the help of other data storage systems like Amazon S3, GPFS (IBM) and MapR file system .

Review Date
Reviewed Item
This is very useful interview questions.
Author Rating