About Top 23 Hadoop Interview Questions and Answers

These are the Top 23 Hadoop Interview Questions and Answers for fresher’s and Experienced. Just go through this  Salesforce Interview Questions before attending Interviews.

Top Answers to Hadoop Interview Questions:

Q1.What are the Input Format types in Hadoop?

  • TextInputFormat
  • KeyValueInputFormat
  • SequenceFileInputFormat

Q2.What is the default InputFormat in Hadoop ?

  • TextInputForma

Q3.What is the default block size in HDFS?

  • Default block size of HDFS is 64 MB

Q4.Define Hive?

It is a software working like data warehouse used to facilitate processing large data sets in the distributed storage system.

Q5.What is Hive Metastore ?

Data about data is called Meta data like table name, column name, data types, location and much more. Hive Metastore stores Meta data of hive tables.

Q6.What is the stable version of Hive?


Q7.Which organization was developed Hive?

Face book

Q8.How can we access Hive without Hadoop?

It is allowed to access Hive with other data storage systems like Amazon S3 or GPFS or MapR file system in the absence of Hadoop.

Q9.Can we upload 3TB data file into HIVE?

Hive Metastore saves the metadata of 3tb table and the actual data will stored in HDFS. It is possible to upload from the local file system but it takes too much time if local system configuration(ram size, processing speed….) is not good.

Q10.Explain Troubleshoot process if either NameNodes or DataNodes are not running?

First we need to check the CLUSTER_ID in NameNode’s VERSION file and DataNode’s VERSION file. Both the CLUSTER IDs should match or else there will be no synchronization between the NameNodes and DataNodes. So, if the CLUSTER IDs of both are different, then you need to keep the CLUSTER IDs same.

Q11.How to start Hadoop using commands?

  • start-dfs.sh
  • start-yarn.dfs

Q12.Differentiate Namenode and Datanode?

Hadoop is a Master-Slave model. The Namenode is the master and Datanodes are slaves. The namenode partitions MapReduce jobs and hands off each piece to different datanodes. Datanodes are responsible for writing data to disk.

Q13.What is a NameSpace?

Namespace is a container. It is Local. Namenode maintains file system tree meta data of all the files and directories. It is hierarchal structure model.

Q14.Does Hadoop provide security?

Hadoop having default file permissions in UNIX file system. Permissions can be modified using chown and chmod. Higher level authentication can achieve by enabling Kerberos.

Q15.What is HDFS?

Distributed File system manages data across the network. HDFS is a java based File System which is providing scalable and reliable data storage designed to span large clusters of commodity servers(running server programs and carry out associated tasks). The main feature of HDFS is write-once-read-multiple-times model. It avoids concurrency control in data processing. HDFS is designed to work with the MapReduce System, where computation is moved to data.

Q16.Describe MapReduce?

A programming model along with its implementation used to processes large data sets with a parallel and distributed algorithm on a cluster. MapReduce split the input data-set into independent chunks process these chunks in a parallel manner on different nodes.

Q17.Explain Hadoop read operation function?

HDFS follows write once read many times(WORM) model. It is a Single master Multiple slaves architecture in which Namenode acts as master and Datanodes as slaves. Metadata information stores in Namenode and the actual data stores in Datanodes. We cannot edit the files once we stored in HDFS.

Q18.Explain Edge Node / Gateway Node in Hadoop?

Edge node is working like a gateway(interface) connection between the Hadoop cluster and the outside network. Edge nodes must separate from Hadoop service nodes like HDFS, MapReduce, etc… Edge nodes centralize the entire configuration entries of Hadoop cluster nodes and decrease administration work.

Q19.What Is Apache Yarn?

Stands for Yet Another Resource Negotiator. It is Hadoop CRM (cluster resource management) system. It is Apache Hadoop framework introduced in Hadoop-2 to improve the MapReduce implementation..

Q20.What Is Checksum?

A digit which is derived from sum of correct digits either in stored or transmitted data unit. Using Checksum information only, a receiver can understand that they arrived data is correct or not. Datanodes are responsible for verifying the data they receive before storing the data and its checksum.

Q21.What Is Serialization/Deserialization?

Serialization is the process of converting object state into persistent state. We can transmit serialized data(into a byte stream) over a network. Deserialization is the reverse process of getting objects from a byte stream from persisted data.

Q22.What is RPC ?

RPC stands for Remote Procedure Calls. It is used for inter process communication between the nodes in Hadoop. RPC uses serialization and De-serialization.

Q23.What is SSH?

SSH is a network protocol to provide service in a secured environment. It is a built in Username and Password authentication scheme for remote login to computer systems. It is used for HDFS and YARN users.


Share this