Mastering Big Data: A Comprehensive Guide to Setting up Hadoop and Its Ecosystem Components

August 10, 2023by Dhawal
Introduction

Hadoop is an open-source framework designed for processing and storing large datasets across distributed clusters of computers.
It provides a reliable, scalable, and cost-effective solution for handling big data. Originally developed by the Apache Software Foundation, Hadoop has become a foundational technology in the field of big data analytics.
In this blog we will cover how to set up Hadoop Ecosystem on a Linux/Rhel Based Environment

Prerequisites:

  • A set of machines or virtual machines (VMs) with Linux-based operating systems (Ubuntu, CentOS, etc.).
  • We are going to install Hadoop Ecosystem on RHEL
  • Basic familiarity with the Linux command line.
  • Java Development Kit (JDK) installed on all machines with java 8.
  • SSH access to all machines from a central node.

Hadoop Setup(v2.10.2):

Step1:Download and extract the hadoop binary v2.10.2

wget https://archive.apache.org/dist/hadoop/common/hadoop-2.10.2/hadoop-2.10.2.tar.gz

tar -xvf hadoop-2.10.2.tar.gz

Step2: Configure the Environment Variables.

Edit the bashrc file and add the following lines

export HADOOP_HOME=/path/to/hadoop/directory

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Run the bashrc command to update the new bashrc

source .bashrc

Similarly repeat the above steps for other hadoop ecosystem components as well.

 

Step3: Change the configuration files

Some of the config files need to be modified to set up hadoop.

1.core-site.xml

<configuration>

  <property>

    <name>fs.defaultFS</name>

    <value>hdfs://<your_main_node>:9000</value>

  </property>

<property>

  <name>hadoop.tmp.dir</name>

  <value>{{ hadoop_tmp_dir_nha }}</value>

</property>

</configuration>


2.hdfs-site.xml

<configuration>

<property>

<name>dfs.namenode.name.dir</name>

<value>namenode-dir-path</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>datanode-dir-path</value>

</property>

<property>

<name>dfs.webhdfs.enabled</name>

<value>true</value>

</property>

</configuration>


3.slaves

Datanode ip or hostname

4.yarn-site.xml

<configuration>

  <property>

    <name>yarn.resourcemanager.hostname</name>

    <value><your_main_node></value>

  </property>

  <property>

    <name>yarn.nodemanager.aux-services</name>

    <value>mapreduce_shuffle</value>

  </property>

</configuration>


Step4: Starting Hadoop cluster 

 

a.Format Hdfs

hdfs namenode -format

b.Start Hadoop

start-dfs.sh

start-yarn.sh


To verify the Hadoop is working do a JPS command in the respective nodes to see if components java service is running or not.

Also browse through the url to verify with the IP that you provided in the configuration

Namenode UI:

<cluster-ip>:50070

ResourceManager UI:

<cluster-ip>:8088

Hbase(v2.4.15)

Prerequisites:

    • Hadoop cluster (HDFS and YARN) already set up.
    • Java Development Kit (JDK) installed on all nodes.
    • Hadoop running in pseudo-distributed or fully-distributed mode.
    • ZooKeeper installed and configured (HBase requires ZooKeeper for coordination).

Now follow the step 2 as mentioned in Hadoop Setup

wget https://archive.apache.org/dist/hbase/2.4.15/hbase-2.4.15-bin.tar.gz

tar -xvf hbase-2.4.15-bin.tar.gz

Configuration file changes:

hbase-site.xml

<configuration>

  <property>

    <name>hbase.rootdir</name>

    <value>hdfs://<your_main_node>:9000/hbase</value>

  </property>

  <property>

    <name>hbase.zookeeper.quorum</name>

    <value><zookeeper_node_1>,<zookeeper_node_2>,...</value>

  </property>

</configuration>


regionservers

Add the number hostnames or IP for the region servers you want

 

Start Hbase

./bin/start-hbase.sh

Apache Hive(v2.3.9)

Now follow the step 2 as mentioned in Hadoop Setup

wget https://archive.apache.org/dist/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz

tar -xvf apache-hive-2.3.9-bin.tar.gz

Configuration changes:

hive-site.xml

<configuration>

<property>

  <name>javax.jdo.option.ConnectionURL</name>

  <value>jdbc:mysql://mysqlendpoint:mysqlport/metastore?createDatabaseIfNotExist=true</value>

</property>

<property>

  <name>javax.jdo.option.ConnectionPassword</name>

  <value>mysql-pass</value>

  <description>password to use against metastore database</description>

</property>

</configuration>


Make sure you have mysql-jar file present in the lib directory of hive.

Start Hive Metastore

hive --service metastore

Start Hive Service

hiveserver2