Introduction
Hadoop is an open-source framework designed for processing and storing large datasets across distributed clusters of computers.
It provides a reliable, scalable, and cost-effective solution for handling big data. Originally developed by the Apache Software Foundation, Hadoop has become a foundational technology in the field of big data analytics.
In this blog we will cover how to set up Hadoop Ecosystem on a Linux/Rhel Based Environment
Prerequisites:
- A set of machines or virtual machines (VMs) with Linux-based operating systems (Ubuntu, CentOS, etc.).
- We are going to install Hadoop Ecosystem on RHEL
- Basic familiarity with the Linux command line.
- Java Development Kit (JDK) installed on all machines with java 8.
- SSH access to all machines from a central node.
Hadoop Setup(v2.10.2):
Step1:Download and extract the hadoop binary v2.10.2
wget https://archive.apache.org/dist/hadoop/common/hadoop-2.10.2/hadoop-2.10.2.tar.gz tar -xvf hadoop-2.10.2.tar.gz
Step2: Configure the Environment Variables.
Edit the bashrc file and add the following lines
export HADOOP_HOME=/path/to/hadoop/directory export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Run the bashrc command to update the new bashrc
source .bashrc
Similarly repeat the above steps for other hadoop ecosystem components as well.
Step3: Change the configuration files
Some of the config files need to be modified to set up hadoop.
1.core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://<your_main_node>:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>{{ hadoop_tmp_dir_nha }}</value> </property> </configuration>
2.hdfs-site.xml
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>namenode-dir-path</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>datanode-dir-path</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> </configuration>
3.slaves
Datanode ip or hostname
4.yarn-site.xml
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value><your_main_node></value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
Step4: Starting Hadoop cluster
a.Format Hdfs
hdfs namenode -format
b.Start Hadoop
start-dfs.sh start-yarn.sh
To verify the Hadoop is working do a JPS command in the respective nodes to see if components java service is running or not.
Also browse through the url to verify with the IP that you provided in the configuration
Namenode UI:
<cluster-ip>:50070
ResourceManager UI:
<cluster-ip>:8088
Hbase(v2.4.15)
Prerequisites:
-
- Hadoop cluster (HDFS and YARN) already set up.
- Java Development Kit (JDK) installed on all nodes.
- Hadoop running in pseudo-distributed or fully-distributed mode.
- ZooKeeper installed and configured (HBase requires ZooKeeper for coordination).
Now follow the step 2 as mentioned in Hadoop Setup
wget https://archive.apache.org/dist/hbase/2.4.15/hbase-2.4.15-bin.tar.gz tar -xvf hbase-2.4.15-bin.tar.gz
Configuration file changes:
hbase-site.xml
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://<your_main_node>:9000/hbase</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value><zookeeper_node_1>,<zookeeper_node_2>,...</value> </property> </configuration>
regionservers
Add the number hostnames or IP for the region servers you want
Start Hbase
./bin/start-hbase.sh
Apache Hive(v2.3.9)
Now follow the step 2 as mentioned in Hadoop Setup
wget https://archive.apache.org/dist/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz tar -xvf apache-hive-2.3.9-bin.tar.gz
Configuration changes:
hive-site.xml
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://mysqlendpoint:mysqlport/metastore?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>mysql-pass</value> <description>password to use against metastore database</description> </property> </configuration>
Make sure you have mysql-jar file present in the lib directory of hive.
Start Hive Metastore
hive --service metastore
Start Hive Service
hiveserver2