Step 1 : Java Installation

1.1 Install latest or desired version of java

sudo apt install default-jdk default-jre -y

1.2 Check java version

java -version

Step 2 : Create Hadoop User (Optional)

If you want to manage Hadoop files independently, create a different user (a Hadoop user).

2.1 Create a new user called hadoop.

sudo adduser hadoop

2.2 Make the hadoop user a member of the sudo group.

sudo usermod -aG sudo hadoop

The -aG argument in the above command usermod stands for append(a)-Groups(G).

2.3 Change to the Hadoop user now.

sudo su - hadoop

Step 3 : Configure Password-less SSH

Note : If you completed step 2, then proceed to step 3 after switching to the hadoop user (sudo su — hadoop).

3.1 Install OpenSSH server and client

sudo apt install openssh-server openssh-client -y

3.2 Generate public and private key pairs.

ssh-keygen -t rsa

3.3 Add the generated public key from id_rsa.pub to authorized_keys

sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

3.4 Change the file permissions for authorized_keys.

sudo chmod 640 ~/.ssh/authorized_keys

3.5 Check to see if the password-less SSH is working.

ssh localhost

Step 4 : Install and Configure Apache Hadoop in hadoop user

Note : Check that you are using the hadoop user; if not, use the following command to switch to the hadoop user.
sudo su - hadoop

4.1 Download latest stable version of hadoop

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

Use the following command if the previous one fails with an error.

sudo apt-get install wget

4.2 Extract the downloaded tar file

tar -xvzf hadoop-3.3.1.tar.gz

4.3 Create Hadoop directory

To ensure that all of your files are organised in one location, move the extracted directory to /usr/local/.

sudo mv hadoop-3.3.1 /usr/local/hadoop

To maintain hadoop logs, create a different directory inside of usr/local/hadoop called logs.

sudo mkdir /usr/local/hadoop/logs

Finally, use the following command to modify the directory’s ownership.

4.4 Configure Hadoop

sudo nano ~/.bashrc

Once executing the above command you can see nano editor in your terminal then paste following lines

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib/native”

Press CTRL + S to save and CTRL + X to exit the nano editor after copying the lines above.

Use the following command to activate environmental variables after closing the nano editor.

source ~/.bashrc

Step 5 : Configure Java Environmental variables

Hadoop can carry out its essential functions thanks to a large number of components. You must define Java environment variables in the configuration file for hadoop-env.sh in order to configure these components, including YARN, HDFS, MapReduce, and Hadoop-related project settings.

5.1 Find Java path and Open-JDK directory with help of following commands

which javac
readlink -f /usr/bin/javac

5.2 Edit Hadoop-env.sh file

This file contains Hadoop’s environment variable settings. You can use these to modify the Hadoop daemon’s behaviour, such as where log files are stored, the maximum amount of heap used, and so on. The only variable in this file that should be changed is JAVA HOME, which specifies the path to the Java 1.5.x installation used by Hadoop.

Open the hadoop-env.sh file in your preferred text editor first. In this case, I’ll use nano.