Tutorial-HDP-Administration - I HDFS & YARN PDF
Tutorial-HDP-Administration - I HDFS & YARN PDF
Tutorial-HDP-Administration - I HDFS & YARN PDF
Tos.Tech | http://thinkopensource.in
2 Big Data - Admin Course
1. Prelude
Ex : 10.10.20.21 tos.master.com
Mount the shared folder in VM host. Henceforth it will be refer as Software folder. You can refer
the supplement document for enabling and mounting shared folder.
Comment the last line in case of any issue as shown below.
Issue
Tos.Tech | http://thinkopensource.in
4 Big Data - Admin Course
2. Ambari
Hadoop requires java hence you need to install JDK and set Java Home on all the nodes.
#su - root
#mkdir /YARN
#tar -xvf jdk-8u181-linux-x64.tar.gz -C /YARN
#cd /YARN
mv jdk1.8.0_181 jdk
To include JAVA_HOME for all bash users, make an entry in /etc/profile.d as follows:
Tos.Tech | http://thinkopensource.in
5 Big Data - Admin Course
We have extracted java in /YARN/jdk folder and specify java home using root logon.
(CentOS 7 64-bit-CLI)
Type bash in the command to prompt to reinitialized the scripts.
Next we will download the ambari repo so that yum utlility can download it.
Steps
1. #mkdir /apps
2. #cd /apps
3. Download the Ambari repository file to a directory on your installation host.
4. yum install wget
Important
Do not modify the ambari.repo file name. This file is expected to be available on the
Ambari Server host during Agent registration.
5. Confirm that the repository is configured by checking the repo list.
yum repolist
Tos.Tech | http://thinkopensource.in
6 Big Data - Admin Course
1. Install the Ambari. This also installs the default PostgreSQL Ambari database.
Tos.Tech | http://thinkopensource.in
7 Big Data - Admin Course
Ambari Server by default uses an embedded PostgreSQL database. When you install the Ambari
Server, the PostgreSQL packages and dependencies must be available for install. These packages
are typically available as part of your Operating System repositories.
Before starting the Ambari Server, you must set up the Ambari Server. Setup configures Ambari
to talk to the Ambari database, installs the JDK and allows you to customize the user account the
Ambari Server daemon will run as.
ambari-server setup
The command manages the setup process. Run the command on the Ambari server host to start
the setup process.
Respond to the setup prompt:
1. If you have not temporarily disabled SELinux, you may get a warning. Accept the default (y),
and continue.
2. By default, Ambari Server runs under root. Accept the default (n) at the Customize user account
for ambari-server daemon prompt, to proceed as root.
3. If you have not temporarily disabled iptables you may get a warning. Enter y to continue.
4. Select a JDK version to download. Enter 1 to download Oracle JDK 1.8. Alternatively, you can
choose to enter a Custom JDK. If you choose Custom JDK, you must manually install the
JDK on all hosts and specify the Java Home path. :
For our lab Accept 2 and enter the Java home as - /YARN/jdk
Note
Tos.Tech | http://thinkopensource.in
8 Big Data - Admin Course
JDK support depends entirely on your choice of Stack versions. By default, Ambari
Server setup downloads and installs Oracle JDK 1.8 and the accompanying Java
Cryptography Extension (JCE) Policy Files.
5. Enable Ambari Server to download and install GPL Licensed LZO packages [y/n] (n)? y
6. Accept the Oracle JDK license when prompted. You must accept this license to download the
necessary JDK from Oracle. The JDK is installed during the deploy phase.
7. Select n at Enter advanced database configuration to use the default, embedded PostgreSQL
database for Ambari. The default PostgreSQL database name is ambari. The default user name
and password are ambari/bigdata.
Tos.Tech | http://thinkopensource.in
9 Big Data - Admin Course
8. Setup completes.
Tos.Tech | http://thinkopensource.in
10 Big Data - Admin Course
ambari-server start
ambari-server status
Tos.Tech | http://thinkopensource.in
11 Big Data - Admin Course
To stop the Ambari Server: Do not execute this command. It’s for your information.
ambari-server stop
On Ambari Server start, Ambari runs a database consistency check looking for issues. If any issues
are found, Ambari Server start will abort and display the following message: DB configs
consistency check failed. Ambari writes more details about database consistency check results to
the/var/log/ambari-server/ambari-server-check-database.log file.
You can force Ambari Server to start by skipping this check with the following option: (optional
when only there is issue)
ambari-server start --skip-database-check
If you have database issues, by choosing to skip this check, do not make any changes to your
cluster topology or perform a cluster upgrade until you correct the database
consistency issues.
Tos.Tech | http://thinkopensource.in
12 Big Data - Admin Course
Solution:
# ambari-server status
Using python /usr/bin/python
Ambari-server status
Ambari Server not running. Stale PID File at: /var/run/ambari-server/ambari-server.pid
# ambari-server reset
Using python /usr/bin/python
Resetting ambari-server
**** WARNING **** You are about to reset and clear the Ambari Server database. This will
remove all cluster host and configuration information from the database. You will be required to
re-configure the Ambari server and re-run the cluster wizard.
Are you SURE you want to perform the reset [yes/no] (no)? yes
Confirm server reset [yes/no](no)? yes
Resetting the Server database...
Creating schema and user...
done.
Creating tables...
done.
Ambari Server 'reset' completed successfully.
Tos.Tech | http://thinkopensource.in
13 Big Data - Admin Course
Prerequisites
Ambari Server must be running.
This will download required drivers for mysql
#yum install mysql-connector-java*
#cd /usr/lib/ambari-agent
#cp /usr/share/java/mysql-connector-java.jar .
Note: Whenever there is any issue related to jar file, determine the jar from the log file and
manually download and copy to the tmp folder as shown above,
log on to Ambari Web using a web browser and install the HDP cluster software.
Stop the firewall in the VM.
#systemctl stop firewalld
Tos.Tech | http://thinkopensource.in
14 Big Data - Admin Course
2. Log in to the Ambari Server using the default user name/password: admin/admin. For a new
cluster, the Cluster Install wizard displays a Welcome page.
Tos.Tech | http://thinkopensource.in
15 Big Data - Admin Course
Click Sign In
Tos.Tech | http://thinkopensource.in
16 Big Data - Admin Course
3. Debugging – Ambari(A)
Debug logs will help us troubleshoot ambari issues better and faster. Debug logs will contain more number of internal calls those will help
us understanding the problem better.
Check current log level in log4j.properties file
In the above picture rootLogger value shown as INFO,file , We need to change it to DEBUG,file.
#tail -f /var/log/ambari-server/ambari-server.log
Tos.Tech | http://thinkopensource.in
17 Big Data - Admin Course
1. Open the relevant configuration file in a UNIX text editor:
2. vi /etc/ambari-server/conf/log4j.properties
ambari-server restart
Check DEBUG log in ambari-server.log file and determine the entry that received heartbeat from your cluster node. In case you
have any issue with any of the node look for the heartbeat that is received at the ambari server.If it’s not in the log file then check
Tos.Tech | http://thinkopensource.in
18 Big Data - Admin Course
the ambari agent status.
Command :
tail -f /var/log/ambari-server/ambari-server.log
Please revert log level to INFO once debug logs collected using same steps. Debug logs take lot of space, can also cause service failures
sometimes.
ambari-agent restart
NOTE: Ambari agent logging level will only change on one host and will not affect the other hosts in the cluster.
tail -f /var/log/ambari-agent/ambari-agent.log
Look for an entry that specifies sending heart beat from the agent to server as shown below.
After this revert the setting to info and restart the ambari agent.
Tos.Tech | http://thinkopensource.in
20 Big Data - Admin Course
Goal: You will verify some of the setting related to HDFS and YARN in the config file so that you
get familiarize with various config files.
ambari-server start
Tos.Tech | http://thinkopensource.in
21 Big Data - Admin Course
ambari-server status
To stop the Ambari Server: Do not execute this command. Its for your information.
ambari-server stop
You can view the log file in case of any issue: /var/log/ambari-server/ambari-server*.log
http://<your.ambari.server>:8080
,where <your.ambari.server> is the name of your ambari server host.
For example, a default Ambari server host is located at http://tos.master.com:8080/#/login.
4. Log in to the Ambari Server using the default user name/password: admin/admin. You can
change these credentials later.
Tos.Tech | http://thinkopensource.in
22 Big Data - Admin Course
http://tos.master.com:8080/#/main/dashboard/metrics
Tos.Tech | http://thinkopensource.in
23 Big Data - Admin Course
HDFS Sevices:
NameNode
DataNode
Yarn services:
Resource Manager
Node Manager
Namenode
In case the namenode take quite a long time to start up i.e exceeds 10 minutes+; verify the log in
the following location. View the latest file out of it.
Tos.Tech | http://thinkopensource.in
24 Big Data - Admin Course
/var/lib/ambari-agent/data/out*txt
If there is error message as shown below: Execute the command to come out of the safe mode.
Tos.Tech | http://thinkopensource.in
25 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
26 Big Data - Admin Course
Open a terminal and vi file /etc/hadoop/conf/core-site.xml. Verify the Port number and the host
that runs the namenode services.
<property>
<name>fs.defaultFS</name>
<value>hdfs://tos.hp.com:8020</value>
<final>true</final>
</property>
<property>
<name>hadoop.proxyuser.hdfs.groups</name>
<value>*</value>
</property>
You can also verify using the ambari console as shown below:
Tos.Tech | http://thinkopensource.in
27 Big Data - Admin Course
This is the graphical representation of the config file. All changes have to be done from the web
console only so that synchronizing to all slaves node will be managed by ambari else you have to
do manually.
Tos.Tech | http://thinkopensource.in
28 Big Data - Admin Course
Verify the replication factor and the physical location of the data or block configured for the
cluster.
#vi /etc/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/nn</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>fs.checkpoint.edits.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/dn</value>
</property>
<property>
Tos.Tech | http://thinkopensource.in
29 Big Data - Admin Course
<name>dfs.namenode.http-address</name>
<value>hp.tos.com:50070</value>
</property>
</configuration>
Tos.Tech | http://thinkopensource.in
30 Big Data - Admin Course
Verify /etc/hadoop/conf/mapred-site.xml
Map reduce related setting: as map reduce is executing in the yarn mode in this cluster.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml, you can use vi editor to view The pluggable shuffle and pluggable sort capabilities
allow replacing the built in shuffle and sort logic with alternate implementations.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Tos.Tech | http://thinkopensource.in
31 Big Data - Admin Course
#vi /etc/hadoop/conf/hadoop-env.sh
HADOOP_HEAPSIZE="500"
HADOOP_NAMENODE_INIT_HEAPSIZE="500"
Tos.Tech | http://thinkopensource.in
32 Big Data - Admin Course
After starting the HDFS, you can verify the java processes as shown below
#su -
#su hdfs
#jps
This command will list the java processes started for the Hadoop – Yarn.
#su -
#su hdfs
#jps
You can verify the Services using Web Interface also, You need to replace IP with that of your
server IP.
Tos.Tech | http://thinkopensource.in
33 Big Data - Admin Course
Access the Name Node UI and Data Node UI. Get familiarize the various features of these UI
especially the Node that belongs to the HDP cluster and the files store in the HDFS.
http://tos.master.com:50070/dfshealth.html#tab-overview
You can click on the various tabs to familiarize with the web UI.
Tos.Tech | http://thinkopensource.in
34 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
35 Big Data - Admin Course
How many datanodes are there in the cluster? Now only one. What about any node being
decommision ?
Any snapshot being taken? You can verfy this after the snapshot lab. Here all snapshot
information will be stored.
Tos.Tech | http://thinkopensource.in
36 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
37 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
38 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
39 Big Data - Admin Course
http://tos.hp.com:8088/
Whenever you submit a job in the yarn cluster, you will get the job listed in this console. How
much resources consume will all be displayed here?
Tos.Tech | http://thinkopensource.in
40 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
41 Big Data - Admin Course
Congrats! You have successfully completed Understanding main configuration of Yarn Cluster.
Tos.Tech | http://thinkopensource.in
42 Big Data - Admin Course
You will be able to submit map reduce job to hadoop yarn cluster at the end of this lab.
You need to ensure that Hadoop cluster is configured and started before proceeding ahead.
We are going to use sample MapReduce Examples provided by the hadoop installation using hdfs
user to understand how to submit MR job.
Tos.Tech | http://thinkopensource.in
43 Big Data - Admin Course
You can verify the execution of jobs using the YARN web console.
Tos.Tech | http://thinkopensource.in
44 Big Data - Admin Course
http://tos.hp.com:8088/ui2/#/cluster-overview
Tos.Tech | http://thinkopensource.in
45 Big Data - Admin Course
We will talk about the Queues when we discuss scheduler later in the training.
Tos.Tech | http://thinkopensource.in
46 Big Data - Admin Course
Now, the job is in accepted state. Finally it will be in running state as shown below.
Tos.Tech | http://thinkopensource.in
47 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
48 Big Data - Admin Course
Click on the application id link and verify the resources consume by this Job. Hover the mouse
over the color to get the exact value of memory consumption.
Find out where the AM executes for the job we have submitted now.
Tos.Tech | http://thinkopensource.in
49 Big Data - Admin Course
In my case it’s the slave which can be different for your execution.
Tos.Tech | http://thinkopensource.in
50 Big Data - Admin Course
In the above example its ask for 5 cotainers each of 768 mb and 1 v cores.
Tos.Tech | http://thinkopensource.in
51 Big Data - Admin Course
Finally at the end of the job execution, the pi result will be shown as above.
Tos.Tech | http://thinkopensource.in
52 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
53 Big Data - Admin Course
Errata: /etc/hadoop/conf/yarn-site.xml
Issue Map reduces jobs not proceeding ahead and stuck at Accepted state.
Solution: verify the yarn-site.xml file and execute yarn services with yarn user only
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>124</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
Reduce the memory of Resource/Nodemanager to about 250 MB each if unable to execute and
start the history server of MR v2.
Tos.Tech | http://thinkopensource.in
54 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
55 Big Data - Admin Course
6. Using HDFS
In this lab you will begin to get acquainted with the Hadoop tools. You will manipulate files in
HDFS, the Hadoop Distributed File System.
Set Up Your Environment
Before starting the labs, start up the VM and the HDFS, you need to logon with hdfs user for this
exercise.
ambari-server start
Tos.Tech | http://thinkopensource.in
56 Big Data - Admin Course
ambari-server status
To stop the Ambari Server: Do not execute this command. Its for your information.
ambari-server stop
Tos.Tech | http://thinkopensource.in
57 Big Data - Admin Course
You can view the log file in case of any issue: /var/log/ambari-server/ambari-server*.log
http://<your.ambari.server>:8080
,where <your.ambari.server> is the name of your ambari server host.
For example, a default Ambari server host is located at http://tos.master.com:8080/#/login.
6. Log in to the Ambari Server using the default user name/password: admin/admin. You can
change these credentials later.
Tos.Tech | http://thinkopensource.in
58 Big Data - Admin Course
http://tos.master.com:8080/#/main/dashboard/metrics
Tos.Tech | http://thinkopensource.in
59 Big Data - Admin Course
HDFS Sevices:
NameNode
DataNode
Yarn services:
Resource Manager
Node Manager
Namenode
In case the namenode take quite a long time to start up i.e exceeds 10 minutes+; verify the log in
the following location. View the latest file out of it.
Tos.Tech | http://thinkopensource.in
60 Big Data - Admin Course
/var/lib/ambari-agent/data/out*txt
If there is error message as shown below: Execute the command to exit the safe mode.
Tos.Tech | http://thinkopensource.in
61 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
62 Big Data - Admin Course
Data files (local), You need to copy all these files in your VM. All exercise need to be performed
using hdfs logon unless specified. You can create a data folder in your home directory and dump
all data inside that folder.
/software/data/shakespeare.tar.gz
/software/data/access_log.gz
/software/data/pg20417.txt
Hadoop is already installed, configured, and running on your virtual machine. Most of your
interaction with the system will be through a command-‐line wrapper called hadoop. If you run
this program with no arguments, it prints a help message. To try this, run the following command
in a terminal window:
# su - hdfs
$ hadoop
Tos.Tech | http://thinkopensource.in
63 Big Data - Admin Course
The hadoop command is subdivided into several subsystems. For example, there is a subsystem
for working with files in HDFS and another for launching and managing MapReduce processing
jobs.
Exploring HDFS
The subsystem associated with HDFS in the Hadoop wrapper program is called FsShell. This
subsystem can be invoked with the command hadoop fs.
Open a terminal window (if one is not already open) by double-‐clicking the Terminal icon on the
desktop.
In the terminal window, enter:
$ hadoop fs
Tos.Tech | http://thinkopensource.in
64 Big Data - Admin Course
You see a help message describing all the commands associated with the FsShell subsystem.
Enter:
$ hadoop fs -ls /
This shows you the contents of the root directory in HDFS. There will be multiple entries, one of
which is /user. Individual users have a “home” directory under this directory, named after their
username; your username in this course is hdfs, therefore your home directory is /user/hdfs.
Tos.Tech | http://thinkopensource.in
65 Big Data - Admin Course
This is different from running hadoop fs -ls /foo, which refers to a directory that doesn’t exist. In
this case, an error message would be displayed.
Note that the directory structure in HDFS has nothing to do with the directory structure of the
local filesystem; they are completely separate namespaces.
Uploading Files
Tos.Tech | http://thinkopensource.in
66 Big Data - Admin Course
Besides browsing the existing filesystem, another important thing you can do with FsShell is to
upload new data into HDFS. Change directories to the local filesystem directory containing the
sample data we will be using in the homework labs.
$ cd /Software
If you perform a regular Linux ls command in this directory, you will see a few files, including two
named shakespeare.tar.gz and shakespeare-stream.tar.gz. Both of these contain the complete
works of Shakespeare in text format, but with different formats and organizations. For now we
will work with shakespeare.tar.gz.
This creates a directory named shakespeare/ containing several files on your local filesystem.
Tos.Tech | http://thinkopensource.in
67 Big Data - Admin Course
This copies the local shakespeare directory and its contents into a remote, HDFS directory named
/user/hdfs/shakespeare.
Tos.Tech | http://thinkopensource.in
68 Big Data - Admin Course
Now try the same fs -ls command but without a path argument:
$ hadoop fs -ls
You should see the same results. If you don’t pass a directory name to the –ls command, it
assumes you mean your home directory, i.e. /user/hdfs.
Relative paths
If you pass any relative (non-absolute) paths to FsShell commands (or use relative paths in
MapReduce programs), they are considered relative to your home directory.
Tos.Tech | http://thinkopensource.in
69 Big Data - Admin Course
We will also need a sample web server log file, which we will put into HDFS for use in future labs.
This file is currently compressed using GZip. Rather than extract the file to the local disk and then
upload it, we will extract and upload in one step.
First, create a directory in HDFS in which to store it:
Now, extract and upload the file in one step. The -c option to gunzip uncompressed to standard
output, and the dash (-) in the hadoop fs -put command takes whatever is being sent to its
standard input and places that data in HDFS.
Run the hadoop fs -ls command to verify that the log file is in your HDFS home directory.
The access log file is quite large – around 500 MB. Create a smaller version of this file, consisting
only of its first 5000 lines, and store the smaller version in HDFS. You can use the smaller version
for testing in subsequent labs.
Tos.Tech | http://thinkopensource.in
70 Big Data - Admin Course
This lists the contents of the /user/hdfs/shakespeare HDFS directory, which consists of the files
comedies, glossary, histories, poems, and tragedies.
The glossary file included in the compressed file you began with is not strictly a work of
Shakespeare, so let’s remove it:
Note that you could leave this file in place if you so wished. If you did, then it would be included in
subsequent computations across the works of Shakespeare, and would skew your results slightly.
As with many real-‐world big data problems, you make trade-‐offs between the labor to purify
your input data and the precision of your results.
Enter:
Tos.Tech | http://thinkopensource.in
71 Big Data - Admin Course
This prints the last 50 lines of Henry IV, Part 1 to your terminal. This command is handy for
viewing the output of MapReduce programs. Very often, an individual output file of a MapReduce
program is very large, making it inconvenient to view the entire file in the terminal. For this
reason, it’s often a good idea to pipe the output of the fs -cat command into head, tail, more, or
less.
To download a file to work with on the local filesystem use the fs -get command. This command
takes two arguments: an HDFS path and a local path. It copies the HDFS contents into the local
filesystem:
Tos.Tech | http://thinkopensource.in
72 Big Data - Admin Course
There are several other operations available with the hadoop fs command to perform most
common filesystem manipulations: mv, cp, mkdir, etc.
$ hadoop fs
This displays a brief usage report of the commands available within FsShell. Try playing around
with a few of these commands if you like.
In order to work with HDFS you need to use the hadoop fs command. For example to list the /
and /app directories you need to input the following commands:
Tos.Tech | http://thinkopensource.in
73 Big Data - Admin Course
hadoop fs -ls /
hadoop fs -ls /tmp
There are many commands you can run within the Hadoop filesystem. For example to make the
directory test you can issue the following command:
hadoop fs -mkdir test
You should be aware that you can pipe (using the | character) any HDFS command to be used with
the Linux shell. For example, you can easily use grep with HDFS by doing the following:
As you can see the grep command only returned the lines which had test in them (thus removing
the "Found x items" line and oozie-root directory from the listing.
In order to move files between your regular linux filesystem and HDFS you will likely use the put
and get commands. First, move a single file to the hadoop filesystem.
Copy pg20417.txt from software folder to data folder
Tos.Tech | http://thinkopensource.in
74 Big Data - Admin Course
You should now see a new file called /user/hdfs/pg* listed. In order to view the contents of this file
we will use the -cat command as follows:
We can also use the linux diff command to see if the file we put on HDFS is actually the same as
the original on the local filesystem. You can do this as follows:
Tos.Tech | http://thinkopensource.in
75 Big Data - Admin Course
In order to use HDFS commands recursively generally you add an "r" to the HDFS command (In
the Linux shell this is generally done with the "-R" argument) For example, to do a recursive
listing we'll use the -lsr command rather than just -ls. Try this:
Tos.Tech | http://thinkopensource.in
76 Big Data - Admin Course
In order to find the size of files you need to use the -du or -dus commands. Keep in mind that
these commands return the file size in bytes. To find the size of the pg20417.txt file use the
following command:
To find the size of all files individually in the /user/root directory use the following command:
To find the size of all files in total of the /user/root directory use the following command:
Tos.Tech | http://thinkopensource.in
77 Big Data - Admin Course
If you would like to get more information about a given command, invoke -help as follows:
hadoop fs -help
For example, to get help on the dus command you'd do the following:
hadoop fs -help dus
You can observe the HDFS’s namenode console as follows:
Familiarize the various options
http://10.10.20.20:50070/dfshealth.html#tab-overview
Tos.Tech | http://thinkopensource.in
78 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
79 Big Data - Admin Course
Click on Datanodes
Click on Snapshot
Tos.Tech | http://thinkopensource.in
80 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
81 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
82 Big Data - Admin Course
Click on Logs
http://192.168.246.131:50070/logs/
You can verify the log by clicking on the datanode log file.
You can verifying the Hadoop File System Health. Check for minimally replicated blocks if any.
Tos.Tech | http://thinkopensource.in
83 Big Data - Admin Course
#hadoop fsck /
Tos.Tech | http://thinkopensource.in
84 Big Data - Admin Course
This is a dfsadmin command for reporting on each DataNode. It displays the status of Hadoop
cluster. Any under replicated blocks or Corrupt replicas?
Tos.Tech | http://thinkopensource.in
85 Big Data - Admin Course
Go to log folder:
# cd /var/hadoop/logs
# ls
# vi hadoop.txt
Tos.Tech | http://thinkopensource.in
86 Big Data - Admin Course
hadoop fsck /
Tos.Tech | http://thinkopensource.in
87 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
88 Big Data - Admin Course
#hadoop version
The default port is 50070. To get a list of files in a directory you would use:
1. curl -i http://tos.hp.com:50070/webhdfs/v1/user/root/output/?op=LISTSTATUS
Tos.Tech | http://thinkopensource.in
89 Big Data - Admin Course
#su - hdfs
#hadoop fs -put yelp_academic_dataset_review.json /user/hdfs/mydata
NameNode has all the metadata such as the replication factor, locations, racks etc… related to
the file.
We can view this information on executing the below command.
On running the above command the gateway node runs the fsck and connects to the Namenode.
Namenode checks for the file and the time it was created.
Next, the Namenode will go to the particular block pool id of the Namenode which contains the
metadata information.
Tos.Tech | http://thinkopensource.in
90 Big Data - Admin Course
Based on the block pool id, it will search for the block id of the data node and the details such as
the rack information on which the data is stored based on the replication factor.
Further, it will give you the information regarding the blocks which are Over-replicated, Under-
replicated, corrupt blocks, the number of data nodes and the racks used along with the health
status of the file system.
Tos.Tech | http://thinkopensource.in
91 Big Data - Admin Course
Apart from this, the scheduler also plays a role in distributing the resources and scheduling a job
on storing data into Hdfs. In this case, I’m using Yarn architecture. The details related to the
scheduling are present in yarn-site.xml. The default scheduler used is capacity scheduler.
Tos.Tech | http://thinkopensource.in
92 Big Data - Admin Course
#hadoop fs -ls /
Tos.Tech | http://thinkopensource.in
93 Big Data - Admin Course
https://community.cloudera.com/t5/Community-Articles/Set-log-level-of-namenode/ta-
p/249460
https://leveragebigdata.blogspot.com/2017/01/debugging-apache-hadoop.html
https://stackoverflow.com/questions/19198367/is-there-a-way-to-debug-namenode-or-datanode-
of-hadoop-using-eclipse
#hdfs dfsadmin -metasave metasave-report.txt
#cat metasave-report.txt
Tos.Tech | http://thinkopensource.in
96 Big Data - Admin Course
Determine the meta file of the block and verify the status as shown below:
Tos.Tech | http://thinkopensource.in
97 Big Data - Admin Course
Optional:
Enable Debug using the following option and restart the services.
Tos.Tech | http://thinkopensource.in
99 Big Data - Admin Course
Set the Namenode java heap size (Memory) to 2.5 GB using the following option
Use Services > [HDFS] > Configs to optimize service performance for the service.
1. In Ambari Web, click a service name in the service summary list on the left.
2. From the the service Summary page, click the Configs tab, then use one of the following tabs
to manage configuration settings.
o Use the Configs tab to manage configuration versions and groups.
o Use the Settings tab to manage Smart Configs by adjusting the green, slider buttons.
o Use the Advanced tab to edit specific configuration properties and values.
Tos.Tech | http://thinkopensource.in
100 Big Data - Admin Course
3. Click Save.
Tos.Tech | http://thinkopensource.in
101 Big Data - Admin Course
4. Enter a description for this configuration version that includes your current changes.
5. Review and confirm each recommended change.
Restart all affected services.
Let us configure
Click on HDFS services Config Config Group -> Manage Config Group Add
Enter the following details:
Ok
Tos.Tech | http://thinkopensource.in
102 Big Data - Admin Course
Select the group on the left side and add the slavea host on the right.
Tos.Tech | http://thinkopensource.in
103 Big Data - Admin Course
Click Save.
Now Let us change the memory setting of Slave A . Select Gonfig group which we have just created
above.
Tos.Tech | http://thinkopensource.in
104 Big Data - Admin Course
Override Configurations
Once you have created the configuration group and assign some hosts to the group, you are ready to override configuration values.
This section uses HDFS Hadoop maximum Java heap size property as an example to describes how to override configuration values.
1. On the HDFS’s configuration page, from the Group drop-down list, select the configuration
group created in the previous section. You will see the configuration values displayed are
identical to the ones in the default group. Configuration groups show full list of configuration
Tos.Tech | http://thinkopensource.in
105 Big Data - Admin Course
2. Click the Override button next to the property you want to set a new value. Enter a new value
in the text box shown below the default value.
3. You will not be able to save the configuration changes unless you specify a value that’s
different from the default value.
Tos.Tech | http://thinkopensource.in
106 Big Data - Admin Course
4. Click Save on the top of the configuration page to save the configuration. Enter a description
for the change in the Save Configuration wizard and click Save again.
Tos.Tech | http://thinkopensource.in
107 Big Data - Admin Course
5. Ambari web UI opens up a new wizard dialog with the save configuration result.
7. Ambari web UI shows different configuration values defined in various groups when it displays
the default group.
Tos.Tech | http://thinkopensource.in
108 Big Data - Admin Course
In this post, we describe how you can override component configuration on a subset of hosts. This is a very useful and straight forward way to
apply host specific configuration values when a cluster is a heterogeneous mixture of hosts. You can also re-assign hosts from the non-default
configuration groups to the default group or the other non-default configuration groups.
Tos.Tech | http://thinkopensource.in
109 Big Data - Admin Course
And before we start, here’s a nifty trick for your tests: When running the benchmarks described
in the following sections, you might want to use the Unix time command to measure the elapsed
time. This saves you the hassle of navigating to the Hadoop JobTracker web interface to get the
(almost) same information. Simply prefix every Hadoop command with time :
time hadoop jar hadoop-*examples*.jar ...
TestDFSIO
The TestDFSIO benchmark is a read and write test for HDFS. It is helpful for tasks such
as stress testing HDFS, to discover performance bottlenecks in your network, to shake
out the hardware, OS and Hadoop setup of your cluster machines (particularly the
NameNode and the DataNodes) and to give you a first impression of how fast your
cluster is in terms of I/O.
The default output directory is /benchmarks/TestDFSIO
When a write test is run via -write , the TestDFSIO benchmark writes its files
to /benchmarks/TestDFSIO on HDFS. Files from older write runs are overwritten.
Benchmark results are saved in a local file called TestDFSIO_results.log in the current local
directory (results are appended if the file already exists) and also printed to STDOUT.
Run write tests before read tests
The read test of TestDFSIO does not generate its own input files. For this reason, it is a
convenient practice to first run a write test via -write and then follow-up with a read test
via -read (while using the same parameters as during the previous -write run).
Tos.Tech | http://thinkopensource.in
110 Big Data - Admin Course
# su - yarn
#export YARN_EXAMPLES=/usr/hdp/current/hadoop-mapreduce-client
cd /usr/hdp/current/hadoop-mapreduce-client
Run a write test (as input data for the subsequent read test)
TestDFSIO is designed in such a way that it will use 1 map task per file, i.e. it is a 1:1 mapping
from files to map tasks. Splits are defined so that each map gets only one filename, which it
creates ( -write ) or reads ( -read ).
The command to run a write test that generates 10 output files of size 1GB for a total of 10GB is:
Tos.Tech | http://thinkopensource.in
111 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
112 Big Data - Admin Course
The cleaning run will delete the output directory /benchmarks/TestDFSIO on HDFS.
Here, the most notable metrics are Throughput mb/sec and Average IO rate mb/sec. Both of them are based on the file size written
(or read) by the individual map tasks and the elapsed time to do so.
Two derived metrics you might be interested in are estimates of the “concurrent” throughput and average IO rate (for the lack of a
better term) your cluster is capable of. Imagine you let TestDFSIO create 1,000 files but your cluster has only 200 map slots. This
means that it takes about five MapReduce waves ( 5 * 200 = 1,000 ) to write the full test data because the cluster can only run
200 map tasks at the same time. In this case, simply take the minimum of the number of files (here: 1,000 ) and the number of
available map slots in your cluster (here: 200 ), and multiply the throughput and average IO rate by this minimum. In our example,
the concurrent throughput would be estimated at 4.989 * 200 = 997.8 MB/s and the concurrent average IO rate at 5.185 *
200 = 1,037.0 MB/s .
Tos.Tech | http://thinkopensource.in
113 Big Data - Admin Course
You do not need to re-generate input data before every TeraSort run (step 2). So you can skip step 1 (TeraGen) for later TeraSort
runs if you are satisfied with the generated data.
Figure 1 shows the basic data flow. We use the included HDFS directory names in the later examples.
Figure 1: Hadoop Benchmarking and Stress Testing: The basic data flow of the TeraSort benchmark suite.
Tos.Tech | http://thinkopensource.in
114 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
115 Big Data - Admin Course
Using the HDFS output directory /user/hduser/terasort-input as an example, the command to run TeraGen in order to
generate 1TB of input data (i.e. 1,000,000,000,000 bytes) is:
Please note that the first parameter supplied to TeraGen is 10 billion (10,000,000,000), i.e. not 1 trillion = 1 TB
(1,000,000,000,000). The reason is that the first parameter specifies the number of rows of input data to generate, each of which
having a size of 100 bytes.
Here is the actual TeraGen data format per row to clear things up:
Tos.Tech | http://thinkopensource.in
116 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
117 Big Data - Admin Course
Using the output directory /user/hdfs/terasort-output from the previous sections and the
report (output) directory /user/hdfs/terasort-validate as an example, the command to run
the TeraValidate test is:
Tos.Tech | http://thinkopensource.in
118 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
119 Big Data - Admin Course
NameNode benchmark
Tos.Tech | http://thinkopensource.in
120 Big Data - Admin Course
The following command will run a NameNode benchmark that creates 1000 files using 12 maps and 6 reducers. It uses a custom
output directory based on the machine’s short hostname. This is a simple trick to ensure that one box does not accidentally write
into the same output directory of another box running NNBench at the same time.
Note that by default the benchmark waits 2 minutes before it actually starts!
Tos.Tech | http://thinkopensource.in
121 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
122 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
123 Big Data - Admin Course
In all nodes,
Datanode, Zookeeper, Zkfc & Journal Node wherever it’s installed.
In slavea
Namenode
At this point we have only one Resource Manager configured in the cluster.
Tos.Tech | http://thinkopensource.in
124 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
125 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
126 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
127 Big Data - Admin Course
Click Next to approve the changes and start automatically configuring ResourceManager HA.
6. On Configure Components, click Complete when all the progress bars finish tracking.
Tos.Tech | http://thinkopensource.in
128 Big Data - Admin Course
As you can see one will be in active and other will be in stand by.
Tos.Tech | http://thinkopensource.in
129 Big Data - Admin Course
In my case the rm1 is the primary RM. (Yarn Config Custom yarn-site.xml)
Tos.Tech | http://thinkopensource.in
130 Big Data - Admin Course
Stop the RM service of master node. When the terminal display the below text
Failing Over RM node message will be displayed in the console as shown below:
Tos.Tech | http://thinkopensource.in
132 Big Data - Admin Course
You can verify from the dashboard also that rm2 is the primary resource manager now.
Now the job will be orchestrated by the new primary RM i.e rm2 and we don’t need to resubmit
the job to the cluster.
Tos.Tech | http://thinkopensource.in
133 Big Data - Admin Course
You can also determine the status of the rm2 i.e slavea using yarn command.
yarn rmadmin -getServiceState rm2
Tos.Tech | http://thinkopensource.in
134 Big Data - Admin Course
This mean that the current active node will be the primary Resource Manager till it get failed over
although the earlier primary node comes up.
Tos.Tech | http://thinkopensource.in
135 Big Data - Admin Course
Case Study to debug and resolve job related issues running in Hadoop cluster. You can’t execute
any job that exceeds the Virtual memory demand then the configure in the configuration file.
If the virtual memory usage exceeds more than the allowed configured memory then the container
will be killed and job will failed.
Let us enable the flag in custom yarn-site.xml file so that Node Manager can monitor the virtual
memory usage of the cluster. i.e yarn.nodemanager.vmem-check-enabled = true
Dashboard Yarn Config Advance
Tos.Tech | http://thinkopensource.in
136 Big Data - Admin Course
After some time the Job will failed with the following errors.
Container Physical Memory consumption at this juncture: Virtual memory usage is beyond the
permissible limit.
Tos.Tech | http://thinkopensource.in
137 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
138 Big Data - Admin Course
1.9 GB of 824 MB virtual memory used ( Virtual memory usage exceeds that of the limit). Killing
container.
Observation:
Open the file and observe the virtual to physical memory allowed ratio. It’s 4 times here for each
map container.
#vi /etc/hadoop/3.1.0.0-78/0/yarn-site.xml
yarn.nodemanager.vmem-pmem-ratio = 4 times
the ("mapreduce.map.memory.mb") is set to 206MB then the total allowed virtual memory is 4 *
206 =824MB.
#vi /etc/hadoop/3.1.0.0-78/0/mapred-site.xml
Tos.Tech | http://thinkopensource.in
139 Big Data - Admin Course
However as shown in the below log 1.9 GB virtual memory is demanded then the allowed 824 Gb
configured. Hence the job failed.
You can verify from the log. This error is due to the overall consumption of virtual memory which
is more than the allocated allowed virtual memory. How do we resolve this? One way is to increase
the Physical memory and raised the allowed virtual memory ratio. Another way is to disable the
validation of the virtual memory. Which we will disable it!
Concepts:
NodeManager can monitor the memory usage (virtual and physical) of the container. If its virtual
memory exceeds “yarn.nodemanager.vmem-pmem-ratio” times the
"mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", then the container will be
killed if “yarn.nodemanager.vmem-check-enabled” is true.
Solution:
yarn.nodemanager.vmem-check-enabled should be false and restart the cluster services i.e
Nodemanager and Resource Manager
Then resubmit the job all over again. Do not update the xml files directly. All changes have to be
done through the Ambari UI only.
Tos.Tech | http://thinkopensource.in
140 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in