Tutorial-HDP-Administration V III
Tutorial-HDP-Administration V III
Tutorial-HDP-Administration V III
Table of Contents
2
3 Big Data - Admin Course
1. Prelude
Ex : 10.10.20.21 tos.master.com
Mount the shared folder in VM host. Henceforth it will be refer as Software folder. You can
refer the supplement document for enabling and mounting shared folder.
Comment the last line in case of any issue as shown below.
Issue
3
4 Big Data - Admin Course
4
5 Big Data - Admin Course
2. Ambari
Hadoop requires java hence you need to install JDK and set Java Home.
#su - root
#mkdir /YARN
#tar -xvf jdk-8u181-linux-x64.tar.gz -C /YARN
mv jdk1.8.0_181 jdk
To include JAVA_HOME for all bash users, make an entry in /etc/profile.d as follows:
We have extracted java in /YARN/jdk folder and specify java home using root logon.
(CentOS 7 64-bit-CLI)
Type bash in the command to prompt to reinitialized the scripts.
Next we will download the ambari repo so that yum utlility can download it.
Steps
1. #mkdir /apps
2. #cd /apps
3. Download the Ambari repository file to a directory on your installation host.
4. yum install wget
5.
wget -nv http://public-repo-
1.hortonworks.com/ambari/centos7/2.x/updates/2.7.3.0/ambari.repo -O
/etc/yum.repos.d/ambari.repo
Important
Do not modify the ambari.repo file name. This file is expected to be available on
the Ambari Server host during Agent registration.
6. Confirm that the repository is configured by checking the repo list.
yum repolist
6
7 Big Data - Admin Course
1. Install the Ambari. This also installs the default PostgreSQL Ambari database.
7
8 Big Data - Admin Course
Ambari Server by default uses an embedded PostgreSQL database. When you install the
Ambari Server, the PostgreSQL packages and dependencies must be available for install.
These packages are typically available as part of your Operating System repositories.
Before starting the Ambari Server, you must set up the Ambari Server. Setup configures
Ambari to talk to the Ambari database, installs the JDK and allows you to customize the
user account the Ambari Server daemon will run as.
ambari-server setup
The command manages the setup process. Run the command on the Ambari server host to
start the setup process.
Respond to the setup prompt:
1. If you have not temporarily disabled SELinux, you may get a warning. Accept the
default (y), and continue.
2. By default, Ambari Server runs under root. Accept the default (n) at the Customize user
account for ambari-server daemon prompt, to proceed as root.
3. If you have not temporarily disabled iptables you may get a warning. Enter y to
continue.
4. Select a JDK version to download. Enter 1 to download Oracle JDK 1.8. Alternatively,
you can choose to enter a Custom JDK. If you choose Custom JDK, you must
manually install the JDK on all hosts and specify the Java Home path. :
8
9 Big Data - Admin Course
For our lab Accept 2 and enter the Java home as - /YARN/jdk
Note
JDK support depends entirely on your choice of Stack versions. By default,
Ambari Server setup downloads and installs Oracle JDK 1.8 and the
accompanying Java Cryptography Extension (JCE) Policy Files.
5. Accept the Oracle JDK license when prompted. You must accept this license to
download the necessary JDK from Oracle. The JDK is installed during the deploy phase.
9
10 Big Data - Admin Course
10
11 Big Data - Admin Course
11
12 Big Data - Admin Course
ambari-server start
ambari-server status
12
13 Big Data - Admin Course
To stop the Ambari Server: Do not execute this command. It’s for your information.
ambari-server stop
On Ambari Server start, Ambari runs a database consistency check looking for issues. If
any issues are found, Ambari Server start will abort and display the following
message: DB configs consistency check failed. Ambari writes more details about database
consistency check results to the/var/log/ambari-server/ambari-server-check-
database.log file.
You can force Ambari Server to start by skipping this check with the following option:
(optional when only there is issue)
ambari-server start --skip-database-check
If you have database issues, by choosing to skip this check, do not make any changes
to your cluster topology or perform a cluster upgrade until you correct the
database consistency issues.
Solution:
# ambari-server status
Using python /usr/bin/python
Ambari-server status
Ambari Server not running. Stale PID File at: /var/run/ambari-server/ambari-server.pid
# ambari-server reset
Using python /usr/bin/python
Resetting ambari-server
**** WARNING **** You are about to reset and clear the Ambari Server database. This
will remove all cluster host and configuration information from the database. You will be
required to re-configure the Ambari server and re-run the cluster wizard.
Are you SURE you want to perform the reset [yes/no] (no)? yes
Confirm server reset [yes/no](no)? yes
Resetting the Server database...
Creating schema and user...
done.
Creating tables...
done.
Ambari Server 'reset' completed successfully.
14
15 Big Data - Admin Course
Next Steps
Prerequisites
Ambari Server must be running.
This will download required drivers for mysql
#yum install mysql-connector-java*
#cd /usr/lib/ambari-agent
#cp /usr/share/java/mysql-connector-java.jar .
Note: Whenever there is any issue related to jar file, determine the jar from the log file and
manually download and copy to the tmp folder as shown above,
15
16 Big Data - Admin Course
log on to Ambari Web using a web browser and install the HDP cluster software.
Stop the firewall in the VM.
#systemctl stop firewalld
Steps
1. Point your web browser to
http://<your.ambari.server>:8080
,where <your.ambari.server> is the name of your ambari server host.
For example, a default Ambari server host is located
at http://tos.master.com:8080/#/login.
2. Log in to the Ambari Server using the default user name/password: admin/admin. For a
new cluster, the Cluster Install wizard displays a Welcome page.
16
17 Big Data - Admin Course
Click Sign In
17
18 Big Data - Admin Course
18
19 Big Data - Admin Course
Install MySQL as usual and start the service. During installation, you will be asked if you
want to accept the results from the .rpm file’s GPG verification. If no error or mismatch
occurs, enter y.
yum install mysql-server
systemctl start mysqld
// Create User for Ranger KMS after upgrading mysql server
mysql_upgrade --force –uroot
19
20 Big Data - Admin Course
20
21 Big Data - Admin Course
Next Step
Steps
1. In Name your cluster, type a name for the cluster you want to create.
Use no white spaces or special characters in the name.
2. Choose Next.
Choose the following Version
21
22 Big Data - Admin Course
22
23 Big Data - Admin Course
23
24 Big Data - Admin Course
ssh-keygen
4. Press Enter key twice to accept all default values.
5. Copy the SSH Public Key (id_rsa.pub) to the root account on your target hosts.
($HOME/.ssh/id_rsa.pub)
6. # cd $HOME/.ssh
.ssh/id_rsa
.ssh/id_rsa.pub
7. Add the SSH Public Key to the authorized_keys file on your target hosts.
cat id_rsa.pub >> authorized_keys
8. Depending on your version of SSH, you may need to set permissions on the .ssh
directory (to 700) and the authorized_keys file in that directory (to 600) on the target
hosts.
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
9. From the Ambari Server, make sure you can connect to each host in the cluster using
SSH, without having to enter a password.
ssh root@<remote.target.host>
where <remote.target.host> has the value of each host name in your cluster.
24
25 Big Data - Admin Course
#ssh [email protected]
10. If the following warning message displays during your first connection: Are you sure
you want to continue connecting (yes/no)? Enter Yes.
11. Retain a copy of the SSH Private Key (id_rsa file) on the machine from which you will
run the web-based Ambari Install Wizard.
25
26 Big Data - Admin Course
Above command will copy the private key to the /apps folder so that you can copy the file
on your laptop using the winscp tool.
26
27 Big Data - Admin Course
27
28 Big Data - Admin Course
Select id_rsa , you need to download the file in your window machine before selecting this
option.
28
29 Big Data - Admin Course
29
30 Big Data - Admin Course
Next
30
31 Big Data - Admin Course
31
32 Big Data - Admin Course
32
33 Big Data - Admin Course
Accept all the options as shown above, some of the services are required by default.
33
34 Big Data - Admin Course
34
35 Big Data - Admin Course
35
36 Big Data - Admin Course
36
37 Big Data - Admin Course
Execute the following by opening a terminal copying the my sql jar to appropriate folder:
#cd /usr/lib/ambari-agent
#cp /usr/share/java/mysql-connector-java.jar .
#cp /usr/share/java/mysql-connector-java.jar /var/lib/ambari-server/resources
37
38 Big Data - Admin Course
Click Next by accepting all default values till Deploy Button get displayed.
Reduce memory setting if any error occur due to insuffecient memory.
39
40 Big Data - Admin Course
Click Deploy
40
41 Big Data - Admin Course
Errors: In case of the following errors.. Copy the jar as mention below.
2019-10-15 03:29:04,668 - The repository with version 3.1.4.0-315 for this command has been marked as resolved. It will be
used to report the version of the component which was installed
#cd /usr/lib/ambari-agent
#cp /usr/share/java/mysql-connector-java.jar .
#cp /usr/share/java/mysql-connector-java.jar /var/lib/ambari-server/resources
After copying the above jar, retry the installation.
41
42 Big Data - Admin Course
In case the system doesn’t start for quite a long time i.e exceeds 10 minutes + verify the log
in the following location. View the latest file with out*txt.
/var/lib/ambari-agent/data/out*txt
If there is a message similar as shown below: Execute the next command to come out of
the safe mode.
42
43 Big Data - Admin Course
43
44 Big Data - Admin Course
Solutions: Open a mysql console and grant the hive user access to the MySQL server.
Use the password which was supplied at the time of istalllation i.e is Life2134
# mysql -uroot –p
mysql> use mysql;
mysql> select user,host from user;
mysql> grant all on *.* to 'hive'@'tos.hp.com' identified by 'Life2134';
44
45 Big Data - Admin Course
45
46 Big Data - Admin Course
At the end of this step, there should be an entry of hive in the user table. Now you can start
the hive metastore and then the hive server. However you need to start the Ranger server
before starting the Hive server. Sequences to start the hive server are as follows.
MySql Server
Hive Metastore
Ranger Admin , Ranger User Sync – Master and Slave.
Hive Server
-------------------------------------------- Lab ends here ---------------------------------------------
46
47 Big Data - Admin Course
Goal: You will verify some of the setting related to HDFS and YARN in the config file so
that you get familiarize with various config files.
ambari-server start
47
48 Big Data - Admin Course
ambari-server status
To stop the Ambari Server: Do not execute this command. Its for your information.
ambari-server stop
You can view the log file in case of any issue: /var/log/ambari-server/ambari-server*.log
http://<your.ambari.server>:8080
,where <your.ambari.server> is the name of your ambari server host.
For example, a default Ambari server host is located
at http://tos.master.com:8080/#/login.
48
49 Big Data - Admin Course
4. Log in to the Ambari Server using the default user name/password: admin/admin. You
can change these credentials later.
49
50 Big Data - Admin Course
Log on to Ambari server; click on any services listed below the services tab in the
dashboard. Click on the red icon and start from the menu option.
http://tos.master.com:8080/#/main/dashboard/metrics
HDFS Sevices:
NameNode
DataNode
Yarn services:
Resource Manager
Node Manager
50
51 Big Data - Admin Course
Namenode
In case the namenode take quite a long time to start up i.e exceeds 10 minutes+; verify the
log in the following location. View the latest file out of it.
/var/lib/ambari-agent/data/out*txt
If there is error message as shown below: Execute the command to come out of the safe
mode.
51
52 Big Data - Admin Course
52
53 Big Data - Admin Course
53
54 Big Data - Admin Course
Open a terminal and vi file /etc/hadoop/conf/core-site.xml. verify the Port number and
the host that runs the namenode services.
<property>
<name>fs.defaultFS</name>
<value>hdfs://tos.hp.com:8020</value>
<final>true</final>
</property>
<property>
<name>hadoop.proxyuser.hdfs.groups</name>
<value>*</value>
</property>
You can also verify using the ambari console as shown below:
54
55 Big Data - Admin Course
This is the graphical representation of the config file. All changes have to be done from
the web console only so that synchronizing to all slaves node will be managed by ambari
else you have to do manually.
55
56 Big Data - Admin Course
Verify the replication factor and the physical location of the data or block configured for
the cluster.
#vi /etc/hadoop/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/nn</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>fs.checkpoint.edits.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/dn</value>
56
57 Big Data - Admin Course
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>hp.tos.com:50070</value>
</property>
</configuration>
57
58 Big Data - Admin Course
Verify /etc/hadoop/conf/mapred-site.xml
Map reduce related setting: as map reduce is executing in the yarn mode in this cluster.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml, you can use vi editor to view The pluggable shuffle and pluggable sort
capabilities allow replacing the built in shuffle and sort logic with alternate
implementations.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
58
59 Big Data - Admin Course
#vi /etc/hadoop/conf/hadoop-env.sh
HADOOP_HEAPSIZE="500"
HADOOP_NAMENODE_INIT_HEAPSIZE="500"
59
60 Big Data - Admin Course
After starting the HDFS, you can verify the java processes as shown below
#su -
#su hdfs
#jps
This command will list the java processes started for the Hadoop – Yarn.
#su -
#su hdfs
#jps
You can verify the Services using Web Interface also, You need to replace IP with that of
your server IP.
60
61 Big Data - Admin Course
Access the Name Node UI and Data Node UI. Get familiarize the various features of these
UI especially the Node that belongs to the HDP cluster and the files store in the HDFS.
http://tos.master.com:50070/dfshealth.html#tab-overview
You can click on the various tabs to familiarize with the web UI.
61
62 Big Data - Admin Course
62
63 Big Data - Admin Course
63
64 Big Data - Admin Course
Any snapshot being taken? You can verfy this after the snapshot lab. Here all snapshot
information will be stored.
64
65 Big Data - Admin Course
65
66 Big Data - Admin Course
66
67 Big Data - Admin Course
http://tos.hp.com:8088/
Whenever you submit a job in the yarn cluster, you will get the job listed in this console.
How much resources consume will all be displayed here?
67
68 Big Data - Admin Course
68
69 Big Data - Admin Course
69
70 Big Data - Admin Course
You will be able to submit map reduce job to hadoop yarn cluster at the end of this lab.
You need to ensure that Hadoop cluster is configured and started before proceeding ahead.
We are going to use sample MapReduce Examples provided by the hadoop installation
using hdfs user to understand how to submit MR job.
70
71 Big Data - Admin Course
You can verify the execution of jobs using the YARN web console.
71
72 Big Data - Admin Course
http://tos.hp.com:8088/ui2/#/cluster-overview
72
73 Big Data - Admin Course
73
74 Big Data - Admin Course
We will talk about the Queues when we discuss scheduler later in the training.
Now, the job is in accepted state. Finally it will be in running state as shown below.
74
75 Big Data - Admin Course
75
76 Big Data - Admin Course
Click on the application id link and verify the resources consume by this Job. Hover the
mouse over the color to get the exact value of memory consumption.
Find out where the AM executes for the job we have submitted now.
76
77 Big Data - Admin Course
In my case it’s the slave which can be different for your execution.
77
78 Big Data - Admin Course
In the above example its ask for 5 cotainers each of 768 mb and 1 v cores.
78
79 Big Data - Admin Course
Finally at the end of the job execution, the pi result will be shown as above.
79
80 Big Data - Admin Course
80
81 Big Data - Admin Course
Errata: /etc/hadoop/conf/yarn-site.xml
Issue Map reduces jobs not proceeding ahead and stuck at Accepted state.
Solution: verify the yarn-site.xml file and execute yarn services with yarn user only
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>124</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
81
82 Big Data - Admin Course
82
83 Big Data - Admin Course
6. HDFS ACL
HDFS ACLs give you the ability to specify fine-grained file permissions for specific named
users or named groups, not just the file’s owner and group.
To use ACLs, first you’ll need to enable ACLs on the NameNode by adding the following
configuration property to hdfs-site.xml and restarting the NameNode. Verify using ambari
UI.
<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>
83
84 Big Data - Admin Course
This example demonstrates how a user ("hdfs"), shares folder access with colleagues from
another team ("hadoopdev"), so that the hadoopdev team can collaborate on the content of
that folder; this is accomplished by updating the default extended ACL of that directory:
Open a terminal and create an user – henry with password life213 using root logon
#su root
#groupadd hadoopdev
#useradd henry -G hadoopdev
#passwd henry
Create a folder /project using the hdfs user and dump a file.
#su hdfs
84
85 Big Data - Admin Course
Create a file in the /app folder and update with the content below.
#vi /app/henry.txt
Copy the above file to the hdfs /project folder using user hdfs else you will have permission
issues.
Open a new terminal and switch user to henry then display the file content.
#su henry
85
86 Big Data - Admin Course
You will be able to display the content since the other group has – r_x i.e read access in
this directory.
Why is it deny? Verify the permission access of Other group. (hints – r_x)
Make the files and sub-directories created within the content directory readable and
writable by team "hadoopdev":
# su hdfs
86
87 Big Data - Admin Course
Inspect the new sub-directory ACLs to verify that HDFS has applied the new default
values:
87
88 Big Data - Admin Course
88
89 Big Data - Admin Course
Note: At the time it is created, the default ACL is copied from the parent directory to the
child directory. Subsequent changes to the parent directory default ACL do not change the
ACLs of the existing child directories.
89
90 Big Data - Admin Course
From two single-node clusters to a multi-node cluster – We will build a multi-node cluster
using two Linux boxes in this lab.
Prerequisites
You have already installed HDP cluster on the master node i.e hp.tos.com using ambari.
90
91 Big Data - Admin Course
Henceforth, we will call the designated master machine just the master and the slave-only
machine the slave. We will also give the two machines these respective hostnames in their
networking setup, most notably in /etc/hosts. If the hostnames of your machines are
different (e.g. node01) then you must adapt the settings in this tutorial as appropriate.
On both the machines, rename it hostname by following the steps mention below, use root
credentials:
# nmtui.
91
92 Big Data - Admin Course
92
93 Big Data - Admin Course
93
94 Big Data - Admin Course
Networking
Both the machines should be able to talk to each other over the network. You need to
enter the IPs of your respective machine in the hosts file of both the machines.
94
95 Big Data - Admin Course
Enable SSH access so that both the machine can talk to each other without password.
The root user on the master must be able to connect to all the slave machines.
a) To its own user account on the master – i.e. ssh master in this context and not
necessarily ssh localhost – and
b) To the root user account on the slave via a password-less SSH login.
You have to add the root @master‘s public SSH key (which should be in
$HOME/.ssh/id_rsa.pub) to the authorized_keys file of root@slave (in this
user’s$HOME/.ssh/authorized_keys).
1. You need to configure Public/Private key for this, you can refer “Configuring SSH” in
the supplement to configure ssh on the master node.
2. You can do this manually or use the following SSH command on master node:
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@slave
95
96 Big Data - Admin Course
This command will prompt you for the login password for user root on slave, then copy the
public SSH key for you, creating the correct directory and fixing the permissions as
necessary.
The final step is to test the SSH setup by connecting with user root from the master to the
user account root on the slave. The step is also needed to save slave‘s host key fingerprint
to the root@master‘sknown_hosts file.
3. So, connecting from master to master…
$ ssh master
…
96
97 Big Data - Admin Course
97
98 Big Data - Admin Course
Hadoop YARN: Cluster Overview of Services. At the end you will have the following
services in the cluster of two nodes.
98
99 Big Data - Admin Course
Basically, the “master” daemons are responsible for coordination and management of the
“slave” daemons while the latter will do the actual data storage and data processing work.
Masters vs. Slaves
Typically one machine in the cluster is designated as the NameNode and another machine
the as ResourceManager, exclusively. These are the actual “master nodes”. The rest of
machines in the cluster act as both DataNode and NodeManager. These are the slaves or
“worker nodes”.
Ensure that java is installed and JAVA_HOME along with PATH variable are set as done
in the master.
Log in to Ambari (tos.master.com), click on Hosts and choose Add New Hosts from
the Actions menu.
99
100 Big Data - Admin Course
100
101 Big Data - Admin Course
In the Confirm Hosts step, Ambari server connects to the new node using SSH, it
registers the new node to the cluster and installs Ambari Agent in order to keep control
over it.Registering phase:
101
102 Big Data - Admin Course
After sometimes it will be displayed as above If there is any issue click in the status link to
determine the root cause.
102
103 Big Data - Admin Course
103
104 Big Data - Admin Course
In step Assign Slaves and Clients; define my node to be a DataNode and has a
NodeManager installed as well (if you are running Apache Storm, Supervisor is also an
option).
Click next.
1. In step Configurations, there is not much to do, unless you operate with more than
one Configuration Group.
2. Click Next.
104
105 Big Data - Admin Course
If there is any failure in the above steps verifies the log by clicking on the failed link and
performs the necessary steps. For example sql drivers updated manually in the master
node you needs to perform the same steps in the slave node too.
105
106 Big Data - Admin Course
Next
106
107 Big Data - Admin Course
Background: The HDFS name table is stored on the NameNode’s (here: master) local
filesystem in the directory specified by dfs.name.dir. The name table is used by the
NameNode to store tracking and coordination information for the DataNodes.
Verify the Name node directory
#cd /hadoop/hdfs/namenode/current
#ls –ltr
107
108 Big Data - Admin Course
Start the following services: Click on Hosts then click on each host listed on the menu to
start as below:
Master:
Name Node
Data Node
Resource Manager
Node Manager
Zookeeper
Slave:
DataNode
Node Manager.
http://10.10.20.20:50070/dfshealth.html#tab-datanode
108
109 Big Data - Admin Course
You can verify the datanode using the Web UI – Name Node UI console:
109
110 Big Data - Admin Course
You can view the node manager nodes with the url http://10.10.20.20:8088/cluster
Execute the MR Job on the two nodes cluster by refering “Map Reduce Job Submission –
YARN” section
-------------------------------------------- Lab ends here -----------------------------------------
110
111 Big Data - Admin Course
8. Using HDFS
In this lab you will begin to get acquainted with the Hadoop tools. You will manipulate
files in HDFS, the Hadoop Distributed File System.
Set Up Your Environment
Before starting the labs, start up the VM and the HDFS, you need to logon with hdfs user
for this exercise.
ambari-server start
111
112 Big Data - Admin Course
ambari-server status
To stop the Ambari Server: Do not execute this command. Its for your information.
112
113 Big Data - Admin Course
ambari-server stop
You can view the log file in case of any issue: /var/log/ambari-server/ambari-server*.log
http://<your.ambari.server>:8080
,where <your.ambari.server> is the name of your ambari server host.
For example, a default Ambari server host is located
at http://tos.master.com:8080/#/login.
6. Log in to the Ambari Server using the default user name/password: admin/admin. You
can change these credentials later.
113
114 Big Data - Admin Course
http://tos.master.com:8080/#/main/dashboard/metrics
114
115 Big Data - Admin Course
HDFS Sevices:
NameNode
DataNode
Yarn services:
Resource Manager
Node Manager
115
116 Big Data - Admin Course
Namenode
In case the namenode take quite a long time to start up i.e exceeds 10 minutes+; verify the
log in the following location. View the latest file out of it.
/var/lib/ambari-agent/data/out*txt
If there is error message as shown below: Execute the command to exit the safe mode.
116
117 Big Data - Admin Course
117
118 Big Data - Admin Course
Data files (local), You need to copy all these files in your VM. All exercise need to be
performed using hdfs logon unless specified. You can create a data folder in your home
directory and dump all data inside that folder.
/software/data/shakespeare.tar.gz
/software/data/access_log.gz
/software/data/pg20417.txt
Hadoop is already installed, configured, and running on your virtual machine. Most of
your interaction with the system will be through a command-‐line wrapper called hadoop.
If you run this program with no arguments, it prints a help message. To try this, run the
following command in a terminal window:
# su - hdfs
$ hadoop
118
119 Big Data - Admin Course
The hadoop command is subdivided into several subsystems. For example, there is a
subsystem for working with files in HDFS and another for launching and managing
MapReduce processing jobs.
Exploring HDFS
The subsystem associated with HDFS in the Hadoop wrapper program is called FsShell.
This subsystem can be invoked with the command hadoop fs.
Open a terminal window (if one is not already open) by double-‐clicking the Terminal icon
on the desktop.
In the terminal window, enter:
$ hadoop fs
119
120 Big Data - Admin Course
You see a help message describing all the commands associated with the FsShell
subsystem.
Enter:
$ hadoop fs -ls /
This shows you the contents of the root directory in HDFS. There will be multiple entries,
one of which is /user. Individual users have a “home” directory under this directory,
120
121 Big Data - Admin Course
named after their username; your username in this course is hdfs, therefore your home
directory is /user/hdfs.
This is different from running hadoop fs -ls /foo, which refers to a directory that doesn’t
exist. In this case, an error message would be displayed.
121
122 Big Data - Admin Course
Note that the directory structure in HDFS has nothing to do with the directory structure of
the local filesystem; they are completely separate namespaces.
Uploading Files
Besides browsing the existing filesystem, another important thing you can do with FsShell
is to upload new data into HDFS. Change directories to the local filesystem directory
containing the sample data we will be using in the homework labs.
$ cd /Software
If you perform a regular Linux ls command in this directory, you will see a few files,
including two named shakespeare.tar.gz and shakespeare-stream.tar.gz. Both of these
contain the complete works of Shakespeare in text format, but with different formats and
organizations. For now we will work with shakespeare.tar.gz.
This creates a directory named shakespeare/ containing several files on your local
filesystem.
122
123 Big Data - Admin Course
This copies the local shakespeare directory and its contents into a remote, HDFS directory
named /user/hdfs/shakespeare.
123
124 Big Data - Admin Course
Now try the same fs -ls command but without a path argument:
$ hadoop fs -ls
You should see the same results. If you don’t pass a directory name to the –ls command, it
assumes you mean your home directory, i.e. /user/hdfs.
Relative paths
If you pass any relative (non-absolute) paths to FsShell commands (or use relative paths
in MapReduce programs), they are considered relative to your home directory.
124
125 Big Data - Admin Course
We will also need a sample web server log file, which we will put into HDFS for use in
future labs. This file is currently compressed using GZip. Rather than extract the file to the
local disk and then upload it, we will extract and upload in one step.
First, create a directory in HDFS in which to store it:
Now, extract and upload the file in one step. The -c option to gunzip uncompressed to
standard output, and the dash (-) in the hadoop fs -put command takes whatever is being
sent to its standard input and places that data in HDFS.
Run the hadoop fs -ls command to verify that the log file is in your HDFS home directory.
The access log file is quite large – around 500 MB. Create a smaller version of this file,
consisting only of its first 5000 lines, and store the smaller version in HDFS. You can use
the smaller version for testing in subsequent labs.
Enter:
This lists the contents of the /user/hdfs/shakespeare HDFS directory, which consists of
the files comedies, glossary, histories, poems, and tragedies.
The glossary file included in the compressed file you began with is not strictly a work of
Shakespeare, so let’s remove it:
Note that you could leave this file in place if you so wished. If you did, then it would be
included in subsequent computations across the works of Shakespeare, and would skew
your results slightly. As with many real-‐world big data problems, you make trade-‐offs
between the labor to purify your input data and the precision of your results.
126
127 Big Data - Admin Course
Enter:
$ hadoop fs -cat shakespeare/histories | tail -n 50
This prints the last 50 lines of Henry IV, Part 1 to your terminal. This command is handy
for viewing the output of MapReduce programs. Very often, an individual output file of a
MapReduce program is very large, making it inconvenient to view the entire file in the
terminal. For this reason, it’s often a good idea to pipe the output of the fs -cat command
into head, tail, more, or less.
127
128 Big Data - Admin Course
To download a file to work with on the local filesystem use the fs -get command. This
command takes two arguments: an HDFS path and a local path. It copies the HDFS
contents into the local filesystem:
There are several other operations available with the hadoop fs command to perform most
common filesystem manipulations: mv, cp, mkdir, etc.
128
129 Big Data - Admin Course
$ hadoop fs
This displays a brief usage report of the commands available within FsShell. Try playing
around with a few of these commands if you like.
In order to work with HDFS you need to use the hadoop fs command. For example to list
the / and /app directories you need to input the following commands:
hadoop fs -ls /
hadoop fs -ls /tmp
There are many commands you can run within the Hadoop filesystem. For example to
make the directory test you can issue the following command:
hadoop fs -mkdir test
You should be aware that you can pipe (using the | character) any HDFS command to be
used with the Linux shell. For example, you can easily use grep with HDFS by doing the
following:
As you can see the grep command only returned the lines which had test in them (thus
removing the "Found x items" line and oozie-root directory from the listing.
In order to move files between your regular linux filesystem and HDFS you will likely use
the put and get commands. First, move a single file to the hadoop filesystem.
Copy pg20417.txt from software folder to data folder
130
131 Big Data - Admin Course
You should now see a new file called /user/hdfs/pg* listed. In order to view the contents of
this file we will use the -cat command as follows:
We can also use the linux diff command to see if the file we put on HDFS is actually the
same as the original on the local filesystem. You can do this as follows:
131
132 Big Data - Admin Course
Since the diff command produces no output we know that the files are the same (the diff
command prints all the lines in the files that differ).
In order to use HDFS commands recursively generally you add an "r" to the HDFS
command (In the Linux shell this is generally done with the "-R" argument) For example,
to do a recursive listing we'll use the -lsr command rather than just -ls. Try this:
132
133 Big Data - Admin Course
In order to find the size of files you need to use the -du or -dus commands. Keep in mind
that these commands return the file size in bytes. To find the size of the pg20417.txt file
use the following command:
To find the size of all files individually in the /user/root directory use the following
command:
To find the size of all files in total of the /user/root directory use the following command:
133
134 Big Data - Admin Course
If you would like to get more information about a given command, invoke -help as
follows:
hadoop fs -help
For example, to get help on the dus command you'd do the following:
hadoop fs -help dus
You can observe the HDFS’s namenode console as follows:
Familiarize the various options
http://10.10.20.20:50070/dfshealth.html#tab-overview
134
135 Big Data - Admin Course
135
136 Big Data - Admin Course
Click on Datanodes
136
137 Big Data - Admin Course
Click on Snapshot
137
138 Big Data - Admin Course
138
139 Big Data - Admin Course
Click on Logs
http://192.168.246.131:50070/logs/
You can verify the log by clicking on the datanode log file.
139
140 Big Data - Admin Course
You can verifying the Hadoop File System Health. Check for minimally replicated blocks if
any.
#hadoop fsck /
140
141 Big Data - Admin Course
This is a dfsadmin command for reporting on each DataNode. It displays the status of
Hadoop cluster. Any under replicated blocks or Corrupt replicas?
141
142 Big Data - Admin Course
Go to log folder:
# cd /var/hadoop/logs
# ls
# vi hadoop.txt
142
143 Big Data - Admin Course
143
144 Big Data - Admin Course
hadoop fsck /
144
145 Big Data - Admin Course
#hadoop version
The default port is 50070. To get a list of files in a directory you would use:
1. curl -i "http://tos.hp.com:50070/webhdfs/v1/user/root/output/?op=LISTSTATUS"
145
146 Big Data - Admin Course
Set the Namenode java heap size (Memory) to 2.5 GB using the following option
Use Services > [HDFS] > Configs to optimize service performance for the service.
1. In Ambari Web, click a service name in the service summary list on the left.
2. From the the service Summary page, click the Configs tab, then use one of the
following tabs to manage configuration settings.
o Use the Configs tab to manage configuration versions and groups.
o Use the Settings tab to manage Smart Configs by adjusting the green, slider buttons.
o Use the Advanced tab to edit specific configuration properties and values.
146
147 Big Data - Admin Course
147
148 Big Data - Admin Course
3. Click Save.
4. Enter a description for this configuration version that includes your current changes.
5. Review and confirm each recommended change.
Restart all affected services.
148
149 Big Data - Admin Course
149
150 Big Data - Admin Course
dfs.nfs3.dump.dir=/tmp/.hdfs-nfs
150
151 Big Data - Admin Course
151
152 Big Data - Admin Course
3.1. Run the following commands to stop the nfs and rpcbind services.
(If they are not running, the following commands will fail, which is
no problem):
152
153 Big Data - Admin Course
3.2. Now start the NFS services using the hadoop script –Open new terminal for each of
the command in slave using root user id:
3.3. Verify that the required services are up and running.from the master console.
# rpcinfo -p tos.slave.com
153
154 Big Data - Admin Course
3.4. Verify that the HDFS namespace is exported and can be mounted
by any client from the master console.
# showmount -e tos.slave.com
154
155 Big Data - Admin Course
#mkdir –p /home/hdfs/hdfs-mount.
4.2. Change its ownership to the hdfs user.
[mkdir -p /home/hdfs/hdfs-mount
chown hdfs /home/hdfs/hdfs-mount]
4.3. Execute the following command on a single line to mount HDFS to
the hdfs- mount directory:
# ls -l /home/hdfs/hdfs-mount
155
156 Big Data - Admin Course
You can display the content of the file from the local OS.
156
157 Big Data - Admin Course
Before You Begin: SSH into node1 i.e master aka tos.master.com
157
158 Big Data - Admin Course
158
159 Big Data - Admin Course
1.4. Use the find command to identify the location of the block on your
local file system. The command will look like the following, and you may
need to run it on master or slave:
159
160 Big Data - Admin Course
160
161 Big Data - Admin Course
As seen above the block is replicated to hp and slave since we have a default replication
factor of 3 and we have only 2 nodes in the cluster.
Step 2: Enable Snapshots
2.1. Now let’s enable the /user/hdfs/data directory for taking snapshots:
161
162 Big Data - Admin Course
3.2. Verify the snapshot was created by viewing the contents of the
data/.snapshot folder:
162
163 Big Data - Admin Course
4.2. Use the ls command to verify the file is no longer in the data folder in HDFS.
163
164 Big Data - Admin Course
4.4.
164
165 Big Data - Admin Course
165
166 Big Data - Admin Course
Earlier Block ID was ending with 4 now its 5. Block ID changes since its iss a new block.
You can view the Block snapshot using the web ui too – Name Node UI..
166
167 Big Data - Admin Course
167
168 Big Data - Admin Course
In this lab we will configure the cache directive and store two files in the cache pool.
Then we will execute some commands to determine the stats of the pools and the cache
being used.
Ensure that the following folder and file exists before configuring the cache. We are
going to cache the files in this folder. If the following files are not present then refer hdfs
lab to store it.
#hadoop fs -ls /user/hdfs/weblog
168
169 Big Data - Admin Course
Let us find out any cache directives and Pools configure in the cluster.
#hdfs cacheadmin –listPools
# hdfs cacheadmin –listDirectives
169
170 Big Data - Admin Course
Let us add the file excite-small.log in the cache by executing the following command.
#hdfs cacheadmin -addDirective -path /user/hdfs/weblog/excite-small.log -pool cPTos
Let us get the stats of the pool after adding the file.
#hdfs cacheadmin -listPools -stats cPTos
170
171 Big Data - Admin Course
171
172 Big Data - Admin Course
Although we have configured Caching, there are certains steps to be completed before it
become effective.
Let us remove the directive using the following commands. You can determine the
directive ids using the list directive command used before.
#hdfs cacheadmin -removeDirective 1
#hdfs cacheadmin -removeDirective 2
You need to configure some OS related setting before the caching become effective.
Change the OS Setting i.e memory limit.
. Set memlock limits on each datanode. This will take effect after you logout and login again. Use root login for these
comands
173
174 Big Data - Admin Course
Change HDFS configurations to support Caching. Add the following to HDFS configs, Custom hdfs-site.xml in Ambari.
This example is for 0.5 GB (in bytes)
1. dfs.datanode.max.locked.memory=536870912
174
175 Big Data - Admin Course
Add Save
175
176 Big Data - Admin Course
Let us verify the cache configured using the dfsadmin report ; use hdfs user id for
running these commands.
# hdfs dfsadmin -report
177
178 Big Data - Admin Course
178
179 Big Data - Admin Course
This way you can determine the total capacity of cache configured and used.
---------------------- Labs End Here ---------------------------------------------------
179
180 Big Data - Admin Course
1. Attach the archival storage. To set up two data volumes, mount the disks on
separate directories, as shown in the following example:
180
181 Big Data - Admin Course
Navigate to HDFS > Settings > DataNode and add the archival storage volumes to the
data node directories, as shown in the following figure:
/hadoop/hdfs/data
[ARCHIVE]/hadoop/hdfs/archive
181
182 Big Data - Admin Course
Save.
182
183 Big Data - Admin Course
In our case we will archive the access_log file. If its not there complete the HDFS lab
to store the mention file in the HDFS.
Apply the COLD storage policy to a directory. As the hdfs user, assign the COLD
storage policy to an HDFS directory by running the following commands from the
command line:
su -hdfs
hdfs dfs -mkdir /archives
hdfs dfs -chmod 777 /archives
hdfs storagepolicies -setStoragePolicy -path /archives -policy COLD
After setup is complete, verify the policies by running the following command:
hdfs storagepolicies -getStoragePolicy -path /archives
Output from this command will be similar to the following example:
183
184 Big Data - Admin Course
Verify the setup. When you first check the blocks on this archive data node, it
returns 0 because there is no data, as shown in the following figure:
Hdfs Summary Quick Links namenode UI Data Nodes Click on the
node else
http://tos.hp.com:50075/datanode.html
Note: Because this example shows only one archive data node, the replication factor
for the file that was uploaded to archive storage must be set to 1; otherwise, there
will not be enough data nodes for other replicas.
If you check the blocks on this data node again, it now returns a nonzero value that
depends on the size of the uploaded file, as shown in the following figure:
This number does not change if you upload files to directories other than /archives.
185
186 Big Data - Admin Course
You will observe that there is a increased in no of blocks in the data folder volume
where as archive remain the same i.e if the cold policy is not applied then the data
are stored in the the data folder.
186
187 Big Data - Admin Course
You need to complete the Storage Policy Lab before going ahead with this lab.
Since this is a cold storage let us apply EC policy so that we can optimize space.
Before configuring EC policies, you can view the list of available erasure coding
policies:
hdfs ec –listPolicies
187
188 Big Data - Admin Course
In the previous example, the list of erasure coding policies indicates that RS-6-3-
1024k is already enabled.
Check the block status of the directory before applying erasure-coded on it.
#hdfs fsck /archives
188
189 Big Data - Admin Course
The following example shows how you can set the erasure coding policy RS-6-3-1024k
on a particular directory:
hdfs ec -setPolicy -path /archives -policy RS-6-3-1024k
To confirm whether the specified directory has the erasure coding policy applied, run the
hdfs ec -getPolicy command:
189
190 Big Data - Admin Course
Take a note of the warning. Setting a policy only affects newly created files, and does
not affect existing files.
Checking the block status on an erasure-coded directory
After enabling erasure coding on a directory, you can check the block status by running
the hdfs fsck command. The following example output shows the status of the erasure-
coded blocks: Which is still not reflected.
hdfs fsck /archives
190
191 Big Data - Admin Course
191
192 Big Data - Admin Course
As shown below, replication factor is shown as 1 for the erasure coding folder.
http://tos.hp.com:50070/explorer.html#/archives
192
193 Big Data - Admin Course
193
194 Big Data - Admin Course
In this tutorial we will explore how we can configure YARN Capacity Scheduler using
Ambari.
Login to Ambari.
194
195 Big Data - Admin Course
Let’s dive into YARN dashboard by selecting Yarn from the left-side bar or the drop down
menu.
195
196 Big Data - Admin Course
We will start updating the configuration for Yarn Capacity Scheduling policies. Click
on Configs tab and click on Advanced.
196
197 Big Data - Admin Course
Next, scroll down to the Scheduler section of the page. The default capacity scheduling
policy just has one queue which is default.
197
198 Big Data - Admin Course
Let’s check out the scheduling policy visually. Scroll up to the top of the page click
on SUMMARY and then select ResourceManager UI from the Quick Links section.
198
199 Big Data - Admin Course
Once in ResourceManager UI select Queues, you will see a visual representation of the
Scheduler Queue and resources allocated to it.
199
200 Big Data - Admin Course
Let’s change the capacity scheduling policy to where we have separate queues and policies
for Engineering, Marketing and Support.
Replace the content for the configurations shown below in the Capacity
Scheduler textbox:
200
201 Big Data - Admin Course
yarn.scheduler.capacity.maximum-am-resource-percent=0.2
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.root.Engineering.Development.acl_administer_jobs=*
yarn.scheduler.capacity.root.Engineering.Development.acl_administer_queue=*
yarn.scheduler.capacity.root.Engineering.Development.acl_submit_applications=*
yarn.scheduler.capacity.root.Engineering.Development.capacity=20
yarn.scheduler.capacity.root.Engineering.Development.minimumaximum-capacity=100
yarn.scheduler.capacity.root.Engineering.Development.state=RUNNING
yarn.scheduler.capacity.root.Engineering.Development.user-limit-factor=1
yarn.scheduler.capacity.root.Engineering.QE.acl_administer_jobs=*
yarn.scheduler.capacity.root.Engineering.QE.acl_administer_queue=*
yarn.scheduler.capacity.root.Engineering.QE.acl_submit_applications=*
yarn.scheduler.capacity.root.Engineering.QE.capacity=80
yarn.scheduler.capacity.root.Engineering.QE.maximum-capacity=90
yarn.scheduler.capacity.root.Engineering.QE.state=RUNNING
yarn.scheduler.capacity.root.Engineering.QE.user-limit-factor=1
yarn.scheduler.capacity.root.Engineering.acl_administer_jobs=*
yarn.scheduler.capacity.root.Engineering.acl_administer_queue=*
yarn.scheduler.capacity.root.Engineering.acl_submit_applications=*
yarn.scheduler.capacity.root.Engineering.capacity=60
yarn.scheduler.capacity.root.Engineering.maximum-capacity=100
yarn.scheduler.capacity.root.Engineering.queues=Development,QE
yarn.scheduler.capacity.root.Engineering.state=RUNNING
yarn.scheduler.capacity.root.Engineering.user-limit-factor=1
201
202 Big Data - Admin Course
yarn.scheduler.capacity.root.Marketing.Advertising.acl_administer_jobs=*
yarn.scheduler.capacity.root.Marketing.Advertising.acl_administer_queue=*
yarn.scheduler.capacity.root.Marketing.Advertising.acl_submit_applications=*
yarn.scheduler.capacity.root.Marketing.Advertising.capacity=30
yarn.scheduler.capacity.root.Marketing.Advertising.maximum-capacity=40
yarn.scheduler.capacity.root.Marketing.Advertising.state=STOPPED
yarn.scheduler.capacity.root.Marketing.Advertising.user-limit-factor=1
yarn.scheduler.capacity.root.Marketing.Sales.acl_administer_jobs=*
yarn.scheduler.capacity.root.Marketing.Sales.acl_administer_queue=*
yarn.scheduler.capacity.root.Marketing.Sales.acl_submit_applications=*
yarn.scheduler.capacity.root.Marketing.Sales.capacity=70
yarn.scheduler.capacity.root.Marketing.Sales.maximum-capacity=80
yarn.scheduler.capacity.root.Marketing.Sales.minimum-user-limit-percent=20
yarn.scheduler.capacity.root.Marketing.Sales.state=RUNNING
yarn.scheduler.capacity.root.Marketing.Sales.user-limit-factor=1
yarn.scheduler.capacity.root.Marketing.acl_administer_jobs=*
yarn.scheduler.capacity.root.Marketing.acl_submit_applications=*
yarn.scheduler.capacity.root.Marketing.capacity=10
yarn.scheduler.capacity.root.Marketing.maximum-capacity=40
yarn.scheduler.capacity.root.Marketing.queues=Sales,Advertising
yarn.scheduler.capacity.root.Marketing.state=RUNNING
yarn.scheduler.capacity.root.Marketing.user-limit-factor=1
yarn.scheduler.capacity.root.Support.Services.acl_administer_jobs=*
yarn.scheduler.capacity.root.Support.Services.acl_administer_queue=*
yarn.scheduler.capacity.root.Support.Services.acl_submit_applications=*
202
203 Big Data - Admin Course
yarn.scheduler.capacity.root.Support.Services.capacity=80
yarn.scheduler.capacity.root.Support.Services.maximum-capacity=100
yarn.scheduler.capacity.root.Support.Services.minimum-user-limit-percent=20
yarn.scheduler.capacity.root.Support.Services.state=RUNNING
yarn.scheduler.capacity.root.Support.Services.user-limit-factor=1
yarn.scheduler.capacity.root.Support.Training.acl_administer_jobs=*
yarn.scheduler.capacity.root.Support.Training.acl_administer_queue=*
yarn.scheduler.capacity.root.Support.Training.acl_submit_applications=*
yarn.scheduler.capacity.root.Support.Training.capacity=20
yarn.scheduler.capacity.root.Support.Training.maximum-capacity=60
yarn.scheduler.capacity.root.Support.Training.state=RUNNING
yarn.scheduler.capacity.root.Support.Training.user-limit-factor=1
yarn.scheduler.capacity.root.Support.acl_administer_jobs=*
yarn.scheduler.capacity.root.Support.acl_administer_queue=*
yarn.scheduler.capacity.root.Support.acl_submit_applications=*
yarn.scheduler.capacity.root.Support.capacity=30
yarn.scheduler.capacity.root.Support.maximum-capacity=100
yarn.scheduler.capacity.root.Support.queues=Training,Services
yarn.scheduler.capacity.root.Support.state=RUNNING
yarn.scheduler.capacity.root.Support.user-limit-factor=1
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.queues=Support,Marketing,Engineering
yarn.scheduler.capacity.root.unfunded.capacity=50
203
204 Big Data - Admin Course
204
205 Big Data - Admin Course
At this point, the configuration is saved but we still need to restart the affected
components by the configuration change as indicated in the orange band below:
205
206 Big Data - Admin Course
Also note that there is now a new version of the configuration as indicated by the
green Current label.
Let’s restart the daemons by clicking on the three dots ... next to Services under the
Ambari Stack. Select Restart All Required and wait for the restart to complete.
206
207 Big Data - Admin Course
And then go back to the Resource Manager UI and refresh the page.Yup! There’s our new
policy:
207
208 Big Data - Admin Course
208
209 Big Data - Admin Course
209
210 Big Data - Admin Course
210
211 Big Data - Admin Course
Click on QE you can see that one job is running in this Queue.
211
212 Big Data - Admin Course
Keep on refershing the above page to determine the % of resources use during the execution
of the jobs by respective Queue.
7.2. You should see resources being used in both the Support and Engineering queues.
212
213 Big Data - Admin Course
RESULT: You just defined two queues for the Capacity Scheduler, configured
specific capacities for each queue, and submitted a job to each queue.
213
214 Big Data - Admin Course
Return to Ambari Dashboard then select Config History and select YARN:
Finally, select Make Current and re-start services one last time.
214
215 Big Data - Admin Course
215
216 Big Data - Admin Course
17. NameNode HA
Successful Outcome: Your cluster will have a Standby NameNode along with Active
NameNode.
Before You Begin: Open the Ambari UI.
Prerequisite:
You have added two nodes in the hadoop cluster and the services are up as shown below:
216
217 Big Data - Admin Course
You require three nodes for configuring Namenode HA. Hence let us add another node. i.e
slavea
Server Service
Master Ambari server , Ambari agent
Slave Ambari agent
Slavea Ambari agent New Addition
217
218 Big Data - Admin Course
Refer the earlier Lab - Adding Node to the cluster. Summary is given as shown below:
218
219 Big Data - Admin Course
Ensure to install java on the host. You can refer the first Lab for this.
cp /usr/share/java/mysql-connector-java.jar /var/lib/ambari-agent/tmp/
Update /etc/hosts on all three machines with the following lines: Ensure that you update with
your IPs appropriately. Append it don’t delete other entries.
10.10.20.24 tos.slave.com
10.10.20.25 tos.slavea.com
Log in to Ambari (tos.master.com), click on Hosts and choose Add New Hosts from
the Actions menu.
220
221 Big Data - Admin Course
Select the following options and ensure that other options are leave as default.
221
222 Big Data - Admin Course
222
223 Big Data - Admin Course
In step Assign Slaves and Clients I define my node to be a DataNode and has a
NodeManager installed as well
223
224 Big Data - Admin Course
As shown above you need to have three node clusters before proceeding ahead for this lab.
Install zookeeper on two slave nodes: slave and slavea node.
Click on Host Hostname Add Zookeper
224
225 Big Data - Admin Course
225
226 Big Data - Admin Course
Enter HACluster as the Nameservice ID for your NameNode HA cluster. (The name must
consist of one word and not have any special characters.) Click the Next button:
227
228 Big Data - Admin Course
228
229 Big Data - Admin Course
2.3. Notice that a Secondary NameNode is not allowed if you have NameNode
HA configured. Ambari takes care of this for you, as you can see on the
Review step of the wizard:
node1.
3.2. Put the NameNode in Safe Mode:
3.4. Once Ambari recognizes that your cluster is in Safe Mode and a
Checkpoint has been made, you will be able to click the Next button.
230
231 Big Data - Admin Course
231
232 Big Data - Admin Course
4.2. Once all the tasks are complete, click the Next button.
5.2. Once Ambari determines that the JournalNodes are initialized, you
will be able to click the Next button:
233 Big Data - Admin Course
234 Big Data - Admin Course
7.2. On slavea, run the command to initialize the metadata for the
new NameNode:
8.2. Click the Done button when all the tasks are complete.
442 Big Data - Admin Course
10.2. Next to the Active NameNode component, select Stop from the Action
menu:
444 Big Data - Admin Course
10.4. Now start the stopped NameNode again, and you will notice that it
becomes a Standby NameNode:
446 Big Data - Admin Course
RESULT: You now have NameNode HA configured on your cluster, and you
have also verified that the HA works when one of the NameNodes stops.
In all nodes,
Datanode, Zookeeper, Zkfc & Journal Node wherever it’s installed.
In slavea
Namenode
At this point we have only one Resource Manager configured in the cluster.
Tos.Tech | http://thinkopensource.in
448 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
449 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
450 Big Data - Admin Course
4. On Select Host (Slavea), accept the default selection or choose an available host.
Tos.Tech | http://thinkopensource.in
451 Big Data - Admin Course
Click Next to approve the changes and start automatically configuring ResourceManager
HA.
6. On Configure Components, click Complete when all the progress bars finish
tracking.
Tos.Tech | http://thinkopensource.in
452 Big Data - Admin Course
As you can see one will be in active and other will be in stand by.
Tos.Tech | http://thinkopensource.in
453 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
454 Big Data - Admin Course
In my case the rm1 is the primary RM. (Yarn Config Custom yarn-site.xml)
Stop the RM service of master node. When the terminal display the below text
Tos.Tech | http://thinkopensource.in
455 Big Data - Admin Course
Failing Over RM node message will be displayed in the console as shown below:
Tos.Tech | http://thinkopensource.in
456 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
457 Big Data - Admin Course
You can verify from the dashboard also that rm2 is the primary resource manager now.
Now the job will be orchestrated by the new primary RM i.e rm2 and we don’t need to
resubmit the job to the cluster.
Tos.Tech | http://thinkopensource.in
458 Big Data - Admin Course
You can also determine the status of the rm2 i.e slavea using yarn command.
yarn rmadmin -getServiceState rm2
Tos.Tech | http://thinkopensource.in
459 Big Data - Admin Course
This mean that the current active node will be the primary Resource Manager till it get
failed over although the earlier primary node comes up.
Tos.Tech | http://thinkopensource.in
460 Big Data - Admin Course
Case Study to debug and resolve job related issues running in Hadoop cluster. You can’t
execute any job that exceeds the Virtual memory demand then the configure in the
configuration file.
If the virtual memory usage exceeds more than the allowed configured memory then the
container will be killed and job will failed.
Let us enable the flag in custom yarn-site.xml file so that Node Manager can monitor the
virtual memory usage of the cluster. i.e yarn.nodemanager.vmem-check-enabled = true
Dashboard Yarn Config Advance
After some time the Job will failed with the following errors.
Container Physical Memory consumption at this juncture: Virtual memory usage is beyond
the permissible limit.
Tos.Tech | http://thinkopensource.in
462 Big Data - Admin Course
Tos.Tech | http://thinkopensource.in
463 Big Data - Admin Course
1.9 GB of 824 MB virtual memory used ( Virtual memory usage exceeds that of the limit).
Killing container.
Observation:
Open the file and observe the virtual to physical memory allowed ratio. It’s 4 times here for
each map container.
#vi /etc/hadoop/3.1.0.0-78/0/yarn-site.xml
yarn.nodemanager.vmem-pmem-ratio = 4 times
the ("mapreduce.map.memory.mb") is set to 206MB then the total allowed virtual memory
is 4 * 206 =824MB.
#vi /etc/hadoop/3.1.0.0-78/0/mapred-site.xml
Tos.Tech | http://thinkopensource.in
464 Big Data - Admin Course
However as shown in the below log 1.9 GB virtual memory is demanded then the allowed
824 Gb configured. Hence the job failed.
You can verify from the log. This error is due to the overall consumption of virtual memory
which is more than the allocated allowed virtual memory. How do we resolve this? One way
is to increase the Physical memory and raised the allowed virtual memory ratio. Another
way is to disable the validation of the virtual memory. Which we will disable it!
Concepts:
NodeManager can monitor the memory usage (virtual and physical) of the container. If its
virtual memory exceeds “yarn.nodemanager.vmem-pmem-ratio” times the
"mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", then the container will
be killed if “yarn.nodemanager.vmem-check-enabled” is true.
Solution:
yarn.nodemanager.vmem-check-enabled should be false and restart the cluster services i.e
Nodemanager and Resource Manager
Tos.Tech | http://thinkopensource.in
465 Big Data - Admin Course
Then resubmit the job all over again. Do not update the xml files directly. All changes have
to be done through the Ambari UI only.
Tos.Tech | http://thinkopensource.in
466 Big Data - Admin Course
Set up Queues
Capacity Scheduler queues can be set up in a hierarchy that reflects the database structure,
resource requirements, and access restrictions required by the various organizations,
groups, and users that utilize cluster resources.
The fundamental unit of scheduling in YARN is a queue. The capacity of each queue
specifies the percentage of cluster resources that are available for applications submitted to
the queue.
For example, suppose that a company has three organizations: Engineering, Support, and
Marketing. The Engineering organization has two sub-teams: Development and QA. The
Support organization has two sub-teams: Training and Services. And finally, the Marketing
organization is divided into Sales and Advertising. The following image shows the queue
hierarchy for this example:
Tos.Tech | http://thinkopensource.in
467 Big Data - Admin Course
Each child queue is tied to its parent queue with the yarn.scheduler.capacity.<queue-
path>.queues configuration property in the capacity-scheduler.xml file. The top-level
"support", "engineering", and "marketing" queues would be tied to the "root" queue as
follows:
Property: yarn.scheduler.capacity.root.queues
Value: support,engineering,marketing
Example:
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>support,engineering,marketing</value>
<description>The top-level queues below root.</description>
</property>
Similarly, the children of the "support" queue would be defined as follows:
Property: yarn.scheduler.capacity.support.queues
Value: training,services
Example:
<property>
Tos.Tech | http://thinkopensource.in
468 Big Data - Admin Course
<name>yarn.scheduler.capacity.support.queues</name>
<value>training,services</value>
<description>child queues under support</description>
</property>
The children of the "engineering" queue would be defined as follows:
Property: yarn.scheduler.capacity.engineering.queues
Value: development,qa
Example:
<property>
<name>yarn.scheduler.capacity.engineering.queues</name>
<value>development,qa</value>
<description>child queues under engineering</description>
</property>
And the children of the "marketing" queue would be defined as follows:
Property: yarn.scheduler.capacity.marketing.queues
Value: sales,advertising
Tos.Tech | http://thinkopensource.in
469 Big Data - Admin Course
Example:
<property>
<name>yarn.scheduler.capacity.marketing.queues</name>
<value>sales,advertising</value>
<description>child queues under marketing</description>
</property>
Tos.Tech | http://thinkopensource.in
470 Big Data - Admin Course
The following table shows how the queue resources are adjusted as users submit jobs to a
queue with a minimum-user-limit-percent value of 20%:
For example, the following properties would set the root acl_submit_applications value to
"" (space character) to block access to all users and groups, and also restrict access to its
child "support" queue to the users "sherlock" and "pacioli" and the members of the "cfo-
group" group:
Each child queue is tied to its parent queue with the yarn.scheduler.capacity.<queue-
path>.queues configuration property in the capacity-scheduler.xml file. The top-level
"support", "engineering", and "marketing" queues would be tied to the "root" queue as
follows:
<property>
<name>yarn.scheduler.capacity.root.acl_submit_applications</name>
<value> </value>
</property>
Tos.Tech | http://thinkopensource.in
471 Big Data - Admin Course
<property>
<name>yarn.scheduler.capacity.root.support.acl_submit_applications</name>
<value>sherlock,pacioli cfo-group</value>
</property>
1. sudo su hdfs
2. hadoop fs -mkdir -p /yarn/node-labels
3. hadoop fs -chown -R yarn:yarn /yarn
4. hadoop fs -chmod -R 700 /yarn
Tos.Tech | http://thinkopensource.in
472 Big Data - Admin Course
.
Step 2: Make sure that you have user directory for ‘yarn’ user on HDFS, if not then please
create it using below commands
Note – You can run below commands from any of the hdfs client.
1. sudo su hdfs
2. hadoop fs -mkdir -p /user/yarn
3. hadoop fs -chown -R yarn:yarn /user/yarn
4. hadoop fs -chmod -R 700 /user/yarn
.
Step 3: Configure below properties in yarn-site.xml via Ambari UI. If you don’t have Ambari
UI, please add it manually to /etc/hadoop/conf/yarn-site.xml and restart required services.
1. yarn.node-labels.enabled=true
2. yarn.node-labels.fs-store.root-dir=hdfs://<namenode-host>:<namenode-rpc-port>/<co
mplete-path_to_node_label_directory>
Tos.Tech | http://thinkopensource.in
473 Big Data - Admin Course
You can verify if node labels have been created by looking at Resource manager UI under
‘Node Lables’ option in the left pane or you can also run below command on any of the Yarn
client
Sample output:
.
Step 5: Allocate node labels to the node managers using below command:
Tos.Tech | http://thinkopensource.in
474 Big Data - Admin Course
Example:
Note – Don’t worry about port if you have only one node manager running per host.
.
Step 6: Map node labels to the queues:
I have created 2 queues ‘a’ and ‘b’ in such a way that, queue ‘a’ can access nodes with label
‘x’ and ‘y’ where queue ‘b’ can only access the nodes with label ‘y’. By default, all the queues
can access nodes with ‘default’ label.
Below is my capacity scheduler configuration:
1. yarn.scheduler.capacity.maximum-am-resource-percent=0.2
2. yarn.scheduler.capacity.maximum-applications=10000
3. yarn.scheduler.capacity.node-locality-delay=40
4. yarn.scheduler.capacity.queue-mappings-override.enable=false
5. yarn.scheduler.capacity.root.a.a1.accessible-node-labels=x,y
6. yarn.scheduler.capacity.root.a.a1.accessible-node-labels.x.capacity=30
7. yarn.scheduler.capacity.root.a.a1.accessible-node-labels.x.maximum-capacity=100
8. yarn.scheduler.capacity.root.a.a1.accessible-node-labels.y.capacity=50
Tos.Tech | http://thinkopensource.in
475 Big Data - Admin Course
9. yarn.scheduler.capacity.root.a.a1.accessible-node-labels.y.maximum-capacity=100
10. yarn.scheduler.capacity.root.a.a1.acl_administer_queue=*
11. yarn.scheduler.capacity.root.a.a1.acl_submit_applications=*
12. yarn.scheduler.capacity.root.a.a1.capacity=40
13. yarn.scheduler.capacity.root.a.a1.maximum-capacity=100
14. yarn.scheduler.capacity.root.a.a1.minimum-user-limit-percent=100
15. yarn.scheduler.capacity.root.a.a1.ordering-policy=fifo
16. yarn.scheduler.capacity.root.a.a1.state=RUNNING
17. yarn.scheduler.capacity.root.a.a1.user-limit-factor=1
18. yarn.scheduler.capacity.root.a.a2.accessible-node-labels=x,y
19. yarn.scheduler.capacity.root.a.a2.accessible-node-labels.x.capacity=70
20. yarn.scheduler.capacity.root.a.a2.accessible-node-labels.x.maximum-capacity=100
21. yarn.scheduler.capacity.root.a.a2.accessible-node-labels.y.capacity=50
22. yarn.scheduler.capacity.root.a.a2.accessible-node-labels.y.maximum-capacity=100
23. yarn.scheduler.capacity.root.a.a2.acl_administer_queue=*
24. yarn.scheduler.capacity.root.a.a2.acl_submit_applications=*
25. yarn.scheduler.capacity.root.a.a2.capacity=60
26. yarn.scheduler.capacity.root.a.a2.maximum-capacity=60
27. yarn.scheduler.capacity.root.a.a2.minimum-user-limit-percent=100
28. yarn.scheduler.capacity.root.a.a2.ordering-policy=fifo
29. yarn.scheduler.capacity.root.a.a2.state=RUNNING
30. yarn.scheduler.capacity.root.a.a2.user-limit-factor=1
31. yarn.scheduler.capacity.root.a.accessible-node-labels=x,y
32. yarn.scheduler.capacity.root.a.accessible-node-labels.x.capacity=100
Tos.Tech | http://thinkopensource.in
476 Big Data - Admin Course
33. yarn.scheduler.capacity.root.a.accessible-node-labels.x.maximum-capacity=100
34. yarn.scheduler.capacity.root.a.accessible-node-labels.y.capacity=50
35. yarn.scheduler.capacity.root.a.accessible-node-labels.y.maximum-capacity=100
36. yarn.scheduler.capacity.root.a.acl_administer_queue=*
37. yarn.scheduler.capacity.root.a.acl_submit_applications=*
38. yarn.scheduler.capacity.root.a.capacity=40
39. yarn.scheduler.capacity.root.a.maximum-capacity=40
40. yarn.scheduler.capacity.root.a.minimum-user-limit-percent=100
41. yarn.scheduler.capacity.root.a.ordering-policy=fifo
42. yarn.scheduler.capacity.root.a.queues=a1,a2
43. yarn.scheduler.capacity.root.a.state=RUNNING
44. yarn.scheduler.capacity.root.a.user-limit-factor=1
45. yarn.scheduler.capacity.root.accessible-node-labels=x,y
46. yarn.scheduler.capacity.root.accessible-node-labels.x.capacity=100
47. yarn.scheduler.capacity.root.accessible-node-labels.x.maximum-capacity=100
48. yarn.scheduler.capacity.root.accessible-node-labels.y.capacity=100
49. yarn.scheduler.capacity.root.accessible-node-labels.y.maximum-capacity=100
50. yarn.scheduler.capacity.root.acl_administer_queue=*
51. yarn.scheduler.capacity.root.b.accessible-node-labels=y
52. yarn.scheduler.capacity.root.b.accessible-node-labels.y.capacity=50
53. yarn.scheduler.capacity.root.b.accessible-node-labels.y.maximum-capacity=100
54. yarn.scheduler.capacity.root.b.acl_administer_queue=*
55. yarn.scheduler.capacity.root.b.acl_submit_applications=*
56. yarn.scheduler.capacity.root.b.b1.accessible-node-labels=y
Tos.Tech | http://thinkopensource.in
477 Big Data - Admin Course
57. yarn.scheduler.capacity.root.b.b1.accessible-node-labels.y.capacity=100
58. yarn.scheduler.capacity.root.b.b1.accessible-node-labels.y.maximum-capacity=100
59. yarn.scheduler.capacity.root.b.b1.acl_administer_queue=*
60. yarn.scheduler.capacity.root.b.b1.acl_submit_applications=*
61. yarn.scheduler.capacity.root.b.b1.capacity=100
62. yarn.scheduler.capacity.root.b.b1.maximum-capacity=100
63. yarn.scheduler.capacity.root.b.b1.minimum-user-limit-percent=100
64. yarn.scheduler.capacity.root.b.b1.ordering-policy=fifo
65. yarn.scheduler.capacity.root.b.b1.state=RUNNING
66. yarn.scheduler.capacity.root.b.b1.user-limit-factor=1
67. yarn.scheduler.capacity.root.b.capacity=60
68. yarn.scheduler.capacity.root.b.maximum-capacity=100
69. yarn.scheduler.capacity.root.b.minimum-user-limit-percent=100
70. yarn.scheduler.capacity.root.b.ordering-policy=fifo
71. yarn.scheduler.capacity.root.b.queues=b1
72. yarn.scheduler.capacity.root.b.state=RUNNING
73. yarn.scheduler.capacity.root.b.user-limit-factor=1
74. yarn.scheduler.capacity.root.capacity=100
75. yarn.scheduler.capacity.root.queues=a,b
Tos.Tech | http://thinkopensource.in
478 Big Data - Admin Course
21. Miscelleanous
Enable User Home Directory Creation.
1. Edit the ambari-properties file using a command line editor (vi, in this example).
vi /etc/ambari-server/conf/ambari.properties
2. Add the following property:
ambari.post.user.creation.hook.enabled=true
3. Add the script path to the ambari properties file:
ambari.post.user.creation.hook=/var/lib/ambari-server/resources/scripts/post-user-creation-hook.sh
4. Restart Ambari server.
ambari-server restart
Default scheduler:
yarn.scheduler.capacity.default.minimum-user-limit-percent=100
yarn.scheduler.capacity.maximum-am-resource-percent=0.2
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
Tos.Tech | http://thinkopensource.in
479 Big Data - Admin Course
yarn.scheduler.capacity.resource-
calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.acl_submit_applications=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.default.acl_administer_jobs=*
yarn.scheduler.capacity.root.default.acl_submit_applications=*
yarn.scheduler.capacity.root.default.capacity=100
yarn.scheduler.capacity.root.default.maximum-capacity=100
yarn.scheduler.capacity.root.default.state=RUNNING
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.queues=default
yarn.scheduler.capacity.schedule-asynchronously.enable=true
yarn.scheduler.capacity.schedule-asynchronously.maximum-threads=1
yarn.scheduler.capacity.schedule-asynchronously.scheduling-interval-ms=10
Tos.Tech | http://thinkopensource.in