Final Manual Practical 1 DE
Final Manual Practical 1 DE
EXPERIMENT NO :1
THEORY:
2. Extract the NiFi files from the .tar.gz file using the following command:
tar xvzf nifi.tar.gz
3. You will now have a folder named nifi-1.12.1. You can run NiFi by executing
the following from inside the folder:
bin/nifi.sh start
4. If you already have Java installed and configured, when you run the status tool as
shown in the following snippet, you will see a path set for JAVA_HOME:
sudo bin/nifi.sh status
5. If you do not see JAVA_HOME set, you may need to install Java using the following
command:
sudo apt install openjdk-11-jre-headless
6. Then, you should edit .bash_profile to include the following line so that NiFi
can find the JAVA_HOME variable:
export JAVA_HOME=/usr/lib/jvm/java11-openjdk-amd64
8. When you run for the status on NiFi, you should now see a path for JAVA_HOME:
Figure 2.1 – NiFi is running
9. When NiFi is ready, which may take a minute, open your web browser and go to
http://localhost:8080/nifi/. You should be seeing the following screen:
11. NiFi is the Processor tool. The other tools, from left to right, are
as follows:
• Input Port
• Output Port
• Processor Group
• Remote Processor Group
• Funnel
• Template
Label
1. You must have a value set for any parameters that are bold. Each parameter has
a question mark icon to help you.
2. You can also right-click on the processes and select the option to use.
3. For GenerateFlowfile, all the required parameters are already filled out.
4. In the preceding screenshot, we have added a value to the parameter of Custom
Text. To add custom properties, you can click the plus sign at the upper-right of
the window. You will be prompted for a name and value. We added my property
filename and set the value to This is a file from nifi.
5. Once configured, the yellow warning icon in the box will turn into a square
(stop button).
To create a connection, hover over the processor box and a circle and arrow will appear:
Installing Apache Airflow can be accomplished using pip. But, before installing Apache
Airflow, you can change the location of the Airflow install by exporting AIRFLOW_HOME.
If you want Airflow to install to opt/airflow, export the AIRLFOW_HOME variable,
as shown:
export AIRFLOW_HOME=/opt/airflow
work with PostgreSQL, then you should install the sub-package by running the following:
apache-airflow[postgres]
To install Apache Airflow, with the options for postgreSQL, slack, and celery, use
the following command:
The default database for Airflow is SQLite. This is acceptable for testing and running on
a single machine, but to run in production and in clusters, you will need to change the
database to something else, such as PostgreSQL.
No Command Airflow: If the airflow command cannot be found, you may need to add it to your
path:
export PATH=$PATH:/home/<username>/.local/bin
The Airflow web server runs on port 8080, the same port as Apache NiFi. You already
changed the NiFi port to 9300 in the nifi.properties file, so you can start the
Airflow web server using the following command:
airflow webserver
If you did not change the NiFi port, or have any other processes running on port 8080,
you can specify the port for Airflow using the -p flag, as shown:
airflow webserver -p 8081
Next, start the Airflow scheduler so that you can run your data flows at set intervals. Run
this command in a different terminal so that you do not kill the web server:
airflow scheduler
Airflow will run without the scheduler, but you will receive a warning when you launch
the web server if the scheduler is not running.
Airflow installs
several example data flows (Directed Acyclic Graphs (DAGs)) during install. You should
see them on the main screen
Installing pgAdmin 4
pgAdmin 4 will make managing PostgreSQL much easier if you are new to relational
databases. The web-based GUI will allow you to view your data and allow you to visually
create tables. To install pgAdmin 4, take the following steps:
1. You need to add the repository to Ubuntu. The following commands should be
added to the repository:
wget --quiet -O - https://www.postgresql.org/media/keys/
ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/
repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/
42 Building Our Data Engineering Infrastructure
sources.list.d/pgdg.list'
sudo apt update
sudo apt install pgadmin4 pgadmin4-apache2 -y
2. You will be prompted to enter an email address for a username and then for
a password. You should see the following screen:
Figure
Conclusion: Here we studied how to install and configure many of the tools used by data
Engineers. You now havetwo working databases – Elasticsearch and PostgreSQL – as well as two
tools for buildingdata pipelines – Apache NiFi and Apache Airflow which comprises of data
engineering infrastructure.