Download as pdf or txt
Download as pdf or txt
You are on page 1of 92

COURSE BOOK

HYDROINFORMATICS: DATA
MANAGEMENT AND ANALYSIS

Waqas Ahmed, Assistant Professor


Center for advanced studies in water, Mehran UET Jamshoro, Pakistan
CHAPTER 1: INTRODUCTION TO THE COURSE BOOK
Why to read this book?
There is a question why you are reading this course book? To answer this; I will ask a question
from you guys, Take a minute and see Figure 1, now tell me that what the data in the Figure 1
represent? Isn’t it difficult to answer?

Figure 1: Unprocessed data


Now see the same data in Figure 2 and answer the same question, what the data in Figure 2
represent? Now it is little bit easy to answer.

Figure 2: Borehole data processed in spread sheet

1
One last question, which will help you to understand why you should read and learn from this
book, see Figure 2 and tell me that, which point is taken in the residential area? I am sure that it
will be difficult for you guys to answer. Now, see Figure 3 and answer the same question, and I
hope you will be able to answer the question.

Figure 3: Borehole observation points plotted on a map


From the example above you can understand that, how large data sets can be messy and requires
processing for effective interpretation, so keeping this in mind this book is written so that you can
organized and do analysis on the data sets

2
What is in the course book?
The global water cycle is a complex system that involves physical, biogeochemical, ecological and
human interaction. Understanding these systems requires processing a huge volume of data. As
the volume of data increases, there is an opportunity to better understand and design systems using
advanced data analysis techniques.

This course book will introduce you to approaches and tools used for data handling, analysis and
visualization. It will help you to: use and modify metadata, consider the data life cycle, implement
data models for application in hydrology, understand data formats, analyze data using statistical
techniques, and to write computer code to retrieve, analyze, and visualize hydrologic data.

This book includes theoretical background and hands-on practice for data handling for
hydrology/environmental problems. A project based exercise will be completed by students that
applies techniques and tools from the book. This book is divided into three sections. Learning
objectives of each section is given below;

Data and Data Lifecycle


1. Explain the importance of data handling and visualization
2. Describe the data lifecycle
3. Create a data management plan
4. Describe hydrologic metadata
Data Models and Database
5. Describe hydrologic data models
6. Design and use database to organize and store data using standard data models
7. Manipulate data using Structured Query Language (SQL)
8. Deploy database on web based server
Data Analysis, Automation and visualization
9. Perform basic programming operations
10. Import, plot and visualize data
11. Extract statistical parameters for different datasets
12. Prepare data for modelling of groundwater system
13. Build, simulate and visualize the modelling results
14. Develop an Arduino based data logger

3
What is Hydroinformatics?

Hydroinformatics encompasses the development of theories, methodologies, and algorithms and


tools; methods for the testing of concepts, analysis and verification; and knowledge representation
and their communication, as they relate to the effective use of data for the characterization of the
water cycle through the various systems. This entails issues related to data acquisition, archiving,
representation, search, communication, visualization, and their use in modeling and knowledge
discovery. In this context, hydroinformatics is an applied science. It sits at the cross roads of
hydrology, computer science and engineering, information science, geographic information
systems, and high end computing. In brief hydroinformatics is about information management
(see Figure 4)

Figure 4: Hydroinformatics is about information management

The very term hydroinformatics has many definitions and covers many ideas, according to who
attempts to define it. Nevertheless, there is a consensus that hydroinformatics, considered as a
technological activity, cannot be defined as a ‘hard’ science. To the contrary, it is regarded as a
bridge that aims to close at least several gaps between ‘hard’, water-oriented sciences,
technologies and activities on the one hand, and such ‘soft’ social requirements as sustainable
development, legal regulations and political aims on the other.

4
Why Hydroinformatics is essential?

To make my point that why Hydroinformatics is essential, I want to you to think of water related
issues in Pakistan and then I will conclude how technology can help us to solve these issues.

List out water related issue


(Instructor sheet)
 Mismanagement
 Scarcity
 Dams
 Corruption
 Inequitable distribution
 Saline GW
 Poor quality drinking water
 Politics

Considering the issue of Dam, let say Kalabagh dam. Question is does it has to be constructed or
not. In present condition, there is no way in to build this dam, even if the purpose is to ensure water
supply to the cities/agriculture or for power generation, or to protect the downstream against
floods. Political conflicts, provincial and national, over the water rights make the decision process
not only continue for decades but it leads to a political deadlock.

Thus, when water problems are concerned, we are living in a conflicting world where the
stakeholders, players and decision makers are not only, as was the case 50 years ago, engineers,
but also others, including whole ranges of citizens and their associative and political emanations
as well as the information carriers (media). All are in one way or another water-oriented. It is of
capital importance to every stakeholder, to every actor in this play, to understand that eventually
only negotiated solutions are acceptable, even if they cannot be completely satisfactory for all
involved. We are in the situation where the rules of allocation of limited resources under
constraints are sought. Thus, if armed solution is excluded, for every specific situation the
stakeholders, willy-nilly, will have to search for a compromise solution. And, once the choice of a
solution is made from several possible solutions, its implementation has to be done as an
engineering solution because that is the traditional purpose of engineers: to build, to implement,
to organize workable structures and solutions which are ad minima satisfactory to meet social
requirements within the available budget.
A rational search, if not for consensus then at least for compromise, is difficult to imagine without
two conditions:
1. Elaboration and presentation to all interested parties of the information, so as to share
information in an equitable way, i.e. so that it is identical in content and intelligible to all
parties–at ask impossible without objective tools.

5
2. Use of objective tools (methodologies, processes) allowing for the confrontation of
consequences of various potentially possible scenarios and solutions as well as for an
iterative approach of the final choice.
At this stage it is necessary to explicate the terminology. Let us stress the difference between data
and information. The data may be collected in the field (e.g. measurements of physical quantities
such as water level, degree of pollution, etc.); satellite imagery, optical or radar, are also data,
although of a different kind; state variables defining eco- system or biodiversity, their evolution,
location of houses, regulations and laws – all are data that can be measured or collected in the field
or which may result from projections, extrapolations and modeling. The information is elaborated
from the data under a form that should be intelligible to the stakeholders. If the information is to
provide an equitable basis for seeking a consensus, it must be supplied under the same form for all
stakeholders and must be considered by all of them as objective. That means that the tools and the
ways in which the data is fused and transformed into the information must not be partial,
subjective, suspected by some stakeholders: there is a need for objective ‘data→information’
transformation tools.
An objective tool (modelling software, risk management system, knowledge base and its content,
information system, communication network) is a tool that is considered by all concerned
stockholders as valid (Cunge 1998). That means that all concerned regard the outputs of the tool
as valid consequences of the inputs, whatever are the inputs, the latter representing hypotheses
proposed by stakeholders. This ‘tool’ can be ‘immaterial’, e.g. a methodology, or a model, or a
data and information processing system. The essential point is that its credibility be accepted by
all concerned. Thus if an output (e.g. the degree of groundwater pollution) is unacceptable to a
stakeholder, it means that for him or her the corresponding input hypothesis (e.g. quantity of
pesticide per area) is unacceptable. Such a tool allows negotiations based on merit, not on passion
– it is an essential link in a loop of consensus as shown in Figure 4, supposing that consensus is
possible.
Obviously, the only way to arrive at such a situation is to involve all those concerned in the
development of requirements for the tool in the development of the tool itself and in its validation.
And, as mentioned above, the presentation of the information in a form that is equitable, intelligible
and identical for all concerned is an essential point in this ‘objective’ approach. The most obvious
and practically proven way to develop such ‘objective’ tools is to finance and control such
developments by a common investment effort by the interested stockholders.
What now is the reason why hydroinformatics can and should play an essential role in building
and applying such tools and not, for example, Information & Communications Technologies
(ICTs) alone? It is because in a conflicting world it is of the utmost importance to have an end-
user oriented approach to be able to construct interfaces between the technology (which is
unintelligible to most of the end-users) and the end-users (stakeholders) themselves. The word
interface means here the capacity to understand the end-user problems, to be credited by the
stakeholders with good technical understanding of the problems and with the capacity to assess
the technical feasibility (not a preference!) of the proposed solutions. And hydroinformatics is
precisely such an activity: simply to apply ICT tools to existing hydraulics and hydrological
methods would not satisfy at all the basic requirements that come into play here. The decision
makers are today politicians, elected bodies, media, public opinion, associations, etc. But whatever
they decide and whenever we talk about the water-related realm, the engineers are supposed to
carry out their decisions. Hydroinformatics should seize the chance and the leading role because

6
of its particular position, because of its experience of ICT tools applied to numerical modelling,
thanks to its links with hydraulics and hydrology and to engineering and because of its expertise
in water problems. If hydroinformatics-oriented institutes and companies do not show the way,
somebody else will appear to do something that outwardly resembles hydroinformatics simply
because there is such an obvious historical, material and social set of requirements for such an
activity. It is difficult, maybe impossible, and even to sketch the more distant future of
hydroinformatics. It is interesting, however, to describe the tendencies and certain obvious
directions that are being taken by some of components of what we call today hydroinformatics
activities.

Hydroinformatics is more about traditional science plus information technology (IT). You will be
thinking after reading this chapter till now that I am water resource manager, environmental
engineer, not a computer scientist, so why I should know all this messy computer stuff. Wait a
minute let’s see the flow chart you developed, you will notice that one component common in all
project flow chart i.e. data, database, and data analysis. This is really essential part of planning and
management. According to Cunge J, & Erlich M. (1999). Now a day’s are three categories of
consultancy exist;

 Traditional consulting companies employing “standard” market-available tools without


particular ambition to acquire an expertise in the field of hydroinformatics
 Consulting companies having a specific expertise in the field of applications of
hydroinformatics tools, employing this expertise for their own projects and studies as well
as for advising the water industry, contractors and the former stream of companies

7
 Consulting companies and institutes who are also developers of hydroinformatics and
modeling software, the latter being put on the market and made commercially available to
the two previous streams.
Customers such as civil contractor, government organizations and water related industry are
potential users of Hydroinformatics. In today’s world only those consultancies will have more
business, which has expertise in Hydroinformatics (especially category II).

8
CHAPTER 2: DATA LIFE CYCLE AND DATA MANAGMENT
The Data Life
A dataset has a longer lifespan than the research project that creates it. Data can be used and re-
used for future research, if:
• shared
• managed well
• Properly preserved
• made available

Data Life Cycle


Because data are valued assets, we need to manage data over their entire lifecycle beyond the
immediate need. The goal of managing over the data lifecycle is to eliminate waste, operate
efficiently, and practice good data stewardship. First step is to plan data management. It is done
through data management plan. Then to acquire data through experiment or other means. Then
you need to process the raw data, and analyze it. Once analyzed then you need to preserve your
data and share to concerned stakeholders.

Figure 5: Data life cycle diagram

Data Management Plan


A data management plan or DMP is a formal document that outlines how you will handle your
data both during your research, and after the project is complete. Data management plan is
prepared to;
• identify and secure resources and gather
• maintain, secure, and utilize data
• Procurement of funding and Budgeting
• Identification of technical and staff resources
• System to store and manipulate the data

Data Management Checklist


• Points relevant to consider when planning appropriate data management for research
• select what is relevant for your research
You can access list at www.data-archive.ac.uk/create-manage/planning-for-sharing/data-
management-checklist

9
Data Management Planning Tools
• dmponline.dcc.ac.uk
• dmp.cdlib.org

Acquire
Acquisition involves collecting or adding to the data holdings. There are different methods of
acquiring data:
• collecting new data (simulations, experiments, data sensor, literature etc)
• converting/transforming legacy data
• sharing/exchanging data; and purchasing data.

Process
Processing denotes actions or steps performed on data to verify, organize, transform, integrate, and
extract data in an appropriate output form for subsequent use. This includes data files and content
organization, and data synthesis or integration, format transformations, and may include
calibration activities (of sensors and other field and laboratory instrumentation). Both raw and
processed data require complete metadata to ensure that results can be duplicated. Methods of
processing must be rigorously documented to ensure the utility and integrity of the data.

Analyze
Analysis involves actions and methods performed on data that help describe facts, detect patterns,
develop explanations, and test hypotheses. This includes data quality assurance, statistical data
analysis, modeling, and interpretation of analysis results.
In this course you will learn how to analyze the data using MS Excel and Matlab tool kit.

Preserve
Preservation involves actions and procedures to keep data for some period of time and/or to set
data aside for future use, and includes data archiving and/or data submission to a data repository.
There should be frequent back-ups, remote servers and now a days there is an opurtunity of using
cloud services to preserve the data such as, DataOne, Hydroshare, dropbox, google drive, github
etc.

Publish/Share
The ability to prepare and issue, or disseminate, quality data to the public and to other agencies is
an important part of the lifecycle process. The data should be medium- and agent-independent,
with an understanding that transfer may occur via automated or non-automated mechanisms. We
need to ensure that data are shared, but with controls to protect proprietary and pre-decisional data
and the integrity of the data itself. Data sharing also requires complete metadata to be useful to
those who are receiving the data.

Describe (Metadata, Documentation)


Throughout the data lifecycle process, documentation must be updated to reflect actions taken
upon the data. This includes acquisition, processing, and analysis, but may touch upon any stage
of the lifecycle. Updated and complete metadata are critical to maintaining data quality. The key
distinction between metadata and documentation is that metadata, in the standard sense of "data
about data," formally describes various key attributes of each data element or collection of
elements, while documentation makes reference to data in the context of their use in specific
systems, applications, settings. Documentation also includes ancillary materials (e.g., field notes)

10
from which metadata can be derived. In the former sense, it's "all about the data;" in the latter, it's
"all about the use."

Manage Quality
Protocols and methods must be employed to ensure that data are properly collected, handled,
processed, used, and maintained at all stages of the scientific data lifecycle. This is commonly
referred to as "QA/QC" (Quality Assurance/Quality Control). QA focuses on building-in quality
to prevent defects while QC focuses on testing for quality (e.g., detecting defects). QA makes sure
you are doing the right things, the right way. QC makes sure the results of what you've done are
what you expected.

Back Up & Secure


Steps must be taken to protect data from accidental data loss, corruption, and unauthorized access.
This includes routinely making additional copies of data files or databases that can be used to
restore the original data or for recovery of earlier instances of the data

11
Example of DMP

Creator(s): DMP dmpcurator


Affiliation: University of California, Office of the President
Last modified: May 30, 2014
Copyright information: The above plan creator(s) have agreed that others may use as much of
the text of this plan as they would like in their own plans, and customize it as necessary. You do
not need to credit the creators as the source of the language used, but using any of their plan's text
does not imply that the creator(s) endorse, or have any relationship to, your project or proposal.
Products of research
Air samples at Mauna Loa Observatory will be collected continuously from air intakes located at
five towers – a central tower and four towers located at compass quadrants.
Raw data files will contain continuously measured CO2 concentrations, calibration standards,
references standards, daily check standards, and blanks. The sample lines located at compass
quadrants were used to examine the influence of source effects associated with wind directions
[3,4]. In addition to the CO2 data, we will record weather data (wind speed and direction,
temperature, humidity, precipitation, and cloud cover). Site conditions at Mauna Loa Observatory
will also be noted and retained. The final data product will consist of 5-minute, 15-minute, hourly,
daily, and monthly average atmospheric concentration of CO2, in mole fraction in water-vapor-
free air measured at the Mauna Loa Observatory, Hawaii. Data are reported as a dry mole fraction
defined as the number of molecules of CO2 divided by the number of molecules of dry air
multiplied by one million (ppm).
The final data product has been thoroughly documented in the open literature [2] and in Scripps
Institution of Oceanography Internal Reports [1].
Data format
The data generated (raw CO2 measurements, meteorological data, calibration and reference
standards) will be placed in comma-separated-values in plain ASCII format, which are readable
over long time periods. The final data file will contain dates for each observation (time, day, month
and year) and the average CO2 concentration. The final data product distributed to most users will
occupy less than 500 KB; raw and ancillary data, which will be distributed on request, will occupy
less than 10 MB.
Metadata will be comprised of two formats—contextual information about the data in a text based
document and ISO 19115 standard metadata in an xml file. These two formats for metadata were
chosen to provide a full explanation of the data (text format) and to ensure compatibility with
international standards (xml format). The standard XML file will be more complete; the document
file will be a human readable summary of the XML file.

12
Access to data, and data sharing practices and policies
The data product will be updated monthly due to updates to the record, revisions due to
recalibration of standard gases, and due to errors. The date of the update will be included in the
data file and will be part of the data file name. Versions of the data product that have been revised
due to errors /updates (other than new data) will be retained in an archive system. A revision history
document will describe the revisions made.
Daily and monthly backups of the data files will be retained at the Keeling Group Lab
(http://scrippsco2.ucsd.edu , accessed 05/2011), at the Scripps Institution of Oceanography
Computer Center, and at the Woods Hole Oceanographic Institution’s Computer Center.
Policies and provisions for re-use, re-distribution and production of derivatives
The final data product will be release to the public as soon as the recalibration of standard gasses
has been completed and the data have been prepared, typically within six months of collection.
There is no period of exclusive use by the data collectors. Users can access documentation and
final monthly CO2 data files via the Scripps CO2 Program website (http://scrippsco2.ucsd.edu ).
The data will be made available via ftp download from the Scripps Institution of Oceanography
Computer Center. Raw data (continuous concentration measurements, weather data, etc.) will be
maintained on an internally accessible server and made available on request at no charge to the
user.
Archiving of data
Our intent is that the long-term high quality final data product generated by this project will be
available for use by the research and policy communities in perpetuity. The raw supporting data
will be available in perpetuity as well, for use by researchers to confirm the quality of the Mauna
Loa Record. The investigators have made arrangements for long-term stewardship and curation at
the Carbon Dioxide Information and Analysis Center (CDIAC), Oak Ridge National Laboratory
(see letter of support). The standardized metadata record for the Mauna Loa CO2 data will be
added to the metadata record database at CDIAC, so that interested users can discover the Mauna
Loa CO2 record along with other related Earth science data. CDIAC has a standardize data product
citation [5] including DOI, that indicates the version of the Mauna Loa Data Product and how to
obtain a copy of that product.

13
Class Activity-1: Data Management Plan
MEHRAN UNIVERSITY OF ENGINEERING & TECHNOLOGY
US- PAKISTAN CENTER FOR ADVANCED STUDIES IN WATER
(USPCAS-W)
HYDRO-INFORMATICS: DATA MANAGEMENT AND ANALYSIS

Name:
Roll No. : Time Allowed: 15 Minutes
Name:
Roll No. :
LBOD Stage-I project area is located in Sindh, in the Lower Indus Basin. It lies between latitudes 24º
10'and 26º 40' N and Longitudes 68º 09' and 69º 26' E in the districts of Shaheed Benazirabad, Sanghar
and Mirpur Khas. The project is located on the Left Bank of River Indus in the command of Sukkur
Barrage.

Recently in 2011 this area received an extreme rainfall event in history of this region. According to
FAO and SUPARCO estimates, about 1.83 million acre (2850 sq. miles) land was inundated in four
districts on 1st of October, after 2 weeks of last rainfall event. Two major factors influenced drainage
operation, which beyond the design perceptions of the system. Large quantities of runoff generated
outside the network found evacuation route through LBOD network. Breaches in irrigation channels
and direct irrigation discharges into drainage network substantially added to the local flooding.
Looking the severity of the event SIDA launch a project “Regional Plan for the Left Bank of Indus,
Delta” to Ensure safe, timely, and unconstrained disposal of drainage and storm water; and Rehabilitate
and improve the existing LBOD infrastructure. Your center has been awarded the project in which you
have been asked to;
1. Procure and Manage the data for the project
a. Metrological data from Pakistan Metrological Department (PMD)
(End product: Monthly average precipitation time series stored in a Sql database)
b. Landsat Satellite Images to quantify the land cover and land use (LCLU) in the area
(End product: Classified images based on FAO defined classes; Projection UTM)
2. Provide expert opinion on the tools to be used in the project
3. Perform Rainfall analysis for the extreme events

This project starts on 01.04.2016 and ends in six months. You are given right to archive the data and
it’s metadata on the cloud server and share it but ensuring that it is properly cited and acknowledging
the SIDA. You are requested to keep the metadata such that it is on Federal Geographic Data
Committee (FGDC) standards and discoverable on internet.

a) Provide a flow chart showing the steps you will take throughout the data life cycle of the
project

14
b) Develop a data management plan (DMP) for this project. Following the given template.

Project Title:

Creator:

Last modified:

Copyright:

Products of Research

Data Format

Access to data, and data sharing practices and policies

Policies and provisions for re-use, re-distribution and production of derivatives

Archiving of data

15
EXERCISE 1: Data Management Plan and Metadata [2 pt]

Submission Date: Before next class


Learning Objectives
 To create a data management plan
 To list attributes for metadata description
Deliverables
 Data Management Plan for a project
 Metadata description
Project details and Task
Keenjhar Lake is situated in Thatta District, Sindh, Pakistan. It is 122 km away from Karachi and
18 km away from the town of Thatta. It is the second largest fresh water lake in Pakistan. It is an
important source that provides drinking water to Thatta District and Karachi city.
Keenjhar being one of the major source to Karachi and Thatta, so government has decided to
estimate its annual water budget and see its water quality.
Water quality samples are proposed to be taken at KB Feeder and Horoolo drain.
KB Canal is feeding source of fresh water to lake and it is originating from the River Indus. The
maintenance of the water quality of canal falls under the jurisdiction of Sindh Irrigation
Department, a major stake holder. The effluent waste of industries operating in Kotri Industrial
Area is discharged into the canal and this is one of the major sources of anthropogenic
contamination. Figure 1 shows the proposed site for sample collection

Figure 6:KB feeder

Horoolo drain is a rainy water drain which carries rain water to Keenjihar Lake when heavy rains
fall in the localities. The length of Horoolo drain is about 4 Km from Horoolo bridge to connecting
point with lake. It is water bed of this drain is about 3-4 feet lower than the lake and lake water
flow back in the drain channel commonly. The water in this section of drain is consumed by the
near settled population, cattle and aquatic life (fish, turtles, etc). In case of the of heavy rain, water
level become high in the drain than lake water level and this drain water flow into the lake. The
low channel and sampling sites are shown in Figure 2.

16
Figure 7:Horoolo Drain

Department had already prepared shape files for KB feeder and Horoolo drain in the year 2000 but
they did not prepared the metadata of files at that time and now they are having difficulty in
finding/searching the data. They want data to be searchable and should only be used by department
not by public. If public want to access data they should request officially to IT manager.
You are hired to help them to solve their problems with data handling. For prequalification, they
ask you to deliver the below mentioned deliverables:
 A Data Management Plan (Hint: You can use online tool for preparing DMP dmp.cdlib.org)
 List of attributes for the Metadata for the available data in the department

Data Access and Submission


You will submit the printed DMP and list of metadata attributes for different kind of datasets in
the department.

17
Grading Rubric: Assignment-1 Date:
Student (Name and Roll Number):

Category No Doesn’t Meet Nearly Meets Self- Instructor


Meets Standard Exceeds Standard
(Max. Score) Evidence Standard Standard Score Score
Title Absent Evidence of Evidence of three Evidence of Title – can assess main
(1) two or less four point from title alone;
Name, Instructor’s
0 0 0 Name, Course, Date,
1 Neatly finished 1
Introduction Absent, There is no Introduction states The The introduction states
(3) no clear the main topic but introduction the main topic and
introduction either: states the main previews the structure of
evidenc or main topic. 1. Does not give a topic and the report. Good
e full overview, previews the overview of the design
Or: structure of the and strategy. An
1 2. Too detailed, report. effective summary.
leading to Gives enough detail to
annoying interest the reader.
0
repetition later. 2 3
2
Organizati Not Paragraphs Organization of Paragraph Writer demonstrates
on and applicabl fail to ideas not fully development logic and sequencing of
e develop the developed. present but not ideas through well-
structural main idea. No Paragraphs lack perfected. Each developed paragraphs.
developme evidence of supporting detail paragraph has Each paragraph has
nt of the structure or sentences. No sufficient thoughtful, supporting
idea: organization. transitions. supporting detail sentences that
procedure, detail develop the main idea.
sentences. No The first sentence of
results, transitions. each paragraph is the
discussion 1–5 6-7 summary sentence.
(10) 8 Transitions enhance
structure. 9 - 10
Engineerin Design The writer Sketchy: left out Discussion Provides what was
g point(s) has no clue required design lacks adequate explicitly asked for. The
not what they are points. Did not detail, but all function of each piece is
Calculation addresse talking about. work on this as the necessary demonstrated to the
s and d. 45 – 58% much as you points are reader in adequate, but
Design should have, and it covered and not overwhelming,
(70) 3 – 42% shows. Many nearly all detail. Answers are
important answers answers are correct and reasonable.
are incorrect. correct. 91 – 100%
61 – 79% 82 – 88%
a) Data Management Plan (80)
b) List of Metadata attributes (10)

18
Category No Doesn’t Meet Nearly Meets Self- Instructor
Meets Standard Exceeds Standard
(Max. Score) Evidence Standard Standard Score Score
Word Not Numerous Misspelled words, Almost no Punctuation,
Usage and applicabl and poor English errors in capitalization, spelling,
e distracting grammar and word punctuation, sentence structure, word
Format errors in choice. Main body capitalization, usage, and significant
(10) punctuation, of report is either spelling, figures all correct. Main
capitalization, longer or sentence body of report is one
spelling, significantly less structure, word page or less. Clear,
sentence than one page. usage, consistent fonts. Good
structure, Figures are too significant word processing skills.
word usage, small and/or figures, and Figures have adequate
significant under-labeled, presentation of contrast. Informative
figures, although they are figures, tables, figure and table titles
tables, and usually of and and legends. Figures
figures. Data acceptable quality appendices. have appropriate axis
vomited onto and focus. Tables Main body of tick spacing, labels,
page(s). incoherent or not report is one units, and legends.
Unacceptable cohesive. Bad font page or less Table columns cohesive,
/ sizes. Too much or labeled, and specify
unprofessiona too little data in units. Document is
l at the appendices. Could 8 stapled. Appendices, if
graduate be improved by provided, are separated
level. 1 – 5 being more by topic, and each have
meticulous. a title, discussion, and
6-7 proper formatting and
display of information
9 - 10
Conclusion Absent Incomplete The conclusion The conclusion The conclusion restates
(4) and/or not does not restates the the main results, and is
0 focused. 1 adequately restate main results. 3 an effective summary. 4
the main results. 2
References Absent With many With some errors, With few All cited works; text,
(2) errors, off- appropriate sources errors, good visual, and data sources
0 the-wall were used. sources were are done in the correct
sources used. 1 used format with no errors.
0 2 Uses innovative sources
of information. 2
TOTAL
(100)

19
CHAPTER 3: DATA MODEL

Data model

A formal method of describing the behavior of the real-world features. A fully developed data
model supports entity classes, relationships between entities, integrity rules and operations on the
entities (ESRI GIS glossary).

Let’s understand that what is meant by the above mentioned definition. Figure 8 describes the
physical features that occurs in one watershed. Some of the important features that exists in a
watershed are:

1. Watershed boundary
2. Aquifer boundary
3. Water body
4. Rivers/streams/canals
5. Measurement stations (i.e. Precipitation, temperature, flow, ground/surface water level
etc.)
6. Wells

Figure 8: Physical features in a watershed and Arc hydro framework data model [refer]

20
Now, you intend to store information related these features in a database for which you need a data
model. What data model will do is take these feature and define each of them in terms on entities
(Tables), attributes (Colum title of each table), and combine different feature through common
attribute by building relationships among them. Basically, data model is the blue print for physical
implementation of a database. Basic elements used to describe a data model are define below;

• Entity – a class of real world objects having common attributes (e.g., sites, variables,
methods). Physically thinking an entity is a table in the database.
• Attribute – A characteristic or property of an entity (site name, latitude, longitude).
Physically thinking an attribute is a column in a table in the database.
• Relationship – an association between two or more entities. Physically thinking a
relationship connects one table to another table through matching attribute in the database.

For this watershed example, we have to see what are the feature that has common attribute-for
example watershed will have common attributes as ID, name, area, type of boundary etc. Similarly,
observations taken with respect to time will have common attribute as date/time of measurement,
measurement value, variable measured, and unit of measurement. Monitoring points will have
common attributes as ID, latitude, longitude, and observation type. River will have common
attribute such as ID, name, length, bedslope, and width.

Figure 8 shows the standard data model that is used in Arc Hydro framework model [refer]. This
model considers the following entities;

 Aquifer – Polygon features representing aquifer boundaries. The features can be classified to
represent different zones such as outcrop and confined sections of the aquifer.
 MonitoringPoint – Point features representing locations where hydrologic variables are
measured, such as stream-gage stations and precipitation gages.
 SeriesCatalog – Table for indexing and summarizing time series stored in the TimeSeries
table.
 TimeSeries – Table for storing time varying data such as water levels, flow, and water
quality.
 VariableDefinition – Table for storing time series values.
 WaterBody – Polygon features representing areas such as ponds, lakes, swamps, and
estuaries.
 WaterLine – Line features representing hydrographic “blue lines”, which represent mapped
streams and water body center lines.
 WaterPoint – Point features representing hydrographic features such as springs, water
withdrawal/discharge locations, and structures.
 Watershed – Polygon features representing drainage areas contributing water flow from the
land surface to the water system.
 Well – Point features representing well locations and their attributes.

Each entity have its associated attributes. Table 1 show the attributes for the aquifer entity. You
will see in the table that each attribute has associated attribute name, its data type and description.
21
For example, HydroID attribute is the unique id that is used to build the relationship, and its data
type is a long integer.

Table 1: Attributes for Aquifer entity

Field name Type Description

Unique feature identifier in the geodatabase used for creating relationships between
HydroID Long Integer
classes of the data model.

Permanent public identifier of the feature used for relating features with external
HydroCode Text
information systems.

Name Text Text attribute representing the name of the aquifer.

Relates aquifer polygons with more detailed descriptions of hydrogeologic units


HGUID Long Integer
defined in the HydrogeologicUnit table.

Distinguishes between types of aquifers or zones within an aquifer (e.g., unconfined


FType Text
and confined).

Designing a data model

In this book, we will focus on designing a data model for relational database. There are three
stages for design of a data model.
Stage 1: Conceptual data model
Stage 2: Logical data model
Stage 3: Physical data model
Stage 1: Conceptual data model

Conceptual data model is simply an information model. Let’s think that you need to design a
database for storing the information regarding an orchard farm. First step will be to think in
simple terms that what physical feature will be there and how they are related to each other.
Figure 9 shows an example of conceptual data model.

Figure 9: Conceptual model for orchard farm

22
In conceptual stage, we only identify the main entities and their relationship. Further detail for
data model is added at stage 2 i.e. logical data model design stage.

Stage 2: Logical data model


In logical data model design stage we further add details for our data model. At this stage of
model design, we provide the attributes to the entities, define primary keys, foreign keys, and
established relationship using the primary key and foreign key of each entity. Logical data model
contains the details but it is independent of the software requirements on which it will be
implemented. Those details are incorporated in stage 3 i.e. physical data model.
Figure 10 shows the logical data model. There are three entities: i) orchards; ii) AppleTrees; iii)
Apples. Orchards has three attributes OrchardID (integer data type), Ownersname (Text data
type), and Area_acres (Double). OrchardID is the primary key. Primary key is the attribute that
is a persistent and unique identifier. Primary key must exists, should be null able and unique.
Entity for which the primary key belongs is called Parent table for that primary key.

Figure 10: Logical data model for orchard farm

AppleTrees entity has five attribute, out of which Orchards_OrchadD is the foreign key. Foreign
key is the primary key of the parent table in the child table. In example here Orchards is the
parent entity and AppleTrees is child entity and they are related via OrchadID key. The
relationship between these two entities is one to many. Four different kind of relationships can
exists between any two entities, which are:

23
Let’s understand what these relationship means, and how to interpret them. In example of
orchard farm, we know that one orchard will have one or more than one trees, and one specific
tree will belong one orchard farm, so it has one to one relationship, and there will be more than
one tree in one orchard farm, so it has one to many relationship. In reading a relationship given
in Figure 11, we will read from Left to Right that Orchards has 1 or more AppleTrees of data.
From Right to Left that AppleTree is located at 1 and only 1 orchard.

Figure 11: Relationship between orchards and Apple Trees entity. Left to Right: Orchards has 0 or more AppleTrees of
data. Right to Left: AppleTree is located at 1 and only 1 orchard

Stage 3: Physical data model


We have gone through all the basics of data modelling terminologies. Now, let’s see how it will
look physically. Here, orchards entity is converted to a table. Its attributes are header of columns.
Each row contain one data value. All these tables are connected through primary and foreign
keys. In physical data model, entities are converted tables, attributes are converted to column
names, and column data type is defined. Tables are related to other tables through the unique
identifiers (primary and foreign keys). At this stage the system becomes software specific, for
examples, there will be some difference while implementing the model on MYSQL or SQL
servers.

Figure 12: Database for orchard farm data model

24
Class Activity-2: Designing data model
Here is in this activity, we will design a data model that will be capable of storing the
information regarding the parameters measured at a well. Target is to implement it on MYSQL
server, and we will use MYSQL workbench for achieving this task. Conceptually thinking
following entities and attributes are sought out.

Boundary
Attribute Data Type Description
HydroID (INT) Unique feature identifier in the database
used for creating relationships between
classes of the data model.

HydroCode (VARCHAR) Permanent public identifier of the feature


used for relating features with external
information systems
Name (VAR) Text attribute representing the name of the
aquifer
BType (VAR) Distinguishes between types of aquifers or
zones within an aquifer (e.g., unconfined
and confined).
Time series
Attribute Data Type Description
FeatureID (INT) Unique feature identifier. Is equal to the
HydroID of the feature associated with the
time series value.
TsYear (DATETIME) Time stamp specifying the date and time
associated with the time series value.
UTCOffset (DOUBLE) Number of hours the time coordinate
system used to define TsTime is displaced
from Coordinated Universal Time.
TsValue (DOUBLE) Numerical value of the variable at the
given location and time.
SElevValue (DOUBLE) Numerical value of the variable surface
elevation at the given location.
Variabledefinition
Attribute Data Type Description
HydroID (INT) Unique numerical identifier for the
variable within the database.
VarName (VARCHAR) The name of the variable.
VarDesc (VAR) The description of the variable.

VarUnits (VARCHAR) Units of measure for the variable.

Well
Attribute Data Type Description
HydroID (INT) Unique feature identifier in the database
used for creating relationships between
classes of the data model.
HydroCode (VARCHAR) Permanent public identifier of the feature
used for relating features with external
information systems

FType (VARCHAR) Distinguishes between types of aquifers

Longitude (FLOAT) Numerical value of longitude


Latitude (FLOAT) Numerical value of latitude

25
Steps to complete the task

Step-1: Open MySQl workbench and double click the localhost connection.
Step-2 Create a new model (File->New Model) or Crtl+N.
Step-3: Double click on the “Add Diagram” icon
Step-4 Create entities mentioned above
Step-5: Naming and adding Attributes to Entities
Step-6 Creating relationships
Step-7 Saving and exporting the ER diagram

Screenshots of each step

Step-1: Open MySQl workbench and double click the localhost connection.

26
Step-2 Create a new model (File->New Model) or Crtl+N.

Step-3: Double click on the “Add Diagram” icon

27
Step-4 Create entities
To create new entities in your drawing, click on the “New Table” button on the toolbar just to the
left of the gridded canvas. Then click on the canvas. You will see a new table/entity show up.
Since MySQL is a relational database management system, entities in MySQL Workbench are
called tables.
To adjust the location of an entity on your drawing, make sure the pointer tool is selected on the
toolbar and then click on an entity and drag it to a new location.

Step-5: Naming and Adding Attributes to Entities

When you double click on an entity (table) in the drawing you will notice that it becomes
selected and the panel at the bottom of the window will reflect the properties of the selected
entity.
Here we will create four entities i.e. boundary, well timeseries and variabledefinition with
following attributes.
Boundary
HydroID HydroCode Name BType
Time series
FeatureID TsYear UTCOffset TsValue SElevValue

Variabledefinition
HydroID VarName VarDesc VarUnits

Well
HydroID HydroCode FType Longitude Latitude

28
Finish entities will look like below;

29
Step-6 Creating Relationships

Before you create a relationship between two entities, you need to first create a primary key for
each entity. You can then create a relationship by first clicking on the desired type of relationship
on the toolbar and then clicking first on the child table (well) followed by the parent table
(boundary). You will notice that the relationship is created and a foreign key is added to the child
table. If you mess up, just right click on the relationship and select “Delete” from the context
menu.

Next, if you double click on the relationship, it’s properties will be shown in the panel at the
bottom of the window. If you click on the “Foreign Key” tab at the bottom, you can use the
available options to modify the Cardinality and Participation for the relationship.

30
In the example above, you would be telling MySQL Workbench how an “boundary” entity is
related to “Wells.” In this example, the most likely relationship is for one “boundary” to be
related to “Zero or more” “Wells.” So, you would want to uncheck the check box next to the
“Mandatory” label on the “Wells” side of the relationship.

Step-7 Saving and Exporting the Diagram


When you finish your model, you will want to save it and then export it as an image so you can
paste it into a Word document as an appendix. At the top of the MySQL Workbench window,
click the “File” menu and then click “Save Model.” Select where you want to save your model
file, give it a name, and then click “Save.”
To export your model as an image, click on the “File” menu and then select Export à Export as
PNG. Select where you want to save your PNG file, give it a name, and then click “Save.” You
can then insert the PNG file into a Word document.

31
Exercise-2 Data Model Design

Submission Date: Before next class


Learning Objectives
1. Develop data models to represent, organize, and store data
2. Design and use relational databases to organize, store, and manipulate data
Problem Statement
For all investigations and projects, where a lot of data is needed, an important task is to manage
collected data. Your task is to design a database for handling geological and hydrogeological point
data assigned to boreholes, observation wells, discharge wells and any other points of
hydrogeological relevance and the respective time series.

Make a list of all data you consider relevant and design a structure of a data base that allows you
to integrate and administrate different types of data connected to point information.

For the structuring of the layer related database you should consider, that at each point there could
a different number of lithological layers (i.e. gravel, sand, loam, silt), which can be assigned to
different geological layers (stratigraphy). Aquifer layers summarize different lithological layers
into hydrogeological units for which hydrogeological data (heads, hydraulic parameters etc) are
assigned according to hydrostratigraphy.

Also consider the integration and management of additional information like time series, chemical
analyses and practical information such as owners, addresses, images etc. Find an adequate
structure of the database for the integration of this information.

32
Deliverable:
Provide a one-page briefing report along with a full-page entity-relationship diagram that shows
your logical model design. In presenting your design:
1. Provide an introduction to the problem.
2. Describe the methods you used to develop your design.
3. Describe your results:
a. Describe the entities and relationships that you have included in your data model.
b. Explain how you will structure the metadata to avoid repetition.
c. Overview the software technology, file formats, etc. you will use to organize the
data and implement your data model.
d. Describe how you could make it easier to get data into and out of your data
model.
4. Provide a brief summary/conclusion section that specifies whether/how your data model
design will facilitate querying and retrieval of subsets of data.
5. Provide a full page entity-relationship diagram as an appendix to your write-up that
shows the entities needed to describe the data, their attributes, and the relationships
between them.

33
Grading Rubric: Assignment-2 Date:
Student (Name and Roll Number):

Category No Doesn’t Meet Nearly Meets Self- Instructor


Meets Standard Exceeds Standard
(Max. Score) Evidence Standard Standard Score Score
Title Absent Evidence of Evidence of three Evidence of Title – can assess main
(1) two or less four point from title alone;
Name, Instructor’s
0 0 0 Name, Course, Date,
1 Neatly finished 1
Introduction Absent, There is no Introduction states The The introduction states
(3) no clear the main topic but introduction the main topic and
introduction either: states the main previews the structure of
evidenc or main topic. 3. Does not give a topic and the report. Good
e full overview, previews the overview of the design
Or: structure of the and strategy. An
1 4. Too detailed, report. effective summary.
leading to Gives enough detail to
annoying interest the reader.
0
repetition later. 2 3
2
Organizati Not Paragraphs Organization of Paragraph Writer demonstrates
on and applicabl fail to ideas not fully development logic and sequencing of
e develop the developed. present but not ideas through well-
structural main idea. No Paragraphs lack perfected. Each developed paragraphs.
developme evidence of supporting detail paragraph has Each paragraph has
nt of the structure or sentences. No sufficient thoughtful, supporting
idea: organization. transitions. supporting detail sentences that
procedure, detail develop the main idea.
sentences. No The first sentence of
results, transitions. each paragraph is the
discussion 1–5 6-7 summary sentence.
(10) 8 Transitions enhance
structure. 9 - 10
Engineerin Design The writer Sketchy: left out Discussion Provides what was
g point(s) has no clue required design lacks adequate explicitly asked for. The
not what they are points. Did not detail, but all function of each piece is
Calculation addresse talking about. work on this as the necessary demonstrated to the
s and d. 45 – 58% much as you points are reader in adequate, but
Design should have, and it covered and not overwhelming,
(70) 3 – 42% shows. Many nearly all detail. Answers are
important answers answers are correct and reasonable.
are incorrect. correct. 91 – 100%
61 – 79% 82 – 88%
a) Describe the entities and relationships (10)
b) Explain how you will structure the metadata to avoid repetition. (5)
c) Overview the software technology, file formats, etc. (5)
d) entity-relationship diagram included in appendix (50)

34
Category No Doesn’t Meet Nearly Meets Self- Instructor
Meets Standard Exceeds Standard
(Max. Score) Evidence Standard Standard Score Score
Word Not Numerous Misspelled words, Almost no Punctuation,
Usage and applicabl and poor English errors in capitalization, spelling,
e distracting grammar and word punctuation, sentence structure, word
Format errors in choice. Main body capitalization, usage, and significant
(10) punctuation, of report is either spelling, figures all correct. Main
capitalization, longer or sentence body of report is one
spelling, significantly less structure, word page or less. Clear,
sentence than one page. usage, consistent fonts. Good
structure, Figures are too significant word processing skills.
word usage, small and/or figures, and Figures have adequate
significant under-labeled, presentation of contrast. Informative
figures, although they are figures, tables, figure and table titles
tables, and usually of and and legends. Figures
figures. Data acceptable quality appendices. have appropriate axis
vomited onto and focus. Tables Main body of tick spacing, labels,
page(s). incoherent or not report is one units, and legends.
Unacceptable cohesive. Bad font page or less Table columns cohesive,
/ sizes. Too much or labeled, and specify
unprofessiona too little data in units. Document is
l at the appendices. Could 8 stapled. Appendices, if
graduate be improved by provided, are separated
level. 1 – 5 being more by topic, and each have
meticulous. a title, discussion, and
6-7 proper formatting and
display of information
9 - 10
Conclusion Absent Incomplete The conclusion The conclusion The conclusion restates
(4) and/or not does not restates the the main results, and is
0 focused. 1 adequately restate main results. 3 an effective summary. 4
the main results. 2
References Absent With many With some errors, With few All cited works; text,
(2) errors, off- appropriate sources errors, good visual, and data sources
0 the-wall were used. sources were are done in the correct
sources used. 1 used format with no errors.
0 2 Uses innovative sources
of information. 2
TOTAL
(100)

35
CHAPTER 4: DEVELOPMENT OF A DATABASE

Here I will try to link the stuff we have learned in the data model chapter. I will explain that how
what we have learned till now links to the real time project. My aim is to create a subsurface
database for Sindh which stores the information regarding the hydrogeological features.
Keeping following steps in mind, I will achieve my aim;
1. Conceptualize the database structure/data model i.e. identify entities, attributes and
relationships among them. My thought was that I should follow standard structure of any
available model and customize it according to my needs. What I did was that I started looking at
the ArcHydro model for groundwater and it was good choice, as it gave me good base for storing
data for my study area.
2. I also needed to select a database software in which I can create and implement the database. As
my vision is to create a web based application for data sharing in Pakistan, so I needed such
working platform that can connect me to internet, is free and integrate with different programing
languages. So, I choose MySQL server and I used MySQL workbench for creation and
implementation of database.
3. During this all soft measures I was thinking, I also need to collect the data and digitized them. I
used the help of students and my irrigation department partners to achieve the task. Data
received was in rather hard format or in soft format but too much scattered to be used for fulfilling
my purpose. Together with students I was able to digitize all of the information in a structure that
is readily usable to import in database or that can be programmed easily to be used in other
software.
4. Investigating the data I received, I started thinking of a data model and came up with the data
model (see Figure 13) with the help of brainstorming with stakeholders.
5. Then Implemented this data model to a database and imported all of the data which was digitized
in excel or other formats.
6. Now I have huge freedom to handle my data using SQL scripting and other languages.

In this chapter, we will go through the steps to develop a database to store hydrogeological
information.

36
Figure 13: Entity-Relationship diagram or Data model

Steps to replicate the database development process

1. Open SQL workbench, use your server connection and open server
2. Open “wtd” (crtl+O).
3. Perform forward engineering to convert data model to database (crtl+G)
4. Now go to connected server, refresh it and you will notice that entities are converted to tables,
attributes are converted to columns.

5. Next step is to access the database. You can do it using SQL scripting. Write “USE wtd;” in the
Query tab. This will make sure that you are in wtd.
6. Now you want to see details of the table created. You can use “DESC Tablename;”
7. If you want to select the table the use “SELECT * FROM Tablename;”
8. Now we will load the data in the tables there are two ways to that;

37
a. Use import tab in workbench to import the csv files
b. Use SQL script “INSERT INTO table_name (column1, column2, column3,
...) VALUES (value1, value2, value3, ...);”

Once you have created the database then perform the following task;

In Class Task-1-Find all the wells that are in “Rohri” command area. Export them to a
csv file.

In Class Task-2- Find out the Latitude and Longitude of Well “36a”

Challenge Task-3- Create a new Table in database in which group all the time series
values on canal command boundary such that it gives average water table of all the well
in that command area.

38
EXERCISE 3: Data base creation and data population [1 point]
Submission Date: next class
Learning Objectives
1. Create database from data model design
2. Populate database with groundwater data
3. Write one page report
Deliverables
Submit a one-page report that addresses the below mentioned questions and screenshot of your
populated database as an appendix. Additionally, once you have loaded the data in the database
then export that data in .csv files and submit as electronic files upload to LMS.
Computer and Data Requirements

1. Data required for this assignment are available in a zip file on the LMS.
2. Data model required for this assignment are available in a zip file on the LMS
3. MySQL workbench and server
Problem
You are hired as research assistant on a project on ground assessment. First task given to you is to
organize and store all of the groundwater data recorded for this project. Your supervisor has created
a data model using workbench and he would like you to use that as a structure for your database.
You are also given freedom to develop your own data model, provide a justification for developing
your data model rather using one given by your supervisor. Provide one pager report, addressing
all the points mentioned below;
1. How is the raw data acquired is arranged?
2. Entity relationship diagram.
3. Flow chart for the steps for creating the database.
4. Normalization steps of the raw data.
5. What problems you face during formatting the given data into desirable format?
6. What errors you received while importing the data into database?
7. What steps you took to remove those errors?
8. What are the advantages of using standard data model?
9. What are the disadvantages of not using standard data model?

39
Grading Rubric: Assignment-3 Date:
Student (Name and Roll Number):

Category No Doesn’t Meet Nearly Meets Self- Instructor


Meets Standard Exceeds Standard
(Max. Score) Evidence Standard Standard Score Score
Title Absent Evidence of Evidence of three Evidence of Title – can assess main
(1) two or less four point from title alone;
Name, Instructor’s
0 0 0 Name, Course, Date,
1 Neatly finished 1
Introduction Absent, There is no Introduction states The The introduction states
(3) no clear the main topic but introduction the main topic and
introduction either: states the main previews the structure of
evidenc or main topic. 1. Does not topic and the report. Good
e give a full previews the overview of the design
overview, structure of the and strategy. An
1 Or: report. effective summary.
2. Too detailed, Gives enough detail to
leading to interest the reader.
0
annoying 2 3
repetition later.
2
Organizati Not Paragraphs Organization of Paragraph Writer demonstrates
on and applicabl fail to ideas not fully development logic and sequencing of
e develop the developed. present but not ideas through well-
structural main idea. No Paragraphs lack perfected. Each developed paragraphs.
developme evidence of supporting detail paragraph has Each paragraph has
nt of the structure or sentences. No sufficient thoughtful, supporting
idea: organization. transitions. supporting detail sentences that
procedure, detail develop the main idea.
sentences. No The first sentence of
results, transitions. each paragraph is the
discussion 1–5 6-7 summary sentence.
(10) 8 Transitions enhance
structure. 9 - 10
Engineerin Design The writer Sketchy: left out Discussion Provides what was
g point(s) has no clue required design lacks adequate explicitly asked for. The
not what they are points. Did not detail, but all function of each piece is
Calculation addresse talking about. work on this as the necessary demonstrated to the
s and d. 45 – 58% much as you points are reader in adequate, but
Design should have, and it covered and not overwhelming,
(70) 3 – 42% shows. Many nearly all detail. Answers are
important answers answers are correct and reasonable.
are incorrect. correct. 91 – 100%
61 – 79% 82 – 88%

Provide evidence (Screenshot etc.) showing that you successfully loaded the data and
metadata (20)

Description about the errors encountered and their removal (20)


Description of advantages and disadvantages of using a standard data model like ODM (20)
Description about the raw data (10)

40
Category No Doesn’t Meet Nearly Meets Self- Instructor
Meets Standard Exceeds Standard
(Max. Score) Evidence Standard Standard Score Score
Word Not Numerous Misspelled words, Almost no Punctuation,
Usage and applicabl and poor English errors in capitalization, spelling,
e distracting grammar and word punctuation, sentence structure, word
Format errors in choice. Main body capitalization, usage, and significant
(10) punctuation, of report is either spelling, figures all correct. Main
capitalization, longer or sentence body of report is one
spelling, significantly less structure, word page or less. Clear,
sentence than one page. usage, consistent fonts. Good
structure, Figures are too significant word processing skills.
word usage, small and/or figures, and Figures have adequate
significant under-labeled, presentation of contrast. Informative
figures, although they are figures, tables, figure and table titles
tables, and usually of and and legends. Figures
figures. Data acceptable quality appendices. have appropriate axis
vomited onto and focus. Tables Main body of tick spacing, labels,
page(s). incoherent or not report is one units, and legends.
Unacceptable cohesive. Bad font page or less Table columns cohesive,
/ sizes. Too much or labeled, and specify
unprofessiona too little data in units. Document is
l at the appendices. Could 8 stapled. Appendices, if
graduate be improved by provided, are separated
level. 1 – 5 being more by topic, and each have
meticulous. a title, discussion, and
6-7 proper formatting and
display of information
9 - 10
Conclusion Absent Incomplete The conclusion The conclusion The conclusion restates
(4) and/or not does not restates the the main results, and is
0 focused. 1 adequately restate main results. 3 an effective summary. 4
the main results. 2
References Absent With many With some errors, With few All cited works; text,
(2) errors, off- appropriate sources errors, good visual, and data sources
0 the-wall were used. sources were are done in the correct
sources used. 1 used format with no errors.
0 2 Uses innovative sources
of information. 2
TOTAL
(100)

41
CHAPTER 5: STRUCTURE QUERY LANGUAGE

What is Structured Query Language?


Special purpose programming language for managing data in relational database
management systems (RDBMS). Adopted by the American National Standards
Institute (ANSI) and the International Standards Organization (ISO) as the
standard data access language

Basic SQL Query Structure


A basic SQL query consists of a SELECT, a FROM, and a WHERE clause
– SELECT
• Specifies the columns to appear in the result
– FROM
• Specifies the tables to use
– WHERE
• Filters the results based on criteria
Important Commands

 SELECT - extracts data from a database


 UPDATE - updates data in a database
 DELETE - deletes data from a database
 INSERT INTO - inserts new data into a database
 CREATE DATABASE - creates a new database
 ALTER DATABASE - modifies a database
 CREATE TABLE - creates a new table
 ALTER TABLE - modifies a table
 DROP TABLE - deletes a table
 CREATE INDEX - creates an index (search key)
 DROP INDEX - deletes an index

42
Create a New Query in MySQL Workbench

USE sindhaquiferdb;
Tell SQL which database
Create a new SQL to use
tab for executing
queries

43
Selecting Data
Syntax
SELECT Field_1, Field_2, Field_n FROM TableName;

Example
Example of Select queries:
1. Select all values in the well table;
USE wtddb;
SELECT * FROM well;

– The “*” means – give me all of the fields/columns in the table


2. Select column HydroID, HydroCode, Ftype columns from well
USE wtddb;
SELECT HydroID, HydroCode, Ftype FROM well;

Adding Criteria to SELECT Queries


The “WHERE” clause specifies which data values or records will be returned
based on criteria
• Conditional operators used with the WHERE clause are:
= Equal
> Greater than
< Less than
<= Less than or equal
>= Greater than or equal
<> Not equal to
LIKE Match a substring, with “%” as a wildcard character
IN/NOT IN Supply a list of items to test
BETWEEN Test between two given values

Syntax
SYNTAX for adding criteria to a SELECT query:
SELECT Field_1, Field_2, Field_n
FROM TableName
WHERE Field_1 = SomeCondition AND/OR Field_2 = AnotherCondition;
44
Example
Example of adding criteria to a SELECT query:
3. Which wells are in the North of 24.9 degree Latitude
USE wtddb;
SELECT * FROM wells
WHERE Latitude > 24.9;
4. Which wells are in the between 26 and 27 degree Latitude
USE wtddb;
SELECT * FROM wells
WHERE Latitude > 24.9;

Multiple Criteria and Boolean Operators


Example
5. Select only those values in which TsValue is less than or equal to 70.
USE wtddb;
SELECT * FROM timeseries

WHERE TsValue <= 70


AND
Well_Boundary_HydroCode = 'DESERT';
6. Select only those values in which TsValue is less than or equal to 70 and all
those values that lie in boundary DESERT
USE wtddb;
SELECT * FROM timeseries

WHERE TsValue <= 70


OR
Well_Boundary_HydroCode = 'DESERT';
7. Select all those values in which TsValue is not less than or equal to 70
USE wtddb;
SELECT * FROM timeseries

WHERE NOT TsValue <= 70

45
LIMIT ROW SELECTION
Syntax
SELECT column_name(s)
FROM table_name
WHERE condition
LIMIT number;

Example
8. Select only those values in which TsValue is less than or equal to 70 but
show only 10 values
USE wtddb;
SELECT * FROM timeseries
WHERE NOT TsValue <= 70
LIMIT 10;

Sorting Results Using ORDER BY


The ORDER BY clause can be used to arrange query results in ascending (ASC) or
descending (DESC) order.

Example
Example: You interested to see the wells arranged from smaller to highest values
9. In the wells arranged from smaller to highest values
USE wtddb;
SELECT TsValue,Well_HydroCode
FROM timeseries
ORDER BY TsValue ASC;

46
NULL Values
• Missing (unknown) info is represented by NULL values
• Result of any comparison involving a NULL value is Unknown

Example
10.
USE wtddb;
SELECT * FROM wells
WHERE HydroCode IS NULL;

11.
USE wtddb;
SELECT * FROM Sites
WHERE HydroCode IS NOT NULL;

47
Selecting from More than One Table
(USING jOIN)
• The “JOIN” statement makes queries relational
• Joins allow you to select information from more than one table using one
SELECT statement

Syntax
SELECT LeftTable.Field1, LeftTable.Field1, RightTable.Field1,
RightTable.Field2
FROM LeftTable
Join_Type RightTable
ON JoinCondition;

TYPES OF JOIN

48
INNER JOIN: Takes every record in the LeftTable and looks for 1 or more
matches in the RightTable based on the JoinCondition. All matched records are
added to the result.

Syntax
SELECT column_name(s)
FROM table1
INNER JOIN table2 ON table1.column_name = table2.column_name;

Example
12. Export table that contain HydroCode, latitude, longitude, value of the wells
using the wtddb database.

USE wtddb;
SELECT
well.HydroCode,well.Latitude,well.Longitude,timeseries.TsYear,timeseries.T
sValue
FROM timeseries
INNER JOIN well ON well.HydroCode = timeseries.Well_HydroCode;

OUTER JOIN: Brings two tables together but includes data even if the
JoinCondition does not find matching records
3 Variations: LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN

Syntax
LEFT JOIN
SELECT column_name(s)
FROM table1
LEFT JOIN table2 ON table1.column_name = table2.column_name;
RIGHT JOIN
SELECT column_name(s)
FROM table1
RIGHT JOIN table2 ON table1.column_name = table2.column_name;

49
FULL JOIN
SELECT column_name(s)
FROM table1
FULL OUTER JOIN table2 ON table1.column_name = table2.column_name
;

Insert into Table

Syntax
INSERT INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);

Example
13.Insert a new administration boundary in the boundary table named as
Hyderabad having code Hyd
USE wtddb;
INSERT INTO boundary(HydroID,HydroCode,Name,Btype)
VALUES (14,'Hyd','Hyderabad','AU');

UPDATE TABLE
Syntax
UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;

Example
14.Update code of administration boundary in the boundary table named as
Hyderabad having code Hyd to Hyder

USE wtddb;
UPDATE boundary
SET HydroCode = 'Hyder'
WHERE HydroID = 14;

50
DELETE from table

Syntax
DELETE FROM table_name
WHERE condition;

Example
15. Delete administration boundary in the boundary table named as
Hyderabad having hydroid = 14

USE wtddb;
DELETE FROM boundary
WHERE HydroID = 14;

MIN/MAX from table


Syntax
SELECT MIN(column_name)
FROM table_name
WHERE condition;

SELECT MAX(column_name)
FROM table_name
WHERE condition;

Example
16. Calculate the minimum water to depth
USE wtddb;
SELECT MIN(TsValue)
FROM timeseries;

51
17.Calculate the maximum water to depth
USE wtddb;
SELECT MAX(TsValue)
FROM timeseries;

COUNT(), AVG() and SUM()


Syntax
COUNT SYNTAX

SELECT COUNT(column_name)
FROM table_name
WHERE condition;

Example
18.Count the number of data points in database
USE wtddb;
SELECT COUNT(TsValue)
FROM timeseries;

Syntax
AVERAGE SYNTAX:

SELECT AVG(column_name)
FROM table_name
WHERE condition;

Example
19.Calculate average DWT
USE wtddb;
SELECT AVG(TsValue)
FROM timeseries;

52
Syntax
SUM SYNTAX:

SELECT SUM(column_name)
FROM table_name
WHERE condition;

LIKE Operator
The LIKE operator is used in a WHERE clause to search for a specified pattern in
a column.

There are two wildcards used in conjunction with the LIKE operator:

 % - The percent sign represents zero, one, or multiple characters


 _ - The underscore represents a single character

Syntax
LIKE SYNTAX
SELECT column1, column2, ...
FROM table_name
WHERE columnN LIKE pattern;

Example
20.Select all values from well table that lies in FULLELI boundary

USE wtddb;
SELECT * FROM well
WHERE Boundary_HydroCode LIKE 'F%';

53
Wildcards
A wildcard character is used to substitute any other character(s) in a string.

Wildcard characters are used with the SQL LIKE operator. The LIKE operator is
used in a WHERE clause to search for a specified pattern in a column.

There are two wildcards used in conjunction with the LIKE operator:

 % - The percent sign represents zero, one, or multiple characters


 _ - The underscore represents a single character

Using the % Wildcard


Example
21.

USE wtddb;
SELECT * FROM well
WHERE Boundary_HydroCode LIKE 'F%';

Using the _ Wildcard


Example
22.

USE wtddb;
SELECT * FROM well
WHERE Boundary_HydroCode LIKE 'F_L';

IN Operator
The IN operator allows you to specify multiple values in a WHERE clause.

The IN operator is a shorthand for multiple OR conditions.

54
Syntax
SELECT column_name(s)
FROM table_name
WHERE column_name IN (value1, value2, ...);

Example
23.
USE wtddb;
SELECT * FROM well
WHERE Boundary_HydroCode IN ('FUL', 'LC');

BETWEEN Operator
The BETWEEN operator selects values within a given range. The values can be
numbers, text, or dates.

The BETWEEN operator is inclusive: begin and end values are included.

Syntax
SELECT column_name(s)
FROM table_name
WHERE column_name BETWEEN value1 AND value2;

Example
24.

USE wtddb;
SELECT * FROM timeseries
Where TsValue BETWEEN 100 AND 150;

55
UNION Operator
The UNION operator is used to combine the result-set of two or more SELECT
statements.

 Each SELECT statement within UNION must have the same number of
columns
 The columns must also have similar data types
 The columns in each SELECT statement must also be in the same order

Syntax
SELECT column_name(s) FROM table1
UNION
SELECT column_name(s) FROM table2;

Example
In our DB this example is not worthy to look but for demonstration, we can test it
25.
USE wtddb;
SELECT TsValue FROM timeseries
UNION
SELECT SElevValue FROM timeseries

GROUP BY Statement
The GROUP BY statement is often used with aggregate functions (COUNT,
MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

Syntax
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
ORDER BY column_name(s);

56
Example
26.
USE wtddb;
SELECT count(TsValue),Well_Boundary_HydroCode
FROM timeseries
WHERE TsValue <100
GROUP BY Well_Boundary_HydroCode

SQL Database
CREATE DATABASE Statement
The CREATE DATABASE statement is used to create a new SQL database.

Syntax
CREATE DATABASE databasename;

Example
27. CREATE DATABASE waqas_wtd2;

DROP DATABASE Statement


The DROP DATABASE statement is used to drop an existing SQL database.

Syntax
DROP DATABASE databasename;

Example
28. DROP DATABASE waqas_wtd2;

57
CREATE TABLE Statement
The CREATE TABLE statement is used to create a new table in a database.
Syntax
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
column3 datatype,
....
);

Example
29.
USE wtddb;
CREATE TABLE borepoint(
HydroID INT,
HydroCode varchar (50),
HGUID INT);

DROP TABLE Statement


The DROP TABLE statement is used to drop an existing table in a database.

Syntax
DROP TABLE table_name;

Example
30. DROP TABLE borepoint;

ALTER TABLE Statement


The ALTER TABLE statement is used to add, delete, or modify columns in an
existing table.

The ALTER TABLE statement is also used to add and drop various constraints on
an existing table.

58
ALTER TABLE - ADD Column
To add a column in a table, use the following syntax:
ALTER TABLE table_name
ADD column_name datatype;

Example
31.
USE wtddb;
ALTER TABLE boundary
ADD NewCol INT;

ALTER TABLE - DROP COLUMN


To delete a column in a table, use the following syntax (notice that some
database systems don't allow deleting a column):
ALTER TABLE table_name
DROP COLUMN column_name;

Example
32.
ALTER TABLE boundary
DROP COLUMN NewCol;

ALTER TABLE - ALTER/MODIFY COLUMN


To change the data type of a column in a table, use the following syntax:
ALTER TABLE table_name
MODIFY COLUMN column_name datatype;

Example
33.
ALTER TABLE boundary
MODIFY COLUMN NewCol varchar(50);

59
Create Constraints
Constraints can be specified when the table is created with the CREATE TABLE
statement, or after the table is created with the ALTER TABLE statement.

Syntax
CREATE TABLE table_name (
column1 datatype constraint,
column2 datatype constraint,
column3 datatype constraint,
....
);

SQL Constraints
SQL constraints are used to specify rules for the data in a table.

Constraints are used to limit the type of data that can go into a table. This
ensures the accuracy and reliability of the data in the table. If there is any
violation between the constraint and the data action, the action is aborted.

Constraints can be column level or table level. Column level constraints apply to
a column, and table level constraints apply to the whole table.

The following constraints are commonly used in SQL:

 NOT NULL - Ensures that a column cannot have a NULL value


 UNIQUE - Ensures that all values in a column are different
 PRIMARY KEY - A combination of a NOT NULL and UNIQUE. Uniquely
identifies each row in a table
 FOREIGN KEY - Uniquely identifies a row/record in another table
 CHECK - Ensures that all values in a column satisfies a specific condition
 DEFAULT - Sets a default value for a column when no value is specified
 INDEX - Used to create and retrieve data from the database very quickly

NOT NULL Constraint


By default, a column can hold NULL values.

60
The NOT NULL constraint enforces a column to NOT accept NULL values.

This enforces a field to always contain a value, which means that you cannot
insert a new record, or update a record without adding a value to this field.

The following SQL ensures that the "ID", "LastName", and "FirstName" columns
will NOT accept NULL values:

Example
34.
CREATE TABLE Admin (
ID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255) NOT NULL,
Age int);

UNIQUE Constraint
The UNIQUE constraint ensures that all values in a column are different.

Both the UNIQUE and PRIMARY KEY constraints provide a guarantee for
uniqueness for a column or set of columns.

A PRIMARY KEY constraint automatically has a UNIQUE constraint.

However, you can have many UNIQUE constraints per table, but only one
PRIMARY KEY constraint per table.

Example

35.
CREATE TABLE Admin (
ID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int,
UNIQUE (ID));

61
PRIMARY KEY Constraint
The PRIMARY KEY constraint uniquely identifies each record in a database table.

Primary keys must contain UNIQUE values, and cannot contain NULL values.

A table can have only one primary key, which may consist of single or multiple
fields.

PRIMARY KEY on CREATE TABLE


The following SQL creates a PRIMARY KEY on the "ID" column when the
"Persons" table is created:

Example
36.
CREATE TABLE Admin (
ID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int,
PRIMARY KEY (ID) );

PRIMARY KEY on ALTER TABLE


To create a PRIMARY KEY constraint on the "ID" column when the table is
already created, use the following SQL:

Example
37.
ALTER TABLE Admin
ADD PRIMARY KEY (ID);

DROP a PRIMARY KEY Constraint


Example
38.
ALTER TABLE Admin
DROP PRIMARY KEY;

62
FOREIGN KEY Constraint
A FOREIGN KEY is a key used to link two tables together.

A FOREIGN KEY is a field (or collection of fields) in one table that refers to the
PRIMARY KEY in another table.

The table containing the foreign key is called the child table, and the table
containing the candidate key is called the referenced or parent table.

FOREIGN KEY on CREATE TABLE


The following SQL creates a FOREIGN KEY on the "PersonID" column when the
"Orders" table is created:

Example
39.
CREATE TABLE Sources (
SourceID int NOT NULL,
adminID int,
PRIMARY KEY (SourceID),
CONSTRAINT FK_adminSource
FOREIGN KEY (adminID)
REFERENCES admin(ID));

FOREIGN KEY on ALTER TABLE


Example
ALTER TABLE Sources
ADD CONSTRAINT FK_adminSource
FOREIGN KEY (adminID) REFERENCES admin(ID);

DROP a FOREIGN KEY Constraint


Example
40.

ALTER TABLE Sources


DROP FOREIGN KEY FK_adminSource;

63
EXERCISE 4: Using Structured Query Language [2 pt]

Submission Date: 07.10.2019


Learning Objectives

1. Query, aggregate, and pivot data using Structured Query Language (SQL)
Deliverables
Submit a report that introduces the problem and address the questions below. Provide your tables
as appendices to your one page write-up. Additionally, provide an appendix that list the SQL
queries that you used to generate the tables (indicate which query results which table). In you
write-up, please indicate, what you have learned?

Computer and Data Requirements

1. Data stored in the database in Exercise 3


2. MySQL workbench and server

Problem

You are working on a groundwater quality, depth to water table and lithology data. Task given to
you is to compare the water quality data with the standard guidelines of FAO. You will work with
the database you created in the previous exercise. Your job is to perform exploratory data analysis
using the water quality, water table and lithology datasets in the database. Perform analyses that
may identify potential water quality impairment, soil types and depth to water table variation in
the study area. In your analysis, you should write SQL queries on the database to assemble the
following;
1. Report total number of sites that exceed the water quality standards for Salinity i.e. (no problem:
EC<0.75 mmhos/cm; Increasing problem: 0.75>EC<0.3 mmhos/cm, Severe problem: >0.3
mmhos/cm).

64
2. Create a new column named ‘Quality’ in the Time series table. Based on the EC ranges assign no
problem (NP), increasing problem (IP), severe problem (SP) to the rows of the column. Report the
point that have no problem (NP), increasing problem (IP) and severe problem (SP).
3. Report the shallow, moderate and deep water table points in the study area.

Depth to water table Type

< 90 cm Shallow

90 < DWT < 150 cm Moderate

>150 Deep

4. Report the subsurface layers that are clay and sand;

HGUCode Description
1 Clay
2 Sand
3 Gravel
4 Shale
5 Silt
6 Lime

65
Grading Rubric: Assignment-4 Date:
Student (Name and Roll Number):

Category No Doesn’t Meet Nearly Meets Self- Instructor


Meets Standard Exceeds Standard
(Max. Score) Evidence Standard Standard Score Score
Title Absent Evidence of Evidence of three Evidence of Title – can assess main
(1) two or less four point from title alone;
Name, Instructor’s
0 0 0 Name, Course, Date,
1 Neatly finished 1
Introduction Absent, There is no Introduction states The The introduction states
(3) no clear the main topic but introduction the main topic and
introduction either: states the main previews the structure of
evidenc or main topic. 1. Does not topic and the report. Good
e give a full previews the overview of the design
overview, structure of the and strategy. An
1 Or: report. effective summary.
2. Too detailed, Gives enough detail to
leading to interest the reader.
0
annoying 2 3
repetition later.
2
Organizati Not Paragraphs Organization of Paragraph Writer demonstrates
on and applicabl fail to ideas not fully development logic and sequencing of
e develop the developed. present but not ideas through well-
structural main idea. No Paragraphs lack perfected. Each developed paragraphs.
developme evidence of supporting detail paragraph has Each paragraph has
nt of the structure or sentences. No sufficient thoughtful, supporting
idea: organization. transitions. supporting detail sentences that
procedure, detail develop the main idea.
sentences. No The first sentence of
results, transitions. each paragraph is the
discussion 1–5 6-7 summary sentence.
(10) 8 Transitions enhance
structure. 9 - 10
Engineerin Design The writer Sketchy: left out Discussion Provides what was
g point(s) has no clue required design lacks adequate explicitly asked for. The
not what they are points. Did not detail, but all function of each piece is
Calculation addresse talking about. work on this as the necessary demonstrated to the
s and d. 45 – 58% much as you points are reader in adequate, but
Design should have, and it covered and not overwhelming,
(70) 3 – 42% shows. Many nearly all detail. Answers are
important answers answers are correct and reasonable.
are incorrect. correct. 91 – 100%
61 – 79% 82 – 88%
1. Provide requested tables in an Appendix (20)
2. Provide an Appendix with a listing of your SQL queries (20)
3. Provide accurate calculation of water quality exceedance (20)
4. Provide explanatory text and answers to the questions (10)

66
Category No Doesn’t Meet Nearly Meets Self- Instructor
Meets Standard Exceeds Standard
(Max. Score) Evidence Standard Standard Score Score
Word Not Numerous Misspelled words, Almost no Punctuation,
Usage and applicabl and poor English errors in capitalization, spelling,
e distracting grammar and word punctuation, sentence structure, word
Format errors in choice. Main body capitalization, usage, and significant
(10) punctuation, of report is either spelling, figures all correct. Main
capitalization, longer or sentence body of report is one
spelling, significantly less structure, word page or less. Clear,
sentence than one page. usage, consistent fonts. Good
structure, Figures are too significant word processing skills.
word usage, small and/or figures, and Figures have adequate
significant under-labeled, presentation of contrast. Informative
figures, although they are figures, tables, figure and table titles
tables, and usually of and and legends. Figures
figures. Data acceptable quality appendices. have appropriate axis
vomited onto and focus. Tables Main body of tick spacing, labels,
page(s). incoherent or not report is one units, and legends.
Unacceptable cohesive. Bad font page or less Table columns cohesive,
/ sizes. Too much or labeled, and specify
unprofessiona too little data in units. Document is
l at the appendices. Could 8 stapled. Appendices, if
graduate be improved by provided, are separated
level. 1 – 5 being more by topic, and each have
meticulous. a title, discussion, and
6-7 proper formatting and
display of information
9 - 10
Conclusion Absent Incomplete The conclusion The conclusion The conclusion restates
(4) and/or not does not restates the the main results, and is
0 focused. 1 adequately restate main results. 3 an effective summary. 4
the main results. 2
References Absent With many With some errors, With few All cited works; text,
(2) errors, off- appropriate sources errors, good visual, and data sources
0 the-wall were used. sources were are done in the correct
sources used. 1 used format with no errors.
0 2 Uses innovative sources
of information. 2
TOTAL
(100)

67
CHAPTER 6: INTRODUCTION TO PYTHON PROGRAMMING

Introduction
In this chapter we will learn the fundamentals of the python programming. Python is a
programming language which is simple and easy to use. It has various features, and is the
growing language now days.

We will learn use of numbers and strings, variables, statement and expressions, functions
and modules, controlling work flow, including use of conditional statement, branching and
looping, create custom functions and call functions in this chapter.

There are various resource that you can look into, and few are given below;

Python Resources

• Beginners Guide to Python (python.org)


• Tutorial (https://docs.python.org/2/tutorial/)
• Think Python by Allen B. Downey
• www.codecademy.com/learn/python
• www.LearnPythonTheHardWay.org

Installation of Python and PyCharm IDE

There are various integrated development environments (IDE’s) available in which you can
write, compile and run python code. We will use PyCharm in this book. Below are the steps to
install and configure Pycharm IDE.

1. Download Pycharm executable from


https://www.jetbrains.com/pycharm/download/#section=windows
2. Run the pycharm-2017.2.3.exe file that starts the Installation Wizard.
3. Follow all steps suggested by the wizard.
4. Ensure that you install Python with the installation of pycharm.
5. After installation, run the pycharm IDE.

68
6. Your screen will look like below;

7. Now, click on Create New Project, and enter the location where you want to save all the files of
the project. You also have to select the interpreter, select Python34 or any other version of
python you are using.

69
8. Now, you need to Create a python script file, so that you can write your first script. Right click
on the connected folder and select Python File, and name it as HelloWorld.

9. Write you first code in the script file.


print (“Hello World”)

You will notice that when you enter in the script file, nothing happens-that is because script is
just like an editor, it will only work when you will execute this script. You can run the script the

70
by pressing CRTL+SHIFT+F10 or by right clicking the script file and selecting Run ‘Helloworld’.

10. Once you will execute the script you will get Hello world printed on the output window. Process
finished with exit code 0 means that there are no errors in the code.

71
Python Basics
Data Type and Structure
Data type of an object determines what type of value it can have and what operation can be
performed on the object

Datatypes
 Strings
String values consist of one or more characters, which include letter, number other types
of data
 Numbers
There are two numeric data type: integer (whole numbers) and floats (fractional number)
 List
List is analogous to arrays. You can store data in list in square brackets, which are
separated by comma. For example list = [1,2,3,4,5,6]
 Tuple
Tuple are similar to list but you cannot modify them but only replace it. You can store
data in tuple in round brackets, which are separated by comma. For example tuple =
(1,2,3,4,5,6)
 Dictionary
Dictionary is similar to lookup. In this data type an identification key is assigned, which
is reference to the data. For Example dict = {key:[1,2,3,4,5,6]}

Data Structures
Data structure is a collection of data elements that are numbered in some way.
 Sequence is most common data structure in which each element is assigned an index.
String, list and tuples are example of sequence.

Example for Data types

Here we will see how we can use different data types in python.

Numbers
Number can be integer or floats.
Integer 1
Float 1.0

Code
print (7+3)
print( 7-3)
print (7*3)
print (7/3)
print (7%3)

72
Output
10
4
21
2.3333333333333335
1
Process finished with exit code 0

Variables

Variables are used to store information. Variable is basically a name that represents a value.

Code
x = 7 # this is called an assignment statement. You assign 7 to a variable x
y = 3
print (x+y)
print(x-y)
print (x*y)
print (x/y)
print (x%y)
Output
10
4
21
2.3333333333333335
1

Process finished with exit code 0


In python language you do not need to declare the variable and define the datatype before using
it. Python does not require you to declare it, you can assign value directly.
It has dynamic variable assignment i.e. python knows from the values what variable it is.
For Example:
x = 7 # Integer
x = 7.0 # Float
>>> x = “MehranUET” #String
Basic rules for naming variable:
 Variable name can consist of letters, numbers and underscores ( _ )
 Variable name cannot begin with number. 1Var is wrong. Var1 is correct.
 Python keywords cannot be used as variable names. This include statements such as
print, import etc.
 Use descriptive name
 Follow conventions. Python has official naming convention guide “Style guide for
python code”
 Keep it short

73
Strings

A set of characters surrounded by quotation marks is called string literals. You can use single ( ‘’
) or double ( “” ). Both works same.

Code
x = "Fall Semester"
y = "2019"
print (x+y)
Output

Fall Semester2019

Process finished with exit code 0

String is a sequence, which means that you can also extract each letter from the string. Suppose I
want print only Fall from so I can specify the index range in x variable and it will print only
those letters. You will notice in code below that I have started the index from 0. In python, index
always start from 0. So what does x [0:4] means, it means that the print values that occur at 0,1,2
and 3.

Code
x = "Fall Semester"
y = "2019"
print (x[0:4])
Output
Fall

Process finished with exit code 0

One important thing to note is that when we are working with strings, we need to have all the
data sets in string. Suppose I want to print “Today temperature is x Celsius”, and I am
inputting x = 35 as an integer then it will give us error. Fix is that we need to convert x to a
string. We can do that by using str () built-in function.
Code
x = 35
print ("Today temperature is" + x + "Celsius")
Output
Traceback (most recent call last):
File
"E:/PCAS_W_Mehran/1.courses/Hydroinformatics/2019_Fall/Lecture/Lecture7/Ex
ampleBasics.py", line 2, in <module>
print ("Today temperature is" + x + "Celsius")
TypeError: Can't convert 'int' object to str implicitly
Process finished with exit code 1

74
Code (Fixed)
x = 35
print ("Today temperature is" + str(x) + "Celsius")
Output

Today temperature is 35 Celsius

Process finished with exit code 0

List
List is one of the important data type in python. List contains items surrounded by square
brackets [] and separated by comma. Individual item in the list can be string, integer, floats and
several other data types.
Suppose I am interested to store the information of the data model created earlier () in the list
datatype. I can do as described below;
Code
# example of list datatype
# list that store names of entities
entities = ['boundary','well','timeseries','variabledefinition']
# list that store information of attributes for each entity
boundary = ['HydroID','HydroCode','Name','Btype']
well = ['HydroID','HydroCode','Ftype','Longitude','Latitude']
timeseries = ['FeatureID','TsYear','UTCOffset','TSValue','SElevValue']
variabledefinition = ['HydroID','VarName','VarDesc','VarUnits']
print (entities)
print (boundary)
print (well)
print (timeseries)
print(variabledefinition)

Output
['boundary', 'well', 'timeseries', 'variabledefinition']
['HydroID', 'HydroCode', 'Name', 'Btype']
['HydroID', 'HydroCode', 'Ftype', 'Longitude', 'Latitude']
['FeatureID', 'TsYear', 'UTCOffset', 'TSValue', 'SElevValue']
['HydroID', 'VarName', 'VarDesc', 'VarUnits']

Process finished with exit code 0

75
Now, suppose I also want to store the data for each entity. I can store the data in 2D list, as
shown in code below;
Code

# example of list datatype


#
# list that store names of entities

entities = ['boundary','well','timeseries','variabledefinition']
# list that store information of attributes for each entity
boundary = ['HydroID','HydroCode','Name','Btype']
well = ['HydroID','HydroCode','Ftype','Longitude','Latitude','BID']
timeseries = ['FeatureID','TsYear','UTCOffset','TSValue','SElevValue']
variabledefinition = ['HydroID','VarName','VarDesc','VarUnits']

dataWell = [[1,'104','obp',68.7603,24.7931,'Ful'],
[2,'107','obp',68.6069,24.8106,'Ful'],
[3,'108','obp',68.5503,24.8106,'Ful']]

# Print complete of data

print (dataWell)

# Print first row of data


print (dataWell [0])

# Print first column of first row of data, which is HydroID of well


print (dataWell [0][0])

Output

[[1, '104', 'obp', 68.7603, 24.7931, 'Ful'], [2, '107', 'obp', 68.6069,
24.8106, 'Ful'], [3, '108', 'obp', 68.5503, 24.8106, 'Ful']]

[1, '104', 'obp', 68.7603, 24.7931, 'Ful']

Process finished with exit code 0

76
Tuple
Tuples are sequences of elements, just like list but tuples are immutable, meaning that they
cannot be changed.

Code
# example of tuple datatype
# Tuple that store names of entities
entities = ('boundary','well','timeseries','variabledefinition')
# Tuple that store information of attributes for each entity
boundary = ('HydroID','HydroCode','Name','Btype')
well = ('HydroID','HydroCode','Ftype','Longitude','Latitude')
timeseries = ('FeatureID','TsYear','UTCOffset','TSValue','SElevValue')
variabledefinition = ('HydroID','VarName','VarDesc','VarUnits')
print (entities)
print (boundary)
print (well)
print (timeseries)
print(variabledefinition)

Output
('boundary', 'well', 'timeseries', 'variabledefinition')
('HydroID', 'HydroCode', 'Name', 'Btype')
('HydroID', 'HydroCode', 'Ftype', 'Longitude', 'Latitude')
('FeatureID', 'TsYear', 'UTCOffset', 'TSValue', 'SElevValue')
('HydroID', 'VarName', 'VarDesc', 'VarUnits')

Process finished with exit code 0

Dictionary
Dictionaries consist of pairs of Keys and their corresponding values.
Pairs are referred as items of the dictionary
A dictionary item consists of a key, followed by a colon (:), and then the corresponding value.
The dictionary itself is surrounded by {} brackets.
Code
# example of Dictionary datatype
# dictionary that store names of entities and their attributes

datamodel = {'boundary':['HydroID','HydroCode','Name','Btype'],
'well':['HydroID','HydroCode','Ftype','Longitude','Latitude'],
'timeseries':['FeatureID','TsYear','UTCOffset','TSValue','SElevValue'],
’variabledefinition': ['HydroID','VarName','VarDesc','VarUnits']}

print (datamodel['boundary'])
print (datamodel['well'])
print (datamodel['timeseries'])
print (datamodel['variabledefinition'])

77
Output
['HydroID', 'HydroCode', 'Name', 'Btype']
['HydroID', 'HydroCode', 'Ftype', 'Longitude', 'Latitude']
['FeatureID', 'TsYear', 'UTCOffset', 'TSValue', 'SElevValue']
['HydroID', 'VarName', 'VarDesc', 'VarUnits']

Process finished with exit code 0

Conditional statements
Till now we have seen simple statements but is many cases we need to take logical decision in
our program or execute a certain portion of code repeatedly.
One way to do it using conditional statements via branching the code using if statements

Code
# example of If Statement
x =10
if x == 10:
print ('Number is 10')

Output
Number is 10

Process finished with exit code 0

We can use mutliple operators, such as;

78
Code
# example of If and elif Statement
x =11
if x == 10:
print ('Number is 10')
elif x<10:
print ('Number is less than 10')
else:
print ('Number is greater than 10')
Output
Number is greater than 10

Process finished with exit code 0

Loops

To repeat certain action we use loop structure. There are two types of loops i.e. for loop and
while loop.

Code
# example of while loop
i = 0
while i <= 10:
print (i)
i+=1

Output
0
1
2
3
4
5
6
7
8
9
10

Process finished with exit code 0

79
Code

# example of for loop


for i in range (0,11):
print (i)

Output
0
1
2
3
4
5
6
7
8
9
10

Process finished with exit code 0

Functions
Custom functions makes the task of repeated programing easy. By creating functions you can
organized your codes into logical parts and reuse it frequently.
Functions are organized into modules and modules can be organized into packages.

Code
# example of creating a function
# defining a function
# add number is a fuction name
# x,y are input arguments
# z is the returned as output
def addnumber (x,y):
z = x + y
return z
# calling a function
# here we call a fuction by giving it inputs
ans = addnumber(1,2)
print (ans)

Output
3

Process finished with exit code 0

80
Example of some built-in function in python

Plotting data using matplotlib

There are several libraries available to performing plotting of data in the python. In this book we
will use matplotlib package. First step is to install the libraries in you python packages.

Using command prompt

Using pycharm
 Press Crtl+Alt+S or go to File -> Settings…
 Select Project tab
 Select Project Interpreter
 Click on + sign
 Search -> matplotlib
 Install Package
 Once, installation is complete. You can test it by importing libraries. Output should not
give any error. Then you are ok to go.

Code
import matplotlib

Output

Process finished with exit code 0

81
Data

We will use time series data of DTW measured at different wells to demonstrate the plotting in
python. Data is stored as follows;

well = ['MO234','MO273','NC014']
# time in days
time = [0,182.52,365.04,547.56,730.08,912.6,1095.12,1277.64]
# data in meters
data = {'MO234':[30.46,28.3,30.97,28.18,29.19,28.28,29.54,28.75],
'MO273':[23.89,22.51,24.44,22.43,22.7,22.29,24.64,23.34],
NC014':[36.8,35.55,36.9,36.49,36.93,36.38,36.83,36.63]}

Line graph

Code
import matplotlib
# here we are imporitng only pyplot libraries from the matplotlib
from matplotlib import pyplot as plt
well = ['MO234','MO273','NC014']
time = [0,182.52,365.04,547.56,730.08,912.6,1095.12,1277.64]
data = {'MO234':[30.46,28.3,30.97,28.18,29.19,28.28,29.54,28.75],
'MO273':[23.89,22.51,24.44,22.43,22.7,22.29,24.64,23.34],
'NC014':[36.8,35.55,36.9,36.49,36.93,36.38,36.83,36.63]}
# plots data w.r.t time
plt.plot(time,data['MO234'])
# show the plot on the screen
plt.show()

Output

82
You can assign labels and do further curation to plot as given below. You can also directly save
the figure for using them in the report.

Code

import matplotlib
# here we are imporitng only pyplot libraries from the matplotlib
from matplotlib import pyplot as plt
well = ['MO234','MO273','NC014']
time = [0,182.52,365.04,547.56,730.08,912.6,1095.12,1277.64]
data = {'MO234':[30.46,28.3,30.97,28.18,29.19,28.28,29.54,28.75],
'MO273':[23.89,22.51,24.44,22.43,22.7,22.29,24.64,23.34],
'NC014':[36.8,35.55,36.9,36.49,36.93,36.38,36.83,36.63]}
# plots data w.r.t time
plt.plot(time,data['MO234'])
plt.xlabel('Time (days)')
plt.ylabel('DTW (meters)')
plt.grid()
# save figure to a file
plt.savefig('fig1.png')
# show the plot on the screen
plt.show()

Output

83
Histograms
Histograms are familiar graphics, and their construction is detailed in numerous introductory
texts on statistics. Bars are drawn whose height is the number ni, or fraction ni/n, of data falling
into one of several categories or intervals.
Histograms are quite useful for depicting large differences in shape or symmetry, such as
whether a data set appears symmetric or skewed
Limitation
Histograms have one primary deficiency -- their visual impression depends on the number of
categories selected for the plot
For data measured on a continuous scale (such as streamflow or concentration), histograms are
not the best method for graphical analysis. The process of forcing continuous data into discrete
categories may obscure important characteristics of the distribution. However, histograms are
excellent when displaying data which have natural categories or groupings. Examples of such
data would include the number of individual organisms found at a stream site grouped by species
type, or the number of water-supply wells exceeding some critical yield grouped by geologic
unit.
Code

import matplotlib
# here we are imporitng only pyplot libraries from the matplotlib
from matplotlib import pyplot as plt
well = ['MO234','MO273','NC014']
time = [0,182.52,365.04,547.56,730.08,912.6,1095.12,1277.64]
data = {'MO234':[30.46,28.3,30.97,28.18,29.19,28.28,29.54,28.75],
'MO273':[23.89,22.51,24.44,22.43,22.7,22.29,24.64,23.34],
'NC014':[36.8,35.55,36.9,36.49,36.93,36.38,36.83,36.63]}
# plots histogram
plt.hist(data['MO234'])
plt.xlabel('Class')
plt.ylabel('frequency')
plt.grid()
# save figure to a file
plt.savefig('fig2.png')
# show the plot on the screen
plt.show()

84
Output

Box Plot
A very useful and concise graphical display for summarizing the distribution of a data set is the
boxplot. Boxplots provide visual summaries of;
 the center of the data (the median--the center line of the box)
 the variation or spread (interquartile range IQR--the box height i.e. Third Quartile- First
Quartile)
 the skewness (quartile skew--the relative size of box halves)
 Presence or absence of unusual values ("outside" and "far outside" values). Outliers are
values that are 1.5(IQR)>Q3 OR 1.5(IQR)<Q1

Boxplots are even more useful in comparing these attributes among several data sets. Boxplots
are often put side-by-side to visually compare and contrast groups of data.

85
Code

import matplotlib
# here we are imporitng only pyplot libraries from the matplotlib
from matplotlib import pyplot as plt
well = ['MO234','MO273','NC014']
time = [0,182.52,365.04,547.56,730.08,912.6,1095.12,1277.64]
data = {'MO234':[30.46,28.3,30.97,28.18,29.19,28.28,29.54,28.75],
'MO273':[23.89,22.51,24.44,22.43,22.7,22.29,24.64,23.34],
'NC014':[36.8,35.55,36.9,36.49,36.93,36.38,36.83,36.63]}
# plots boxplot for single dataset
dataBoxP = [data['MO234']]
plt.boxplot(dataBoxP)
plt.grid()
# save figure to a file
plt.savefig('fig3.png')
# show the plot on the screen
plt.show()

Output

86
Code

import matplotlib
# here we are imporitng only pyplot libraries from the matplotlib
from matplotlib import pyplot as plt
well = ['MO234','MO273','NC014']
time = [0,182.52,365.04,547.56,730.08,912.6,1095.12,1277.64]
data = {'MO234':[30.46,28.3,30.97,28.18,29.19,28.28,29.54,28.75],
'MO273':[23.89,22.51,24.44,22.43,22.7,22.29,24.64,23.34],
'NC014':[36.8,35.55,36.9,36.49,36.93,36.38,36.83,36.63]}
# plots boxplot for multiple dataset
dataBoxP = [data['MO234'],data['MO273'],data['NC014']]
plt.boxplot(dataBoxP)
plt.grid()
# save figure to a file
plt.savefig('fig4.png')
# show the plot on the screen
plt.show()

Output

87
You can assign labels and do further curation to box plot as given below. You can also directly
save the figure for using them in the report.

Code

import matplotlib
# here we are imporitng only pyplot libraries from the matplotlib
from matplotlib import pyplot as plt
well = ['MO234','MO273','NC014']
time = [0,182.52,365.04,547.56,730.08,912.6,1095.12,1277.64]
data = {'MO234':[30.46,28.3,30.97,28.18,29.19,28.28,29.54,28.75],
'MO273':[23.89,22.51,24.44,22.43,22.7,22.29,24.64,23.34],
'NC014':[36.8,35.55,36.9,36.49,36.93,36.38,36.83,36.63]}
# plots boxplot for multiple dataset
dataBoxP = [data['MO234'],data['MO273'],data['NC014']]
plt.boxplot(dataBoxP,labels = ['MO234','MO273','NC014'])
plt.ylabel('DTW [m]')
plt.title('Box plot of wells')
plt.grid()
# save figure to a file
plt.savefig('fig5.png')
# show the plot on the screen
plt.show()

Output

88
Scatterplots
The two-dimensional scatterplot is one of the most familiar graphical methods for data analysis.
It illustrates the relationship between two variables.
A scatter plot is a plot of the values of Y versus the corresponding values of X:
 Vertical axis: variable Y--usually the response variable
 Horizontal axis: variable X--usually some variable we suspect may be related to the
response

Scatter plots can provide answers to the following questions:


 Are variables X and Y related?
 Are variables X and Y linearly related?
 Are variables X and Y non-linearly related?
 Does the variation in Y change depending on X?
 Are there outliers?

Here we will one extra dataset of electric conductivity to infer relationship between DTW and
EC
.
Code
import matplotlib
# here we are imporitng only pyplot libraries from the matplotlib
from matplotlib import pyplot as plt
well = ['MO234','MO273','NC014']
time = [0,182.52,365.04,547.56,730.08,912.6,1095.12,1277.64]
data = {'MO234':[30.46,28.3,30.97,28.18,29.19,28.28,29.54,28.75],
'MO273':[23.89,22.51,24.44,22.43,22.7,22.29,24.64,23.34],
'NC014':[36.8,35.55,36.9,36.49,36.93,36.38,36.83,36.63]}
dataEC = {'MO234':[2000,1400,2100,1500,1800,1800,2000,1700],
'MO273':[3000,2400,3100,1000,1000,2800,2000,2700],
'NC014':[200,500,700,500,800,900,700,700]}
# plots scatter plot for two variables
plt.scatter(data['MO234'],dataEC ['MO234'])
plt.xlabel('DTW [m]')
plt.ylabel('EC[uS/cm]')
plt.title('Scatter plot DTW v/s EC')
plt.grid()
# save figure to a file
plt.savefig('fig6.png')
# show the plot on the screen
plt.show()

89
Output

Now, let’s suppose I want to show all three datasets on one scatter plot. Then I can do as follows;

import matplotlib
# here we are imporitng only pyplot libraries from the matplotlib
from matplotlib import pyplot as plt
well = ['MO234','MO273','NC014']
time = [0,182.52,365.04,547.56,730.08,912.6,1095.12,1277.64]
data = {'MO234':[30.46,28.3,30.97,28.18,29.19,28.28,29.54,28.75],
'MO273':[23.89,22.51,24.44,22.43,22.7,22.29,24.64,23.34],
'NC014':[36.8,35.55,36.9,36.49,36.93,36.38,36.83,36.63]}
dataEC = {'MO234':[2000,1400,2100,1500,1800,1800,2000,1700],
'MO273':[3000,2400,3100,1000,1000,2800,2000,2700],
'NC014':[200,500,700,500,800,900,700,700]}

# plots multiple scatter plot for two variables on one plot


for i in well:
plt.scatter(data[i],dataEC[i])
plt.xlabel('DTW [m]')
plt.ylabel('EC[uS/cm]')
plt.title('Scatter plot DTW v/s EC')
plt.legend(well)
plt.grid()
# save figure to a file
plt.savefig('fig7.png')
# show the plot on the screen
plt.show()

90
91

You might also like