Download as pdf or txt
Download as pdf or txt
You are on page 1of 85

4.

0 Intended Learning Outcomes


and Topics
Intended Learning Outcomes
At the end of this module, the students will be able to:

• Define what relational database modeling and normalization are (Knowledge)


• Outline the procedure of relational database modeling (Comprehension)
• Correlate applications of database normalization (Application and Analysis)
• Design their own company's database and apply normalization (Synthesis)
• Compare each group's work and recommend improvements to their
colleagues (Evaluation)

Topics
4.1 | Relational Database Modeling and Normalization
4.1 | Relational Database Modeling
Relational Database Modeling

Introduction to Relational Databases

▪ The easiest way to understand a database is as a collection of related files.


Imagine a file (either paper or digital) of sales orders in a shop. Then there's another
file of products, containing stock records. To fulfil an order, you'd need to look up the
product in the order file and then look up and adjust the stock levels for that
particular product in the product file. A database and the software that controls the
database, called a database management system (DBMS), helps with this kind of
task.
▪ Most databases today are relational databases, named such because they deal
with tables of data related by a common field. For example, Table 1 below shows the
product table, and Table 2 shows the invoice table. As you can see, the relation
between the two tables is based on the common field ProductCode . Any two tables
can relate to each other simply by having a field in common.
Database Terminology
Let's take a closer look at the previous two tables to see how they are organized:

• Each table consists of many rows and columns.


• Each new row contains data about one single entity (such as one product or one
order line). This is called a record. For example, the first row in Table 1 is a record; it
describes the A150 product, which is a bag of cement that costs Php 220.00. The
terms row and record are interchangeable.
• Each column (also called an attribute) contains one piece of data that relates to the
record, called a tuple. Examples of attributes are the quantity of an item sold or the
price of a product. An attribute, when referring to a database table, is called a field.
For example, the data in the Description column in Table 1 are fields. The
terms attribute and field are interchangeable.
Understanding the Hierarchical Database Model

▪ The earliest model was the hierarchical database model, resembling an upside-down
tree.
▪ Files are related in a parent-child manner, with each parent capable of relating to
more than one child, but each child only being related to one parent. Most of you will
be familiar with this kind of structure—it’s the way most file systems work. There is
usually a root, or top-level, directory that contains various other directories and files.
Each subdirectory can then contain more files and directories, and so on. Each file
or directory can only exist in one directory itself—it only has one parent. As you can
see in the image below A1 is the root directory, and its children are B1 and B2. B1 is
a parent to C1, C2, and C3, which in turn has children of its own.

This model, although being a vast improvement on dealing with unrelated files, has
some serious disadvantages. It represents one-to-many relationships well (one parent
has many children; for example, one company branch has many employees), but it has
problems with many-to-many relationships. Relationships such as that between a
product file and an orders file are difficult to implement in a hierarchical model.
Specifically, an order can contain many products, and a product can appear in many
orders.
Also, the hierarchical model is not flexible because adding new relationships can result
in wholesale changes to the existing structure, which in turn means all existing
applications need to change as well. This is not fun when someone has forgotten a
table and wants it added to the system shortly before the project is due to launch! And
developing the applications is complex because the programmer needs to know the
data structure well in order to traverse the model to access the needed data. As you’ve
seen in the earlier chapters, when accessing data from two related tables, you only
need to know the fields you require from those two tables. In the hierarchical model,
you’d need to know the entire chain between the two. For example, to relate data
from A1 and D4, you’d need to take the route: A1, B1, C3 and D4.
Understanding the Network Database Model
The network database model was a progression from the hierarchical database
model and was designed to solve some of that model's problems, specifically the lack of
flexibility. Instead of only allowing each child to have one parent, this model allows each
child to have multiple parents (it calls the children members and the parents owners). It
addresses the need to model more complex relationships such as the orders/parts
many-to-many relationship mentioned in the hierarchical article. As you can see in the
figure below, A1 has two members, B1 and B2. B1. is the owner of C1, C2, C3 and C4.
However, in this model, C4 has two owners, B1 and B2.

Of course, this model has its problems, or everyone would still be using it. It is more difficult to
implement and maintain, and, although more flexible than the hierarchical model, it still has
flexibility problems, Not all relations can be satisfied by assigning another owner, and the
programmer still has to understand the data structure well in order to make the model efficient.

Understanding the Relational Database Model


The relational database model was a huge leap forward from the network database
model. Instead of relying on a parent-child or owner-member relationship, the relational
model allows any file to be related to any other by means of a common field. Suddenly,
the complexity of the design was greatly reduced because changes could be made to
the database schema without affecting the system's ability to access data. And because
access was not by means of paths to and from files, but from a direct relationship
between files, new relations between these files could easily be added.
In 1970, when E.F. Codd developed the model, it was thought to be impractical. The
increased ease of use comes at a large performance penalty, and the hardware in those
days was not able to implement the model. Since then, of course, hardware has taken
huge strides to where today, even the simplest computers can run sophisticated
relational database management systems.
Relational databases go hand-in-hand with the development of SQL. The simplicity of
SQL - where even a novice can learn to perform basic queries in a short period of time -
is a large part of the reason for the popularity of the relational model.
The two tables below relate to each other through the ProductCode field. Any two tables
can relate to each other simply by creating a field they have in common.

Relational Databases: Basic Terms


The relational database model uses certain terms to describe its components:

• Data are the values kept in the database. On their own, the data means very
little. CA 684-213 is an example of data in a DMV (Division of Motor Vehicles)
database.
• Information is processed data. For example, CA 684-213 is the car registration
number of a car belonging to Lyndon Manson, in a DMV database.
• A database is a collection of tables, also called entities.
• Each table is made up of records (the horizontal rows in the table, also
called tuples). Each record should be unique, and can be stored in any order in the
table.
• Each record is made up of fields (which are the vertical columns of the table, also
called attributes). Basically, a record is one fact (for example, one customer or one
sale).
• These fields can be of various types. Generally, types fall into three kinds: character,
numeric, and date. For example, a customer name is a character field, a customer's
birthday is a date field, and a customer's number of children is a numeric field.
• The range of allowed values for a field is called the domain (also called a field
specification). For example, a credit card field may be limited to only the
values Mastercard , Visa and Amex .
• A field is said to contain a null value when it contains nothing at all. Null fields can
create complexities in calculations and have consequences for data accuracy. For
this reason, many fields are specifically set not to contain null values.
• A key accesses specific records in a table.
• An index is a mechanism to improve the performance of a database. Indexes are
often confused with keys. Indexes are, strictly speaking, part of the physical
structure, and keys are part of the logical structure. You'll often see the terms used
interchangeably, however, including throughout this Knowledge Base.
• A view is a virtual table made up of a subset of the actual tables.
• A one-to-one (1:1) relationship is where for each instance of the first table in a
relationship, only one instance of the second table exists, An example of this would
be a case where a chain of stores carries a vending machine. Each vending
machine can only be in one store, and each store carries only one vending machine.

• A one-to-many (1:N) relationship is where for each instance of the first table in a
relationship, many instances of the second table exist. This is a common kind of
relationship. An example is the relationship between a sculptor and their sculptures.
Each sculptor may have created many sculptures, but each sculpture has been created
by only one sculptor.

• A many-to-many (M:N) relationship occurs where, for each instance of the first table,
there are many instances of the second table, and for each instance of the second table,
there are many instances of the first. For example, a student can have many lecturers,
and a lecturer can have many students.

• A mandatory relationship exists where for each instance of the first table in a
relationship, one or more instances of the second must exist. For example, for a music
group to exist, there must exist at least one musician in that group.
• An optional relationship is where for each instance of the first table in a relationship,
there may exist instances of the second. For example, if an author can be listed in the
database without having written a book (in other words, a prospective author), that
relationship is optional. The reverse isn't necessarily true though. For example, for a
book to be listed, it must have an author.
• Data integrity refers to the condition where data is accurate, valid, and consistent. An
example of poor integrity would be if a customer telephone number is stored differently in
two different locations. Another is where a course record contains a reference to a
lecturer who is no longer present at the school. Database normalization is a technique
that assists you to minimize the risk of these sorts of problems.

Relational Databases: Table Keys


A key, or index, as the term itself indicates, unlocks access to the tables. If you know
the key, you know how to identify specific records and the relationships between the
tables.
Each key consists of one or more fields, or field prefix. The order of columns in an index
is significant. Each key has a name.
A candidate key is a field, or combination of fields, that uniquely identifies a record. It
cannot contain a null value, and its value must be unique. (With duplicates, you would
no longer be identifying a unique record).
A primary key (PK) is a candidate key that has been designated to identify unique
records in the table throughout the database structure.
A surrogate key is a primary key that contains unique values automatically generated by
the database system - usually, integer numbers. A surrogate key has no meaning,
except uniquely identifying a record. This is the most common type of primary key.
For example, see the following table:

At first glance, there are two possible candidate keys for this table.
Either CustomerCode or a combination of FirstName, Surname and TelNo would
suffice. It is always better to choose the candidate key with the least number of fields for
the primary key, so you would choose CustomerCode in this example (note that it is a
surrogate key). Upon reflection, there is also the possibility of the second combination
not being unique. The combination of FirstName, Surname and TelNo could in theory
be duplicated, such as where a father has a son of the same name who is contactable
at the same telephone number. This system would have to expressly exclude this
possibility for these three fields to be considered for the status of primary key.
There may be many Ariane Edisons, but you avoid confusion by assigning each a
unique number. Once a primary key has been created, the remaining candidates are
labeled as alternate keys.

Relational Databases: Foreign Keys


You already know that a relationship between two tables is created by assigning a
common field to the two tables (see Relational Databases: Table Keys). This common
field must be a primary key to one table. Consider a relationship between
a customer table and a sale table. The relationship is not much good if instead of using
the primary key, customer_code, in the sale table, you use another field that is not
unique, such as the customer's first name. You would be unlikely to know for sure which
customer made the sale in that case. So, in the table below, customer_code is called
the foreign_key in the sale table; in other words, it is the primary key in a foreign table

Lecturer table

Code First Name Surname

1 Anne Cohen

2 Leonard Clark

3 Vusi Cave
Course table

Course Title Lecturer Code

Introduction to Programming 1

Information Systems 2

Systems Software 3

Referential integrity exists here, as all the lecturers in the course table exist in the lecturer table.
However, let's assume Anne Cohen leaves the institution, and you remove her from the lecturer
table. In a situation where referential integrity is not enforced, she would be removed from the
lecturer table, but not from the course table, as shown below:

Lecturer table

Code First Name Surname

2 Leonard Clark

3 Vusi Cave

Course table

Course Title Lecturer Code

Introduction to Programming 1

Information Systems 2

Systems Software 3

Now, when you look up who lectures Introduction to Programming, you are sent to a
non-existent record. This is called poor data integrity.
Foreign keys also allow cascading deletes and updates. For example, if Anne Cohen
leaves, taking the Introduction to Programming Course with her, all trace of her can be
removed from both the lecturer and course table using one statement. The
delete cascades through the relevant tables, removing all relevant records.
Foreign keys can also contain null values, indicating that no relationship exists.

Relational Databases: Views


Views are virtual tables. They are only a structure and contain no data. Their purpose is
to allow a user to see a subset of the actual data. A view can consist of a subset of one
table. For example, the student view, below, is a subset of the student table.

Student View

First name

Surname

Grade

Student Table

Student_id

First name

Surname

Grade

Address

Telephone

This view could be used to allow other students to see their fellow student's marks but
not allow them access to personal information.
Alternatively, a view could be a combination of a number of tables, such as the view
below:
Student View

First name

Surname

Grade

Student Table

Student_id

First name

Surname

Address

Telephone

Course Table

Course_id

Course description

Grade Table

Student_id

Course_id

Grade
Views are also useful for security. In larger organizations, where many developers may be
working on a project, views allow developers to access only the data they need. What they don't
need, even if it is in the same table, is hidden from them, safe from being seen or manipulated.
It also allows queries to be simplified for developers. For example, without the view, a developer
would have to retrieve the fields in the view with the following sort of query

SELECT first_name, surname, course_description, grade FROM student, grade, course

WHERE grade.student_id = student.student_id AND grade.course_id = course.course_id

With this view, a developer could do the same with the following:

SELECT first_name, surname, course_description, grade FROM student_grade_view

Much simpler for a junior developer who hasn't yet learned to do joins, and it's just less
hassle for a senior developer too!

Database Design

▪ Phase 1: Analysis
▪ Phase 2: Conceptual Design
▪ Phase 3: Logical and Physical Design
▪ Phase 4: Implementation
▪ Phase 5: Testing
▪ Phase 6: Operation
▪ Phase 7: Maintenance
4.2 | Database Normalization
Database Normalization

Developed in the 1970's by E.F. Codd, database normalization is standard requirement


of many database designs.

Normalization is a technique that can help you avoid data anomalies and other
problems with managing your data.
It consists of transforming a table through various stages: 1st normal form, 2nd normal
form, 3rd normal form, and beyond.
It aims to:

• Eliminate data redundancies (and therefore use less space)


• Make it easier to make changes to data, and avoid anomalies when doing so
• Make referential integrity constraints easier to enforce
• Produce an easily comprehensible structure that closely resembles the
situation the data represents, and allows for growth

Let's begin by creating a sample set of data. You'll walk through the process of
normalization first without worrying about the theory, to get an understanding of the
reasons you'd want to normalize. Once you've done that, we'll introduce the theory and
the various stages of normalization, which will make the whole process described below
much simpler the next time you do it.
Imagine you are working on a system that records plants placed in certain locations,
and the soil descriptions associated with them.
The location:

• Location Code: 11
• Location name: Kirstenbosch Gardens

contains the following three plants:

• Plant code: 431


• Plant name: Leucadendron
• Soil category: A
• Soil description: Sandstone

• Plant code: 446


• Plant name: Protea
• Soil category: B
• Soil description: Sandstone/Limestone

• Plant code: 482


• Plant name: Erica
• Soil category: C
• Soil description: Limestone

The location:

• Location Code: 12
• Location name: Karbonkelberg Mountains

contains the following two plants:

• Plant code: 431


• Plant name: Leucadendron
• Soil category: A
• Soil description: Sandstone

• Plant code: 449


• Plant name: Restio
• Soil category: B
• Soil description: Sandstone/Limestone

Tables in a relational database are in a grid, or table format (MariaDB, like most modern
DBMSs is a relational database), so let's rearrange this data in the form of a tabular
report:

Plant data displayed as a tabular report

Location Plant Soil


Location name Plant name Soil description
code code category

Kirstenbosch
11 431 Leaucadendron A Sandstone
Gardens

446 Protea B Sandstone/limestone

482 Erica C Limestone

Karbonkelberg
12 431 Leucadendron A Sandstone
Mountains

449 Restio B Sandstone/limestone

How are you to enter this data into a table in the database? You could try to copy the
layout you see above, resulting in a table something like the below. The null fields
reflect the fields where no data was entered.
Trying to create a table with plant data

Location Plant Soil


Location name Plant name Soil description
code code category

Kirstenbosch
11 431 Leucadendron A Sandstone
Gardens

NULL NULL 446 Protea B Sandstone/limestone

NULL NULL 482 Erica C Limestone

Karbonkelberg
12 431 Leucadendron A Sandstone
Mountains

NULL NULL 449 Restio B Sandstone/limestone

This table is not much use, though. The first three rows are actually a group, all
belonging to the same location. If you take the third row by itself, the data is incomplete,
as you cannot tell the location the Erica is to be found. Also, with the table as it stands,
you cannot use the location code, or any other fields, as a primary key (remember, a
primary key is a field, or list of fields, that uniquely identify one record). There is not
much use in having a table if you can't uniquely identify each record in it.

So, the solution is to make sure each table row can stand alone, and is not part of a
group, or set. To achieve this, remove the groups, or sets of data, and make each row a
complete record in its own right, which results in the table below.

Each record stands alone

Location Plant Soil


Location name Plant name Soil description
code code category

Kirstenbosch
11 431 Leucadendron A Sandstone
Gardens
Kirstenbosch
11 446 Protea B Sandstone/limestone
Gardens

Kirstenbosch
11 482 Erica C Limestone
Gardens

Karbonkelberg
12 431 Leucadendron A Sandstone
Mountains

Karbonkelberg
12 449 Restio B Sandstone/limestone
Mountains

Notice that the location code cannot be a primary key on its own. It does not uniquely
identify a row of data. So, the primary key must be a combination of location
code and plant code. Together these two fields uniquely identify a row of data. Think
about it. You would never add the same plant type more than once to a particular
location. Once you have the fact that it occurs in that location, that's enough. If you want
to record quantities of plants at a location - for this example, you're just interested in the
spread of plants - you don't need to add an entire new record for each plant; rather, just
add a quantity field. If for some reason you would be adding more than one instance of
a plant/location combination, you'd need to add something else to the key to make it
unique.

So, now the data can go in table format, but there are still problems with it. The table
stores the information that code 11 refers to Kirstenbosch Gardens three times! Besides
the waste of space, there is another serious problem. Look carefully at the data below.

Data anomaly

Location Plant Soil


Location name Plant name Soil description
code code category

Kirstenbosch
11 431 Leucadendron A Sandstone
Gardens

Kirstenbosh
11 446 Protea B Sandstone/limestone
Gardens
Kirstenbosch
11 482 Erica C Limestone
Gardens

Karbonkelberg
12 431 Leucadendron A Sandstone
Mountains

Karbonkelberg
12 449 Restio B Sandstone/limestone
Mountains

Did you notice anything strange? Congratulations if you did! Kirstenbosch is misspelled
in the second record. Now imagine trying to spot this error in a table with thousands of
records! By using the structure in the table above, the chances of data anomalies
increases dramatically.

The solution is simple. You remove the duplication. What you are doing is looking for
partial dependencies - in other words, fields that are dependent on a part of a key and
not the entire key. Because both the location code and the plant code make up the key,
you look for fields that are dependent only on location code or on plant name.

There are quite a few fields where this is the case. Location name is dependent on
location code (plant code is irrelevant in determining project name), and plant name, soil
code, and soil name are all dependent on plant number. So, take out all these fields, as
shown in the table below:

Removing the fields not dependent on the entire key

Location code Plant code

11 431

11 446

11 482

12 431
12 449

Clearly, you can't remove the data and leave it out of your database completely. You take it out,
and put it into a new table, consisting of the fields that have the partial dependency and the
fields on which they are dependent. For each of the key fields in the partial dependency, you
create a new table (in this case, both are already part of the primary key, but this doesn't always
have to be the case). So, you identified plant name, soil description and soil category as being
dependent on plant code. The new table will consist of plant code, as a key, as well as plant
name, soil category and soil description, as shown below:

Creating a new table with location data

Plant code Plant name Soil category Soil description

431 Leucadendron A Sandstone

446 Protea B Sandstone/limestone

482 Erica C Limestone

431 Leucadendron A Sandstone

449 Restio B Sandstone/limestone

You do the same process with the location data, shown below:

Creating a new table with location data

Location code Location name

11 Kirstenbosch Gardens

12 Karbonkelberg Mountains
See how these tables remove the earlier duplication problem? There is only one record
that contains Kirstenbosch Gardens, so the chances of noticing a misspelling are much
higher. And you aren't wasting space storing the name in many different records. Notice
that the location code and plant code fields are repeated in two tables. These are the
fields that create the relation, allowing you to associate the various plants with the
various locations. Obviously there is no way to remove the duplication of these fields
without losing the relation altogether, but it is far more efficient storing a small code
repeatedly than a large piece of text.

But the table is still not perfect. There is still a chance for anomalies to slip in. Examine
the table below carefully:

Another anomaly

Plant code Plant name Soil category Soil description

431 Leucadendron A Sandstone

446 Protea B Sandstone/limestone

482 Erica C Limestone

431 Leucadendron A Sandstone

449 Restio B Sandstone

The problem in the table above is that the Restio has been associated with Sandstone,
when in fact, having a soil category of B, it should be a mix of sandstone and limestone
(the soil category determines the soil description in this example). Once again you're
storing data redundantly. The soil category to soil description relationship is being
stored in its entirety for each plant. As before, the solution is to take out this excess data
and place it in its own table. What you are in fact doing at this stage is looking for
transitive relationships, or relationships where a nonkey field is dependent on another
nonkey field. Soil description, although in one sense dependent on plant code (it did
seem to be a partial dependency when we looked at it in the previous step), is actually
dependent on soil category. So, soil description must be removed. Once again, take it
out and place it in a new table, along with its actual key (soil category) as shown in the
tables below:
Plant data after removing the soil description

Plant code Plant name Soil category

431 Leucadendron A

446 Protea B

482 Erica C

449 Restio B

Creating a new table with the soil description

Soil category Soil description

A Sandstone

B Sandstone/limestone

C Limestone

You've cut down on the chances of anomalies once again. It is now impossible to
mistakenly assume soil category B is associated with anything but a mix of sandstone
and limestone. The soil description to soil category relationships are stored in only one
place - the new soil table, where you can be much more certain they are accurate.

Often, when you're designing a system you don't yet have a complete set of test data
available, and it's not necessary if you understand how the data relates. This article has
used the tables and their data to demonstrate the consequences of storing data in
tables that were not normalized, but without them you have to rely on dependencies
between fields, which is the key to database normalization.

Now, we will describe the process more formally, starting with moving from
unnormalized data (or zero normal form) to first normal form.
Database Normalization: 1st Normal Form
Tables are in 1st normal form if they follow these rules:

• There are no repeating groups.


• All the key attributes are defined.
• All attributes are dependent on the primary key.

What this means is that data must be able to fit into a tabular format, where each field
contains one value. This is also the stage where the primary key is defined. Some
sources claim that defining the primary key is not necessary for a table to be in first
normal form, but usually it's done at this stage and is necessary before we can progress
to the next stage. Theoretical debates aside, you'll have to define your primary keys at
this point

Although not always seen as part of the definition of 1st normal form, the principle of atomicity is
usually applied at this stage as well. This means that all columns must contain their smallest
parts, or be indivisible. A common example of this is where someone creates a name field,
rather than first name and surname fields. They usually regret it later.

So far, the plant example has no keys, and there are repeating groups. To get it into 1st
normal form, you'll need to define a primary key and change the structure so that there
are no repeating groups; in other words, each row / column intersection contains one,
and only one, value. Without this, you cannot put the data into the ordinary two-
dimensional table that most databases require. You define location code and plant code
as the primary key together (neither on its own can uniquely identify a record), and
replace the repeating groups with a single-value attribute. After doing this, you are left
with the structure shown in the table below (the primary key is in italics):

Plant location table

Location code

Location name

Plant code

Plant name
Soil category

Soil description

This table is now in 1st normal formal.

Database Normalization: 2nd Normal Form


After converting to the first normal form, the following table structure was achieved:

Plant location table

Location code

Location name

Plant code

Plant name

Soil category

Soil description

Is this in 2nd normal form?


A table is in 2nd normal form if:

• it is in 1st normal form


• it includes no partial dependencies (where an attribute is only dependent on part of a
primary key)

For an attribute to be only dependent on part of the primary key, the primary key must
consist of more than one field. If the primary key contains only one field, the table is
automatically in 2nd normal form if it is in 1st normal form
Let's examine all the fields. Location name is only dependent on location code. Plant
name, soil category, and soil description are only dependent on plant code (this
assumes that each plant only occurs in one soil type, which is the case in this example).
So you remove each of these fields and place them in a separate table, with the key
being that part of the original key on which they are dependent. For example, with plant
name, the key is plant code. This leaves you with the tables below:

Plant location table with partial dependencies removed

Plant location table

Plant code

Location code

Table resulting from fields dependent on plant code

Plant table

Plant code

Plant name

Soil category

Soil description

Table resulting from fields dependent on location code

Location table

Location code

Location name
The resulting tables are now in 2nd normal form.

Database Normalization: 3rd Normal Form


After converting to the second normal form, the following table structure was achieved:

Plant location table

Plant code

Location code

Plant table

Plant code

Plant name

Soil category

Soil description

Location table

Location code

Location name

Are these tables in 3rd normal form?


A table is in 3rd normal form if:

• it is in 2nd normal form


• it contains no transitive dependencies (where a non-key attribute is
dependent on the primary key through another non-key attribute)

If a table only contains one non-key attribute, it is obviously impossible for a non-key
attribute to be dependent on another non-key attribute. Any tables where this is the
case that are in 2nd normal form are then therefore also in 3rd normal form.
As only the plant table has more than one non-key attribute, you can ignore the others
because they are in 3rd normal form already. All fields are dependent on the primary
key in some way, since the tables are in second normal form. But is this dependency on
another non-key field? Plant name is not dependent on either soil category or soil
description. Nor is soil category dependent on either soil description or plant name.
However, soil description is dependent on soil category. You use the same procedure
as before, removing it, and placing it in its own table with the attribute that it was
dependent on as the key. You are left with the tables below:

Plant location table remains unchanged

Plant location table

Plant code

Location code

Plant table with soil description removed

Plant table

Plant code

Plant name

Soil category

The new soil table

Soil table

Soil category

Soil description
Location table remains unchanged

Location table

Location code

Location name

All of these tables are now in 3rd normal form. 3rd normal form is usually sufficient for most tables
because it avoids the most common kind of data anomalies. It's suggested getting most tables you
work with to 3rd normal form before you implement them, as this will achieve the aims of
normalization listed in Database Normalization Overview in the vast majority of cases.

The normal forms beyond this, such as Boyce-Codd normal form and 4th normal form, are rarely
useful for business applications. In most cases, tables in 3rd normal form are already in these
normal forms anyway. But any skilful database practitioner should know the exceptions, and be able
to normalize to the higher levels when required.

Sample Output of Database Normalization


Live Conference 4
Sep. 20, 2021 @ 10:30 AM - 1:30 PM
Module 4
L
41 Relational Database
>. Modeling
7
au
oe
gs”
Introduction to Relational Databases
» The easiest way to understand a database is as a collection of related files.
Imagine a file (either paper or digital) of sales orders in a shop. Then there's
another file of products, containing stock records. To fulfil an order, you'd
need to look up the product in the order file and then look up and adjust the
stock levels for that particular product in the product file. A database and the
software that controls the database, called a database management
system (DBMS), helps with this kind of task.
«= Most databases today are relational databases, named such because they
deal with tables of data related by a common field. For example, Table 1
below shows the product table, and Table 2 shows the invoice table. As you
can see, the relation between the two tables is based on the common
field ProductCode . Any two tables can relate to each other simply by having a
field in common.

Table 1 Table 2
ProductCode Description Price (Php) Unit InvoiceCode InvoiceLine ProductCode etsy
A150 Cement 220.00 bag PO20 1 A150 10

A125 Sand 700.00 | cum PO21 2 A125


Database Terminology
Let's take a closer look at the previous two tables to see how they are
organized:

« Each table consists of many rows and columns.


e Each new row contains data about one single entity (such as one product or
one order line). This is called a record. For example, the first row in Table 1
is a record; it describes the A150 product, which is a bag of cement that
costs Php 220.00. The terms row and record are interchangeable.
e Each column (also called an attribute) contains one piece of data that
relates to the record, called a tuple. Examples of attributes are the quantity
of an item sold or the price of a product. An attribute, when referring to a
database table, is called a field. For example, the data in
the Description column in Table 1 are fields. The
terms attribute and field are interchangeable.
Attribute/Column
ames

Student ee

Roll_No. Name Department


101 Steive Comp. Sci.
L~ 265 Jhoson Finance
Field ——505——_Mareret-7— Biology Tuple/row/
record/entity
325, Jenny Social Sci.
256 Davis Comp. Sci.
453 Sheryl Biology
365 Emma Maths
Understanding the Hierarchical Database Model
= The earliest model was the hierarchical database model, resembling an
upside-down tree.
= Files are related in a parent-child manner, with each parent capable of
relating to more than one child, but each child only being related to one
parent. Most of you will be familiar with this kind of structure—it’s the way
most file systems work. There is usually a root, or top-level, directory that
contains various other directories and files. Each subdirectory can then
contain more files and directories, and so on. Each file or directory can only
exist in one directory itself—it only has one parent. As you can see in the
image below A7 is the root directory, and its children are B7 and B2. B7isa
parent to C7, C2, and C3, which in turn has children of its own.
This model, although being a vast improvement on dealing with unrelated files,
has some serious disadvantages. It represents one-to-many relationships well
(one parent has many children; for example, one company branch has many
employees), but it has problems with many-to-many relationships.
Relationships such as that between a product file and an orders file are difficult
to implement in a hierarchical model. Specifically, an order can contain many
products, and a product can appear in many orders.
Also, the hierarchical model is not flexible because adding new relationships
can result in wholesale changes to the existing structure, which in turn means
all existing applications need to change as well. This is not fun when someone
has forgotten a table and wants it added to the system shortly before the project
is due to launch! And developing the applications is complex because the
programmer needs to know the data structure well in order to traverse the
model to access the needed data. As you’ve seen in the earlier chapters, when
accessing data from two related tables, you only need to know the fields you
require from those two tables. In the hierarchical model, you’d need to know the
entire chain between the two. For example, to relate data from A7 and D4,
you'd need to take the route: A7, B7, C3 and D4.
Understanding the Network Database Model

The network database model was a progression from the hierarchical database
model and was designed to solve some of that model's problems, specifically
the lack of flexibility. Instead of only allowing each child to have one parent, this
model allows each child to have multiple parents (it calls the
children members and the parents owners). It addresses the need to model
more complex relationships such as the orders/parts many-to-many relationship
mentioned in the hierarchical article. As you can see in the figure below, A7 has
two members, B7 and B2. B7. is the owner of C7, C2, C3 and C4. However, in
this model, C4 has two owners, B7 and B2.
Of course, this model has its problems, or everyone would still be using it. It is
more difficult to implement and maintain, and, although more flexible than the
hierarchical model, it still has flexibility problems, Not all relations can be
satisfied by assigning another owner, and the programmer still has to
understand the data structure well in order to make the model efficient.
Understanding the Relational Database Model
The relational database model was a huge leap forward from the network
database model. Instead of relying on a parent-child or owner-member
relationship, the relational model allows any file to be related to any other by
means of a common field. Suddenly, the complexity of the design was greatly
reduced because changes could be made to the database schema without
affecting the system's ability to access data. And because access was not by
means of paths to and from files, but from a direct relationship between files,
new relations between these files could easily be added.
In 1970, when E.F. Codd developed the model, it was thought to be impractical.
The increased ease of use comes at a large performance penalty, and the
hardware in those days was not able to implement the model. Since then, of
course, hardware has taken huge strides to where today, even the simplest
computers can run sophisticated relational database management systems.

Relational databases go hand-in-hand with the development of SQL. The


simplicity of SQL - where even a novice can learn to perform basic queries in a
short period of time - is a large part of the reason for the popularity of the
relational model.
The two tables below relate to each other through the ProductCode field. Any
two tables can relate to each other simply by creating a field they have in
common.

Table1
ProductCode Description Price(Php) Unit

A150 Cement 220.00 bag


A125 Sand 700.00 | cum

Table2
InvoiceCode_ InvoiceLine ProductCode Qty

PO20 1 A150 10

PO21 2 A125 15
Relational Databases: Basic Terms
The relational database model uses certain terms to describe its components:

« Data are the values kept in the database. On their own, the data means
very little. ca 684-213 is an example of data in a DMV (Division of Motor
Vehicles) database.
« Information is processed data. For example, ca 684-213 is the car
registration number of a car belonging to Lyndon Manson, in a DMV
database.
e A database is a collection of tables, also called entities.
« Each table is made up of records (the horizontal rows in the table, also
called tuples). Each record should be unique, and can be stored in any
order in the table.
e Each record is made up of fields (which are the vertical columns of the
table, also called attributes). Basically, a record is one fact (for example, one
customer or one sale).
¢ These fields can be of various types. Generally, types fall into three kinds:
character, numeric, and date. For example, a customer name is a character
field, a customer's birthday is a date field, and a customer's number of
children is a numeric field.
e The range of allowed values for a field is called the domain (also called
a field specification). For example, a credit card field may be limited to only
the values Mastercard , (Visa and Amex.
¢ A field is said to contain a nul/ value when it contains nothing at all. Null
fields can create complexities in calculations and have consequences for
data accuracy. For this reason, many fields are specifically set not to contain
null values.
e A key accesses specific records in a table.
e An index is a mechanism to improve the performance of a database.
Indexes are often confused with keys. Indexes are, strictly speaking, part of
the physical structure, and keys are part of the logical structure. You'll often
see the terms used interchangeably, however, including throughout this
Knowledge Base.
e A view is a virtual table made up of a subset of the actual tables.
« A one-to-one (1:1) relationship is where for each instance of the first table in
a relationship, only one instance of the second table exists, An example of
this would be a case where a chain of stores carries a vending machine.
Each vending machine can only be in one store, and each store carries only
one vending machine.

Vending machine Store


-}-— Aapears
in ——>-
e A one-to-many (1:N) relationship is where for each instance of the first table
in a relationship, many instances of the second table exist. This is a
common kind of relationship. An example is the relationship between a
sculptor and their sculptures. Each sculptor may have created many
sculptures, but each sculpture has been created by only one sculptor.

Sculptor Sculpture
Sculpts ——O<
« A many-to-many (M:N) relationship occurs where, for each instance of the
first table, there are many instances of the second table, and for each
instance of the second table, there are many instances of the first. For
example, a student can have many lecturers, and a lecturer can have many
students.

Student Lecturer
>——
Is taught by ——<
« A mandatory relationship exists where for each instance of the first table in
a relationship, one or more instances of the second must exist. For
example, for a music group to exist, there must exist at least one musician
in that group.
e An optional relationship is where for each instance of the first table in a
relationship, there may exist instances of the second. For example, if an
author can be listed in the database without having written a book (in other
words, a prospective author), that relationship is optional. The reverse isn't
necessarily true though. For example, for a book to be listed, it must have
an author.
e Data integrity refers to the condition where data is accurate, valid, and
consistent. An example of poor integrity would be if a customer telephone
number is stored differently in two different locations. Another is where a
course record contains a reference to a lecturer who is no longer present at
the school. Database normalization is a technique that assists you to
minimize the risk of these sorts of problems.
Relational Databases: Table Keys
A key, or index, as the term itself indicates, unlocks access to the tables. If you
know the key, you know how to identify specific records and the relationships
between the tables.

Each key consists of one or more fields, or field prefix. The order of columns in
an index is significant. Each key has a name.

A candidate key is a field, or combination of fields, that uniquely identifies a


record. It cannot contain a null value, and its value must be unique. (With
duplicates, you would no longer be identifying a unique record).

A primary key (PK) is a candidate key that has been designated to identify
unique records in the table throughout the database structure.
A surrogate key is a primary key that contains unique values automatically
generated by the database system - usually, integer numbers. A surrogate key
has no meaning, except uniquely identifying a record. This is the most common
type of primary key.

For example, see the following table:

CustomerCode FirstName Surname TelNo

1 Joseph Bautista 0928-548-5899

2 Andrew Soriano 0906-256-5499


3 Vince Fajardo 0917-586-5844
Relational Databases: Foreign Keys
You already know that a relationship between two tables is created by
assigning a common field to the two tables (see Relational Databases: Table
Keys). This common field must be a primary key to one table. Consider a
relationship between a customer table and a sale table. The relationship is not
much good if instead of using the primary key, customer_code, in the sale table,
you use another field that is not unique, such as the customer's first name. You
would be unlikely to know for sure which customer made the sale in that case.
So, in the table below, customer_code is called the foreign_key in
the sale table; in other words, it is the primary key in a foreign table.
CUSTOMER SALE
Customer code + Invoice number
Firstname ~ Customer code
Surname Amount
Telephone number

Foreign keys allow for something called referential integrity. What this means is
that if a foreign key contains a value, this value refers to an existing record in
the related table. For example, take a look at the tables below:
Lecturer table Course table

Code FirstName Surname Course Title Lecturer Code

1 Anne Cohen Introduction to Programming | 1

2 Leonard Clark Information Systems an

3 Vusi Cave Systems Software 3


Referential integrity exists here, as all the lecturers in the course table exist in
the /ecturer table. However, let's assume Anne Cohen leaves the institution,
and you remove her from the lecturer table. In a situation where referential
integrity is not enforced, she would be removed from the lecturer table, but not
from the course table, as shown below:
Lecturer table Course table

Code FirstName Surname Course Title Lecturer Code

2 Leonard Clark Introduction to Programming | 1

3 Vusi Cave Information Systems 2

Systems Software 3
Relational Databases: Views
Views are virtual tables. They are only a structure and contain no data. Their
purpose is to allow a user to see a subset of the actual data. A view can consist
of a subset of one table. For example, the student view, below, is a subset of
the student table.
ye fe(-Tel ms Student Table

First name Student_id

Surname First name

Grade Surname

Grade

Address

Telephone
Student View Student Table Course Table Grade Table

First name Student_id Course_id 31 8Te l=] al aCe.

Surname First name Course description Course_id

Grade Surname Grade

Address

Telephone
Database Design
» Phase 1: Analysis
» Phase 2: Conceptual Design
# Phase 3: Logical and Physical Design
= Phase 4: Implementation
= Phase 5: Testing
= Phase 6: Operation
» Phase 7: Maintenance
4&2 Database Normalization
Bem
Database
Normalization
INF
~s
2NF
M
SNF
BEY oo
=F. 7
Database Normalization : Edgar F. Codd, was the inventor
of the relational model . He also introduced the concept of
database normalization which was based on his Relational
Model. He proposed 1NF, 2NF And 3 NF normal forms.
Developed in the 1970's by E.F. Codd, database normalization is standard
requirement of many database designs.

Normalization is a technique that can help you avoid data anomalies and other
problems with managing your data.

It consists of transforming a table through various stages: 7st normal form, 2nd
normal form, 3rd normal form, and beyond.

It aims to:

¢ Eliminate data redundancies (and therefore use less space)


¢ Make it easier to make changes to data, and avoid anomalies when doing
so
e Make referential integrity constraints easier to enforce
« Produce an easily comprehensible structure that closely resembles the
situation the data represents, and allows for growth
Let's begin by creating a sample set of data. You'll walk through the process of
normalization first without worrying about the theory, to get an understanding of
the reasons you'd want to normalize. Once you've done that, we'll introduce the
theory and the various stages of normalization, which will make the whole
process described below much simpler the next time you do it.

Imagine you are working on a system that records plants placed in certain
locations, and the soil descriptions associated with them.
The location: The location:

e Location Code: 11 « Location Code: 12


e Location name: Kirstenbosch Gardens « Location name: Karbonkelberg Mountains

contains the following two plants:


contains the following three plants:
« Plant code: 431
e Plant code: 431
« Plant name: Leucadendron
e Plant name: Leucadendron
¢ Soil category: A
¢ Soil category: A ¢ Soil description: Sandstone
¢ Soil description: Sandstone
« Plant code: 449
e Plant code: 446 « Plant name: Restio
e Plant name: Protea ¢ Soil category: B
e Soil category: B ¢ Soil description: Sandstone/Limestone
e Soil description: Sandstone/Limestone

« Plant code: 482


e Plant name: Erica
e Soil category: C
¢ Soil description: Limestone
Plant data displayed as a tabular report

Location ails To)|


Location name Plant name Soil description
code category

Kirstenbosch
11 431 Leaucadendron | A Sandstone
Gardens

446 Protea B Sandstone/limestone

482 Erica Cc Limestone

Karbonkelberg
12 . 431 Leucadendron |A Sandstone
Mountains

449 Restio B Sandstone/limestone


Trying to create a table with plant data

Location ; Soil ; 7
Location name Plant name Soil description
code category

Kirstenbosch
11 431 Leucadendron | A Sandstone
Gardens

NULL NULL 446 Protea B Sandstone/limestone

NULL NULL 482 Erica C Limestone

12 eran 431 Leucadendron | A Sandstone


Mountains

NULL NULL 449 Restio B Sandstone/limestone


Each record stands alone

Location Plant Soil


Location name Plant name Soil description
rezel (=) code category

Kirstenbosch
11 431 Leucadendron | A Sandstone
Gardens

Kirstenbosch .
11 446 Protea B Sandstone/limestone
Gardens

i h
1 aaa 482 | Erica Cc Limestone
Gardens

Karbonkelberg
12 . 431 Leucadendron | A Sandstone
Mountains

12 Karbonkelberg 449 Restio B Sandstone/limestone


Mountains
Data anomaly

Location ; Soil ; —
moter lle emar lil a Umar) Soil description
code category

Kirstenbosch
11 431 Leucadendron | A Sandstone
Gardens

Kirstenbosh .
11 446 Protea B Sandstone/limestone
Gardens

Kirstenbosch . .
11 482 Erica C Limestone
Gardens

Karbonkelberg
12 . 431 Leucadendron | A Sandstone
Mountains

Karbonkelber
12 g 449 Restio B Sandstone/limestone
Mountains
Removing the fields not dependent on the entire key

Location code Plant code

11 431

11 446

11 482

tz 431

12 449
Creating a new table with location data Creating a new table with location data

Plant code Plant name Soil category Soil description Location code Location name

431 Leucadendron | A Sandstone 11 Kirstenbosch Gardens

446 Protea B Sandstone/limestone 12 Karbonkelberg Mountains

482 Erica Cc Limestone

431 Leucadendron | A Sandstone

449 Restio B Sandstone/limestone


Creating a new table with location data

Plant code Plant name Soil category Soil description

431 Leucadendron | A Sandstone

446 Protea B Sandstone/limestone

482 Erica Cc Limestone

431 Leucadendron | A Sandstone

449 Restio B Sandstone/limestone


Plant data after removing the soil description Creating a new table with the soil description

Plant code Plant name Soil category Soil category Soil description

431 Leucadendron | A A Sandstone

446 Protea B B Sandstone/limestone

482 Erica Cc C Limestone

449 Restio B
Plant data displayed as a tabular report

Location 2 Cli g Soil py


oerCo mar iC) aE Taare ty Toy me Cite)
tte al
code code category

Kirstenbosch
11 431 Leaucadendron | A Sandstone
Gardens

Protea Sandstone/limestone

Erica Limestone

Karbonkelberg
. 431 Leucadendron | A Sandstone
Mountains
Removing the fields not dependent on the entire key
Restio Sandstone/limestone

Location code Plant code

" — Plant data after removing the soil description


11 446
Plant code Plant name Soil category
11 482
431 Leucadendron | A
f2 431
446 Protea B
12 449
482 Erica Cc

449 Restio B
Creating a new table with location data Creating a new table with the soil description

Location code Location name SE EDEN EL Stl

11 Kirstenbosch Gardens . anne”


B Sandstone/limestone
12 Karbonkelberg Mountains
c Limestone
Database Normalization: ist Normal Form
Tables are in 1st normal form if they follow these rules:

« There are no repeating groups.


e All the key attributes are defined.
e All attributes are dependent on the primary key.
Database Normalization: 2nd Normal Form

A table is in 2nd normal form if:

e itis in 1st normal form


e it includes no partial dependencies (where an attribute is only dependent on
part of a primary key)
Database Normalization: 3rd Normal Form

A table is in 3rd normal form if:

e itis in 2nd normal form


« it contains no transitive dependencies (where a non-key attribute is
dependent on the primary key through another non-key attribute)
Thanks!

You might also like