Note File Org
Note File Org
File is any collection of data stored under a file name on a computer storage medium such as disk or tape.
Data in a file might comprise alphabetical, numerical or special characters or digitized images. A single
word processed letter stored on the computer under a file name would be regarded as a file, just as would
hundreds of college staff records (information) stored under a single file name. Records in a file are
individually different but share some things in common. For example, the payroll file of college staff
might contain information concerning all staff in the college and their payroll details. Each record in the
payroll file is related to a particular staff.
Computer file may contain lines of text (text file), it may contain an arbitrary binary image and it may
contain executable file depending on its purpose of creation. Input into application is by means of file and
output is saved in a file. The structure of a file is usually depends on the person designing it. There are
several standards that have been established to handle different purposes. Most of the computer files are
created and used by supported applications. For example, in a Microsoft-word application, user creates
document (file) and name it and determines its location by himself. The content of the document file can
be manipulated by the user in ways that the application understands. In some cases, applications pack all
their data files into a single file, using makers to distinguish it base on the information contained in it.
Files can be created, moved, modified or deleted. In most cases, applications supporting the file handle
these operations, but user can also manipulate files when necessary. For example, working on Microsoft
word, files are usually created and modified in response to user commands, but the user can also move,
rename or delete files directly by users.
In operating systems, like Unix and Linux, users can not directly relate with files, all operations on files
are been controlled by Kernel. Kernel is the core of the operating system that performs essential functions
such as controlling memory and files and allocation of system resources. In this case, the operating
system creates an abstraction such that all interaction of user with the file will be through a hard link. For
example, user cannot delete files, it can only delete the links to files. When the kernel noticed that no
more existing link to the file for sometimes, it will delete the file automatically.
1
Byte: It is a collection of 8 bits. It represents a character and it is described as the smallest unit of
computer memory and basic unit for measurement of information storage. Conventionally byte is denoted
by B.
Some Quantities of Bit
Unit Decimal Binary
equivalent equivalent
kilobit (kbit) 103 210
megabit (Mbit) 106 220
gigabit (Gbit) 109 230
terabit (Tbit) 1012 240
petabit (Pbit) 1015 250
exabit (Ebit) 1018 260
Zettabit (Zbit) 1021 270
Yottabit (Ybit) 1024 280
Character: A character is the smallest unit of information. It includes letters, digits and special symbols
such as; +, -, = *, %, &, etc (i.e. all symbols on a keyboard).
Field: A field is defined as a combination of characters. It is one or more bytes holding a single piece
of information about a single entity (i.e. attribute/characteristics of an entity). This attribute can be
assigned values and characterized by length and data type. For example, First name, Age, Gender, Date of
birth, etc. all these can be assigned value and characterized by its length and data type. The content of the
field is provided by a user or a program.
Record: Record is a collection of related and organized fields that can be treated as a complete
information about an entity in a file. For example, staff record would contain fields such as name, staff
identification number.
File: it is collection of related records. They are treated as a single entity by users and applications, and
may be referenced by name. Files have unique file names and may be manipulated e.g. created or deleted.
Database: Collection of related data is referred to as Database. Relationships exist among elements of
data and the database is designed for use by a number of different applications. A database may contain
all information related to an organization or project. The database can contains one or more types of files.
Types of Files
Master File: A master file consists of data fields which are of a permanent nature. The values of these
fields must continually be brought up-to-date so that the file will always contain the most recent
transaction or affairs in the organization. For instance, an employee file is made of records whose fields
may include; Employee Number, Name, Date of birth, Qualification, Salary grade, etc. These fields are
permanent, although, Qualification and Salary grade might need to be updated at a future time/date.
2
Transaction File: Transaction file is related to the activities within an organization. This file contains
operational data extracted from the transaction of the organization. For example, if a new staff is
employed or any one is promoted, the data is transaction data and when such records are collected, a
transaction file is obtained. A transaction file is often used to update the master file. Once it has played
this role, it is no longer needed. It may however be retained or archived for security control purposes.
Examples of transaction file include; invoices, hour worked, new qualifications, promotion, electricity
meter reading, etc.
Work File: The work file is derived from the either the master or transaction files. They are often
temporal in nature as they may not be needed/required after its purpose of creation. For example, a
personnel file can be interrogated to display the records of employees who meet certain criteria, and when
the records are extracted from the master file, they are kept in a temporary work file before they are
printed. In most cases, work files are created to hold the sorted records.
Reference File: A reference file is a file which contains records that may be used in the future or for
reference purposes. Reference file is also called standing file whose records are responsibly permanent.
Archive File: This file is also referred to as historical file as it contains old files/records which are
currently not useful or no longer useful. For instance, the files containing particulars of former clients,
records of graduated students in an institution, etc.
Data File: A file containing data/value, such as a file created within an application programs. Data files
are normally organized as sets of records with one or more associated access methods. For example, it
may be a word processing document, a spreadsheet, a database file or a chart file.
3
files etc. the data would then have to be keyed or scanned, record by record, into the computer data file. It
is also possible to modify data that had previously been recorded in the file or to delete them.
Sorting Record in File: this involves rearranging the records in a file according to a specified criterion.
For example, the records of staff in the college data file are normally recorded in the order of the date of
employment and grade. However, the computer could be instructed to sort the records such that they are
arranged in the alphabetical order of the names of staff in the college or by their grades. The advantage of
computer file over the paper file in terms of ease of sorting records is that, in paper file different
rearrangement would required to be manually created for every rearrangement criterion. Imagine the
stress recreating a list of over 500 staffs. In the computerized file, only one file is actually created and
stored in the computer, but the computer can display many different rearrangements of the file as
required, and as easily as one could switch from one page of a book to another.
Counting and Calculating with Data in the File: these include counting the number of records in file,
counting the number of times a word or number occurs in any field of the file etc. for example, one might
be interested in the number of staff employed within a particular period of time. Also, for data field
containing quantities, one can calculate the total or average quantity of all or a subset of records in the
file. Computer is an extremely powerful calculator, it can automatically read billions of numeric data
from computer data files, and add, subtract, divide and multiply them as appropriate in a matter of
seconds.
Displaying and Browsing Records in a File: after data are entered into records in the file, it would be
possible to browse the records in the file by instructing the computer to display the records on the
computer’s screen. In the same manner as one can browse the records in a paper record, one can also
browse file on a computer.
File Querying/Interrogating: This is retrieving specific data from a file according to the set of retrieval
criteria.
File merging: Combining multiple sets of data files or records to produce only one set, usually in an
ordered sequence is referred to as file merging.
Retrieving/reading: This involves reading an existing data from a form of storage or input medium.
Writing: Writing is the act of recording data onto some form of storage.
Deleting: This means removing a record or item of data from a storage medium such as disk/tape.
File storage: When a file is created, it is stored in the appropriate storage medium such as disk, flash
disk, tape, drum, etc.
4
Data Processing
Data Processing is the analysis and organization of data by the repeated use of one or more computer
programs. Data processing is used extensively in business, engineering, and science and to an increasing
extent in nearly all areas in which computers are used. Businesses use data processing for such tasks as
payroll preparation, accounting, record keeping, inventory control, sales analysis, and the processing of
bank and credit card account statements. Engineers and scientists use data processing for a wide variety of
applications, including the processing of seismic data for oil and mineral exploration, the analysis of new
product designs, the processing of satellite imagery, and the analysis of data from scientific experiments.
Data processing is divided into two kinds of processing;
1. Database Processing
2. Transaction Processing.
Database Processing: A database is a collection of common records that can be searched, accessed, and
modified, such as bank account records, school transcripts, and income tax data. In database processing, a
computerized database is used as the central source of reference data for the computations.
Transaction Processing: refers to interaction between two computers in which one computer initiates a
transaction and another computer provides the first with the data or computation required for that
function. Most modern data processing uses one or more databases at one or more central sites.
Transaction processing is used to access and update the databases when users need to immediately view
or add information; other data processing programs are used at regular intervals to provide summary
reports of activity and database status. Examples of systems that involve all of these functions are
automated teller machines, credit sales terminals, and airline reservation systems.
5
v. Human Resource Management e.g. compensation analysis, employee skills inventory, personnel
requirements forecasting
6
File Systems
A file system provides a mapping between the logical and physical views of a file, through a set of
services and an interface. Simply put, it is a system that manages and organizes all computer files, stores
them and makes them available when they are needed. File system hides all the device-specific aspects of
file manipulation from users. Without file system, information/data/file kept in the storage would be one
large body of data with no way to tell where one file ends and where the other starts which will make file
identification difficult. The basic services of a file system include;
1) Keeping track of file (knowing location)
2) I/O support, especially the transmission mechanism to and from main memory
3) Managing of secondary storage
4) Sharing of I/O devices
5) Providing protection mechanisms for information held in the system.
The way a computer organizes, names, stores and manipulates files is globally referred to as its file
system. More formally, a file system is a special-purpose database for the storage, hierarchical
organization, manipulation, navigation, access, and retrieval of data.
Most computers have at least one file system while some computers allow the use of several different file
systems. For instance, on newer MS Windows computers, the older File Allocation Table (FAT-type) file
systems of MS-DOS and old versions of Windows are supported, in addition to the New Technology File
System (NTFS) file system that is the normal file system for recent versions of Windows.
Each system has its own advantages and disadvantages. Standard FAT allow only eight-character file
names plus a three-character for extension and does not allow spaces, while NTFS allows much longer
names that can contain spaces. You can call a file “Payroll records” in NTFS, but in FAT you would be
restricted to something like “payroll.dat” unless you were using VFAT, a FAT extension allowing longer
file names.
The most familiar file systems make use of an underlying data storage device that offers access to an
array of fixed-size blocks, sometimes called sectors, generally a power of 2 in size (512 bytes or 1, 2, or 4
KB are most common). The file system is responsible for organizing these sectors into files and
directories, and keeping track of which sectors belong to which file and which are not being used. Most
file systems address data in fixed-sized units called "clusters" or "blocks" which contain a certain number
of disk sectors (usually 1- 64). This is the smallest logical amount of disk space that can be allocated to
hold a file. However, file systems need not make use of a storage device at all. A file system can be used
to organize and represent access to any data, whether it be stored or dynamically generated (e.g., procfs).
7
Types of File Systems
Types of file system can be classified into disk file systems, network file systems and special purpose file
systems.
Disk File Systems
A disk file system is a file system designed for storage of files on a data storage device, most commonly a
disk drive, which might be directly or indirectly connected to the computer. Examples of disk file systems
include FAT (FAT12, FAT16, FAT32), NTFS, HFS and HFS+, ext2, ext3, ISO 9660, ODS-5, and UDF.
Some disk file systems are journaling file systems or versioning file systems.
Flash File Systems: A flash file system is a file system designed for storing files on flash memory
devices. These are becoming more prevalent as the number of mobile devices is increasing, and the
capacity of flash memories catches up with hard drives.
Database File Systems: A new concept for file management is the concept of a database-based file
system. Instead of, or in addition to, hierarchical structured management, files are identified by their
characteristics, like type of file, topic, author, or similar metadata.
Transactional File Systems: Each disk operation may involve changes to a number of different files and
disk structures. In many cases, these changes are related, meaning that it is important that they all be
executed at the same time. For example, a bank sending another bank some money electronically. The
bank's computer will "send" the transfer instruction to the other bank and also update its own records to
indicate the transfer has occurred. If for some reason the computer crashes before it has had a chance to
update its own records, then on reset, there will be no record of the transfer but the bank will be missing
some money. Transaction processing introduces the guarantee that at any point while it is running, a
transaction can either be finished completely or reverted completely (though not necessarily both at any
given point). This means that if there is a crash or power failure, after recovery, the stored state will be
consistent. (Either the money will be transferred or it will not be transferred, but it won't ever go missing
"in transit"). This type of file system is designed to be fault tolerant, but may incur additional overhead to
do.
Network File Systems
A network file system is a file system that acts as a client for a remote file access protocol, providing
access to files on a server. Examples of network file systems include clients for the NFS, AFS, SMB
protocols, and file-system-like clients for FTP and WebDAV.
8
Special Purpose File Systems
A special purpose file system is basically any file system that is not a disk file system or network file
system. This includes systems where the files are arranged dynamically by software, intended for such
purposes as communication between computer processes or temporary file space. Special purpose file
systems are most commonly used by file-centric operating systems such as Unix. Examples include the
procfs (/proc) file system used by some Unix variants, which grants access to information about processes
and other operating system features.
9
File Management System
File management system (FMS) (also known as file manager) is the software responsible for
creating, deleting, modifying and controlling access to files and also manages the recourses
being used by files. The four major responsibilities of FMS include;
The operation of a file system and file manager can be likening to what operates in a typical
library system, where the file system is a library and file manager is the librarian that performs
the four major operations in the library.
1) A librarian uses the catalog to keep track of each item in the collection; each entry lists the
call number and the details that help patrons find the books they want.
2) The library relies on a policy to store everything in the collection including oversized books,
magazines, books-on-tape, DVDs, maps, and videos. And they must be physically arranged
so people can find what they need.
3) When it’s requested, the item is retrieved from its shelf and the borrower’s name is noted in
the circulation records.
4) When the item is returned, the librarian makes the appropriate notation in the circulation
records and reshelf it.
In a computer system, the file manager keeps track of its files with the filename, its physical
location in secondary storage, and important information about each file by storing these
information on file directories.
The file manager’s policy determines where each file is stored and how the system, and its users,
will relate or access the file. Usually through commands that are independent of device details.
In addition, the policy must determine who will have access to what material, and this involves
two factors, which are flexibility of access to the information and its subsequent protection. The
10
file manager does this by allowing access to shared files, providing distributed access, and
allowing users to browse through public directories. Meanwhile, the operating system must
protect its files against system malfunctions and provide security checks via account numbers
and passwords to preserve the integrity of the data and safeguard against tampering.
Computer system allocates files by activating the appropriate secondary storage device and
loading the file into memory while updating its records of who is using what file.
Finally, the file manager deallocates files by updating the file tables and rewriting the file (if
revised) to the secondary storage device. Any processes waiting to access the file are then
notified of its availability.
11
Volume configuration
File manager refers to storage units as volume which maybe removable or non-removable. Each
volume usually contains several files and they are called multifile volumes. Although, in some
special cases, a file might be extremely large and is contained in more than one volumes, in this
case, such file is refers to as multivolume files. Each volume in the system is recognized by their
name and file manager write this name and other volume’s descriptive information on an easy-
access place, usually at the first sector of the outermost track of the disk, at the beginning of the
tape, at the inner most part of the CD or DVD. The following are the volume descriptive
information (for a disk drive)
i. Volume name: user allocated name
ii. Creation date: Date when the volume was created or added
iii. Pointer to directory area: indicates first sector where directory is stored
iv. Pointer to file area: indicates first sector where file is stored
v. File system code: used to detect volumes with incorrect formats
After the volume descriptor, the next thing stored in the volume is the master file directory.
Master file directory (MFD) is the mother of all directories. It stores the names and
characteristics of all files as well as subdirectories in the volume and the remaining space are
used to store files. In the early operating systems, each volume can have only one directory,
which is created by the file manager and contained filename, usually arranged in alphabetical or
chronological order. This is very simple to use and maintain but with major disadvantages;
i. It requires a long time to search for an individual file, especially when the MFD is organized in
an arbitrary order.
ii. If the user had more than 256 small files stored in the volume, the directory space (with a
maximum of 256 filename) would fill up before the disk storage space filled up. The user would
then receive a message of “disk full” when only the directory itself was full.
iii. Users could not create subdirectories to group the files that were related.
iv. Multiple users could not safeguard their files from other users because the entire directory was
freely made available to every user in the group on request.
v. Each program in the entire directory needed a unique name, so in a situation where the directory
is serving many users, only one person save file with a particular name.
12
The present file managers in operating system allow users to create individual subdirectories
(folders), to enable grouping of related files together. The subdirectories are organized in tree
manner with the master file being the root. This allow efficient search of individual directories
because there are fewer entries to search for. The point of entry for every file request is the MFD,
when users want to access file, the filename is sent to the file manager. The file manager first
searches the MFD for the subdirectory that contains the file and then searches the subdirectory
for the requested file and its location. Every file entry in every directory contains information
describing the file and this is called file descriptor. The following are the information in file
descriptor;
i. Filename: within a single directory, filenames must be unique; in some operating systems, the
filenames are case sensitive
ii. File type: the organization and usage are dependent on the system and application on it
iii. File size: although it could be computed from other information, the size is kept here for
convenience
iv. File location: identification of the first physical block (or all blocks) where the file is stored
v. Date and time of creation
vi. Protection information: access restrictions, based on who is allowed to access the file and what
type of access is allowed
vii. Record size: its fixed size or its maximum size, depending on the type of record
Conventionally, file complete name is composed of the directory and subdirectories containing
the file, the filename and the extension of the file. For example in window operating system, the
file path C:\IMFST\COLLEGE\STUDENTS_PAYMENT.DOC indicates that this file resides in
13
the storage device (volume) C, in the directory IMFST, in a subdirectory COLLEGE. The name
of the file is STUDENTS_PAYMENT and the file can be access by word processing application.
In Unix or Linux operating system, the file path /imfst/college/ students_payment.doc. It starts
with a front slash which indicates a root directory. Followed by a subdirectory “imfst”, and
another subdirectory “college” and finally the file’s name students_payment.doc (note in Unix
and Linux filenames are case sensitive and they are often expressed in lowercase).
File Organization
File organization deals with how records in files are arranged (how data are stored, added,
accessed) and not how the file itself is arranged in directory (folder). File organization is very
important because it determines the methods of access, efficient, flexibility and storage devices
to use. Arranging records in file can take two formats and every record must use either of the
formats; Fixed-length and Variable-length record format.
14
Physical File Organization
Records in files can be arranged in one of several ways based on the type of storage medium
used. For example serial can only be used to organize file’s record in magnetic tape, while
sequential, direct, and indexed sequential can be used in disk drive. Popularly, there are four
methods of file organization and these includes; Serial, Sequential, direct (random), indexed
sequential file organization.
15
it is that, user identifies a field (or combination of fields) in the record which uniquely identifies
each record and serves as the key field. This key is transformed into logical address by the
hashing algorithm and this logical address is translated into physical address by the file manager
i.e. the cylinder, surface, record numbers.
R1 R1 R2 R2 R3 R3 R4 R4
c. Length Length Length Length Unblocked Variable-
length
There are three types of storage allocation; contiguous, noncontiguous and indexed storage
allocation.
Contiguous Storage:
In contiguous storage records are stored closely to each other on the same compact area. In this
type of record allocation, record can be found and read once its starting address and size are
known. It also enable easy of direct access because every part of the file are stored in the same
16
compact area. The major disadvantage of contiguous storage is that a file cannot be expanded
after creation unless there is empty space available immediately after it. Hence, provision for
expansion must be provided when the file is created. If the space is not enough or got exhausted,
the entire file must be copied to a larger section of the disk when there is need to add more
record. Fragmentation (slice of unused space) is another disadvantage of contiguous storage, and
this can be addressed by compacting and rearrangement of files. The file manager keeps track of
this empty storage area by treating them as files, they are entered into file directory but are
flagged to differentiate it from the real files
Free FileA FileA FileA FileA FileB FileB FileB FileB Free FileC FileC
spac R1 R2 R3 R4 R1 R2 R3 R4 Space R1 R2
e
FileA cannot be expanded, while fileB can be expanded
Noncontiguous Storage:
This allows files to be stored in any available storage space on the disk. Although, it allows file’s
records to be stored in a contiguous manner in as much there is enough empty space and other
records are stored in other available sections of the disk. These are usually called the extents of
the file and are linked together with pointers. The physical capacity of each extent is determined
by operating system, which can be 256 or power of two (i.e. bytes). File extents are usually
linked in one of two ways, linking at storage level and linking at directive level.
In linking at storage level, each extent points to the next one in the sequence on the disk. The
directory entry for the file consists of the filename, the storage location of the first extent, the
location of the last extent and the total number of extents without counting the first one.
Disk Block
1 2 | 7 3 4 5
F1
6 7 | 13 8 9 10 Directory
F1 File No. Start End No. of Extents
11 12 13| 20 14 15
F1 F1 2 17 4
16 17 | - 18 19 20 | 17
F1 F1
In linking at the directory level, each extent is
listed with its physical address, its size, and a pointer to the next extents. A null pointer,
17
indicated by hyphen (-), indicates that it is the last one. Although, the two noncontiguous
allocation methods address external storage fragmentation and need for compaction, it does not
support direct access to file records because it is difficult to determine the exact location of a
specific record.
Disk Block Index Table
1 2 3 4 5
File Address Size Next
File1(1 File2(1)
1 512
)
File1 (1) 2 512 9
6 7 8 9 10
File2 (1) 3 512 7
File2(2 File1(2
4 512
) )
: :
11 12 13 14 15
File2(2) 7 512 12
File2(3
8 512
)
File1(2) 9 512 19
16 17 18 19 20
File1(4) File1(3
File2(3) 12 512 -
)
File1(4) 18 512 -
File1(3) 19 512 18
20 512
Usually at the point of file creation, files are normally declared to be sequential or direct to
enable the file manager to select most appropriate and efficient method for the storage allocation.
Contiguous support direct access while noncontiguous support sequential.
Index Storage:
Indexed storage allocation allows direct record access by bringing together the pointers linking
every extents of a particular file into an index block. An index block is created for every file,
which contains the addresses of each disk block that makeup the file (i.e. location of all extents
that makeup the file). Indexed storage allocation supports both sequential and direct access, but it
18
does not necessarily improve the use of storage space because each file must have an index block
(which also occupies space on the disk block).
Disk Block
1 2 3 4 5 Index
3
File1 File2 File End
7
12 File 1 15
6 7 8 9 10
File 2 5
File2 File1
11 12 13 14 15
2
File2
9
19
18
16 17 18 19 20
File1 File1
Files are accessed by the file manager using the address of the last byte read to access the next
sequential record. Which means the current byte address (CBA) must always be updated every
time a record is accessed, such as when the READ command is executed. The way the CBA is
determined differs for each access method and record format used.
a. Record1 Record2 Record3 Record4 Record5 Record6
For a fixed-length record all records have the same length i.e. RL = x for all
19
Sequential Access Method
For sequential access of a fixed-length records, the CBA is updated simply by incrementing it by
the record length (RL) which is constant for fixed-length record.
CBA = CBA + RL
For sequential access of variable-length record, the file manager adds the length of the record
(RL) plus the number of bytes used to hold the record length (N, where N is m, n, p and q as
illustrated b above) to the CBA.
CBA = CBA + RL + N
Direct Access Method
A file organized using direct access method can be accessed using either direct or sequential
method if the records are of fixed length. A file organized with a direct method and the record
format is variable-length cannot be accessed by direct method simply because it is difficult to
compute record length directly. In this, the record will be accessed by sequential method.
For direct access of a fixed-length records, the CBA can be computed directly from the record
length and the desired record number RN (information provided through the READ command)
minus 1.
CBA = RL * (RN - 1)
For example, if we are looking for the beginning of the fifteenth (15) record with a fixed record
length of 25 bytes, the CBA would be computed as
CBA = 25* (15 - 1) = 350
20
As explained, file’s organization and the method used to access its records are very closely
intertwined; so when one talks about a specific type of organization, one is almost certainly
implying a specific type of access.
21
only users’ name who are given permission to use the file are listed; those denied all access are
grouped under a global heading WORLD. In some system, the access control list put every user
into a category; SYSTEM/ADMIN, OWNER, GROUP and WORLD. SYSTEM/ADMIN is use
for system personnel who have unlimited access to all files in the system. The OWNER has
absolute control over all files created in the owner’s account. An owner may create a GROUP
file so that all users belonging to the appropriate group have access to it. While WORLD
composed of all other users in the system that is, those who do not fall into any of the other three
categories.
File Access
File1 USER1 (RWED), USER2 (R-E-), USER4 (RWE-), USER5 (--E-), WORLD (----)
File2 USER2 (RWED), USER5 (R-E-), USER6 (RWE-), WORLD (----)
File3 USER3 (RWED), USER4 (R-E-), USER6 (RWE-), WORLD (----)
File4 USER5 (RWED), WORLD (----)
File5 USER1 (RWED), USER2 (R-E-), USER3 (RWE-), WORLD (----)
Capability List
Capability list create a table of access control information using a different perspective. In it
every user and the files they have access are contained in the table. It is a new method but it is
gaining popularity, especially in operating system like Linux to even control access to both
devices and files.
Users Access
User1 File1 (RWED), File2 (R-E-), File 4 (RWE-), File 5 (--E-),
User2 File2 (RWED), File5 (R-E-), File6 (RWE-),
User3 File3 (RWED), USER File4 (R-E-), File6 (RWE-),
User4 File5 (RWED),
User5 File1 (RWED), File2 (R-E-), File3 (RWE-),
Data Compression
Algorithm for data compression can be categorized into two; lossless and lossy algorithm.
Lossless Algorithm
Lossless algorithm is used for text compression. It retains all the data in the file throughout the
compression-decompression process. There are several methods some of which are; Records
with repeated characters, repeated terms, and front-end compression method.
Records with repeated characters: Data in a database (usually of fixed-length field) might
include entries with fewer characters than what it is designed for the field and fill the remaining
22
with blank characters. This can be replaced with a variable-length field and a special code to
indicate how many blanks characters were truncated (removed). For example, if the original
string, ADEOYE will looks like this when it’s stored uncompressed in a field that is 15
characters wide (if b stands for a blank character).
ADEOYEbbbbbbbbb
When it is encoded it looks like this:
ADEOYEb9
Repeated terms: Frequently used terms in a database can be compressed by using symbols to
represent the terms. For example, in College’s student database, words like student, course,
teacher, classroom, grade, and department will be common and each could be represented with a
single character. In this case it is imperative for the system to be able to distinguish between
compressed and uncompressed data.
Front-end compression: in this method data compression is built on the previous data element.
For example, the student database where the students’ names are kept in alphabetical order could
be compressed as described in the table below. Although, this method will save space but it will
require additional processing time to decompress and must be able to differentiate between
compressed and uncompressed data.
Etc.
23
Lossy Algorithm
In lossy compression some data from the original file are loss to allow significant compression.
This means the compression process is irreversible as the original file cannot be reconstructed.
The specifics of the compression algorithm are highly dependent on the type of file being
compressed and are usually regulated by International Organization for Standardization (ISO)
Storage Media
Storage media can be categorized into two; sequential access media (which support sequential
storage and access to records in files) and direct access storage devices (which support both
sequential and direct storage and access to records in files). Their major difference is the speed
and sharability.
For example, if you have records of 160 characters each, and were stored on a tape with a density
of 1600 bpi, this then theoretically mean that you can only store 10 records on one inch of the
tape. But practically this depends on how the records are stored. There are two ways of storing
records on tape; individually and grouped in blocks. When records are stored individually, each
record would need to be separated by a space to indicate its starting and ending point. If the
records are stored in blocks, then the entire block is preceded by a space and followed by a
space, but the individual records are still stored sequentially within the block.
The tactics of how magnetic tape reads and writes data is that it moves under the read/write head
only when there is need to access or write a record, at all other time it standing still. So the tape
24
moves in jerks as it stops, reads, and moves on at high speed, or stops, writes, and starts again,
and so on. Records would be written in the same way. The tape needs time and space to stop, so
a gap is inserted between each record. This is called inter-record gap (IRG) and is about 0.5 inch
long regardless of the sizes of the records it separates. Therefore, if 10 records are stored
individually, there will be nine 0.5inch IRGs between each of the records. Let assume each
record occupies only 0.1 inch of the tape. This implies that 5.5inches will be needed to store 10
records that supposed to occupy one inch which is not an efficient use of storage medium.
The second method is to group records into blocks before storing on tape. This is called blocking
and it is performed when the file is created (the records must be unblock accurately when
retrieving). The number of records in a block is usually determined by the application program,
and it is often set to take advantage of the transfer rate, which is the density of the tape
(measured in bpi), multiplied by the tape drive speed (transport speed) which is measured in
inches per second (ips).
Transfer rate (ips) = density * transport speed
For example if the transport speed is 200ips and the tape stores 1600 bytes per inch (bpi), a total
of 320,000 bytes can be transferred in one second. This implies that the optimal size of a block is
320,000bytes. In this method, the entire block must be read into a buffer, so the buffer must be as
large as the block.
The gap which in this case called inter-block gap (IBG) is still 0.5 inch long, but the data from
each block of 10 records is now stored on just 1inch tape and the 0.5inch of IBG which makes a
total of 1.5inches to store 10 records compare to 5.5inches needed in individual method.
25
There are two disadvantages of blocking
i. Overhead and software routines are needed for blocking, deblocking, and recordkeeping.
ii. Buffer space may be wasted if you need only one logical record but must read an entire
block to get it.
Access time to block or individual record on magnetic tape depends on where the record is
located. For example, if a tape has 2400 feet and the transport speed is 200ips. Calculate the time
taken to read the last record on the tape.
Solution:
1feet = 12inches, therefore 2,400feet will equivalent to 28,800inches (2400 * 12). Now, if it
reads 200inches in one second, how many seconds will it take 28,800 inches.
28,800 / 200 = 144 seconds (about 2.5 minutes) to read the last record in the tape.
Hard disk is a stack of one or more metal platters that spin on a spindle. Each platter is coated
with iron oxides, and the entire unit is encased in a sealed chamber. It holds more data than other
direct access storage devices, because they often include several platters, stacked one on top of
another to have large capacity disk storage space. The number of read/write heads specifies the
number of sides that the disk uses to store data. If a particular hard disk drive has 12 disk
platters, the number of heads is most likely to be 11, which implies that two sides of the platter
will not be used for data storage. Each platter (except those at the top and bottom of the stack)
has two surfaces for recording, and each surface is formatted with a specific number of
concentric tracks where the data are recorded. The number of tracks varies from manufacturer to
26
manufacturer, but typically there are a thousand or more on a high capacity hard disk. Each track
on each surface is numbered: Track 0 identifies the outermost concentric circle on each surface;
the highest-numbered track is in the center. Two read-write head moves over each pair of the
plate one for the surface above and the other for the surface below by the read-write arm. The
arm moves all of the heads in unison, so if one head is on Track 36, then all of the heads are on
Track 36. That is they are all positioned on the same track but on their respective surfaces
creating a virtual cylinder. Based on this, it is more efficient to fill the disk track by track than to
fill surface by surface. When the Track 0 of the entire surfaces is filled, it gives a virtual cylinder
of data. So the number of cylinder is equivalent to number of track. This makes access to every
portion of the disk easier. To access any portion of the disk, the system needs three things. First
the cylinder number (i.e. the track number), so the arm can move the read-write to it. Secondly,
the surface number, so the system activates the appropriate read-write head. Lastly, the sector
number, where the read-write will read or store the data. This makes it easy for the read-write
head to know where and when it should begin reading or writing.
A fixed-head magnetic disk looks like a large CD or DVD covered with magnetic film that has
been formatted, usually on both sides, into concentric circles (each circle is a track). Data is
recorded serially on each track by the fixed read/write head positioned over each of the track.
This makes fixed-head magnetic disks much faster than the movable-head disks. The major
disadvantages of this type of disk is that is very expensive and the volume of data it can store is
very small when compared to movable-head magnetic disk, because the tracks are positioned
farther apart to allow for the width of the read/write heads.
27
Access time of magnetic disks is determined by three parameters; seek time, search
time/rotational delay and transfer time.
Seek time is the time required to position the read-write heads on the track where the data is.
Seek time is not required for fixed magnetic disk, since each track has its own read-write head. It
is the slowest of the three factors.
Search time/rotational delay is the time it takes to rotate the disk until the requested record is
placed beneath the read-write head.
Transfer time is the time required to move the data from the secondary storage to the primary
storage. It is the fastest of the three.
28