Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

File Concept in Computing

File is any collection of data stored under a file name on a computer storage medium such as disk or tape.
Data in a file might comprise alphabetical, numerical or special characters or digitized images. A single
word processed letter stored on the computer under a file name would be regarded as a file, just as would
hundreds of college staff records (information) stored under a single file name. Records in a file are
individually different but share some things in common. For example, the payroll file of college staff
might contain information concerning all staff in the college and their payroll details. Each record in the
payroll file is related to a particular staff.
Computer file may contain lines of text (text file), it may contain an arbitrary binary image and it may
contain executable file depending on its purpose of creation. Input into application is by means of file and
output is saved in a file. The structure of a file is usually depends on the person designing it. There are
several standards that have been established to handle different purposes. Most of the computer files are
created and used by supported applications. For example, in a Microsoft-word application, user creates
document (file) and name it and determines its location by himself. The content of the document file can
be manipulated by the user in ways that the application understands. In some cases, applications pack all
their data files into a single file, using makers to distinguish it base on the information contained in it.
Files can be created, moved, modified or deleted. In most cases, applications supporting the file handle
these operations, but user can also manipulate files when necessary. For example, working on Microsoft
word, files are usually created and modified in response to user commands, but the user can also move,
rename or delete files directly by users.
In operating systems, like Unix and Linux, users can not directly relate with files, all operations on files
are been controlled by Kernel. Kernel is the core of the operating system that performs essential functions
such as controlling memory and files and allocation of system resources. In this case, the operating
system creates an abstraction such that all interaction of user with the file will be through a hard link. For
example, user cannot delete files, it can only delete the links to files. When the kernel noticed that no
more existing link to the file for sometimes, it will delete the file automatically.

Component of File and their Relationship


Bits: bits are binary digit that takes value 0 or 1. Bits are basic unit of information storage. It is the only
symbols computers recognized and communicate. In transmission, bits are seen as an electrical pulse
generated by the internal clock in the control unit or data register. In this case, logical 1 (true value) is
represented by 5 volts and logical 0 (false value) is represented by 0 volt. Conventionally bit is denoted
by b.

1
Byte: It is a collection of 8 bits. It represents a character and it is described as the smallest unit of
computer memory and basic unit for measurement of information storage. Conventionally byte is denoted
by B.
Some Quantities of Bit
Unit Decimal Binary
equivalent equivalent
kilobit (kbit) 103 210
megabit (Mbit) 106 220
gigabit (Gbit) 109 230
terabit (Tbit) 1012 240
petabit (Pbit) 1015 250
exabit (Ebit) 1018 260
Zettabit (Zbit) 1021 270
Yottabit (Ybit) 1024 280

Character: A character is the smallest unit of information. It includes letters, digits and special symbols
such as; +, -, = *, %, &, etc (i.e. all symbols on a keyboard).
Field: A field is defined as a combination of characters. It is one or more bytes holding a single piece
of information about a single entity (i.e. attribute/characteristics of an entity). This attribute can be
assigned values and characterized by length and data type. For example, First name, Age, Gender, Date of
birth, etc. all these can be assigned value and characterized by its length and data type. The content of the
field is provided by a user or a program.
Record: Record is a collection of related and organized fields that can be treated as a complete
information about an entity in a file. For example, staff record would contain fields such as name, staff
identification number.
File: it is collection of related records. They are treated as a single entity by users and applications, and
may be referenced by name. Files have unique file names and may be manipulated e.g. created or deleted.
Database: Collection of related data is referred to as Database. Relationships exist among elements of
data and the database is designed for use by a number of different applications. A database may contain
all information related to an organization or project. The database can contains one or more types of files.

Types of Files
Master File: A master file consists of data fields which are of a permanent nature. The values of these
fields must continually be brought up-to-date so that the file will always contain the most recent
transaction or affairs in the organization. For instance, an employee file is made of records whose fields
may include; Employee Number, Name, Date of birth, Qualification, Salary grade, etc. These fields are
permanent, although, Qualification and Salary grade might need to be updated at a future time/date.

2
Transaction File: Transaction file is related to the activities within an organization. This file contains
operational data extracted from the transaction of the organization. For example, if a new staff is
employed or any one is promoted, the data is transaction data and when such records are collected, a
transaction file is obtained. A transaction file is often used to update the master file. Once it has played
this role, it is no longer needed. It may however be retained or archived for security control purposes.
Examples of transaction file include; invoices, hour worked, new qualifications, promotion, electricity
meter reading, etc.

Work File: The work file is derived from the either the master or transaction files. They are often
temporal in nature as they may not be needed/required after its purpose of creation. For example, a
personnel file can be interrogated to display the records of employees who meet certain criteria, and when
the records are extracted from the master file, they are kept in a temporary work file before they are
printed. In most cases, work files are created to hold the sorted records.
Reference File: A reference file is a file which contains records that may be used in the future or for
reference purposes. Reference file is also called standing file whose records are responsibly permanent.
Archive File: This file is also referred to as historical file as it contains old files/records which are
currently not useful or no longer useful. For instance, the files containing particulars of former clients,
records of graduated students in an institution, etc.
Data File: A file containing data/value, such as a file created within an application programs. Data files
are normally organized as sets of records with one or more associated access methods. For example, it
may be a word processing document, a spreadsheet, a database file or a chart file.

File Processing Operations


File Creation: This involves specifying a name for the file, and defining the file’s record structure. Every
computer application program includes program for creating file and its record structure.
File Searching: this involves locating data in a file by reference to a special field of each record/data,
called the key. The key is a unique field used to identify certain record in a file. If a record is to be
inserted into a file, it must be given a unique key value.
File Update: This involves entering data into records of the file, modifying existing records and deleting
existing records as might required. Data entering into computer can be done after the file’s record
structure as been defined. Data entry may be performed as the data becomes available, for example, as the
college employed more staff there is need for update. In an alternative situation, the data might already
exist in other formats, such as completed (filled) paper form, notebooks, completed questionnaire, paper

3
files etc. the data would then have to be keyed or scanned, record by record, into the computer data file. It
is also possible to modify data that had previously been recorded in the file or to delete them.
Sorting Record in File: this involves rearranging the records in a file according to a specified criterion.
For example, the records of staff in the college data file are normally recorded in the order of the date of
employment and grade. However, the computer could be instructed to sort the records such that they are
arranged in the alphabetical order of the names of staff in the college or by their grades. The advantage of
computer file over the paper file in terms of ease of sorting records is that, in paper file different
rearrangement would required to be manually created for every rearrangement criterion. Imagine the
stress recreating a list of over 500 staffs. In the computerized file, only one file is actually created and
stored in the computer, but the computer can display many different rearrangements of the file as
required, and as easily as one could switch from one page of a book to another.
Counting and Calculating with Data in the File: these include counting the number of records in file,
counting the number of times a word or number occurs in any field of the file etc. for example, one might
be interested in the number of staff employed within a particular period of time. Also, for data field
containing quantities, one can calculate the total or average quantity of all or a subset of records in the
file. Computer is an extremely powerful calculator, it can automatically read billions of numeric data
from computer data files, and add, subtract, divide and multiply them as appropriate in a matter of
seconds.
Displaying and Browsing Records in a File: after data are entered into records in the file, it would be
possible to browse the records in the file by instructing the computer to display the records on the
computer’s screen. In the same manner as one can browse the records in a paper record, one can also
browse file on a computer.
File Querying/Interrogating: This is retrieving specific data from a file according to the set of retrieval
criteria.
File merging: Combining multiple sets of data files or records to produce only one set, usually in an
ordered sequence is referred to as file merging.
Retrieving/reading: This involves reading an existing data from a form of storage or input medium.
Writing: Writing is the act of recording data onto some form of storage.
Deleting: This means removing a record or item of data from a storage medium such as disk/tape.
File storage: When a file is created, it is stored in the appropriate storage medium such as disk, flash
disk, tape, drum, etc.

4
Data Processing
Data Processing is the analysis and organization of data by the repeated use of one or more computer
programs. Data processing is used extensively in business, engineering, and science and to an increasing
extent in nearly all areas in which computers are used. Businesses use data processing for such tasks as
payroll preparation, accounting, record keeping, inventory control, sales analysis, and the processing of
bank and credit card account statements. Engineers and scientists use data processing for a wide variety of
applications, including the processing of seismic data for oil and mineral exploration, the analysis of new
product designs, the processing of satellite imagery, and the analysis of data from scientific experiments.
Data processing is divided into two kinds of processing;
1. Database Processing
2. Transaction Processing.
Database Processing: A database is a collection of common records that can be searched, accessed, and
modified, such as bank account records, school transcripts, and income tax data. In database processing, a
computerized database is used as the central source of reference data for the computations.
Transaction Processing: refers to interaction between two computers in which one computer initiates a
transaction and another computer provides the first with the data or computation required for that
function. Most modern data processing uses one or more databases at one or more central sites.
Transaction processing is used to access and update the databases when users need to immediately view
or add information; other data processing programs are used at regular intervals to provide summary
reports of activity and database status. Examples of systems that involve all of these functions are
automated teller machines, credit sales terminals, and airline reservation systems.

Reasons for data processing in organization


Organizations undertake data processing activities to obtain information with which to control and
support organization business activities such as;
i. Financial activities e.g. cash management, credit management, investment management, capital
budgeting
ii. Accounting activities: order processing, inventory control, accounts receivable, accounts payable,
payroll, general ledger
iii. Production/operations e.g. manufacturing resource planning, manufacturing execution systems,
process control
iv. Marketing activities e.g. customer relationship management, interactive marketing, sales force
automation etc.

5
v. Human Resource Management e.g. compensation analysis, employee skills inventory, personnel
requirements forecasting

Data Processing Cycle


The Data processing cycle is a series of steps/activities carried out to extract information from raw data.
The data-processing cycle represents the chain of processing events in most data-processing applications.
It consists of data recording, transmission, reporting, storage, and retrieval in this order.
Data Recording: The original data is first recorded in a form readable by a computer. This can be
accomplished in several ways: by manually entering information into some form of computer memory
using a keyboard, by using a sensor to transfer data onto a magnetic tape or floppy disk, by filling in ovals
on a computer-readable paper form, or by swiping a credit card through a reader.
Data Transmission: After data recording, the data are then transmitted to a computer that performs the
data processing functions. This step may involve physically moving the recorded data to the computer or
transmitting it electronically over Internet. Once the data reach the computer’s memory, the computer
processes it. The operations the computer performs can include accessing and updating a database and
creating or modifying existing information. Data processing can be in two forms; in batch and in real-
time.
Data Reporting: After processing the data, the computer reports summary results to the program’s
operator.
Data Storage: As the computer processes the data, it stores both the modifications and the original data.
This storage can be both in the original data-entry form and in carefully controlled computer data forms
such as magnetic tape. Data are often stored in more than one place for both legal and practical reasons.
Computer systems can malfunction and lose all stored data, and the original data may be needed to
recreate the database as it existed before the crash.
Data Retrieval: The final step in the data-processing cycle is the retrieval of stored information at a later
time. This is usually done to access records contained in a database, to apply new data-processing
functions to the data, or in the event that some part of the data has been lost to recreate portions of a
database. Examples of data retrieval in the data-processing cycle include the analysis of stored sales
receipts to reveal new customer spending patterns.

6
File Systems
A file system provides a mapping between the logical and physical views of a file, through a set of
services and an interface. Simply put, it is a system that manages and organizes all computer files, stores
them and makes them available when they are needed. File system hides all the device-specific aspects of
file manipulation from users. Without file system, information/data/file kept in the storage would be one
large body of data with no way to tell where one file ends and where the other starts which will make file
identification difficult. The basic services of a file system include;
1) Keeping track of file (knowing location)
2) I/O support, especially the transmission mechanism to and from main memory
3) Managing of secondary storage
4) Sharing of I/O devices
5) Providing protection mechanisms for information held in the system.

The way a computer organizes, names, stores and manipulates files is globally referred to as its file
system. More formally, a file system is a special-purpose database for the storage, hierarchical
organization, manipulation, navigation, access, and retrieval of data.

Most computers have at least one file system while some computers allow the use of several different file
systems. For instance, on newer MS Windows computers, the older File Allocation Table (FAT-type) file
systems of MS-DOS and old versions of Windows are supported, in addition to the New Technology File
System (NTFS) file system that is the normal file system for recent versions of Windows.

Each system has its own advantages and disadvantages. Standard FAT allow only eight-character file
names plus a three-character for extension and does not allow spaces, while NTFS allows much longer
names that can contain spaces. You can call a file “Payroll records” in NTFS, but in FAT you would be
restricted to something like “payroll.dat” unless you were using VFAT, a FAT extension allowing longer
file names.
The most familiar file systems make use of an underlying data storage device that offers access to an
array of fixed-size blocks, sometimes called sectors, generally a power of 2 in size (512 bytes or 1, 2, or 4
KB are most common). The file system is responsible for organizing these sectors into files and
directories, and keeping track of which sectors belong to which file and which are not being used. Most
file systems address data in fixed-sized units called "clusters" or "blocks" which contain a certain number
of disk sectors (usually 1- 64). This is the smallest logical amount of disk space that can be allocated to
hold a file. However, file systems need not make use of a storage device at all. A file system can be used
to organize and represent access to any data, whether it be stored or dynamically generated (e.g., procfs).

7
Types of File Systems
Types of file system can be classified into disk file systems, network file systems and special purpose file
systems.
Disk File Systems
A disk file system is a file system designed for storage of files on a data storage device, most commonly a
disk drive, which might be directly or indirectly connected to the computer. Examples of disk file systems
include FAT (FAT12, FAT16, FAT32), NTFS, HFS and HFS+, ext2, ext3, ISO 9660, ODS-5, and UDF.
Some disk file systems are journaling file systems or versioning file systems.

Flash File Systems: A flash file system is a file system designed for storing files on flash memory
devices. These are becoming more prevalent as the number of mobile devices is increasing, and the
capacity of flash memories catches up with hard drives.

Database File Systems: A new concept for file management is the concept of a database-based file
system. Instead of, or in addition to, hierarchical structured management, files are identified by their
characteristics, like type of file, topic, author, or similar metadata.

Transactional File Systems: Each disk operation may involve changes to a number of different files and
disk structures. In many cases, these changes are related, meaning that it is important that they all be
executed at the same time. For example, a bank sending another bank some money electronically. The
bank's computer will "send" the transfer instruction to the other bank and also update its own records to
indicate the transfer has occurred. If for some reason the computer crashes before it has had a chance to
update its own records, then on reset, there will be no record of the transfer but the bank will be missing
some money. Transaction processing introduces the guarantee that at any point while it is running, a
transaction can either be finished completely or reverted completely (though not necessarily both at any
given point). This means that if there is a crash or power failure, after recovery, the stored state will be
consistent. (Either the money will be transferred or it will not be transferred, but it won't ever go missing
"in transit"). This type of file system is designed to be fault tolerant, but may incur additional overhead to
do.
Network File Systems
A network file system is a file system that acts as a client for a remote file access protocol, providing
access to files on a server. Examples of network file systems include clients for the NFS, AFS, SMB
protocols, and file-system-like clients for FTP and WebDAV.

8
Special Purpose File Systems
A special purpose file system is basically any file system that is not a disk file system or network file
system. This includes systems where the files are arranged dynamically by software, intended for such
purposes as communication between computer processes or temporary file space. Special purpose file
systems are most commonly used by file-centric operating systems such as Unix. Examples include the
procfs (/proc) file system used by some Unix variants, which grants access to information about processes
and other operating system features.

File Systems Under Microsoft Windows


Windows makes use of the FAT and NTFS (New Technology File System) file systems. The File
Allocation Table (FAT) filing system, supported by all versions of Microsoft Windows was an evolution
of that used in Microsoft's earlier operating system (MS-DOS which in turn was based on 86-DOS). FAT
ultimately traces its roots back to the short-lived M-DOS project and Standalone disk BASIC before it.
Older versions of the FAT file system (FAT12 and FAT16) had file name length limits, a limit on the
number of entries in the root directory of the file system and had restrictions on the maximum size of
FAT-formatted disks or partitions. Specifically, FAT12 and FAT16 had a limit of 8 characters for the file
name, and 3 characters for the extension. This is commonly referred to as the 8.3 filename limit. VFAT,
which was an extension to FAT12 and FAT16 introduced in Windows NT 3.5 and subsequently included
in Windows 95, allowed long file names (LFN). FAT32 also addressed many of the limits in FAT12 and
FAT16, but remains limited compared to NTFS. NTFS, introduced with the Windows NT operating
system. This file system possesses some features such as; Hard links, multiple file streams, attribute
indexing, quota tracking, compression and mount-points for other file systems (called "junctions") are
also supported.
Unlike many other operating systems, Windows uses a drive letter abstraction at the user level to
distinguish one disk or partition from another. For example, the path C:\WINDOWS represents a
directory WINDOWS on the partition represented by the letter C. The C drive is most commonly used for
the primary hard disk partition, on which Windows is usually installed and from which it boots. This
"tradition" has become so firmly ingrained that bugs came about in older versions of Windows which
made assumptions that the drive that the operating system was installed on was C. The tradition of using
"C" for the drive letter can be traced to MS-DOS, where the letters A and B were reserved for up to two
floppy disk drives.

9
File Management System
File management system (FMS) (also known as file manager) is the software responsible for
creating, deleting, modifying and controlling access to files and also manages the recourses
being used by files. The four major responsibilities of FMS include;

1) Keep track of where each file is stored.


2) Use a policy that will determine where and how the files will be stored, making sure to
efficiently use the available storage space and provide efficient access to the files.
3) Allocate each file when a user has been cleared for access to it, then record its use.
4) Deallocate the file when the file is to be returned to storage, and communicate its availability
to others who may be waiting for it.

The operation of a file system and file manager can be likening to what operates in a typical
library system, where the file system is a library and file manager is the librarian that performs
the four major operations in the library.
1) A librarian uses the catalog to keep track of each item in the collection; each entry lists the
call number and the details that help patrons find the books they want.
2) The library relies on a policy to store everything in the collection including oversized books,
magazines, books-on-tape, DVDs, maps, and videos. And they must be physically arranged
so people can find what they need.
3) When it’s requested, the item is retrieved from its shelf and the borrower’s name is noted in
the circulation records.
4) When the item is returned, the librarian makes the appropriate notation in the circulation
records and reshelf it.

In a computer system, the file manager keeps track of its files with the filename, its physical
location in secondary storage, and important information about each file by storing these
information on file directories.
The file manager’s policy determines where each file is stored and how the system, and its users,
will relate or access the file. Usually through commands that are independent of device details.
In addition, the policy must determine who will have access to what material, and this involves
two factors, which are flexibility of access to the information and its subsequent protection. The

10
file manager does this by allowing access to shared files, providing distributed access, and
allowing users to browse through public directories. Meanwhile, the operating system must
protect its files against system malfunctions and provide security checks via account numbers
and passwords to preserve the integrity of the data and safeguard against tampering.
Computer system allocates files by activating the appropriate secondary storage device and
loading the file into memory while updating its records of who is using what file.
Finally, the file manager deallocates files by updating the file tables and rewriting the file (if
revised) to the secondary storage device. Any processes waiting to access the file are then
notified of its availability.

User and Program interaction with the file manager


Usually user interact with the file manager by issuing commands (mostly by just clicking) such
as OPEN, DELETE, COPY, PASTE, RENAME. For example when user clicks to save a file for
the first time, it is actually commanding the file manager to create a new file. This and other
commands are designed to devoid of detailed instructions required to run the device where files
are stored which makes operation on device independent. Therefore for user to access a file, the
user does not need to know where exactly the physical location of files on the disk (cylinder,
surface and sector) and even the medium type (tape, magnetic disk, optical disk or flash storage).
This is because the detail of file access operation is somewhat complex. Each logical command
is broken down into step-by-step sequence that triggers actions in the device.
For instance, when user’s program issues a command to read a record from a disk, the READ
command need to be decomposed into the following:
i. Move the read/write heads to the cylinder or track where the record is to be found.
ii. Wait until the sector containing the desired record passes under the read/write head (i.e.
the rotational delay).
iii. Activate the appropriate read/write head and read the record.
iv. Transfer the record to main memory.
v. Set a flag to indicate that the device is free to satisfy another request.
The file manager does all these without requesting users to include in each program the low level
instructions for operating every device. Imagine users need to include instructions to operate all
the different types of devices and models in their program.

11
Volume configuration
File manager refers to storage units as volume which maybe removable or non-removable. Each
volume usually contains several files and they are called multifile volumes. Although, in some
special cases, a file might be extremely large and is contained in more than one volumes, in this
case, such file is refers to as multivolume files. Each volume in the system is recognized by their
name and file manager write this name and other volume’s descriptive information on an easy-
access place, usually at the first sector of the outermost track of the disk, at the beginning of the
tape, at the inner most part of the CD or DVD. The following are the volume descriptive
information (for a disk drive)
i. Volume name: user allocated name
ii. Creation date: Date when the volume was created or added
iii. Pointer to directory area: indicates first sector where directory is stored
iv. Pointer to file area: indicates first sector where file is stored
v. File system code: used to detect volumes with incorrect formats

After the volume descriptor, the next thing stored in the volume is the master file directory.
Master file directory (MFD) is the mother of all directories. It stores the names and
characteristics of all files as well as subdirectories in the volume and the remaining space are
used to store files. In the early operating systems, each volume can have only one directory,
which is created by the file manager and contained filename, usually arranged in alphabetical or
chronological order. This is very simple to use and maintain but with major disadvantages;

i. It requires a long time to search for an individual file, especially when the MFD is organized in
an arbitrary order.
ii. If the user had more than 256 small files stored in the volume, the directory space (with a
maximum of 256 filename) would fill up before the disk storage space filled up. The user would
then receive a message of “disk full” when only the directory itself was full.
iii. Users could not create subdirectories to group the files that were related.
iv. Multiple users could not safeguard their files from other users because the entire directory was
freely made available to every user in the group on request.
v. Each program in the entire directory needed a unique name, so in a situation where the directory
is serving many users, only one person save file with a particular name.

12
The present file managers in operating system allow users to create individual subdirectories
(folders), to enable grouping of related files together. The subdirectories are organized in tree
manner with the master file being the root. This allow efficient search of individual directories
because there are fewer entries to search for. The point of entry for every file request is the MFD,
when users want to access file, the filename is sent to the file manager. The file manager first
searches the MFD for the subdirectory that contains the file and then searches the subdirectory
for the requested file and its location. Every file entry in every directory contains information
describing the file and this is called file descriptor. The following are the information in file
descriptor;

i. Filename: within a single directory, filenames must be unique; in some operating systems, the
filenames are case sensitive
ii. File type: the organization and usage are dependent on the system and application on it
iii. File size: although it could be computed from other information, the size is kept here for
convenience
iv. File location: identification of the first physical block (or all blocks) where the file is stored
v. Date and time of creation
vi. Protection information: access restrictions, based on who is allowed to access the file and what
type of access is allowed
vii. Record size: its fixed size or its maximum size, depending on the type of record

File Naming Conventions


File name usually has two or many components depending on the file manager. The two major
components are the filename and its extension. The extension is usually two or three characters
and it is separated from the filename by a dot (.), and its purpose is to indicate the file type and
the content of the file. Example, a file with .doc, is different from one with .MP3 extension. Note
some extensions such as .EXE, .BAT, .COB, .FOR are restricted by certain operating systems
because they serve as signal to the system to use a specific compiler or program to run the file.

Conventionally, file complete name is composed of the directory and subdirectories containing
the file, the filename and the extension of the file. For example in window operating system, the
file path C:\IMFST\COLLEGE\STUDENTS_PAYMENT.DOC indicates that this file resides in

13
the storage device (volume) C, in the directory IMFST, in a subdirectory COLLEGE. The name
of the file is STUDENTS_PAYMENT and the file can be access by word processing application.
In Unix or Linux operating system, the file path /imfst/college/ students_payment.doc. It starts
with a front slash which indicates a root directory. Followed by a subdirectory “imfst”, and
another subdirectory “college” and finally the file’s name students_payment.doc (note in Unix
and Linux filenames are case sensitive and they are often expressed in lowercase).

File Organization
File organization deals with how records in files are arranged (how data are stored, added,
accessed) and not how the file itself is arranged in directory (folder). File organization is very
important because it determines the methods of access, efficient, flexibility and storage devices
to use. Arranging records in file can take two formats and every record must use either of the
formats; Fixed-length and Variable-length record format.

Fixed-length record format:


In this type of record format, the size of all records is predetermined and is the same across all
records in the file. Since the size of records are predetermined, if a size given to records is larger
than number of characters being stored in it, storage spaces will be wasted and if it is too small
for data to be stored, some characters will be truncated which cause lose of some data.

Akinade Tope Sams NDSLT/16/324 ND II Science Laboratory Tec Male Ogun

Variable-length record format:


This format did not predetermine the size given to each record and adjust to number of characters
used to store each record. Although, it does not leave empty storage space or truncate characters,
it does not support direct access of records stored in it. Because it is difficult to compute the
exact records’ location on the storage media.

Akinade Tope Samson NDSLT/16/324 ND II Science Laboratory Technology Male Ogun

14
Physical File Organization
Records in files can be arranged in one of several ways based on the type of storage medium
used. For example serial can only be used to organize file’s record in magnetic tape, while
sequential, direct, and indexed sequential can be used in disk drive. Popularly, there are four
methods of file organization and these includes; Serial, Sequential, direct (random), indexed
sequential file organization.

Serial File Organization:


In this type of file organization method, records are arranged (stored or accessed) in a
chronological order (i.e. as it comes). The records are not organized in any order on the storage
medium. Although, this method is the simplest of all file organization methods, but accessing
record is very cumbersome because each record must be assessed one by one until the record
being searched for is found.

Sequential File Organization:


This is the most popular of all file organization methods. Records are stored in a particular order
and are accessed or retrieved following the order in which the records are stored one after the
other. Retrieving a record require searching sequentially through the entire file record by record
based on the order used to stored it. Although, there are improved methods of searching in
sequential file organization such as binary search and searching by key field. Using the key field,
instead of searching the entire content of each record only the particular field is searched and
once key matches the field the corresponding record will be retrieved. The binary search is based
on the principles that since the records are sorted in a particular order using a key (field), the
whole records are divided into two and search for the record in the possible half e.g. if there are 1
to 10 records in a file and there is need to retrieve record 8, the records can be divided into two
where records 1 to 5 are ignored and only 6 to 10 are searched.

Direct File organization:


This is also known as random file organization, it stores records randomly and allow records
access in any order without having to search from the beginning of the file to end. Records are
accessed directly using records’ relative addresses (logical addresses) which are determined or
computed when records are stored and again when they are to be retrieved. The principle behind

15
it is that, user identifies a field (or combination of fields) in the record which uniquely identifies
each record and serves as the key field. This key is transformed into logical address by the
hashing algorithm and this logical address is translated into physical address by the file manager
i.e. the cylinder, surface, record numbers.

Indexed Sequential File Organization:


This type of file organization method combines the best features of sequential and direct file
organization methods. It create index table and used it to locate records on storage media. It does
not use the result of hashing algorithm to generate records’ addresses, but it uses it to generate
index table which serves as pointer to records. So to access records in the file, the file manager
first search the index table and then goes directly to the physical location as indicated in the
entry.

Physical Storage Allocation


Physical storage allocation refers to how storage spaces are allocated to records. It is clear that
records within a file must have the same format. This may vary in terms of length and record
format used.

a. R1 R2 R3 R4 R5 R6 R7 Unblocked fixed-length records.

b. Block No. of R1 R2 R3 Block2 R5 R6 R7 Blocked fixed-


1 records length records

R1 R1 R2 R2 R3 R3 R4 R4
c. Length Length Length Length Unblocked Variable-
length

There are three types of storage allocation; contiguous, noncontiguous and indexed storage
allocation.

Contiguous Storage:
In contiguous storage records are stored closely to each other on the same compact area. In this
type of record allocation, record can be found and read once its starting address and size are
known. It also enable easy of direct access because every part of the file are stored in the same

16
compact area. The major disadvantage of contiguous storage is that a file cannot be expanded
after creation unless there is empty space available immediately after it. Hence, provision for
expansion must be provided when the file is created. If the space is not enough or got exhausted,
the entire file must be copied to a larger section of the disk when there is need to add more
record. Fragmentation (slice of unused space) is another disadvantage of contiguous storage, and
this can be addressed by compacting and rearrangement of files. The file manager keeps track of
this empty storage area by treating them as files, they are entered into file directory but are
flagged to differentiate it from the real files

Free FileA FileA FileA FileA FileB FileB FileB FileB Free FileC FileC
spac R1 R2 R3 R4 R1 R2 R3 R4 Space R1 R2
e
FileA cannot be expanded, while fileB can be expanded

Noncontiguous Storage:
This allows files to be stored in any available storage space on the disk. Although, it allows file’s
records to be stored in a contiguous manner in as much there is enough empty space and other
records are stored in other available sections of the disk. These are usually called the extents of
the file and are linked together with pointers. The physical capacity of each extent is determined
by operating system, which can be 256 or power of two (i.e. bytes). File extents are usually
linked in one of two ways, linking at storage level and linking at directive level.
In linking at storage level, each extent points to the next one in the sequence on the disk. The
directory entry for the file consists of the filename, the storage location of the first extent, the
location of the last extent and the total number of extents without counting the first one.
Disk Block
1 2 | 7 3 4 5
F1
6 7 | 13 8 9 10 Directory
F1 File No. Start End No. of Extents
11 12 13| 20 14 15
F1 F1 2 17 4
16 17 | - 18 19 20 | 17
F1 F1
In linking at the directory level, each extent is
listed with its physical address, its size, and a pointer to the next extents. A null pointer,

17
indicated by hyphen (-), indicates that it is the last one. Although, the two noncontiguous
allocation methods address external storage fragmentation and need for compaction, it does not
support direct access to file records because it is difficult to determine the exact location of a
specific record.
Disk Block Index Table
1 2 3 4 5
File Address Size Next
File1(1 File2(1)
1 512
)
File1 (1) 2 512 9
6 7 8 9 10
File2 (1) 3 512 7
File2(2 File1(2
4 512
) )
: :
11 12 13 14 15
File2(2) 7 512 12
File2(3
8 512
)
File1(2) 9 512 19
16 17 18 19 20
File1(4) File1(3
File2(3) 12 512 -
)

File1(4) 18 512 -
File1(3) 19 512 18
20 512

Usually at the point of file creation, files are normally declared to be sequential or direct to
enable the file manager to select most appropriate and efficient method for the storage allocation.
Contiguous support direct access while noncontiguous support sequential.

Index Storage:
Indexed storage allocation allows direct record access by bringing together the pointers linking
every extents of a particular file into an index block. An index block is created for every file,
which contains the addresses of each disk block that makeup the file (i.e. location of all extents
that makeup the file). Indexed storage allocation supports both sequential and direct access, but it

18
does not necessarily improve the use of storage space because each file must have an index block
(which also occupies space on the disk block).

Disk Block

1 2 3 4 5 Index
3
File1 File2 File End
7
12 File 1 15
6 7 8 9 10
File 2 5
File2 File1
11 12 13 14 15
2
File2
9
19
18
16 17 18 19 20
File1 File1

File Access Methods


The method of accessing record in file depends on the type of file organization method used to
store the records. The most flexible among them is indexed sequential method which allow for
both direct and sequential access method, while the least flexible is sequential organization
method which will only allow for sequential access method.

Files are accessed by the file manager using the address of the last byte read to access the next
sequential record. Which means the current byte address (CBA) must always be updated every
time a record is accessed, such as when the READ command is executed. The way the CBA is
determined differs for each access method and record format used.
a. Record1 Record2 Record3 Record4 Record5 Record6
For a fixed-length record all records have the same length i.e. RL = x for all

b. m Record1 n Record2 p Record3 q Record4


For a variable-length record all records have varied record length and these are stored before the
record. So the system also uses some byte to store this information about each record.

19
Sequential Access Method
For sequential access of a fixed-length records, the CBA is updated simply by incrementing it by
the record length (RL) which is constant for fixed-length record.
CBA = CBA + RL
For sequential access of variable-length record, the file manager adds the length of the record
(RL) plus the number of bytes used to hold the record length (N, where N is m, n, p and q as
illustrated b above) to the CBA.
CBA = CBA + RL + N
Direct Access Method
A file organized using direct access method can be accessed using either direct or sequential
method if the records are of fixed length. A file organized with a direct method and the record
format is variable-length cannot be accessed by direct method simply because it is difficult to
compute record length directly. In this, the record will be accessed by sequential method.

For direct access of a fixed-length records, the CBA can be computed directly from the record
length and the desired record number RN (information provided through the READ command)
minus 1.
CBA = RL * (RN - 1)
For example, if we are looking for the beginning of the fifteenth (15) record with a fixed record
length of 25 bytes, the CBA would be computed as
CBA = 25* (15 - 1) = 350

Indexed Sequential Access Method


Records in an indexed sequential file can be accessed either sequentially or directly, so either of
the procedures to compute the CBA presented above would apply but with one extra step. The
index file must be searched for the pointer to the block where the data are stored. Because the
index file is smaller than the data file, it can be kept in main memory, and a quick search can be
performed to locate the block where the desired record is located. Then, the block can be
retrieved from secondary storage, and the beginning byte address of the record can be calculated.
In systems that support several levels of indexing to improve access to very large files, the index
at each level must be searched before the computation of the CBA can be done. The entry point
to this type of data file is usually through the index file.

20
As explained, file’s organization and the method used to access its records are very closely
intertwined; so when one talks about a specific type of organization, one is almost certainly
implying a specific type of access.

Access Control Verification


File sharing among many users (program) has required measures of control to safeguard the
integrity of files. There are five possible actions that can be performed on a file, Read only,
Write only, Execute only, Delete only and combination of the four. There are several methods of
controlling these operations.

Access control matrix


Access control matrix is simple and easy to implement, but requires more disk space as files and
users grows. Because as the number of files and users increases, the matrix becomes extremely
large, sometimes too large to store in main memory. It only works well for systems with few
files and users. Each column of the matrix (table) identifies a user and each row identifies a file.
The intersection of the row and column contains the access right given to that user to that file.
For example, the table below shows the access right given to each user for each file. Where R =
Read Access, W = Write Access, E = Execute Access, D = Delete Access, and a dash (-) =
Access Not Allowed. User 1 is allowed unlimited access to File 1 but is allowed only to read and
execute File 4 and is denied access to the three other files.
Access Control Matrix
User 1 User 2 User 3 User 4 User 5 User 6
File 1 RWED R-E- ---- RWE- --E- RW--
File 2 ---- R-E- R-E- --E- ---- -WE-
File 3 ---- RWED ---- --E- ---- --ED
File 4 R-E- ---- ---- ---- RWED RWE-
File 5 ---- ---- ---- ---- RWED RWED

Access Control List


Access control list is a modification of access control matrix and presently it is the most popular
among access control scheme. Each file is entered in the list and contains the names of the users
who are allowed to access the file and the type of access given to each user. To shorten the list,

21
only users’ name who are given permission to use the file are listed; those denied all access are
grouped under a global heading WORLD. In some system, the access control list put every user
into a category; SYSTEM/ADMIN, OWNER, GROUP and WORLD. SYSTEM/ADMIN is use
for system personnel who have unlimited access to all files in the system. The OWNER has
absolute control over all files created in the owner’s account. An owner may create a GROUP
file so that all users belonging to the appropriate group have access to it. While WORLD
composed of all other users in the system that is, those who do not fall into any of the other three
categories.
File Access
File1 USER1 (RWED), USER2 (R-E-), USER4 (RWE-), USER5 (--E-), WORLD (----)
File2 USER2 (RWED), USER5 (R-E-), USER6 (RWE-), WORLD (----)
File3 USER3 (RWED), USER4 (R-E-), USER6 (RWE-), WORLD (----)
File4 USER5 (RWED), WORLD (----)
File5 USER1 (RWED), USER2 (R-E-), USER3 (RWE-), WORLD (----)

Capability List
Capability list create a table of access control information using a different perspective. In it
every user and the files they have access are contained in the table. It is a new method but it is
gaining popularity, especially in operating system like Linux to even control access to both
devices and files.
Users Access
User1 File1 (RWED), File2 (R-E-), File 4 (RWE-), File 5 (--E-),
User2 File2 (RWED), File5 (R-E-), File6 (RWE-),
User3 File3 (RWED), USER File4 (R-E-), File6 (RWE-),
User4 File5 (RWED),
User5 File1 (RWED), File2 (R-E-), File3 (RWE-),

Data Compression
Algorithm for data compression can be categorized into two; lossless and lossy algorithm.
Lossless Algorithm
Lossless algorithm is used for text compression. It retains all the data in the file throughout the
compression-decompression process. There are several methods some of which are; Records
with repeated characters, repeated terms, and front-end compression method.
Records with repeated characters: Data in a database (usually of fixed-length field) might
include entries with fewer characters than what it is designed for the field and fill the remaining
22
with blank characters. This can be replaced with a variable-length field and a special code to
indicate how many blanks characters were truncated (removed). For example, if the original
string, ADEOYE will looks like this when it’s stored uncompressed in a field that is 15
characters wide (if b stands for a blank character).
ADEOYEbbbbbbbbb
When it is encoded it looks like this:
ADEOYEb9

Repeated terms: Frequently used terms in a database can be compressed by using symbols to
represent the terms. For example, in College’s student database, words like student, course,
teacher, classroom, grade, and department will be common and each could be represented with a
single character. In this case it is imperative for the system to be able to distinguish between
compressed and uncompressed data.

Front-end compression: in this method data compression is built on the previous data element.
For example, the student database where the students’ names are kept in alphabetical order could
be compressed as described in the table below. Although, this method will save space but it will
require additional processing time to decompress and must be able to differentiate between
compressed and uncompressed data.

Original List Compressed List

Akinade, Mosunmola Akinade, Mosunmola

Akinadeolu, Fatimah 7olu, Fatimah

Akinadesope, Victor 7sope, Victor

Akindamola, Timilahin 4damola, Tmilehin

Akinsola, Rafiat 4sola, Rafiat

Akolawole, Titilop 2olawole, Titilope

Akorede Titilayo 3rede 5ayo

Etc.

23
Lossy Algorithm
In lossy compression some data from the original file are loss to allow significant compression.
This means the compression process is irreversible as the original file cannot be reconstructed.
The specifics of the compression algorithm are highly dependent on the type of file being
compressed and are usually regulated by International Organization for Standardization (ISO)

Storage Media
Storage media can be categorized into two; sequential access media (which support sequential
storage and access to records in files) and direct access storage devices (which support both
sequential and direct storage and access to records in files). Their major difference is the speed
and sharability.

Sequential Access Storage Media


This type of storage media stores record serially one after the other. The length of records is
usually determined by the application program and each record can be identified by its position
on the tape. To access a single record, the tape must be mounted and fast-forwarded from its
beginning until the desired position is located which is time consuming. To explain how records
are stored on magnetic tape, let consider a tape that is 1800 feet long. If this has nine tracks, only
eight of it will be used for record and one will be used for parity bit (parity bit is use for routine
error checking). The number of characters that can be stored per inch is usually determined by
the density of the tape, which can be 1600 bytes per inch (bpi).

For example, if you have records of 160 characters each, and were stored on a tape with a density
of 1600 bpi, this then theoretically mean that you can only store 10 records on one inch of the
tape. But practically this depends on how the records are stored. There are two ways of storing
records on tape; individually and grouped in blocks. When records are stored individually, each
record would need to be separated by a space to indicate its starting and ending point. If the
records are stored in blocks, then the entire block is preceded by a space and followed by a
space, but the individual records are still stored sequentially within the block.

The tactics of how magnetic tape reads and writes data is that it moves under the read/write head
only when there is need to access or write a record, at all other time it standing still. So the tape

24
moves in jerks as it stops, reads, and moves on at high speed, or stops, writes, and starts again,
and so on. Records would be written in the same way. The tape needs time and space to stop, so
a gap is inserted between each record. This is called inter-record gap (IRG) and is about 0.5 inch
long regardless of the sizes of the records it separates. Therefore, if 10 records are stored
individually, there will be nine 0.5inch IRGs between each of the records. Let assume each
record occupies only 0.1 inch of the tape. This implies that 5.5inches will be needed to store 10
records that supposed to occupy one inch which is not an efficient use of storage medium.

Record 1 IRG Record 2 IRG Record 3 IRG Record 4 IRG

The second method is to group records into blocks before storing on tape. This is called blocking
and it is performed when the file is created (the records must be unblock accurately when
retrieving). The number of records in a block is usually determined by the application program,
and it is often set to take advantage of the transfer rate, which is the density of the tape
(measured in bpi), multiplied by the tape drive speed (transport speed) which is measured in
inches per second (ips).
Transfer rate (ips) = density * transport speed
For example if the transport speed is 200ips and the tape stores 1600 bytes per inch (bpi), a total
of 320,000 bytes can be transferred in one second. This implies that the optimal size of a block is
320,000bytes. In this method, the entire block must be read into a buffer, so the buffer must be as
large as the block.
The gap which in this case called inter-block gap (IBG) is still 0.5 inch long, but the data from
each block of 10 records is now stored on just 1inch tape and the 0.5inch of IBG which makes a
total of 1.5inches to store 10 records compare to 5.5inches needed in individual method.

IBG R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 IBG R1 R2 R3 R4 R5 ….


Block of 10 records
Blocking has two distinct advantages:
i. Fewer I/O operations are needed because a single READ command can move an entire
block, the physical record that includes several logical records, into main memory.
ii. Less tape (storage space) is wasted because the size of the physical record exceeds the size
of the gap.

25
There are two disadvantages of blocking
i. Overhead and software routines are needed for blocking, deblocking, and recordkeeping.
ii. Buffer space may be wasted if you need only one logical record but must read an entire
block to get it.

Access time to block or individual record on magnetic tape depends on where the record is
located. For example, if a tape has 2400 feet and the transport speed is 200ips. Calculate the time
taken to read the last record on the tape.
Solution:
1feet = 12inches, therefore 2,400feet will equivalent to 28,800inches (2400 * 12). Now, if it
reads 200inches in one second, how many seconds will it take 28,800 inches.
28,800 / 200 = 144 seconds (about 2.5 minutes) to read the last record in the tape.

Direct Access Storage Devices


Direct access storage devices (DASDs) are devices that can directly read or write to any
available space on disk. DASDs can be grouped into three categories: magnetic disks, optical
discs, and flash memory.
Magnetic disk:
Magnetic disk is the most popular direct access medium for storing data. By direct access it
means that a record can be accessed without having to move through the preceding records.
Magnetic disks are of two types, fixed-head magnetic disk and movable-head magnetic disk (also
known as hard disk).

Hard disk is a stack of one or more metal platters that spin on a spindle. Each platter is coated
with iron oxides, and the entire unit is encased in a sealed chamber. It holds more data than other
direct access storage devices, because they often include several platters, stacked one on top of
another to have large capacity disk storage space. The number of read/write heads specifies the
number of sides that the disk uses to store data. If a particular hard disk drive has 12 disk
platters, the number of heads is most likely to be 11, which implies that two sides of the platter
will not be used for data storage. Each platter (except those at the top and bottom of the stack)
has two surfaces for recording, and each surface is formatted with a specific number of
concentric tracks where the data are recorded. The number of tracks varies from manufacturer to

26
manufacturer, but typically there are a thousand or more on a high capacity hard disk. Each track
on each surface is numbered: Track 0 identifies the outermost concentric circle on each surface;
the highest-numbered track is in the center. Two read-write head moves over each pair of the
plate one for the surface above and the other for the surface below by the read-write arm. The
arm moves all of the heads in unison, so if one head is on Track 36, then all of the heads are on
Track 36. That is they are all positioned on the same track but on their respective surfaces
creating a virtual cylinder. Based on this, it is more efficient to fill the disk track by track than to
fill surface by surface. When the Track 0 of the entire surfaces is filled, it gives a virtual cylinder
of data. So the number of cylinder is equivalent to number of track. This makes access to every
portion of the disk easier. To access any portion of the disk, the system needs three things. First
the cylinder number (i.e. the track number), so the arm can move the read-write to it. Secondly,
the surface number, so the system activates the appropriate read-write head. Lastly, the sector
number, where the read-write will read or store the data. This makes it easy for the read-write
head to know where and when it should begin reading or writing.

A fixed-head magnetic disk looks like a large CD or DVD covered with magnetic film that has
been formatted, usually on both sides, into concentric circles (each circle is a track). Data is
recorded serially on each track by the fixed read/write head positioned over each of the track.
This makes fixed-head magnetic disks much faster than the movable-head disks. The major
disadvantages of this type of disk is that is very expensive and the volume of data it can store is
very small when compared to movable-head magnetic disk, because the tracks are positioned
farther apart to allow for the width of the read/write heads.

Storage capacity of a disk


The storage capacity of a disk is computed based on the number of platters, tracks per platter,
sectors per platter and no of bytes per sector.
Storage capacity = No of recording platters * No. of Tracks per Platter * No. of Sectors per track
* No. of bytes per sector
Question: What is the storage capacity of a disk that has 24 plates each with 3000 tracks. Each
track is having 200 sectors and each sector can store 512 bytes.

Magnetic disk access time

27
Access time of magnetic disks is determined by three parameters; seek time, search
time/rotational delay and transfer time.
Seek time is the time required to position the read-write heads on the track where the data is.
Seek time is not required for fixed magnetic disk, since each track has its own read-write head. It
is the slowest of the three factors.
Search time/rotational delay is the time it takes to rotate the disk until the requested record is
placed beneath the read-write head.
Transfer time is the time required to move the data from the secondary storage to the primary
storage. It is the fastest of the three.

28

You might also like