Text Representation
Text Representation
Text is a collection of characters that can be represented in binary, which is the language that
computers use to process information.
To represent text in binary, a computer uses a character set.
A character set is a collection of all the characters and symbols that can be represented by a
computer system and the corresponding binary codes that represent them. Each
character and symbol is assigned a unique value.
A list of characters that have been defined by computer hardware and software. It is
necessary to have a method of coding so that the computer can understand human
characters.
One of the most commonly used character sets is:
i. The standard ASCII character set , which assigns a unique 7−bit binary code to each
character, including uppercase and lowercase letters, digits, punctuation marks, and control
characters e.g. the ASCII code for the uppercase letter 'A' is 01000001 (65 in denary), while
the code for the character '?' is 00111111 ( 63 in denary)
ii. Extended ASCII character set , which uses 8-bit codes . This allows for characters in non-
English alphabets and for some graphical characters to be included.
ASCII has limitations in terms of the number of characters it can represent, and it does not
support characters from languages other than English for example Chinese characters.
iii. To address these limitations, Unicode , allows for a greater range of characters and
symbols, including different languages and emojis, thus supporting many OS, search
engines and internet browsers used globally.
Unicode uses a variable-length encoding scheme that assigns a unique code to each character,
which can be represented in binary form using multiple bytes. (ASCII uses 1 byte whilst Unicode
uses 4 bytes) e.g. the Unicode code for the heart symbol is U+2665, which can be represented in
binary form as 11100110 10011000 10100101.
As Unicode requires more bits per character than ASCII, it can result in larger file sizes and
slower processing times when working with text-based data.
There is an overlap with standard ASCII code, since the first 128 (English) characters are the same,
but Unicode can support several thousand different characters in total.
- The type of data that is used to provide information, such as the dimensions and resolution
of an image, it is called metadata.
Most images use a lot more colours than black and white. Each colour has its own binary values.
Colours are created by computer screens using the Red Green Blue (RGB) colour system. This
system mixes the colours red, green and blue in different amounts to achieve each colour.
Each image has a resolution and a colour depth.
Image resolution is the number of pixels in an imade, i.e, the number of pixels wide by the number
of pixels high.
Colour depth is the number of bits used to represent each colour e.g each colour could be represented by
using 8-bit, 16-bit or 32-bit binary numbers.
The greater the number of bits the, the greater the range of colours that can be represented. If the colour
depth of an image is reduced, the quality of the image is often reduced.
If the image resolution or the colour depth of an image is changed, this will have an effect on the size of
the image file.
If the resolution is increased, the image will be created using more pixels, so more data will need to be
stored.
If the colour depth of the image is increased, each pixel will need more data to display a greater range of
colours so more data will need to be stored. Both will result in a larger file size for the image.
Representation of Sound
Converting sound to binary
Soundwaves are vibrations in the air. The human ear senses these vibrations and interprets them as sound.
Each sound wave has a frequency, wavelength and amplitude. The amplitude specifies the loudness of the
sound.
Sound waves vary continuously. This means that sound is analogue. Computers cannot work with
analogue data, so sound waves need to be sampled in order to be stored in a computer. Sampling means
measuring the amplitude of the sound wave. This is done using an analogue to digital converter (ADC).
i.e. when sound is recorded, this is done at set intervals, this is known as sampling.
The sample rate/sampling rate is the number of samples taken in a second. Sample rate is measured in
Hertz. 1 Hertz is equal to 1 sample per second. A common sample rate is 44.1 khz would require 44 100
samples to be taken each second. That is a lot of data. If the sample rate is increased, the amount of data
required for the recording is increased. This increases the file size of the file that stores the sound.
The sample resolution is the number of bits used to represent each sample. E.g a common sample
resolution is 16-bit.
The higher the sample resolution, the greater the variations in amplitude that can be stored for each
sample. This means that aspects such as loudness of the sound can be recorded more accurately. This will
also increase the amount of data that needs to be stored for each sample. This increases the file size of the
file that stores the sound.
Data Compression
A method that uses an algorithm to reduce the size of the file.
sound and image files can be very large. It is therefore necessary to reduce (or compress) the
size of a file for the following reasons:
- to save storage space on devices such as the hard disk drive/solid state drive
- it will take less time to transmit the file from one device to another
- it will be quicker to upload or download the file
- to use less network bandwidth when transferring files across the network / internet
- reduced file size also reduces costs. For example, when using cloud storage, the cost is
based on the size of the files stored. Also, an internet service provider (ISP) may charge a
user based on the amount of data downloaded.
There are two types of compression that can be used, Lossy and lossless compression
Lossy compression
- uses a compression algorithm that finds and permanently removes unnecessary and reduntant data
in the file. This means the original file cannot be reconstructed once it has been compressed.
- Mainly used on an image file or sound file.
- In an image, it may reduce the resolution and/or the bit/colour depth. Unnecessary data in the
image that can be removed are the colours that the human eye cannot distinguish, or reducing the
number of pixels used (image resolution) used to create the image.
- In a sound file, it may reduce the sampling rate and/or the resolution. Unnecessary data in
the image that can be removed are the sounds that cannot be heard by the human ear.