Video Telephony Quality Measurement with VTQI

Technical Paper

Technical Paper

Video Telephony Quality Measurement with VTQI


Video Telephony Quality Measurement with VTQI

1. 2. 3. Introduction ................................................................1 Human Visual Perception ..........................................1 Assessing Video Quality with Objective Measures ....................................................................1
3.1. Reference vs. No-reference Methods .....................................2 3.2. Perceptual vs. Non-perceptual Input.......................................2 3.3. Technical Properties of Video That Affect Perceived Quality.....................................................................................3 3.4. Degradations during Data Transfer .........................................3


VTQI ............................................................................4
4.1. What VTQI Is Based On .........................................................4 4.2. What VTQI Does Not Consider...............................................5 4.3. Output from the VTQI Algorithm..............................................5


Future Extensions......................................................5

Video Telephony Quality Measurement with VTQI



The high data rates of 3G networks enable video services such as video telephony and streaming. Like the voice service, video services need to be monitored to ensure that users experience them as being of adequate quality. For voice, automated quality assurance has reached a mature state, and standardized methods and tools exist for objective speech quality monitoring and for troubleshooting of the service. Video services, on the other hand, are not yet mature in this regard. Assessing the quality of video is also more difficult because of the greater complexity of the signal as well as its perception by humans. For multimedia, which combines video and audio, this complexity is compounded. The present paper deals with video telephony and describes an algorithm called 1 VTQI which has been developed by Ericsson for objectively judging the quality of the video telephony service.


Human Visual Perception

Visual perception by humans is a highly complex affair that involves multiple mechanisms and is influenced by expectations and prior knowledge. Quality judgments are inextricably tied to perceptual mechanisms. The perceived quality of a video transmission will therefore depend not only on its technical quality but also on other factors, such as its content. The viewers emotional involvement is another major factor. Expectations on quality are naturally also dependent on the equipment used. People will accept lower quality when watching a mobile phone screen than they do when watching a DVD movie at home.


Assessing Video Quality with Objective Measures

The term quality of experience (QoE) has been coined to differentiate between userperceived quality and technical quality measures relating to data transport, commonly denoted quality of service (QoS). QoE could be defined as the overall acceptability of an application or service, as perceived subjectively by the end-user. To measure QoE as such with objective methods is patently impossible, since it is dependent on the factors mentioned in chapter 2, and on many other things besides. Fortunately, it is possible to obtain a fair approximation of QoE by studying technical properties of the transferred video. What we are then measuring is not QoE itself, but aspects of video quality that are related to QoE. Chapter 3 takes a look at some of these aspects, and against that background chapter 4 sketches the workings of VTQI.

Reference vs. No-reference Methods



Reference vs. No-reference Methods

Some methods of objective quality assessment compare the signal presented to the end-user with the original, undistorted signal. The original then serves as a reference against which the end-users signal is measured. If a video frame presented to the viewer is identical to the original, the highest possible score is obtained for that frame. The more the original has been distorted, the lower the score. A synchronization algorithm is required to align the two signals correctly before the comparison is made. A no-reference method, in contrast, deals only with the received signal. Consequently, it does not measure degradation but judges the quality of the received signal on its own merits, extracting and assessing some judiciously chosen properties of the signal. Between these two extremes we find reduced-reference methods, where the quality assessment algorithm does not consider the reference as such but does receive some information about it. Full-reference methods have the advantage of greater precision: the correlation with subjective perceived quality is normally somewhat higher than for a no-reference method. This is to be expected, since the reference method has more information to go on. No-reference methods, on the other hand, are more generally applicable: access to the original may be difficult, or its capture may be impractical. Noreference methods also do not require synchronization.


Perceptual vs. Non-perceptual Input

Input to quality assessment algorithms can be perceptual or non-perceptual. Perceptual input is related to what humans perceive; in the case at hand, video and audio. Non-perceptual input is data that cannot be perceived by a human, such as throughput or block errors over a radio link. An algorithm that uses perceptual input normally extracts artifact properties such as blockiness, blurriness and jerkiness from the video images. These properties are then used to estimate perceived quality on a scale analogous to that of MOS (Mean Opinion Score), usually with reference to a model of the human visual system. Algorithms taking perceptual input are optimally suited to detect artifacts in individual video frames. Algorithms with non-perceptual input estimate perceived quality based on parameters such as the choice of codecs and frame rate, the block error rate, and the achieved throughput. Such an algorithm will not be as versatile as an algorithm with perceptual input; rather, it must be tuned for a specific setup, say, for a specific video codec and a limited set of bit rates. Properly trained, however, the algorithm can perform excellently within its application area. It will not detect single-image artifacts, but it will report the same performance on average as an algorithm using perceptual input. Furthermore, the average performance is usually the focus of interest, as opposed to detailed information on transient phenomena. An indisputable advantage of dispensing with perceptual input is that it permits computationally more efficient implementations.




Technical Properties of Video That Affect Perceived Quality


Technical Properties of Video That Affect Perceived Quality

In spite of the difficulties mentioned in chapter 2, there exist concrete properties of video footage that correlate closely with perceived video quality. The following are some key parameters relating to the process of recording and encoding the video signal: Codecs Frame rate Quantization Picture resolution When a video is transmitted over a wireless link, the limited bandwidth imposes constraints on these parameters. Normally the video needs to be encoded with a lossy compression algorithm, which irreversibly degrades the video quality to some extent: a trade-off has to be made between frame rate, quantization, and picture resolution.


Degradations during Data Transfer

Difficult network conditions, such as high interference and/or bad coverage, may result in radio block errors. Such errors may in turn result in visible artifacts at the application level, i.e. in the video replay. Examples of such phenomena are blockiness in the video and corrupted audio.

Video Telephony Quality Measurement with VTQI




VTQI is short for Video Telephony Quality Index. Like SQI, the corresponding TEMS objective quality measure for voice, VTQI is a no-reference method (compare section 3.1). The main motive for not using a reference is to keep the computational complexity down. VTQI is also non-perceptual (see section 4.1 below). This approach, too, has been chosen in order to limit the computational complexity and allow frequent updating of the VTQI score (see section 4.3). The kind of subjective test which VTQI strives to imitate is one where viewers are instructed to assess both video and audio and combine their perception of each into an overall multimedia quality score. The output from the VTQI algorithm is expressed as a value between 1 and 5, conforming to the MOS (Mean Opinion Score) scale which is frequently used in subjective quality tests. The unit for VTQI is called MOS-VTQI. VTQI has been tuned for the QCIF video format (176 x 144 pixels) which is commonly used in video telephony.


What VTQI Is Based On

VTQI is based entirely on non-perceptual input (see section 3.2), essentially the following: 1. The quality of the encoded (compressed) signal prior to transmission. This quality is straightforwardly a function of the codecs used and the bit rate. Compare section 3.3. However, since the radio bearer currently used in UMTS for video telephony is always a 64 kbit/s bearer, bit rate variation is in fact not an issue. This leaves the codecs: For the H.263 and MPEG-4 video codecs, the clean quality in terms of VTQI has been computed in advance. In practice, what codec is used in the video call is deduced from the signaling between server and client. (In current implementations of VTQI in TEMS products, the video codec is assumed to be H.263, but a VTQI model for MPEG-4 also exists.) The audio codec is assumed always to be AMR-NB operating at 12.2 kbit/s. 2. BLER (block error rate). This is the most important single cause of poor quality in video telephony. Focusing on BLER means that VTQI will faithfully reflect the impact of air interface conditions on QoE.

Bit error rate (BER), on the other hand, is not reported by current WCDMA user terminals and so is not available for use in the VTQI model. The content that has been used to tune the VTQI algorithm consists of footage typical of video telephony, including the following: close-up of person talking; slow




Video Telephony Quality Measurement with VTQI

panning in shopping mall by stationary user; street scene filmed by walking user (with phone held relatively still).


What VTQI Does Not Consider

VTQI does not use perceptual input to detect specific visible artifacts as described in section 3.2. The transferred video is not analyzed frame by frame in any way. Thanks to the monitoring of BLER (see section 4.1 above), however, even slight degradations impacting video and audio perception will still be noticed by the algorithm and affect the VTQI score.


Output from the VTQI Algorithm

A new VTQI rating (a single score for video and audio combined) is produced at intervals of length 12 s (depending on the phone model). A scale from 1 to 5 is used, conforming to that of the Mean Opinion Score (MOS) obtained in subjective tests. Each VTQI score is a time average taken over the last 8 seconds; the first score is thus obtained 8 s into the video call. This windowing procedure prevents short block error bursts from impacting the VTQI score in a disproportionate manner.


Future Extensions

Possible refinements of VTQI include: Implementation in TEMS products of a VTQI model for the MPEG-4 video codec. Adding further input to the algorithm: Taking into account the allocation of bits in the video codec quantization (image compression) process. The quantization is continuously adapted to the amount of movement in the picture. Simply put, the more movement there is, the more bits (out of a fixed total) must be allocated to coding the movement, and the fewer bits remain for describing the basic appearance of the picture. In other words, more movement means worse picture quality (other things being equal), and information on the quantizers current bit allocation scheme can be used by an algorithm such as VTQI as a factor in predicting quality. Taking into account other video and audio codec parameters, such as frame rate. Considering BLER for audio and video separately.

