Download as pdf or txt
Download as pdf or txt
You are on page 1of 111

© CFA Institute. For candidate use only. Not for distribution.

READING

2
Organizing, Visualizing,
and Describing Data
by Pamela Peterson Drake, PhD, CFA, and Jian Wu, PhD
Pamela Peterson Drake, PhD, CFA, is at James Madison University (USA). Jian Wu, PhD,
is at State Street (USA).

LEARNING OUTCOMES
Mastery The candidate should be able to:

a. identify and compare data types;


b. describe how data are organized for quantitative analysis;
c. interpret frequency and related distributions;
d. interpret a contingency table;
e. describe ways that data may be visualized and evaluate uses of
specific visualizations;
f. describe how to select among visualization types;
g. calculate and interpret measures of central tendency;
h. evaluate alternative definitions of mean to address an investment
problem;
i. calculate quantiles and interpret related visualizations;
j. calculate and interpret measures of dispersion;
k. calculate and interpret target downside deviation;
l. interpret skewness;
m. interpret kurtosis;
n. interpret correlation between two variables.

© 2020 CFA Institute. All rights reserved.


© CFA Institute. For candidate use only. Not for distribution.
64 Reading 2 ■ Organizing, Visualizing, and Describing Data

1 INTRODUCTION
Data have always been a key input for securities analysis and investment management,
but the acceleration in the availability and the quantity of data has also been driving
the rapid evolution of the investment industry. With the rise of big data and machine
learning techniques, investment practitioners are embracing an era featuring large
volume, high velocity, and a wide variety of data—allowing them to explore and exploit
this abundance of information for their investment strategies.
While this data-­rich environment offers potentially tremendous opportunities for
investors, turning data into useful information is not so straightforward. Organizing,
cleaning, and analyzing data are crucial to the development of successful investment
strategies; otherwise, we end up with “garbage in and garbage out” and failed invest-
ments. It is often said that 80% of an analyst’s time is spent on finding, organizing,
cleaning, and analyzing data, while just 20% of her/his time is taken up by model
development. So, the importance of having a properly organized, cleansed, and well-­
analyzed dataset cannot be over-­emphasized. With this essential requirement met,
an appropriately executed data analysis can detect important relationships within
data, uncover underlying structures, identify outliers, and extract potentially valuable
insights. Utilizing both visual tools and quantitative methods, like the ones covered
in this reading, is the first step in summarizing and understanding data that will be
crucial inputs to an investment strategy.
This reading provides a foundation for understanding important concepts that are
an indispensable part of the analytical tool kit needed by investment practitioners,
from junior analysts to senior portfolio managers. These basic concepts pave the way
for more sophisticated tools that will be developed as the quantitative methods topic
unfolds and that are integral to gaining competencies in the investment management
techniques and asset classes that are presented later in the CFA curriculum.
Section 2 covers core data types, including continuous and discrete numerical
data, nominal and ordinal categorical data, and structured versus unstructured data.
Organizing data into arrays and data tables and summarizing data in frequency dis-
tributions and contingency tables are discussed in Sections 3–5. Section 6 introduces
the important topic of data visualization using a range of charts and graphics to
summarize, explore, and better understand data. Section 7 covers the key measures
of central tendency, including several variants of mean that are especially useful in
investments. Quantiles and their investment applications are the focus of Section
8. Key measures of dispersion are discussed in Sections 9 and 10. The shape of data
distributions—specifically, skewness and kurtosis—are covered in Section 11. Section
12 provides a graphical introduction to covariance and correlation between two vari-
ables. The reading concludes with a Summary.

2 DATA TYPES

a Identify and compare data types


Data can be defined as a collection of numberpanel datas, characters, words, and
text—as well as images, audio, and video—in a raw or organized format to represent
facts or information. To choose the appropriate statistical methods for summarizing
and analyzing data and to select suitable charts for visualizing data, we need to dis-
tinguish among different data types. We will discuss data types under three different
perspectives of classifications: numerical versus categorical data; cross-­sectional vs.
time-­series vs. panel data; and structured vs. unstructured data.
© CFA Institute. For candidate use only. Not for distribution.
Data Types 65

2.1  Numerical versus Categorical Data


From a statistical perspective, data can be classified into two basic groups: numerical
data and categorical data.

2.1.1  Numerical Data


Numerical data are values that represent measured or counted quantities as a number
and are also called quantitative data. Numerical (quantitative) data can be split into
two types: continuous data and discrete data.
Continuous data are data that can be measured and can take on any numerical
value in a specified range of values. For example, the future value of a lump-­sum
investment measures the amount of money to be received after a certain period of
time bearing an interest rate. The future value could take on a range of values depend-
ing on the time period and interest rate. Another common example of continuous
data is the price returns of a stock that measures price change over a given period in
percentage terms.
Discrete data are numerical values that result from a counting process. So,
practically speaking, the data are limited to a finite number of values. For example,
the frequency of discrete compounding, m, counts the number of times that interest
is accrued and paid out in a given year. The frequency could be monthly (m = 12),
quarterly (m = 4), semi-­yearly (m = 2), or yearly (m = 1).

2.1.2  Categorical Data


Categorical data (also called qualitative data) are values that describe a quality
or characteristic of a group of observations and therefore can be used as labels to
divide a dataset into groups to summarize and visualize. Usually they can take only
a limited number of values that are mutually exclusive. Examples of categorical data
for classifying companies include bankrupt vs. not bankrupt and dividends increased
vs. no dividend action.
Nominal data are categorical values that are not amenable to being organized
in a logical order. An example of nominal data is the classification of publicly
listed stocks into 11 sectors, as shown in Exhibit  1, that are defined by the Global
Industry Classification Standard (GICS). GICS, developed by Morgan Stanley Capital
International (MSCI) and Standard & Poor’s (S&P), is a four-­tiered, hierarchical indus-
try classification system consisting of 11 sectors, 24 industry groups, 69 industries,
and 158 sub-­industries. Each sector is defined by a unique text label, as shown in the
column named “Sector.”

Exhibit 1  Equity Sector Classification by GICS


Sector Code
(Text Label) (Numerical Label)

Energy 10
Materials 15
Industrials 20
Consumer Discretionary 25
Consumer Staples 30
Health Care 35
Financials 40
Information Technology 45
Communication Services 50
(continued)
© CFA Institute. For candidate use only. Not for distribution.
66 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 1  (Continued)

Sector Code
(Text Label) (Numerical Label)
Utilities 55
Real Estate 60

Source: S&P Global Market Intelligence.

Text labels are a common format to represent nominal data, but nominal data
can also be coded with numerical labels. As shown below, the column named “Code”
contains a corresponding GICS code of each sector as a numerical value. However,
the nominal data in numerical format do not indicate ranking, and any arithmetic
operations on nominal data are not meaningful. In this example, the energy sector
with the code 10 does not represent a lower or higher rank than the real estate sector
with the code 60. Often, financial models, such as regression models, require input
data to be numerical; so, nominal data in the input dataset must be coded numerically
before applying an algorithm (that is, a process for problem solving) for performing
the analysis. This would be mainly to identify the category (here, sector) in the model.
Ordinal data are categorical values that can be logically ordered or ranked. For
example, the Morningstar and Standard & Poor’s star ratings for investment funds
are ordinal data in which one star represents a group of funds judged to have had
relatively the worst performance, with two, three, four, and five stars representing
groups with increasingly better performance or quality as evaluated by those firms.
Ordinal data may also involve numbers to identify categories. For example, in
ranking growth-­oriented investment funds based on their five-­year cumulative returns,
we might assign the number 1 to the top performing 10% of funds, the number 2 to
next best performing 10% of funds, and so on; the number 10 represents the bottom
performing 10% of funds. Despite the fact that categories represented by ordinal
data can be ranked higher or lower compared to each other, they do not necessarily
establish a numerical difference between each category. Importantly, such investment
fund ranking tells us nothing about the difference in performance between funds
ranked 1 and 2 compared with the difference in performance between funds ranked
3 and 4 or 9 and 10.
Having discussed different data types from a statistical perspective, it is import-
ant to note that at first glance, identifying data types may seem straightforward. In
some situations, where categorical data are coded in numerical format, they should
be distinguished from numerical data. A sound rule of thumb: Meaningful arithmetic
operations can be performed on numerical data but not on categorical data.

EXAMPLE 1 

Identifying Data Types (I)


Identify the data type for each of the following kinds of investment-­related
information:
1 Number of coupon payments for a corporate bond. As background, a
corporate bond is a contractual obligation between an issuing corporation
(i.e., borrower) and bondholders (i.e., lenders) in which the issuer agrees
to pay interest—in the form of fixed coupon payments—on specified
dates, typically semi-­annually, over the life of the bond (i.e., to its maturity
date) and to repay principal (i.e., the amount borrowed) at maturity.
© CFA Institute. For candidate use only. Not for distribution.
Data Types 67

2 Cash dividends per share paid by a public company. Note that cash
dividends are a distribution paid to shareholders based on the number of
shares owned.
3 Credit ratings for corporate bond issues. As background, credit ratings
gauge the bond issuer’s ability to meet the promised payments on the
bond. Bond rating agencies typically assign bond issues to discrete catego-
ries that are in descending order of credit quality (i.e., increasing probabil-
ity of non-­payment or default).
4 Hedge fund classification types. Note that hedge funds are investment
vehicles that are relatively unconstrained in their use of debt, derivatives,
and long and short investment strategies. Hedge fund classification types
group hedge funds by the kind of investment strategy they pursue.

Solution to 1
Number of coupon payments are discrete data. For example, a newly-­issued
5-­year corporate bond paying interest semi-­annually (quarterly) will make 10
(20) coupon payments during its life. In this case, coupon payments are limited
to a finite number of values; so, they are discrete.
Solution to 2
Cash dividends per share are continuous data since they can take on any non-­
negative values.
Solution to 3
Credit ratings are ordinal data. A rating places a bond issue in a category, and
the categories are ordered with respect to the expected probability of default.
But arithmetic operations cannot be done on credit ratings, and the difference in
the expected probability of default between categories of highly rated bonds, for
example, is not necessarily equal to that between categories of lowly rated bonds.
Solution to 4
Hedge fund classification types are nominal data. Each type groups together
hedge funds with similar investment strategies. In contrast to credit ratings for
bonds, however, hedge fund classification schemes do not involve a ranking.
Thus, such classification schemes are not ordinal data.

2.2  Cross-­Sectional versus Time-­Series versus Panel Data


Another data classification standard is based on how data are collected, and it cate-
gorizes data into three types: cross-­sectional, time series, and panel.
Prior to the description of the data types, we need to explain two data-­related
terminologies: variable and observation. A variable is a characteristic or quantity that
can be measured, counted, or categorized and is subject to change. A variable can also
be called a field, an attribute, or a feature. For example, stock price, market capital-
ization, dividend and dividend yield, earnings per share (EPS), and price-­to-­earnings
ratio (P/E) are basic data variables for the financial analysis of a public company. An
observation is the value of a specific variable collected at a point in time or over a
specified period of time. For example, last year DEF, Inc. recorded EPS of $7.50. This
value represented a 15% annual increase.
© CFA Institute. For candidate use only. Not for distribution.
68 Reading 2 ■ Organizing, Visualizing, and Describing Data

Cross-­sectional data are a list of the observations of a specific variable from


multiple observational units at a given point in time. The observational units can be
individuals, groups, companies, trading markets, regions, etc. For example, January
inflation rates (i.e., the variable) for each of the euro-­area countries (i.e., the observa-
tional units) in the European Union for a given year constitute cross-­sectional data.
Time-­series data are a sequence of observations for a single observational unit
of a specific variable collected over time and at discrete and typically equally spaced
intervals of time, such as daily, weekly, monthly, annually, or quarterly. For example,
the daily closing prices (i.e., the variable) of a particular stock recorded for a given
month constitute time-­series data.
Panel data are a mix of time-­series and cross-­sectional data that are frequently
used in financial analysis and modeling. Panel data consist of observations through
time on one or more variables for multiple observational units. The observations in
panel data are usually organized in a matrix format called a data table. Exhibit 2 is
an example of panel data showing quarterly earnings per share (i.e., the variable) for
three companies (i.e., the observational units) in a given year by quarter. Each column
is a time series of data that represents the quarterly EPS observations from Q1 to Q4
of a specific company, and each row is cross-­sectional data that represent the EPS of
all three companies of a particular quarter.

Exhibit 2  Earnings per Share in Euros of Three Eurozone Companies in a


Given Year
Time Period Company A Company B Company C

Q1 13.53 0.84 −0.34


Q2 4.36 0.96 0.08
Q3 13.16 0.79 −2.72
Q4 12.95 0.19 0.09

2.3  Structured versus Unstructured Data


Categorizing data into structured and unstructured types is based on whether or not
the data are in a highly organized form.
Structured data are highly organized in a pre-­defined manner, usually with
repeating patterns. The typical forms of structured data are one-­dimensional arrays,
such as a time series of a single variable, or two-­dimensional data tables, where each
column represents a variable or an observation unit and each row contains a set of
values for the same columns. Structured data are relatively easy to enter, store, query,
and analyze without much manual processing. Typical examples of structured com-
pany financial data are:
■■ Market data: data issued by stock exchanges, such as intra-­day and daily closing
stock prices and trading volumes.
■■ Fundamental data: data contained in financial statements, such as earnings per
share, price to earnings ratio, dividend yield, and return on equity.
■■ Analytical data: data derived from analytics, such as cash flow projections or
forecasted earnings growth.
Unstructured data, in contrast, are data that do not follow any conventionally
organized forms. Some common types of unstructured data are text—such as financial
news, posts in social media, and company filings with regulators—and also audio/
video, such as managements’ earnings calls and presentations to analysts.
© CFA Institute. For candidate use only. Not for distribution.
Data Types 69

Unstructured data are a relatively new classification driven by the rise of alterna-
tive data (i.e., data generated from unconventional sources, like electronic devices,
social media, sensor networks, and satellites, but also by companies in the normal
course of business) and its growing adoption in the financial industry. Unstructured
data are typically alternative data as they are usually collected from unconventional
sources. By indicating the source from which the data are generated, such data can
be classified into three groups:
■■ Produced by individuals (i.e., via social media posts, web searches, etc.);
■■ Generated by business processes (i.e., via credit card transactions, corporate
regulatory filings, etc.); and
■■ Generated by sensors (i.e., via satellite imagery, foot traffic by mobile devices,
etc.).
Unstructured data may offer new market insights not normally contained in data
from traditional sources and may provide potential sources of returns for investment
processes. Unlike structured data, however, utilizing unstructured data in investment
analysis is challenging. Typically, financial models are able to take only structured data
as inputs; therefore, unstructured data must first be transformed into structured data
that models can process.
Exhibit 3 shows an excerpt from Form 10-­Q (Quarterly Report) filed by Company
XYZ with the US Securities and Exchange Commission (SEC) for the fiscal quarter
ended 31 March 20XX. The form is an unstructured mix of text and tables, so it can-
not be directly used by computers as input to financial models. The SEC has utilized
eXtensible Business Reporting Language (XBRL) to structure such data. The data
extracted from the XBRL submission can be organized into five tab-­delimited TXT
format files that contain information about the submission, including taxonomy tags
(i.e., financial statement items), dates, units of measure (uom), values (i.e., for the tag
items), and more—making it readable by computer. Exhibit 4 shows an excerpt from
one of the now structured data tables downloaded from the SEC’s EDGAR (Electronic
Data Gathering, Analysis, and Retrieval) database.

Exhibit 3  Excerpt from 10-­Q of Company XYZ for Fiscal Quarter Ended 31 March 20XX
Company XYZ
Form 10-­Q
Fiscal Quarter Ended 31 March 20XX
Table of Contents
Part I
Page
Item 1 Financial Statements 1
Item 2 Management’s Discussion and Analysis of Financial 21
Condition and Results of Operations
Item 3 Quantitative and Qualitative Disclosures About Market 32
Risk
Item 4 Controls and Procedures 32
Part II
Item 1 Legal Proceedings 33
Item 1A Risk Factors 33
Item 2 Unregistered Sales of Equity Securities and Use of 43
Proceeds
Item 3 Defaults Upon Senior Securities 43
(continued)
© CFA Institute. For candidate use only. Not for distribution.
70 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 3  (Continued)

Item 4 Mine Safety Disclosures 43


Item 5 Other Information 43
Item 6 Exhibits 44

Condensed Consolidated Statements of Operations (Unaudited)


(in millions, except number of shares, which are reflected in thousands and per share
amounts)
31 March 20XX
Net sales:
 Products $46,565
 Services 11,450
   Total net sales 58,015

Cost of sales:
 Products 32,047
 Services 4,147
   Total cost of sales 36,194
   Gross margin 21,821

Operating expenses:
  Research and development 3,948
  Selling, general and administrative 4,458
   Total operating expenses 8,406

Operating income 13,415


Other income/(expense), net 378
Income before provision for income taxes 13,793
Provision for income taxes 2,232
Net income $11,561

Source: EDGAR.

Exhibit 4  Structured Data Extracted from Form 10-­Q of Company XYZ for Fiscal Quarter Ended 31
March 20XX
adsh tag ddate uom value

0000320193-­19-­ RevenueFromContractWithCustomerExcludingAssessedTax 20XX0331 USD $58,015,000,000


000066
0000320193-­19-­ GrossProfit 20XX0331 USD $21,821,000,000
000066
0000320193-­19-­ OperatingExpenses 20XX0331 USD $8,406,000,000
000066
© CFA Institute. For candidate use only. Not for distribution.
Data Types 71

Exhibit 4  (Continued)

adsh tag ddate uom value


0000320193-­19-­ OperatingIncomeLoss 20XX0331 USD $13,415,000,000
000066
0000320193-­19-­ NetIncomeLoss 20XX0331 USD $11,561,000,000
000066

Source: EDGAR.

EXAMPLE 2 

Identifying Data Types (II)


1 Which of the following is most likely to be structured data?
A Social media posts where consumers are commenting on what they
think of a company’s new product.
B Daily closing prices during the past month for all companies listed on
Japan’s Nikkei 225 stock index.
C Audio and video of a CFO explaining her company’s latest earnings
announcement to securities analysts.
2 Which of the following statements describing panel data is most accurate?
A It is a sequence of observations for a single observational unit of a
specific variable collected over time at discrete and equally spaced
intervals.
B It is a list of observations of a specific variable from multiple observa-
tional units at a given point in time.
C It is a mix of time-­series and cross-­sectional data that are frequently
used in financial analysis and modeling.
3 Which of the following data series is least likely to be sortable by values?
A Daily trading volumes for stocks listed on the Shanghai Stock
Exchange.
B EPS for a given year for technology companies included in the S&P
500 Index.
C Dates of first default on bond payments for a group of bankrupt
European manufacturing companies.
4 Which of the following best describes a time series?
A Daily stock prices of the XYZ stock over a 60-­month period.
B Returns on four-­star rated Morningstar investment funds at the end of
the most recent month.
C Stock prices for all stocks in the FTSE100 on 31 December of the most
recent calendar year.
© CFA Institute. For candidate use only. Not for distribution.
72 Reading 2 ■ Organizing, Visualizing, and Describing Data

Solution to 1
B is correct as daily closing prices constitute structured data. A is incorrect as
social media posts are unstructured data. C is incorrect as audio and video are
unstructured data.
Solution to 2
C is correct as it most accurately describes panel data. A is incorrect as it
describes time-­series data. B is incorrect as it describes cross-­sectional data.
Solution to 3
C is correct as dates are ordinal data that can be sorted by chronological order but
not by value. A and B are incorrect as both daily trading volumes and earnings
per share (EPS) are numerical data, so they can be sorted by values.
Solution to 4
A is correct since a time series is a sequence of observations of a specific variable
(XYZ stock price) collected over time (60 months) and at discrete intervals of
time (daily). B and C are both incorrect as they are cross-­sectional data.

2.4  Data Summarization


b Describe how data are organized for quantitative analysis
Given the wide variety of possible formats of raw data, which are data available in
their original form as collected, such data typically cannot be used by humans or
computers to directly extract information and insights. Organizing data into a one-­
dimensional array or a two-­dimensional array is typically the first step in data analytics
and modeling. In this section, we will illustrate the construction of these typical data
organization formats. We will also introduce two useful tools that can efficiently sum-
marize one-­variable and two-­variable data: frequency distributions and contingency
tables, respectively. Both of them can give us a quick snapshot of the data and allow
us to find patterns in the data and associations between variables.

3 ORGANIZING DATA FOR QUANTITATIVE ANALYSIS

b Describe how data are organized for quantitative analysis


Quantitative analysis and modeling typically require input data to be in a clean and
formatted form, so raw data are usually not suitable for use directly by analysts.
Depending upon the number of variables, raw data can be organized into two typ-
ical formats for quantitative analysis: one-­dimensional arrays and two-­dimensional
rectangular arrays.
A one-­dimensional array is the simplest format for representing a collection of
data of the same data type, so it is suitable for representing a single variable. Exhibit 5
is an example of a one-­dimensional array that shows the closing price for the first 10
trading days for ABC Inc. stock after the company went public. Closing prices are
time-­series data collected at daily intervals, so it is natural to organize them into a
time-­ordered sequence. The time-­series format also facilitates future data updates
to the existing dataset. In this case, closing prices for future trading sessions can be
easily added to the end of the array with no alteration of previously formatted data.
© CFA Institute. For candidate use only. Not for distribution.
Organizing Data for Quantitative Analysis 73

More importantly, in contrast to compiling the data randomly in an unorganized


manner, organizing such data by its time-­series nature preserves valuable information
beyond the basic descriptive statistics that summarize central tendency and spread
variation in the data’s distribution. For example, by simply plotting the data against time,
we can learn whether the data demonstrate any increasing or decreasing trends over
time or whether the time series repeats certain patterns in a systematic way over time.

Exhibit 5  One-­Dimensional Array: Daily Closing Price of


ABC Inc. Stock
Observation by Day Stock Price ($)

1 57.21
2 58.26
3 58.64
4 56.19
5 54.78
6 54.26
7 56.88
8 54.74
9 52.42
10 50.14

A two-­dimensional rectangular array (also called a data table) is one of the most
popular forms for organizing data for processing by computers or for presenting data
visually for consumption by humans. Similar to the structure in an Excel spreadsheet,
a data table is comprised of columns and rows to hold multiple variables and multiple
observations, respectively. When a data table is used to organize the data of one single
observational unit (i.e., a single company), each column represents a different variable
(feature or attribute) of that observational unit, and each row holds an observation for
the different variables; successive rows represent the observations for successive time
periods. In other words, observations of each variable are a time-­series sequence that
is sorted in either ascending or descending time order. Consequently, observations
of different variables must be sorted and aligned to the same time scale. Example 3
shows how to organize a raw dataset for a company collected online into a machine-­
readable data table.

EXAMPLE 3 

Organizing a Company’s Raw Data into a Data Table


Suppose you are conducting a valuation analysis of ABC Inc., which has been
listed on the stock exchange for two years. The metrics to be used in your val-
uation include revenue, earnings per share (EPS), and dividends paid per share
(DPS). You have retrieved the last two years of ABC’s quarterly data from the
exchange’s website, which is shown in Exhibit 6. The data available online are
pre-­organized into a tabular format, where each column represents a fiscal year
and each row represents a particular quarter with values of the three measures
clustered together.
© CFA Institute. For candidate use only. Not for distribution.
74 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 6  Metrics of ABC Inc. Retrieved Online


Year 1 Year 2
Fiscal Quarter (Fiscal Year) (Fiscal Year)

March
 Revenue $3,784(M) $4,097(M)
 EPS 1.37 −0.34
 DPS N/A N/A
June
 Revenue $4,236(M) $5,905(M)
 EPS 1.78 3.89
 DPS N/A 0.25
September
 Revenue $4,187(M) $4,997(M)
 EPS −3.38 −2.88
 DPS N/A 0.25
December
 Revenue $3,889(M) $4,389(M)
 EPS −8.66 −3.98
 DPS N/A 0.25

Use the data to construct a two-­dimensional rectangular array (i.e., data table)
with the columns representing the metrics for valuation and the observations
arranged in a time-­series sequence.
Solution:
To construct a two-­dimensional rectangular array, we first need to determine
the data table structure. The columns have been specified to represent the three
valuation metrics (i.e., variables): revenue, EPS and DPS. The rows should be
the observations for each variable in a time ordered sequence. In this example,
the data for the valuation measures will be organized in the same quarterly
intervals as the raw data retrieved online, starting from Q1 Year 1 to Q4 Year 2.
Then, the observations from the original table can be placed accordingly into the
data table by variable name and by filing quarter. Exhibit 7 shows the raw data
reorganized in the two-­dimensional rectangular array (by date and associated
valuation metric), which can now be used in financial analysis and is readable
by a computer.
It is worth pointing out that in case of missing values while organizing
data, how to handle them depends largely on why the data are missing. In this
example, dividends (DPS) in the first five quarters are missing because ABC Inc.
did not authorize (and pay) any dividends. So, filling the dividend column with
zeros is appropriate. If revenue, EPS, and DPS of a given quarter are missing
due to particular data source issues, however, these missing values cannot be
simply replaced with zeros; this action would result in incorrect interpretation.
Instead, the missing values might be replaced with the latest available data or with
interpolated values, depending on how the data will be consumed or modeled.
© CFA Institute. For candidate use only. Not for distribution.
Summarizing Data Using Frequency Distributions 75

Exhibit 7  Data Table for ABC Inc.


Revenue ($ Million) EPS ($) DPS ($)

Q1 Year 1 3,784 1.37 0


Q2 Year 1 4,236 1.78 0
Q3 Year 1 4,187 −3.38 0
Q4 Year 1 3,889 −8.66 0
Q1 Year 2 4,097 −0.34 0
Q2 Year 2 5,905 3.89 0.25
Q3 Year 2 4,997 −2.88 0.25
Q4 Year 2 4,389 −3.98 0.25

SUMMARIZING DATA USING FREQUENCY


DISTRIBUTIONS 4
c Interpret frequency and related distributions
We now discuss various tabular formats for describing data based on the count of
observations. These tables are a necessary step toward building a true visualization of
a dataset. Later, we shall see how bar charts, tree-­maps, and heat maps, among other
graphic tools, are used to visualize important properties of a dataset.
A frequency distribution (also called a one-­way table) is a tabular display of
data constructed either by counting the observations of a variable by distinct values
or groups or by tallying the values of a numerical variable into a set of numerically
ordered bins. It is an important tool for initially summarizing data by groups or bins
for easier interpretation.
Constructing a frequency distribution of a categorical variable is relatively straight-
forward and can be stated in the following two basic steps:
1 Count the number of observations for each unique value of the variable.
2 Construct a table listing each unique value and the corresponding counts, and
then sort the records by number of counts in descending or ascending order to
facilitate the display.
Exhibit 8 shows a frequency distribution of a portfolio’s stock holdings by sectors
(the variables), which are defined by GICS. The portfolio contains a total of 479 stocks
that have been individually classified into 11 GICS sectors (first column). The stocks
are counted by sector and are summarized in the second column, absolute frequency.
The absolute frequency, or simply the raw frequency, is the actual number of obser-
vations counted for each unique value of the variable (i.e., each sector). Often it is
desirable to express the frequencies in terms of percentages, so we also show the rel-
ative frequency (in the third column), which is calculated as the absolute frequency
of each unique value of the variable divided by the total number of observations. The
relative frequency provides a normalized measure of the distribution of the data,
allowing comparisons between datasets with different numbers of total observations.
© CFA Institute. For candidate use only. Not for distribution.
76 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 8  Frequency Distribution for a Portfolio by Sector


Sector Absolute Relative
(Variable) Frequency Frequency

Industrials 73 15.2%
Information Technology 69 14.4%
Financials 67 14.0%
Consumer Discretionary 62 12.9%
Health Care 54 11.3%
Consumer Staples 33 6.9%
Real Estate 30 6.3%
Energy 29 6.1%
Utilities 26 5.4%
Materials 26 5.4%
Communication Services 10 2.1%
Total 479 100.0%

A frequency distribution table provides a snapshot of the data, and it facilitates


finding patterns. Examining the distribution of absolute frequency in Exhibit 8, we
see that the largest number of stocks (73), accounting for 15.2% of the stocks in the
portfolio, are held in companies in the industrials sector. The sector with the least
number of stocks (10) is communication services, which represents just 2.1% of the
stocks in the portfolio.
It is also easy to see that the top four sectors (i.e., industrials, information tech-
nology, financials, and consumer discretionary) have very similar relative frequencies,
between 15.2% and 12.9%. Similar relative frequencies, between 6.9% and 5.4%, are also
seen among several other sectors. Note that the absolute frequencies add up to the
total number of stocks in the portfolio (479), and the sum of the relative frequencies
should be equal to 100%.
Frequency distributions also help in the analysis of large amounts of numerical
data. The procedure for summarizing numerical data is a bit more involved than that
for summarizing categorical data because it requires creating non-­overlapping bins
(also called intervals or buckets) and then counting the observations falling into each
bin. One procedure for constructing a frequency distribution for numerical data can
be stated as follows:
1 Sort the data in ascending order.
2 Calculate the range of the data, defined as Range = Maximum value −
Minimum value.
3 Decide on the number of bins (k) in the frequency distribution.
4 Determine bin width as Range/k.
5 Determine the first bin by adding the bin width to the minimum value. Then,
determine the remaining bins by successively adding the bin width to the prior
bin’s end point and stopping after reaching a bin that includes the maximum
value.
© CFA Institute. For candidate use only. Not for distribution.
Summarizing Data Using Frequency Distributions 77

6 Determine the number of observations falling into each bin by counting the
number of observations whose values are equal to or exceed the bin minimum
value yet are less than the bin’s maximum value. The exception is in the last bin,
where the maximum value is equal to the last bin’s maximum, and therefore, the
observation with the maximum value is included in this bin’s count.
7 Construct a table of the bins listed from smallest to largest that shows the num-
ber of observations falling into each bin.
In Step 4, when rounding the bin width, round up (rather than down) to ensure
that the final bin includes the maximum value of the data.
These seven steps are basic guidelines for constructing frequency distributions. In
practice, however, we may want to refine the above basic procedure. For example, we
may want the bins to begin and end with whole numbers for ease of interpretation.
Another practical refinement that promotes interpretation is to start the first bin at
the nearest whole number below the minimum value.
As this procedure implies, a frequency distribution groups data into a set of bins,
where each bin is defined by a unique set of values (i.e., beginning and ending points).
Each observation falls into only one bin, and the total number of bins covers all the
values represented in the data. The frequency distribution is the list of the bins together
with the corresponding measures of frequency.
To illustrate the basic procedure, suppose we have 12 observations sorted in
ascending order (Step 1):
−4.57, −4.04, −1.64, 0.28, 1.34, 2.35, 2.38, 4.28, 4.42, 4.68, 7.16, and 11.43.
The minimum observation is −4.57, and the maximum observation is +11.43. So,
the range is +11.43 − (−4.57) = 16 (Step 2).
If we set k = 4 (Step 3), then the bin width is 16/4 = 4 (Step 4).
Exhibit 9 shows the repeated addition of the bin width of 4 to determine the end-
point for each of the bins (Step 5).

Exhibit 9  Determining Endpoints of the Bins


−4.57 + 4.0 = −0.57
−0.57 + 4.0 = 3.43
3.43 + 4.0 = 7.43
7.40 + 4.0 = 11.43

Thus, the bins are [−4.57 to −0.57), [−0.57 to 3.43), [3.43 to 7.43), and [7.43 to
11.43], where the notation [−4.57 to −0.57) indicates −4.57 ≤ observation < −0.57. The
parentheses indicate that the endpoints are not included in the bins, and the square
brackets indicate that the beginning points and the last endpoint are included in the
bin. Exhibit 10 summarizes Steps 5 through 7.

Exhibit 10  Frequency Distribution


Bin Absolute Frequency

A −4.57 ≤ observation < −0.57 3


B −0.57 ≤ observation < 3.43 4
C 3.43 ≤ observation < 7.43 4
D 7.43 ≤ observation ≤ 11.43 1
© CFA Institute. For candidate use only. Not for distribution.
78 Reading 2 ■ Organizing, Visualizing, and Describing Data

Note that the bins do not overlap, so each observation can be placed uniquely into
one bin, and the last bin includes the maximum value.
We turn to these issues in discussing the construction of frequency distributions
for daily returns of the fictitious Euro-­Asia-­Africa (EAA) Equity Index. The dataset
of daily returns of the EAA Equity Index spans a five-­year period and consists of
1,258 observations with a minimum value of −4.1% and a maximum value of 5.0%.
Thus, the range of the data is 5% − (−4.1%) = 9.1%, approximately. [The mean daily
return—mean as a measure of central tendency will be discussed shortly—is 0.04%.]
The decision on the number of bins (k) into which we should group the observa-
tions often involves inspecting the data and exercising judgment. How much detail
should we include? If we use too few bins, we will summarize too much and may lose
pertinent characteristics. Conversely, if we use too many bins, we may not summarize
enough and may introduce unnecessary noise.
We can establish an appropriate value for k by evaluating the usefulness of the
resulting bin width. A large number of empty bins may indicate that we are attempting
to over-­organize the data to present too much detail. Starting with a relatively small
bin width, we can see whether or not the bins are mostly empty and whether or not
the value of k associated with that bin width is too large. If the bins are mostly empty,
implying that k is too large, we can consider increasingly larger bins (i.e., smaller values
of k) until we have a frequency distribution that effectively summarizes the distribution.
Suppose that for ease of interpretation we want to use a bin width stated in whole
rather than fractional percentages. In the case of the daily EAA Equity Index returns,
a 1% bin width would be associated with 9.1/1 = 9.1 bins, which can be rounded up to
k = 10 bins. That number of bins will cover a range of 1% × 10 = 10%. By constructing
the frequency distribution in this manner, we will also have bins that end and begin
at a value of 0%, thereby allowing us to count the negative and positive returns in the
data. Without too much work, we have found an effective way to summarize the data.
Exhibit 11 shows the frequency distribution for the daily returns of the EAA Equity
Index using return bins of 1%, where the first bin includes returns from −5.0% to −4.0%
(exclusive, meaning < −4%) and the last bin includes daily returns from 4.0% to 5.0%
(inclusive, meaning ≤ 5%). Note that to facilitate interpretation, the first bin starts at
the nearest whole number below the minimum value (so, at −5.0%).
Exhibit 11 includes two other useful ways to present the data (which can be com-
puted in a straightforward manner once we have established the absolute and relative
frequency distributions): the cumulative absolute frequency and the cumulative relative
frequency. The cumulative absolute frequency cumulates (meaning, adds up) the
absolute frequencies as we move from the first bin to the last bin. Similarly, the cumu-
lative relative frequency is a sequence of partial sums of the relative frequencies. For
the last bin, the cumulative absolute frequency will equal the number observations in
the dataset (1,258), and the cumulative relative frequency will equal 100%.

Exhibit 11  Frequency Distribution for Daily Returns of EAA Equity Index


Return Relative Cumulative Cumulative
Absolute
Bin Frequency Absolute Relative
Frequency
(%) (%) Frequency Frequency (%)

−5.0 to −4.0 1 0.08 1 0.08


−4.0 to −3.0 7 0.56 8 0.64
−3.0 to −2.0 23 1.83 31 2.46
−2.0 to −1.0 77 6.12 108 8.59
−1.0 to 0.0 470 37.36 578 45.95
0.0 to 1.0 555 44.12 1,133 90.06
© CFA Institute. For candidate use only. Not for distribution.
Summarizing Data Using Frequency Distributions 79

Exhibit 11  (Continued)

Return Relative Cumulative Cumulative


Absolute
Bin Frequency Absolute Relative
Frequency
(%) (%) Frequency Frequency (%)
1.0 to 2.0 110 8.74 1,243 98.81
2.0 to 3.0 13 1.03 1,256 99.84
3.0 to 4.0 1 0.08 1,257 99.92
4.0 to 5.0 1 0.08 1,258 100.00

As Exhibit  11 shows, the absolute frequencies vary widely, ranging from 1 to


555. The bin encompassing returns between 0% and 1% has the most observations
(555), and the corresponding relative frequency tells us these observations account
for 44.12% of the total number of observations. The frequency distribution gives us
a sense of not only where most of the observations lie but also whether the distribu-
tion is evenly spread. It is easy to see that the vast majority of observations (37.36%
+ 44.12% = 81.48%) lie in the middle two bins spanning −1% to 1%. We can also see
that not many observations are greater than 3% or less than −4%. Moreover, as there
are bins with 0% as ending or beginning points, we are able to count positive and
negative returns in the data. Looking at the cumulative relative frequency in the last
column, we see that the bin of −1% to 0% shows a cumulative relative frequency of
45.95%. This indicates that 45.95% of the observations lie below the daily return of
0% and that 54.05% of the observations are positive daily returns.
It is worth noting that other than being summarized in tables, frequency distri-
butions also can be effectively represented in visuals, which will be discussed shortly
in the section on data visualization.

EXAMPLE 4 

Constructing a Frequency Distribution of Country Index


Returns
Suppose we have the annual equity index returns of a given year for 18 different
countries, as shown in Exhibit 12, and we are asked to summarize the data.

Exhibit 12  Annual Equity Index Returns for 18


Countries
Market Index Return (%)

Country A 7.7
Country B 8.5
Country C 9.1
Country D 5.5
Country E 7.1
Country F 9.9
Country G 6.2
Country H 6.8
Country I 7.5
(continued)
© CFA Institute. For candidate use only. Not for distribution.
80 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 12  (Continued)

Market Index Return (%)


Country J 8.9
Country K 7.4
Country L 8.6
Country M 9.6
Country N 7.7
Country O 6.8
Country P 6.1
Country Q 8.8
Country R 7.9

Construct a frequency distribution table from these data and state some key
findings from the summarized data.
Solution:
The first step in constructing a frequency distribution table is to sort the return
data in ascending order:
Market Index Return (%)

Country D 5.5
Country P 6.1
Country G 6.2
Country H 6.8
Country O 6.8
Country E 7.1
Country K 7.4
Country I 7.5
Country A 7.7
Country N 7.7
Country R 7.9
Country B 8.5
Country L 8.6
Country Q 8.8
Country J 8.9
Country C 9.1
Country M 9.6
Country F 9.9

The second step is to calculate the range of the data, which is 9.9% − 5.5% = 4.4%.
The third step is to decide on the number of bins. Here, we will use k = 5.
The fourth step is to determine the bin width. Here, it is 4.4%/5 = 0.88%, which
we will round up to 1.0%.
The fifth step is to determine the bins, which are as follows:
5.0% + 1.0% = 6.0%
6.0% + 1.0% = 7.0%
7.0% + 1.0% = 8.0%
© CFA Institute. For candidate use only. Not for distribution.
Summarizing Data Using a Contingency Table 81

8.0% + 1.0% = 9.0%


9.0% + 1.0% = 10.0%
For ease of interpretation, the first bin is set to begin with the nearest whole
number (5.0%) below the minimum value (5.5%) of the data series.
The sixth step requires counting the return observations falling into each bin,
and the seventh (last) step is use these results to construct the final frequency
distribution table.
Exhibit 13 presents the frequency distribution table, which summarizes the data
in Exhibit 12 into five bins spanning 5% to 10%. Note that with 18 countries, the
relative frequency for one observation is calculated as 1/18 = 5.56%.

Exhibit 13  Frequency Distribution of Equity Index Returns


Cumulative
Cumulative Relative
Return Absolute Relative Absolute Frequency
Bin (%) Frequency Frequency (%) Frequency (%)

5.0 to 6.0 1 5.56 1 5.56


6.0 to 7.0 4 22.22 5 27.78
7.0 to 8.0 6 33.33 11 61.11
8.0 to 9.0 4 22.22 15 83.33
9.0 to 10.0 3 16.67 18 100.00

As Exhibit  13 shows, there is substantial variation in these equity index


returns. One-­third of the observations fall in the 7.0 to 8.0% bin, making it the
bin with the most observations. Both the 6.0 to 7.0% bin and the 8.0 to 9.0%
bin hold four observations each, accounting for 22.22% of the total number of
the observations, respectively. The two remaining bins have fewer observations,
one or three observations, respectively.

SUMMARIZING DATA USING A CONTINGENCY TABLE


5
d Interpret a contingency table
We have shown that the frequency distribution table is a powerful tool to summarize
data for one variable. How can we summarize data for two variables simultaneously?
A contingency table provides a solution to this question.
A contingency table is a tabular format that displays the frequency distributions
of two or more categorical variables simultaneously and is used for finding patterns
between the variables. A contingency table for two categorical variables is also known
as a two-­way table. Contingency tables are constructed by listing all the levels (i.e.,
categories) of one variable as rows and all the levels of the other variable as columns
in the table. A contingency table having R levels of one variable in rows and C levels
of the other variable in columns is referred to as an R × C table. Note that each vari-
able in a contingency table must have a finite number of levels, which can be either
ordered (ordinal data) or unordered (nominal data). Importantly, the data displayed
in the cells of the contingency table can be either a frequency (count) or a relative
frequency (percentage) based on either overall total, row totals, or column totals.
© CFA Institute. For candidate use only. Not for distribution.
82 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit  14 presents a 5  × 3 contingency table that summarizes the number of


stocks (i.e., frequency) in a particular portfolio of 1,000 stocks by two variables, sec-
tor and company market capitalization. Sector has five levels, with each one being a
GICS-­defined sector. Market capitalization (commonly referred to as “market cap”) is
defined for a company as the number of shares outstanding times the price per share.
The stocks in this portfolio are categorized by three levels of market capitalization:
large cap, more than $10 billion; mid cap, $10 billion to $2 billion; and small cap, less
than $2 billion.

Exhibit 14  Portfolio Frequencies by Sector and Market Capitalization


Market Capitalization Variable
(3 Levels)
Sector Variable (5 Levels) Small Mid Large Total

Communication Services 55 35 20 110


Consumer Staples 50 30 30 110
Energy 175 95 20 290
Health Care 275 105 55 435
Utilities 20 25 10 55
Total 575 290 135 1,000

The entries in the cells of the contingency table show the number of stocks of each
sector with a given level of market cap. For example, there are 275 small-­cap health
care stocks, making it the portfolio’s largest subgroup in terms of frequency. These
data are also called joint frequencies because you are joining one variable from the
row (i.e., sector) and the other variable from the column (i.e., market cap) to count
observations. The joint frequencies are then added across rows and across columns,
and these corresponding sums are called marginal frequencies. For example, the
marginal frequency of health care stocks in the portfolio is the sum of the joint fre-
quencies across all three levels of market cap, so 435 (= 275 + 105 + 55). Similarly,
adding the joint frequencies of small-­cap stocks across all five sectors gives the marginal
frequency of small-­cap stocks of 575 (= 55 + 50 + 175 + 275 + 20).
Clearly, health care stocks and small-­cap stocks have the largest marginal frequen-
cies among sector and market cap, respectively, in this portfolio. Note the marginal
frequencies represent the frequency distribution for each variable. Finally, the marginal
frequencies for each variable must sum to the total number of stocks (overall total)
in the portfolio—here, 1,000 (shown in the lower right cell).
Similar to the one-­way frequency distribution table, we can express frequency in
percentage terms as relative frequency by using one of three options. We can divide
the joint frequencies by: a) the total count; b) the marginal frequency on a row; or c)
the marginal frequency on a column.
Exhibit 15 shows the contingency table using relative frequencies based on total
count. It is readily apparent that small-­cap health care and energy stocks comprise the
largest portions of the total portfolio, at 27.5% (= 275/1,000) and 17.5% (= 175/1,000),
respectively, followed by mid-­cap health care and energy stocks, at 10.5% and 9.5%,
respectively. Together, these two sectors make up nearly three-­quarters of the portfolio
(43.5% + 29.0% = 72.5%).
© CFA Institute. For candidate use only. Not for distribution.
Summarizing Data Using a Contingency Table 83

Exhibit 15  Relative Frequencies as Percentage of Total


Market Capitalization Variable
(3 Levels)
Sector Variable (5 Levels) Small Mid Large Total

Communication Services 5.5% 3.5% 2.0% 11.0%


Consumer Staples 5.0% 3.0% 3.0% 11.0%
Energy 17.5% 9.5% 2.0% 29.0%
Health Care 27.5% 10.5% 5.5% 43.5%
Utilities 2.0% 2.5% 1.0% 5.5%
Total 57.5% 29.0% 13.5% 100%

Exhibit  16 shows relative frequencies based on marginal frequencies of market


cap (i.e., columns). From this perspective, it is clear that the health care and energy
sectors dominate the other sectors at each level of market capitalization: 78.3% (=
275/575 + 175/575), 69.0% (= 105/290 + 95/290), and 55.6% (= 55/135 + 20/135), for
small, mid, and large caps, respectively. Note that there may be a small rounding error
difference between these results and the numbers shown in Exhibit 15.

Exhibit 16  Relative Frequencies: Sector as Percentage of Market Cap


Market Capitalization Variable
(3 Levels)
Sector Variable (5 Levels) Small Mid Large Total

Communication Services 9.6% 12.1% 14.8% 11.0%


Consumer Staples 8.7% 10.3% 22.2% 11.0%
Energy 30.4% 32.8% 14.8% 29.0%
Health Care 47.8% 36.2% 40.7% 43.5%
Utilities 3.5% 8.6% 7.4% 5.5%
Total 100.0% 100.0% 100.0% 100.0%

In conclusion, the findings from these contingency tables using frequencies and
relative frequencies indicate that in terms of the number of stocks, the portfolio can
be generally described as a small- to mid-­cap-­oriented health care and energy sector
portfolio that also includes stocks of several other defensive sectors.
As an analytical tool, contingency tables can be used in different applications. One
application is for evaluating the performance of a classification model (in this case,
the contingency table is called a confusion matrix). Suppose we have a model for
classifying companies into two groups: those that default on their bond payments and
those that do not default. The confusion matrix for displaying the model’s results will
be a 2 × 2 table showing the frequency of actual defaults versus the model’s predicted
frequency of defaults. Exhibit 17 shows such a confusion matrix for a sample of 2,000
non-­investment-­grade bonds. Using company characteristics and other inputs, the
model correctly predicts 300 cases of bond defaults and 1,650 cases of no defaults.
© CFA Institute. For candidate use only. Not for distribution.
84 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 17  Confusion Matrix for Bond Default Prediction Model


Predicted Actual Default
Default Yes No Total

Yes 300 40 340


No 10 1,650 1,660
Total 310 1,690 2,000

We can also observe that this classification model incorrectly predicts default in
40 cases where no default actually occurred and also incorrectly predicts no default
in 10 cases where default actually did occur. Later in the CFA Program curriculum
you will learn how to construct a confusion matrix, how to calculate related model
performance metrics, and how to use them to evaluate and tune a classification model.
Another application of contingency tables is to investigate potential association
between two categorical variables. For example, revisiting Exhibit 14, one may ask
whether the distribution of stocks by sectors is independent of the levels of market
capitalization? Given the dominance of small-­cap and mid-­cap health care and energy
stocks, the answer is likely, no.
One way to test for a potential association between categorical variables is to per-
form a chi-­square test of independence. Essentially, the procedure involves using
the marginal frequencies in the contingency table to construct a table with expected
values of the observations. The actual values and expected values are used to derive
the chi-­square test statistic. This test statistic is then compared to a value from the
chi-­square distribution for a given level of significance. If the test statistic is greater
than the chi-­square distribution value, then there is evidence to reject the claim of
independence, implying a significant association exists between the categorical vari-
ables. The following example describes how a contingency table is used to set up this
test of independence.

EXAMPLE 5 

Contingency Tables and Association between Two


Categorical Variables
Suppose we randomly pick 315 investment funds and classify them two ways:
by fund style, either a growth fund or a value fund; and by risk level, either
low risk or high risk. Growth funds primarily invest in stocks whose earnings
are expected to grow at a faster rate than earnings for the broad stock market.
Value funds primarily invest in stocks that appear to be undervalued relative to
their fundamental values. Risk here refers to volatility in the return of a given
investment fund, so low (high) volatility implies low (high) risk. The data are
summarized in a 2 × 2 contingency table shown in Exhibit 18.
© CFA Institute. For candidate use only. Not for distribution.
Summarizing Data Using a Contingency Table 85

Exhibit 18  Contingency Table by Investment Fund Style and Risk


Level
Low Risk High Risk

Growth 73 26
Value 183 33

1 Calculate the number of growth funds and number of value funds out of
the total funds.
2 Calculate the number of low-­risk and high-­risk funds out of the total
funds.
3 Describe how the contingency table is used to set up a test for indepen-
dence between fund style and risk level.

Solution to 1
The task is to calculate the marginal frequencies by fund style, which is done by
adding joint frequencies across the rows. Therefore, the marginal frequency for
growth is 73 + 26 = 99, and the marginal frequency for value is 183 + 33 = 216.
Solution to 2
The task is to calculate the marginal frequencies by fund risk, which is done by
adding joint frequencies down the columns. Therefore, the marginal frequency
for low risk is 73 + 183 = 256, and the marginal frequency for high risk is 26 +
33 = 59.
Solution to 3
Based on the procedure mentioned for conducting a chi-­square test of indepen-
dence, we would perform the following three steps.
Step 1: Add the marginal frequencies and overall total to the contingency
table. We have also included the relative frequency table for observed values.

Exhibit 19a  Observed Marginal Frequencies and Relative Frequencies


Observed Values Observed Values
Low High Low High
Risk Risk Risk Risk

Growth 73 26 99 Growth 74% 26% 100%


Value 183 33 216 Value 85% 15% 100%
256 59 315

Step 2: Use the marginal frequencies in the contingency table to construct


a table with expected values of the observations. To determine expected values
for each cell, multiply the respective row total by the respective column total,
then divide by the overall total. So, for celli,j (in ith row and jth column):
Expected Valuei,j = (Total Row i × Total Column j)/Overall Total (1)
For example,
Expected value for Growth/Low Risk is: (99 × 256)/ 315 = 80.46; and
© CFA Institute. For candidate use only. Not for distribution.
86 Reading 2 ■ Organizing, Visualizing, and Describing Data

Expected value for Value/High Risk is: (216 × 59) / 315 = 40.46.
The table of expected values (and accompanying relative frequency table) are:

Exhibit 19b  Expected Marginal Frequencies and Relative Frequencies


Observed Values Observed Values
Low High Low High
Risk Risk Risk Risk

Growth 80.457 18.543 99 Growth 81% 19% 100%


Value 175.543 40.457 216 Value 81% 19% 100%
256 59 315

Step 3: Use the actual values and the expected values of observation counts
to derive the chi-­square test statistic, which is then compared to a value from
the chi-­square distribution for a given level of significance. If the test statistic
is greater than the chi-­square distribution value, then there is evidence of a
significant association between the categorical variables.

6 DATA VISUALIZATION

e Describe ways that data may be visualized and evaluate uses of specific
visualizations
Visualization is the presentation of data in a pictorial or graphical format for the
purpose of increasing understanding and for gaining insights into the data. As has
been said, “a picture is worth a thousand words.” In this section, we discuss a variety
of charts that are useful for understanding distributions, making comparisons, and
exploring potential relationships among data. Specifically, we will cover visualizing
frequency distributions of numerical and categorical data by using plots that represent
multi-­dimensional data for discovering relationships and by interpreting visuals that
display unstructured data.

6.1  Histogram and Frequency Polygon


A histogram is a chart that presents the distribution of numerical data by using the
height of a bar or column to represent the absolute frequency of each bin or interval
in the distribution.
To construct a histogram from a continuous variable, we first need to split the
data into bins and summarize the data into a frequency distribution table, such as
the one we constructed in Exhibit 11. In a histogram, the y-axis generally represents
the absolute frequency or the relative frequency in percentage terms, while the x-axis
usually represents the bins of the variable. Using the frequency distribution table in
Exhibit 11, we plot the histogram of daily returns of the EAA Equity Index, as shown
in Exhibit 20. The bars are of equal width, representing the bin width of 1% for each
return interval. The bars are usually drawn with no spaces in between, but small gaps
can also be added between adjacent bars to increase readability, as in this exhibit. In
this case, the height of each bar represents the absolute frequency for each return
© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 87

bin. A quick glance can tell us that the return bin 0% to 1% (exclusive) has the highest
frequency, with more than 500 observations (555, to be exact), and it is represented
by the tallest bar in the histogram.
An advantage of the histogram is that it can effectively present a large amount of
numerical data that has been grouped into a frequency distribution and can allow a
quick inspection of the shape, center, and spread of the distribution to better under-
stand it. For example, in Exhibit 20, despite the histogram of daily EAA Equity Index
returns appearing bell-­shaped and roughly symmetrical, most bars to the right side
of the origin (i.e., zero) are taller than those on the left side, indicating that more
observations lie in the bins in positive territory. Remember that in the earlier dis-
cussion of this return distribution, it was noted that 54.1% of the observations are
positive daily returns.
As mentioned, histograms can also be created with relative frequencies—the choice
of using absolute versus relative frequency depends on the question being answered.
An absolute frequency histogram best answers the question of how many items are
in each bin, while a relative frequency histogram gives the proportion or percentage
of the total observations in each bin.

Exhibit 20  Histogram Overlaid with Frequency Polygon for Daily Returns of


EAA Equity Index
Frequency

500

400

300

200

100

0
–5 –4 –3 –2 –1 0 1 2 3 4 5
Index Return (%)

Another graphical tool for displaying frequency distributions is the frequency


polygon. To construct a frequency polygon, we plot the midpoint of each return
bin on the x-axis and the absolute frequency for that bin on the y-axis. We then con-
nect neighboring points with a straight line. Exhibit 20 shows the frequency polygon
that overlays the histogram. In the graph, for example, the return interval 1% to 2%
(exclusive) has a frequency of 110, so we plot the return-­interval midpoint of 0.5%
(which is 1.50% on the x-axis) and a frequency of 110 (on the y-axis). Importantly,
the frequency polygon can quickly convey a visual understanding of the distribution
since it displays frequency as an area under the curve.
Another form for visualizing frequency distributions is the cumulative frequency
distribution chart. Such a chart can plot either the cumulative absolute frequency or
the cumulative relative frequency on the y-axis against the upper limit of the interval.
The cumulative frequency distribution chart allows us to see the number or the per-
centage of the observations that lie below a certain value. To construct the cumulative
© CFA Institute. For candidate use only. Not for distribution.
88 Reading 2 ■ Organizing, Visualizing, and Describing Data

frequency distribution, we graph the returns in the fourth (i.e., Cumulative Absolute
Frequency) or fifth (i.e., Cumulative Relative Frequency) column of Exhibit 11 against
the upper limit of each return interval.
Exhibit 21 presents the graph of the cumulative absolute frequency distribution
for the daily returns on the EAA Equity Index. Notice that the cumulative distribu-
tion tends to flatten out when returns are extremely negative or extremely positive
because the frequencies in these bins are quite small. The steep slope in the middle
of Exhibit 21 reflects the fact that most of the observations—[(470 + 555)/1,258], or
81.5%—lie in the neighborhood of −1.0% to 1.0%.

Exhibit 21  Cumulative Absolute Frequency Distribution of Daily Returns of


EAA Equity Index
Cumulative Frequency

1,200

1,000

800

600

400

200

–4 –3 –2 –1 0 1 2 3 4 5
Index Return (%)

6.2  Bar Chart


As we have demonstrated, the histogram is an efficient graphical tool to present the
frequency distribution of numerical data. The frequency distribution of categorical
data can be plotted in a similar type of graph called a bar chart. In a bar chart, each
bar represents a distinct category, with the bar’s height proportional to the frequency
of the corresponding category.
Similar to plotting a histogram, the construction of a bar chart with one categorical
variable first requires a frequency distribution table summarized from the variable.
Note that the bars can be plotted vertically or horizontally. In a vertical bar chart,
the y-axis still represents the absolute frequency or the relative frequency. Different
from the histogram, however, is that the x-axis in a bar chart represents the mutually
exclusive categories to be compared rather than bins that group numerical data.
For example, using the marginal frequencies for the five GICS sectors shown in
the last column in Exhibit 14, we plot a horizontal bar chart in Exhibit 22 to show
the frequency of stocks by sector in the portfolio. The bars are of equal width to rep-
resent each sector, and sufficient space should be between adjacent bars to separate
them from each other. Because this is a horizontal bar chart—in this case, the x-axis
shows the absolute frequency and the y-axis represents the sectors—the length of
© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 89

each bar represents the absolute frequency of each sector. Since sectors are nominal
data with no logical ordering, the bars representing sectors may be arranged in any
order. However, in the particular case where the categories in a bar chart are ordered
by frequency in descending order and the chart includes a line displaying cumulative
relative frequency, then it is called a Pareto Chart. The chart is often used to highlight
dominant categories or the most important groups.
Bar charts provide a snapshot to show the comparison between categories of data.
As shown in Exhibit 22, the sector in which the portfolio holds most stocks is the
health care sector, with 435 stocks, followed by the energy sector, with 290 stocks.
The sector in which the portfolio has the least number of stocks is utilities, with 55
stocks. To compare categories more accurately, in some cases we may add the fre-
quency count to the right end of each bar (or the top end of each bar in the case of
a vertical bar chart).

Exhibit 22  Frequency by Sector for Stocks in a Portfolio


Sector

Communication Services

Consumer Staples

Energy

Health Care

Utilities

0 100 200 300 400


Frequency

The bar chart shown in Exhibit 22 can present the frequency distribution of only
one categorical variable. In the case of two categorical variables, we need an enhanced
version of the bar chart, called a grouped bar chart (also known as a clustered bar
chart), to show joint frequencies. Using the joint frequencies by sector and by level
of market capitalization given in Exhibit  14, for example, we show how a grouped
bar chart is constructed in Exhibit 23. While the y-axis still represents the same cat-
egorical variable (the distinct GICS sectors as in Exhibit 22), in Exhibit 23 three bars
are clustered side-­by-­side within the same sector to represent the three respective
levels of market capitalization. The bars within each cluster should be colored differ-
ently to distinguish between them, but the color schemes for the sub-­groups must
be identical across the sector clusters, as shown by the legend at the upper right of
Exhibit 23. Additionally, the bars in each sector cluster must always be placed in the
same order throughout the chart. It is easy to see that the small-­cap heath care stocks
are the sub-­group with the highest frequency (275), and we can also see that small-­
cap stocks are the largest sub-­group within each sector—except for utilities, where
mid cap is the largest.
© CFA Institute. For candidate use only. Not for distribution.
90 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 23  Frequency by Sector and Level of Market Capitalization for


Stocks in a Portfolio
Sector

Communication Services

Consumer Staples

Energy

Health Care

Utilities

0 50 100 150 200 250


Frequency

Small Cap Mid Cap Large Cap

An alternative form for presenting the joint frequency distribution of two cat-
egorical variables is a stacked bar chart. In the vertical version of a stacked bar
chart, the bars representing the sub-­groups are placed on top of each other to form
a single bar. Each subsection of the bar is shown in a different color to represent the
contribution of each sub-­group, and the overall height of the stacked bar represents
the marginal frequency for the category. Exhibit 23 can be replotted in a stacked bar
chart, as shown in Exhibit 24.

Exhibit 24  Frequency by Sector and Level of Market Capitalization in a


Stacked Bar Chart

400

300

200

100

0
Communication Consumer Energy Healh Utilities
Services Staples Care

Sector

Small Cap Mid Cap Large Cap


© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 91

We have shown that the frequency distribution of categorical data can be clearly
and efficiently presented by using a bar chart. However, it is worth noting that appli-
cations of bar charts may be extended to more general cases when categorical data are
associated with numerical data. For example, suppose we want to show a company’s
quarterly profits over the past one year. In this case, we can plot a vertical bar chart
where each bar represents one of the four quarters in a time order and its height
indicates the value of profits for that quarter.

6.3  Tree-­Map
In addition to bar charts and grouped bar charts, another graphical tool for displaying
categorical data is a tree-­map. It consists of a set of colored rectangles to represent
distinct groups, and the area of each rectangle is proportional to the value of the
corresponding group. For example, referring back to the marginal frequencies by
GICS sector in Exhibit 14, we plot a tree-­map in Exhibit 25 to represent the frequency
distribution by sector for stocks in the portfolio. The tree-­map clearly shows that
health care is the sector with the largest number of stocks in the portfolio, which is
represented by the rectangle with the largest area.

Exhibit 25  Tree-­Map for Frequency Distribution by Sector in a Portfolio

Health Care Energy Comm.


Services

Small (175) Small (55)

Small (275)

Mid (35)

Mid (95) Large


(20)
Large (20)

Consumer Staples Utilites L


Mid (25) a
Mid (105) Large (55) r
Small (50) Large Mid g
e
(30) (30) (10)
Small (20)

Note that this example also depicts one more categorical variable (i.e., level of
market capitalization). The tree-­map can represent data with additional dimensions
by displaying a set of nested rectangles. To show the joint frequencies of sub-­groups
by sector and level of market capitalization, as given in Exhibit 14, we can split each
existing rectangle for sector into three sub-­rectangles to represent small-­cap, mid-­cap,
and large-­cap stocks, respectively. In this case, the area of each nested rectangle would
be proportional to the number of stocks in each market capitalization sub-­group.
The exhibit clearly shows that small-­cap health care is the sub-­group with the largest
number of stocks. It is worth noting a caveat for using tree-­maps: Tree-­maps become
difficult to read if the hierarchy involves more than three levels.
© CFA Institute. For candidate use only. Not for distribution.
92 Reading 2 ■ Organizing, Visualizing, and Describing Data

6.4  Word Cloud


So far, we have shown how to visualize the frequency distribution of numerical data
or categorical data. However, can we find a chart to depict the frequency of unstruc-
tured data—particularly, textual data? A word cloud (also known as tag cloud) is a
visual device for representing textual data. A word cloud consists of words extracted
from a source of textual data, with the size of each distinct word being proportional
to the frequency with which it appears in the given text. Note that common words
(e.g., “a,” “it,” “the”) are generally stripped out to focus on key words that convey the
most meaningful information. This format allows us to quickly perceive the most
frequent terms among the given text to provide information about the nature of the
text, including topic and whether or not the text conveys positive or negative news.
Moreover, words conveying different sentiment may be displayed in different colors.
For example, “profit” typically indicates positive sentiment so might be displayed in
green, while “loss” typically indicates negative sentiment and may be shown in red.
Exhibit 26 is an excerpt from the Management’s Discussion and Analysis (MDA)
section of the 10-­Q filing for QXR Inc. for the quarter ended 31 March 20XX. Taking
this text, we can create a word cloud, as shown in Exhibit 27. A quick glance at the word
cloud tells us that the following words stand out (i.e., they were used most frequently
in the MDA text): “billion,” “revenue,” “year,” “income,” “growth,” and “financial.” Note
that specific words, such as “income” and “growth,” typically convey positive sentiment,
as contrasted with such words as “loss” and “decline,” which typically convey negative
sentiment. In conclusion, word clouds are a useful tool for visualizing textual data that
can facilitate understanding the topic of the text as well as the sentiment it may convey.

Exhibit 26  Excerpt of MDA Section in Form 10-­Q of QXR Inc. for Quarter
Ended 31 March 20XX
MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL
CONDITION AND RESULTS OF OPERATIONS
Please read the following discussion and analysis of our financial condition and
results of operations together with our consolidated financial statements and
related notes included under Part I, Item 1 of this Quarterly Report on Form 10-­Q
Executive Overview of Results
Below are our key financial results for the three months ended March 31, 20XX
(consolidated unless otherwise noted):
■■ Revenues of $36.3 billion and revenue growth of 17% year over year, con-
stant currency revenue growth of 19% year over year.
■■ Major segment revenues of $36.2 billion with revenue growth of 17% year
over year and other segments’ revenues of $170 million with revenue
growth of 13% year over year.
■■ Revenues from the United States, EMEA, APAC, and Other Americas
were $16.5 billion, $11.8 billion, $6.1 billion, and $1.9 billion, respectively.
■■ Cost of revenues was $16.0 billion, consisting of TAC of $6.9 billion and
other cost of revenues of $9.2 billion. Our TAC as a percentage of adver-
tising revenues were 22%.
■■ Operating expenses (excluding cost of revenues) were $13.7 billion,
including the EC AFS fine of $1.7 billion.
■■ Income from operations was $6.6 billion
■■ Other income (expense), net, was $1.5 billion.
© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 93

Exhibit 26  (Continued)
■■ Effective tax rate was 18%
■■ Net income was $6.7 billion with diluted net income per share of $9.50.
■■ Operating cash flow was $12.0 billion.
■■ Capital expenditures were $4.6 billion.

Exhibit 27  Word Cloud Visualizing Excerpted Text in MDA Section in Form


10-­Q of QXR Inc.

6.5  Line Chart


A line chart is a type of graph used to visualize ordered observations. Often a line
chart is used to display the change of data series over time. Note that the frequency
polygon in Exhibit 20 and the cumulative frequency distribution chart in Exhibit 21
are also line charts but used particularly in those instances for representing data
frequency distributions.
Constructing a line chart is relatively straightforward: We first plot all the data
points against horizontal and vertical axes and then connect the points by straight
line segments. For example, to show the 10-­day daily closing prices of ABC Inc. stock
presented in Exhibit 5, we first construct a chart with the x-axis representing time (in
days) and the y-axis representing stock price (in dollars). Next, plot each closing price
as points against both axes, and then use straight line segments to join the points
together, as shown in Exhibit 28.
© CFA Institute. For candidate use only. Not for distribution.
94 Reading 2 ■ Organizing, Visualizing, and Describing Data

An important benefit of a line chart is that it facilitates showing changes in the


data and underlying trends in a clear and concise way. This helps to understand the
current data and also helps with forecasting the data series. In Exhibit 28, for example,
it is easy to spot the price changes over the first 10 trading days since ABC’s initial
public offering (IPO). We see that the stock price peaked on Day 3 and then traded
lower. Following a partial recovery on Day 7, it declined steeply to around $50 on Day
10. In contrast, although the one-­dimensional data array table in Exhibit 5 displays
the same values as the line chart, the data table by itself does not provide a quick
snapshot of changes in the data or facilitate understanding underlying trends. This is
why line charts are helpful for visualization, particularly in cases of large amounts of
data (i.e., hundreds, or even thousands, of data points).

Exhibit 28  Daily Closing Prices of ABC Inc.’s Stock and Its Sector Index
Price ($) Sector Index

58 6,380

6,360
56 6,340

6,320
54
6,200

6,280
52
6,260

6,240
50
1 2 3 4 5 6 7 8 9 10
Day
Price ($) Sector Index

A line chart is also capable of accommodating more than one set of data points,
which is especially helpful for making comparisons. We can add a line to represent
each group of data (e.g., a competitor’s stock price or a sector index), and each
line would have a distinct color or line pattern identified in a legend. For example,
Exhibit 28 also includes a plot of ABC’s sector index (i.e., the sector index for which
ABC stock is a member, like health care or energy) over the same period. The sector
index is displayed with its own distinct color to facilitate comparison. Note also that
because the sector index has a different range (approximately 6,230 to 6,390) than
ABCs’ stock ($50 to $59 per share), we need a secondary y-axis to correctly display
the sector index, which is on the right-­hand side of the exhibit.
This comparison can help us understand whether ABC’s stock price movement
over the period is due to potential mispricing of its share issuance or instead due to
industry-­specific factors that also affect its competitors’ stock prices. The comparison
shows that over the period, the sector index moved in a nearly opposite trend versus
ABC’s stock price movement. This indicates that the steep decline in ABC’s stock price
is less likely attributable to sector-­specific factors and more likely due to potential
over-­pricing of its IPO or to other company-­specific factors.
When an observational unit (here, ABC Inc.) has more than two features (or
variables) of interest, it would be useful to show the multi-­dimensional data all in
one chart to gain insights from a more holistic view. How can we add an additional
dimension to a two-­dimensional line chart? We can replace the data points with
© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 95

varying-­sized bubbles to represent a third dimension of the data. Moreover, these


bubbles may even be color-­coded to present additional information. This version of
a line chart is called a bubble line chart.
Exhibit 7, for example, presented three types of quarterly data for ABC Inc. for
use in a valuation analysis. We would like to plot two of them, revenue and earnings
per share (EPS), over the two-­year period. As shown in Exhibit 29, with the x-axis
representing time (i.e., quarters) and the y-axis representing revenue in millions of
dollars, we can plot the revenue data points against both axes to form a typical line
chart. Next, each marker representing a revenue data point is replaced by a circular
bubble with its size proportional to the magnitude of the EPS in the corresponding
quarter. Moreover, the bubbles are colored in a binary scheme with green represent-
ing profits and red representing losses. In this way, the bubble line chart reflects the
changes for both revenue and EPS simultaneously, and it also shows whether the EPS
represents a profit or a loss.

Exhibit 29  Quarterly Revenue and EPS of ABC Incorporated


Revenue ($M)

6,000 $3.89

6,500
−$2.88
5,000
−$3.98
4,500 $1.78 −$3.38
−$8.66

4,000 $1.37 −$0.34

3,500
Q1 Year 1 Q2 Year 1 Q3 Year 1 Q4 Year 1 Q1 Year 2 Q2 Year 2 Q3 Year 2 Q4 Year 2

Revenue EPS Profit EPS Loss

As depicted, ABC’s earning were quite volatile during its initial two years as a
public company. Earnings started off as a profit of $1.37/share but finished the first
year with a big loss of −$8.66/share, during which time revenue experienced only small
fluctuations. Furthermore, while revenues and earnings both subsequently recovered
sharply—peaking in Q2 of Year 2—revenues then declined, and the company returned
to significant losses (−3.98/share) by the end of Year 2.

6.6  Scatter Plot


A scatter plot is a type of graph for visualizing the joint variation in two numerical
variables. It is a useful tool for displaying and understanding potential relationships
between the variables.
A scatter plot is constructed with the x-axis representing one variable and the
y-axis representing the other variable. It uses dots to indicate the values of the two
variables for a particular point in time, which are plotted against the corresponding
axes. Suppose an analyst is investigating potential relationships between sector index
returns and returns for the broad market, such as the S&P 500 Index. Specifically, he
or she is interested in the relative performance of two sectors, information technology
(IT) and utilities, compared to the market index over a specific five-­year period. The
analyst has obtained the sector and market index returns for each month over the
© CFA Institute. For candidate use only. Not for distribution.
96 Reading 2 ■ Organizing, Visualizing, and Describing Data

five years under investigation and plotted the data points in the scatter plots, shown
in Exhibit 30 for IT versus the S&P 500 returns and in Exhibit 31 for utilities versus
the S&P 500 returns.
Despite their relatively straightforward construction, scatter plots convey lots of
valuable information. First, it is important to inspect for any potential association
between the two variables. The pattern of the scatter plot may indicate no apparent
relationship, a linear association, or a non-­linear relationship. A scatter plot with
randomly distributed data points would indicate no clear association between the
two variables. However, if the data points seem to align along a straight line, then
there may exist a significant relationship among the variables. A positive (negative)
slope for the line of data points indicates a positive (negative) association, meaning
the variables move in the same (opposite) direction. Furthermore, the strength of the
association can be determined by how closely the data points are clustered around
the line. Tight (loose) clustering signals a potentially stronger (weaker) relationship.

Exhibit 30  Scatter Plot of Information Technology Sector Index Return vs.


S&P 500 Index Return
Information Technology

10.0

7.5

5.0

2.5

–2.5

–5.0

–7.5

–7.5 –5.0 –2.5 0 2.5 5.0 7.5


S&P 500
© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 97

Exhibit 31  Scatter Plot of Utilities Sector Index Return vs. S&P 500 Index
Return
Information Technology

–2

–4

–6

–7.5 –5.0 –2.5 0 2.5 5.0 7.5


S&P 500

Examining Exhibit 30, we can see the returns of the IT sector are highly positively
associated with S&P 500 Index returns because the data points are tightly clustered
along a positively sloped line. Exhibit 31 tells a different story for relative performance
of the utilities sector and S&P 500 index returns: The data points appear to be distrib-
uted in no discernable pattern, indicating no clear relationship among these variables.
Second, observing the data points located toward the ends of each axis, which represent
the maximum or minimum values, provides a quick sense of the data range. Third,
assuming that a relationship among the variables is apparent, inspecting the scatter
plot can help to spot extreme values (i.e., outliers). For example, an outlier data point
is readily detected in Exhibit 30, as indicated by the arrow. As you will learn later in
the CFA Program curriculum, finding these extreme values and handling them with
appropriate measures is an important part of the financial modeling process.
Scatter plots are a powerful tool for finding patterns between two variables, for
assessing data range, and for spotting extreme values. In practice, however, there
are situations where we need to inspect for pairwise associations among many vari-
ables—for example, when conducting feature selection from dozens of variables to
build a predictive model.
A scatter plot matrix is a useful tool for organizing scatter plots between pairs
of variables, making it easy to inspect all pairwise relationships in one combined
visual. For example, suppose the analyst would like to extend his or her investigation
by adding another sector index. He or she can use a scatter plot matrix, as shown in
Exhibit  32, which now incorporates four variables, including index returns for the
S&P 500 and for three sectors: IT, utilities, and financials.
© CFA Institute. For candidate use only. Not for distribution.
98 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 32  Pairwise Scatter Plot Matrix


S&P 500 S&P 500 S&P 500 S&P 500

–5.0 –5.0 –5.0 –5.0


50 50 50 50
20 20 20 20
10 10 10 10
–10 0 10 –10 0 10 –5 0 5 –10 0 10
S&P 500 Information Technology Utilities Financials

Information Technology Information Technology Information Technology Information Technology

–5.0 –5.0 –5.0 –5.0


40 40 40 40
50 50 50 50
20 20 20 20

–10 0 10 –10 0 0 –5 0 5 –10 0 10


S&P 500 Information Technology Utilities Financials

Utilities Utilities Utilities Utilities

7.5 7.5 7.5 7.5


5.0 5.0 5.0 5.0
2.5 2.5 2.5 2.5
0 0 0 0
–2.5 –2.5 –2.5 –2.5
–5.0 –5.0 –5.0 –5.0
–7.5 –7.5 –7.5 –7.5
–10 0 10 –10 0 10 –5 0 5 –10 0 10
S&P 500 Information Technology Utilities Financials

Financials Financials Financials Financials


15 15 15 15
10 10 10 10
5 5 5 5
0 0 0 0
–5 –5 –5 –5
–10 –10 –10 –10

–10 0 10 –10 0 10 –5 0 5 –10 0 10


S&P 500 Information Technology Utilities Financials

The scatter plot matrix contains each combination of bivariate scatter plot (i.e.,
S&P 500 vs. each sector, IT vs. utilities, IT vs. financials, and financials vs. utilities) as
well as univariate frequency distribution histograms for each variable plotted along
the diagonal. In this way, the scatter plot matrix provides a concise visual summary of
each variable and of potential relationships among them. Importantly, the construction
of the scatter plot matrix is typically a built-­in function in most major statistical soft-
ware packages, so it is relatively easy to implement. It is worth pointing out that the
upper triangle of the matrix is the mirror image of the lower triangle, so the compact
form of the scatter plot matrix that uses only the lower triangle is also appropriate.
With the addition of the financial sector, the bottom panel of Exhibit 32 reveals the
following additional information, which can support sector allocation in the portfolio
construction process:
■■ Strong positive relationship between returns of financial and S&P 500;
© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 99

■■ Positive relationship between returns of financial and IT; and


■■ No clear relationship between returns of financial and utilities.
It is important to note that despite their usefulness, scatter plots and scatter plot
matrixes should not be considered as a substitute for robust statistical tests; rather,
they should be used alongside such tests for best results.

6.7  Heat Map


A heat map is a type of graphic that organizes and summarizes data in a tabular
format and represents them using a color spectrum. For example, given a portfolio,
we can create a contingency table that summarizes the joint frequencies of the stock
holdings by sector and by level of market capitalization, as in Exhibit 33.

Exhibit 33  Frequencies by Sector and Market Capitalization in Heat Map

Communication Services 21 43 83
80
Consumer Staples 36 81 45
60
Energy 99 95 29

40
Health Care 4 8 18

20
Utilities 81 37 58

Small Cap Mid Cap Large Cap

Cells in the chart are color-­coded to differentiate high values from low values by
using the color scheme defined in the color spectrum on the right side of the chart.
As shown by the heat map, this portfolio has the largest exposure (in terms of num-
ber of stocks) to small- and mid-­cap energy stocks. It has substantial exposures to
large-­cap communications services, mid-­cap consumer staples, and small-­cap utilities;
however, exposure to the health care sector is limited. In sum, the heat map reveals
this portfolio to be relatively well-­diversified among sectors and market-­cap levels.
Besides their use in displaying frequency distributions, heat maps are commonly used
for visualizing the degree of correlation among different variables.

EXAMPLE 6 

Evaluating Data Visuals


1 You have a cumulative absolute frequency distribution graph (similar to
the one in Exhibit 21) of daily returns over a five-­year period for an index
of Asian equity markets.
Interpret the meaning of the slope of such a graph.
© CFA Institute. For candidate use only. Not for distribution.
100 Reading 2 ■ Organizing, Visualizing, and Describing Data

2 You are creating a word cloud for a visual representation of text on a


company’s quarterly earnings announcements over the past three years.
The word cloud uses font size to indicate word frequency. This particular
company has experienced both quarterly profits and losses during the
period under investigation.
Describe how the word cloud might be used to convey information
besides word frequency.
3 You are examining a scatter plot of monthly stock returns, similar to the
one in Exhibit 30, for two technology companies: one is a hardware man-
ufacturer, and the other is a software developer. The scatter plot shows a
strong positive association among their returns.
Describe what other information the scatter plot can provide.
4 You are reading a vertical bar chart displaying the sales of a company over
the past five years. The sales of the first four years seem nearly flat as the
corresponding bars are nearly the same height, but the bar representing
the sales of the most recent year is approximately three times as high as
the other bars.
Explain whether we can conclude that the sales of the fifth year tripled
compared to sales in the earlier years.

Solution 1
The slope of the graph of a cumulative absolute frequency distribution reflects
the change in the number of observations between two adjacent return bins. A
steep (flat) slope indicates a large (small) change in the frequency of observations
between adjacent return bins.
Solution 2
Color can add an additional dimension to the information conveyed in the word
cloud. For example, red can be used for “losses” and other words conveying neg-
ative sentiment, and green can be used for “profit” and other words indicative
of positive sentiment.
Solution 3
Besides the sign and degree of association of the stocks’ returns, the scatter
plot can provide a visual representation of whether the association is linear or
non-­linear, the maximum and minimum values for the return observations, and
an indication of which observations may have extreme values (i.e., are potential
outliers).
Solution 4
Typically, the heights of bars in a vertical bar chart are proportional to the values
that they represent. However, if the graph is using a truncated y-axis (i.e., one
that does not start at zero), then values are not accurately represented by the
height of bars. Therefore, we need to examine the y-axis of the bar chart before
concluding that sales in the fifth year were triple the sales of the prior years.

6.8  Guide to Selecting among Visualization Types


f Describe How to Select among Visualization Types
We have introduced and discussed a variety of different visualization types that are
regularly used in investment practice. When it comes to selecting a chart for visualizing
data, the intended purpose is the key consideration: Is it for exploring and/or presenting
© CFA Institute. For candidate use only. Not for distribution.
Data Visualization 101

distributions or relationships, or is it for making comparisons? Given your intended


purpose, the best selection is typically the simplest visual that conveys the message
or achieves the specific goal. Exhibit 34 presents a flow chart for facilitating selection
among the visualization types we have discussed. Finally, note that some visualization
types, such as bar chart and heat map, may be suitable for several different purposes.

Exhibit 34  Flow Chart of Selecting Visualization Types

• Histogram
Numerical
Data • Frequency
Polygon
• Cumulative
• Scatter Plot Distribution
(Two Variables) Chart
Relationship What to Distribution
• Scatter Plot Matrix
explore or
(Multiple Variables)
present?
• Heat Map
• (Multiple Variables) Categorical • Bar Chart
Comparison Data
• Tree Map
• Heat Map

Among Over Time Unstructured


Categories Data
• Word Cloud

• Line Chart
• Bar Chart (Two Variables)
• Tree Map • Bubble Line Chart
• Heat Map (Three Variables)

Data visualization is a powerful tool to show data and gain insights into data.
However, we need to be cautious that a graph could be misleading if data are mispre-
sented or the graph is poorly constructed. There are numerous different ways that may
lead to a misleading graph. We list four typical pitfalls here that analysts should avoid.
First, an improper chart type is selected to present data, which would hinder the
accurate interpretation of data. For example, to investigate the correlation between
two data series, we can construct a scatter plot to visualize the joint variation between
two variables. In contrast, plotting the two data series separately in a line chart would
make it rather difficult to examine the relationship.
Second, data are selectively plotted in favor of the conclusion an analyst intends
to draw. For example, data presented for an overly short time period may appear to
show a trend that is actually noise—that is, variation within the data’s normal range
if examining the data over a longer time period. So, presenting data for too short a
time window may mistakenly point to a non-­existing trend.
Third, data are improperly plotted in a truncated graph that has a y-axis that
does not start at zero. In some situations, the truncated graph can create the false
impression of significant differences when there is actually only a small difference.
For example, suppose a vertical bar chart is used to compare annual revenues of two
companies, one with $9 billion and the other with $10 billion. If the y-axis starts at
$8  billion, then the bar heights would inaccurately imply that the latter company’s
revenue is twice the former company’s revenue.
Last, but not least, is the improper scaling of axes. For example, given a line chart,
setting a higher than necessary maximum on the y-axis tends to compress the graph
into an area close to the x-axis. This causes the graph to appear to be less steep and
less volatile than if it was properly plotted. In sum, analysts need to avoid these misuses
of visualization when charting data and must ensure the ethical use of data visuals.
© CFA Institute. For candidate use only. Not for distribution.
102 Reading 2 ■ Organizing, Visualizing, and Describing Data

EXAMPLE 7 

Selecting Visualization Types


1 A portfolio manager plans to buy several stocks traded on a small emerg-
ing market exchange but is concerned whether the market can provide
sufficient liquidity to support her purchase order size. As the first step,
she wants to analyze the daily trading volumes of one of these stocks over
the past five years.
Explain which type of chart can best provide a quick view of trading vol-
ume for the given period.
2 An analyst is building a model to predict stock market downturns.
According to the academic literature and his practitioner knowledge and
expertise, he has selected 10 variables as potential predictors. Before
continuing to construct the model, the analyst would like to get a sense
of how closely these variables are associated with the broad stock market
index and whether any pair of variables are associated with each other.
Describe the most appropriate visual to select for this purpose.
3 Central Bank members meet regularly to assess the economy and decide
on any interest rate changes. Minutes of their meetings are published on
the Central Bank’s website. A quantitative researcher wants to analyze the
meeting minutes for use in building a model to predict future economic
growth.
Explain which type of chart is most appropriate for creating an overview
of the meeting minutes.
4 A private investor wants to add a stock to her portfolio, so she asks her
financial adviser to compare the three-­year financial performances (by
quarter) of two companies. One company experienced consistent revenue
and earnings growth, while the other experienced volatile revenue and
earnings growth, including quarterly losses.
Describe the chart the adviser should use to best show these performance
differences.

Solution to 1
The five-­year history of daily trading volumes contains a large amount of
numerical data. Therefore, a histogram is the best chart for grouping these data
into frequency distribution bins and for showing a quick snapshot of the shape,
center, and spread of the data’s distribution.
Solution to 2
To inspect for a potential relationship between two variables, a scatter plot is
a good choice. But with 10 variables, plotting individual scatter plots is not an
efficient approach. Instead, utilizing a scatter plot matrix would give the analyst
a good overview in one comprehensive visual of all the pairwise associations
between the variables.
Solution to 3
Since the meeting minutes consist of textual data, a word cloud would be the
most suitable tool to visualize the textual data and facilitate the researcher’s
understanding of the topic of the text as well as the sentiment, positive or neg-
ative, it may convey.
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 103

Solution to 4
The best chart for making this comparison would be a bubble line chart using
two different color lines to represent the quarterly revenues for each company.
The bubble sizes would then indicate the magnitude of each company’s quarterly
earnings, with green bubbles signifying profits and red bubbles indicating losses.

MEASURES OF CENTRAL TENDENCY


7
g Calculate and interpret measures of central tendency
So far, we have discussed methods we can use to organize and present data so that
they are more understandable. The frequency distribution of an asset return series,
for example, reveals much about the nature of the risks that investors may encounter
in a particular asset. Although frequency distributions, histograms, and contingency
tables provide a convenient way to summarize a series of observations, these methods
are just a first step toward describing the data. In this section, we discuss the use of
quantitative measures that explain characteristics of data. Our focus is on measures
of central tendency and other measures of location. A measure of central tendency
specifies where the data are centered. Measures of central tendency are probably more
widely used than any other statistical measure because they can be computed and
applied relatively easily. Measures of location include not only measures of central
tendency but other measures that illustrate the location or distribution of data.
In the following subsections, we explain the common measures of central ten-
dency—the arithmetic mean, the median, the mode, the weighted mean, the geometric
mean, and the harmonic mean. We also explain other useful measures of location,
including quartiles, quintiles, deciles, and percentiles.
A statistic is a summary measure of a set of observations, and descriptive statis-
tics summarize the central tendency and spread variation in the distribution of data.
If the statistic summarizes the set of all possible observations of a population, we
refer to the statistic as a parameter. If the statistic summarizes a set of observations
that is a subset of the population, we refer to the statistic as a sample statistic, often
leav8ing off the word “sample” and simply referring to it as a statistic. While measures
of central tendency and location can be calculated for populations and samples, our
focus is on sample measures (i.e., sample statistics) as it is rare that an investment
manager would be dealing with an entire population of data.

7.1  The Arithmetic Mean


Analysts and portfolio managers often want one number that describes a representative
possible outcome of an investment decision. The arithmetic mean is one of the most
frequently used measures of the center of data.
Definition of Arithmetic Mean. The arithmetic mean is the sum of the values
of the observations divided by the number of observations.

7.1.1  The Sample Mean


The sample mean is the arithmetic mean or arithmetic average computed for a sam-
ple. As you will see, we use the terms “mean” and “average” interchangeably. Often,
we cannot observe every member of a population; instead, we observe a subset or
sample of the population.
© CFA Institute. For candidate use only. Not for distribution.
104 Reading 2 ■ Organizing, Visualizing, and Describing Data

Sample Mean Formula. The sample mean or average, X (read “X-­bar”), is the
arithmetic mean value of a sample:
n
 Xi
i 1
X  (2)
n
where n is the number of observations in the sample.
Equation 2 tells us to sum the values of the observations (Xi) and divide the sum
by the number of observations. For example, if a sample of market capitalizations for
six publicly traded Australian companies contains the values (in AUD billions) 35, 30,
22, 18, 15, and 12, the sample mean market cap is 132/6 = A$22 billion. As previously
noted, the sample mean is a statistic (that is, a descriptive measure of a sample).
Means can be computed for individual units or over time. For instance, the sample
might be the return on equity (ROE) in a given year for a sample of 25 companies in
the FTSE Eurotop 100, an index of Europe’s 100 largest companies. In this case, we
calculate the mean ROE in that year as an average across 25 individual units. When
we examine the characteristics of some units at a specific point in time (such as ROE
for the FTSE Eurotop 100), we are examining cross-­sectional data; the mean of these
observations is the cross-­sectional mean. If the sample consists of the historical
monthly returns on the FTSE Eurotop 100 for the past five years, however, then we
have time-­series data; the mean of these observations is the time-­series mean. We
will examine specialized statistical methods related to the behavior of time series in
the reading on time-­series analysis.
Except in cases of large datasets with many observations, we should not expect any
of the actual observations to equal the mean; sample means provide only a summary
of the data being analyzed. Also, although in some cases the number of values below
the mean is quite close to the number of values above the mean, this need not be
the case. As an analyst, you will often need to find a few numbers that describe the
characteristics of the distribution, and we will consider more later. The mean is gen-
erally the statistic that you use as a measure of the typical outcome for a distribution.
You can then use the mean to compare the performance of two different markets.
For example, you might be interested in comparing the stock market performance of
investments in Asia Pacific with investments in Europe. You can use the mean returns
in these markets to compare investment results.

EXAMPLE 8 

Calculating a Cross-­Sectional Mean


Suppose we want to examine the performance of a sample of selected stock
indexes from 11 different countries. The 52-­week percentage change is reported
in Exhibit 35 for Year 1, Year 2, and Year 3 for the sample of indexes.

Exhibit 35  Annual Returns for Years 1 to 3 for


Selected Countries’ Stock Indexes
52-­Week Return (%)
Index Year 1 Year 2 Year 3

Country A −15.6 −5.4 6.1


Country B 7.8 6.3 −1.5
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 105

Exhibit 35  (Continued)

52-­Week Return (%)


Index Year 1 Year 2 Year 3
Country C 5.3 1.2 3.5
Country D −2.4 −3.1 6.2
Country E −4.0 −3.0 3.0
Country F 5.4 5.2 −1.0
Country G 12.7 6.7 −1.2
Country H 3.5 4.3 3.4
Country I 6.2 7.8 3.2
Country J 8.1 4.1 −0.9
Country K 11.5 3.4 1.2

Country

A
B
C
D
E
F
G
H
I
J
K

–20 –15 –10 –5 0 5 10 15


Annual Return (%)
Year 1 Year 2 Year 3
Using the data provided, calculate the sample mean return for the 11 indexes
for each year.
Solution:
For Year 3, the calculation applies Equation 2 to the returns for Year 3: (6.1 − 1.5 +
3.5 + 6.2 + 3.0 − 1.0 − 1.2 + 3.4 + 3.2 − 0.9 + 1.2)/11 = 22.0/11 = 2.0%. Using
a similar calculation, the sample mean is 3.5% for Year 1 and 2.5% for Year 2.

7.1.2  Properties of the Arithmetic Mean


The arithmetic mean can be likened to the center of gravity of an object. Exhibit 36
expresses this analogy graphically by plotting nine hypothetical observations on a
bar. The nine observations are 2, 4, 4, 6, 10, 10, 12, 12, and 12; the arithmetic mean is
72/9 = 8. The observations are plotted on the bar with various heights based on their
frequency (that is, 2 is one unit high, 4 is two units high, and so on). When the bar is
placed on a fulcrum, it balances only when the fulcrum is located at the point on the
scale that corresponds to the arithmetic mean.
© CFA Institute. For candidate use only. Not for distribution.
106 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 36  Center of Gravity Analogy for the Arithmetic Mean

1 2 3 4 5 6 7 9 10 11 12

Fulcrum

As analysts, we often use the mean return as a measure of the typical outcome for
an asset. As in Example 8, however, some outcomes are above the mean and some are
below it. We can calculate the distance between the mean and each outcome, which is
the deviation. Mathematically, it is always true that the sum of the deviations around
the mean equals 0. We can see this by using the definition of the arithmetic mean
n
shown in Equation 2, multiplying both sides of the equation by n: nX   X i . The
sum of the deviations from the mean is calculated as follows: i 1
n n n n
 X i  X    X i   X   X i  nX 0
i 1 i 1 i 1 i 1
Deviations from the arithmetic mean are important information because they
indicate risk. The concept of deviations around the mean forms the foundation for the
more complex concepts of variance, skewness, and kurtosis, which we will discuss later.
A property and potential drawback of the arithmetic mean is its sensitivity to
extreme values, or outliers. Because all observations are used to compute the mean and
are given equal weight (i.e., importance), the arithmetic mean can be pulled sharply
upward or downward by extremely large or small observations, respectively. For
example, suppose we compute the arithmetic mean of the following seven numbers:
1, 2, 3, 4, 5, 6, and 1,000. The mean is 1,021/7 = 145.86, or approximately 146. Because
the magnitude of the mean, 146, is so much larger than most of the observations (the
first six), we might question how well it represents the location of the data. Perhaps
the most common approach in such cases is to report the median, or middle value,
in place of or in addition to the mean.

7.1.3  Outliers
In practice, although an extreme value or outlier in a financial dataset may just repre-
sent a rare value in the population, it may also reflect an error in recording the value
of an observation or an observation generated from a different population from that
producing the other observations in the sample. In the latter two cases, in particu-
lar, the arithmetic mean could be misleading. So, what do we do? The first step is to
examine the data, either by inspecting the sample observations if the sample is not
too large or by using visualization approaches. Once we are comfortable that we have
identified and eliminated errors (that is, we have “cleaned” the data), we can then
address what to do with extreme values in the sample. When dealing with a sample
that has extreme values, there may be a possibility of transforming the variable (e.g.,
a log transformation) or of selecting another variable that achieves the same purpose.
However, if alternative model specifications or variable transformations are not pos-
sible, then here are three options for dealing with extreme values:
Option 1 Do nothing; use the data without any adjustment.
Option 2 Delete all the outliers.
Option 3 Replace the outliers with another value.
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 107

The first option is appropriate if the values are legitimate, correct observations, and
it is important to reflect the whole of the sample distribution. Outliers may contain
meaningful information, so excluding or altering these values may reduce valuable
information. Further, because identifying a data point as extreme leaves it up to the
judgment of the analyst, leaving in all observations eliminates that need to judge a
value as extreme.
The second option excludes the extreme observations. One measure of central
tendency in this case is the trimmed mean, which is computed by excluding a stated
small percentage of the lowest and highest values and then computing an arithmetic
mean of the remaining values. For example, a 5% trimmed mean discards the lowest
2.5% and the highest 2.5% of values and computes the mean of the remaining 95%
of values. A trimmed mean is used in sports competitions when judges’ lowest and
highest scores are discarded in computing a contestant’s score.
The third option involves substituting values for the extreme values. A measure
of central tendency in this case is the winsorized mean. It is calculated by assigning
a stated percentage of the lowest values equal to one specified low value and a stated
percentage of the highest values equal to one specified high value, and then it com-
putes a mean from the restated data. For example, a 95% winsorized mean sets the
bottom 2.5% of values equal to the value at or below which 2.5% of all the values lie
(as will be seen shortly, this is called the “2.5th percentile” value) and the top 2.5%
of values equal to the value at or below which 97.5% of all the values lie (the “97.5th
percentile” value).
In Exhibit 37, we show the differences among these options for handling outliers
using daily returns for the fictitious Euro-­Asia-­Africa (EAA) Equity Index in Exhibit 11.

Exhibit 37  Handling Outliers: Daily Returns to an Index


Consider the fictitious EAA Equity Index. Using daily returns on the EAA Equity
Index for the period of five years, consisting of 1,258 trading days, we can see
the effect of trimming and winsorizing the data:
Winsorized
Arithmetic Trimmed Mean Mean
Mean [Trimmed 5%] [95%]
(%) (%) (%)

Mean 0.035 0.048 0.038


Number of Observations 1,258 1,194 1,258

The trimmed mean eliminates the lowest 2.5% of returns, which in this
sample is any daily return less than −1.934%, and it eliminates the highest 2.5%,
which in this sample is any daily return greater than 1.671%. The result of this
trimming is that the mean is calculated using 1,194 observations instead of the
original sample’s 1,258 observations.
The winsorized mean substitutes −1.934% for any return below −1.934 and
substitutes 1.671% for any return above 1.671. The result in this case is that the
trimmed and winsorized means are above the arithmetic mean.

7.2  The Median


A second important measure of central tendency is the median.
Definition of Median. The median is the value of the middle item of a set
of items that has been sorted into ascending or descending order. In an odd-­
numbered sample of n items, the median is the value of the item that occupies
© CFA Institute. For candidate use only. Not for distribution.
108 Reading 2 ■ Organizing, Visualizing, and Describing Data

the (n + 1)/2 position. In an even-­numbered sample, we define the median as


the mean of the values of items occupying the n/2 and (n + 2)/2 positions (the
two middle items).
Suppose we have a return on assets (in %) for each of three companies: 0.0, 2.0,
and 2.1. With an odd number of observations (n = 3), the median occupies the (n +
1)/2 = 4/2 = 2nd position. The median is 2.0%. The value of 2.0% is the “middlemost”
observation: One lies above it, and one lies below it. Whether we use the calculation
for an even- or odd-­numbered sample, an equal number of observations lie above
and below the median. A distribution has only one median.
A potential advantage of the median is that, unlike the mean, extreme values do
not affect it. For example, if a sample consists of the observations of 1, 2, 3, 4, 5, 6 and
1,000, the median is 4. The median is not influenced by the extremely large outcome
of 1,000. In other words, the median is affected less by outliers than the mean and
therefore is useful in describing data that follow a distribution that is not symmetric,
such as revenues.
The median, however, does not use all the information about the size of the obser-
vations; it focuses only on the relative position of the ranked observations. Calculating
the median may also be more complex. To do so, we need to order the observations
from smallest to largest, determine whether the sample size is even or odd, and then
on that basis, apply one of two calculations. Mathematicians express this disadvantage
by saying that the median is less mathematically tractable than the mean.
We use the data from Exhibit 35 to demonstrate finding the median, reproduced
in Exhibit 38 in ascending order of the return for Year 3, with the ranked position
from 1 (lowest) to 11 (highest) indicated. Because this sample has 11 observations,
the median is the value in the sorted array that occupies the (11 + 1)/2 = 6th position.
Country E’s index occupies the sixth position and is the median. The arithmetic mean
for Year 3 for this sample of indexes is 2.0%, whereas the median is 3.0.

Exhibit 38  Returns on Selected Country Stock Indexes for


Year 3 in Ascending Order
Year 3
Index Return (%) Position

Country B −1.5 1
Country G −1.2 2
Country F −1.0 3
Country J −0.9 4
Country K 1.2 5
Country E 3.0 ← 6
Country I 3.2 7
Country H 3.4 8
Country C 3.5 9
Country A 6.1 10
Country D 6.2 11
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 109

Country

D
A
C
H
I
E Median
K
J
F
G
B

–2 0 2 4 6 8
Return (%)
If a sample has an even number of observations, the median is the mean of the two
values in the middle. For example, if our sample in Exhibit 38 had 12 indexes instead
of 11, the median would be the mean of the values in the sorted array that occupy
the sixth and the seventh positions.

7.3  The Mode


The third important measure of central tendency is the mode.
Definition of Mode. The mode is the most frequently occurring value in a
distribution.
A distribution can have more than one mode, or even no mode. When a distribu-
tion has a single value that is most frequently occurring, the distribution is said to be
unimodal. If a distribution has two most frequently occurring values, then it has two
modes and is called bimodal. If the distribution has three most frequently occurring
values, then it is trimodal. When all the values in a dataset are different, the distri-
bution has no mode because no value occurs more frequently than any other value.
Stock return data and other data from continuous distributions may not have a
modal outcome. When such data are grouped into bins, however, we often find an
interval (possibly more than one) with the highest frequency: the modal interval (or
intervals). Consider the frequency distribution of the daily returns for the EAA Equity
Index over five years that we looked at in Exhibit 11. A histogram for the frequency
distribution of these daily returns is shown in Exhibit 39. The modal interval always
has the highest bar in the histogram; in this case, the modal interval is 0.0 to 0.9%,
and this interval has 493 observations out of a total of 1,258 observations.
Notice that this histogram in Exhibit 39 looks slightly different from the one in
Exhibit 11, since this one has 11 bins and follows the seven-­step procedure exactly.
Thus, the bin width is 0.828 [= (5.00 − −4.11)/11], and the first bin begins at the
minimum value of −4.11%. It was noted previously that for ease of interpretation, in
practice bin width is often rounded up to the nearest whole number; the first bin can
start at the nearest whole number below the minimum value. These refinements and
the use of 10 bins were incorporated into the histogram in Exhibit 11, which has a
modal interval of 0.0% to 1.0%.
© CFA Institute. For candidate use only. Not for distribution.
110 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 39  Histogram of Daily Returns on the EAA Equity Index


Number of Observations
600

500 493
470

400

300

200
122
100 94
37 27
5 7 1 1 1
0
–4.1 to –3.3 to –2.5 to –1.6 to –0.8 to 0 to 0.9 to 1.7 to 2.5 to 3.3 to 4.2 to
–3.3 –2.5 –1.6 –0.8 0 0.9 1.7 2.5 3.3 4.2 5.0
Daily Return Range (%)

The mode is the only measure of central tendency that can be used with nominal
data. For example, when we categorize investment funds into different styles and
assign a number to each style, the mode of these categorized data is the most frequent
investment fund style.

7.4  Other Concepts of Mean


Earlier we explained the arithmetic mean, which is a fundamental concept for describ-
ing the central tendency of data. An advantage of the arithmetic mean over two other
measures of central tendency, the median and mode, is that the mean uses all the
information about the size of the observations. The mean is also relatively easy to
work with mathematically.
However, other concepts of mean are very important in investments. In the fol-
lowing sections, we discuss such concepts.

7.4.1  The Weighted Mean


The concept of weighted mean arises repeatedly in portfolio analysis. In the arithmetic
mean, all sample observations are equally weighted by the factor 1/n. In working with
portfolios, we often need the more general concept of weighted mean to allow for
different (i.e., unequal) weights on different observations.
To illustrate the weighted mean concept, an investment manager with $100 million
to invest might allocate $70 million to equities and $30 million to bonds. The portfolio,
therefore, has a weight of 0.70 on stocks and 0.30 on bonds. How do we calculate the
return on this portfolio? The portfolio’s return clearly involves an averaging of the
returns on the stock and bond investments. The mean that we compute, however,
must reflect the fact that stocks have a 70% weight in the portfolio and bonds have a
30% weight. The way to reflect this weighting is to multiply the return on the stock
investment by 0.70 and the return on the bond investment by 0.30, then sum the two
results. This sum is an example of a weighted mean. It would be incorrect to take an
arithmetic mean of the return on the stock and bond investments, equally weighting
the returns on the two asset classes.
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 111

Weighted Mean Formula. The weighted mean X w (read “X-bar sub-w”), for a
set of observations X1, X2, …, Xn with corresponding weights of w1, w2, …, wn,
is computed as:

n
Xw   wi X i (3)
i 1

where the sum of the weights equals 1; that is,  wi  1.


i
In the context of portfolios, a positive weight represents an asset held long and a
negative weight represents an asset held short.
The formula for the weighted mean can be compared to the formula for the arith-
metic mean. For a set of observations X1, X2, …, Xn, let the weights w1, w2, …, wn all
n
equal 1/n. Under this assumption, the formula for the weighted mean is 1 n  X i . This
i 1

is the formula for the arithmetic mean. Therefore, the arithmetic mean is a special
case of the weighted mean in which all the weights are equal.

EXAMPLE 9  

Calculating a Weighted Mean


Using the country index data shown in Exhibit  35, consider a portfolio that
consists of three funds that track three countries’ indexes: County C, Country
G, and Country K. The portfolio weights and index returns are as follows:

Allocation Annual Return (%)


Index Tracked by Fund (%) Year 1 Year 2 Year 3

Country C 25% 5.3 1.2 3.5


Country G 45% 12.7 6.7 −1.2
Country K 30% 11.5 3.4 1.2

Using the information provided, calculate the returns on the portfolio for
each year.
Solution
Converting the percentage asset allocation to decimal form, we find the mean
return as the weighted average of the funds’ returns. We have:
Mean portfolio return for Year 1 = 0.25 (5.3) + 0.45 (12.7) + 0.30(11.5)
= 10.50%
Mean portfolio return for Year 2 = 0.25 (1.2) + 0.45 (6.7) + 0.30 (3.4)
= 4.34%
Mean portfolio return for Year 3 = 0.25 (3.5) + 0.45 (−1.2) + 0.30 (1.2)
= 0.70%

This example illustrates the general principle that a portfolio return is a weighted
sum. Specifically, a portfolio’s return is the weighted average of the returns on the
assets in the portfolio; the weight applied to each asset’s return is the fraction of the
portfolio invested in that asset.
© CFA Institute. For candidate use only. Not for distribution.
112 Reading 2 ■ Organizing, Visualizing, and Describing Data

Market indexes are computed as weighted averages. For market-­capitalization


weighted indexes, such as the CAC-­40 in France, the TOPIX in Japan, or the S&P
500 in the United States, each included stock receives a weight corresponding to its
market value divided by the total market value of all stocks in the index.
Our illustrations of weighted mean use past data, but they might just as well use
forward-­looking data. When we take a weighted average of forward-­looking data, the
weighted mean is the expected value. Suppose we make one forecast for the year-­
end level of the S&P 500 assuming economic expansion and another forecast for the
year-­end level of the S&P 500 assuming economic contraction. If we multiply the first
forecast by the probability of expansion and the second forecast by the probability of
contraction and then add these weighted forecasts, we are calculating the expected
value of the S&P 500 at year-­end. If we take a weighted average of possible future
returns on the S&P 500, where the weights are the probabilities, we are computing the
S&P 500’s expected return. The probabilities must sum to 1, satisfying the condition
on the weights in the expression for weighted mean, Equation 3.

7.4.2  The Geometric Mean


The geometric mean is most frequently used to average rates of change over time or
to compute the growth rate of a variable. In investments, we frequently use the geo-
metric mean to either average a time series of rates of return on an asset or a portfolio
or to compute the growth rate of a financial variable, such as earnings or sales. The
geometric mean is defined by the following formula.
Geometric Mean Formula. The geometric mean, X G , of a set of observations
X1, X2, …, Xn is:

XG = n X1 X 2 X 3  X n with Xi ≥ 0 for i = 1, 2, …, n. (4)


Equation 4 has a solution, and the geometric mean exists only if the product under
the square root sign is non-­negative. Therefore, we must impose the restriction that all
the observations Xi are greater than or equal to zero. We can solve for the geometric
mean directly with any calculator that has an exponentiation key (on most calculators,
yx). We can also solve for the geometric mean using natural logarithms. Equation 4
can also be stated as
1
lnX G  ln  X1 X 2 X 3  X n 
n
or, because the logarithm of a product of terms is equal to the sum of the logarithms
of each of the terms, as
n
lnX i
i 1
lnX G 
n

When we have computed lnX G , then X G = elnX G (on most calculators, the key for
this step is ex).
Risky assets can have negative returns up to −100% (if their price falls to zero), so
we must take some care in defining the relevant variables to average in computing a
geometric mean. We cannot just use the product of the returns for the sample and
then take the nth root because the returns for any period could be negative. We must
recast the returns to make them positive. We do this by adding 1.0 to the returns
expressed as decimals, where Rt represents the return in period t. The term (1 + Rt)
represents the year-­ending value relative to an initial unit of investment at the begin-
ning of the year. As long as we use (1 + Rt), the observations will never be negative
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 113

because the biggest negative return is −100%. The result is the geometric mean of
1 + Rt; by then subtracting 1.0 from this result, we obtain the geometric mean of the
individual returns Rt.
An equation that summarizes the calculation of the geometric mean return, RG, is
a slightly modified version of Equation 4 in which Xi represents “1 + return in decimal
form.” Because geometric mean returns use time series, we use a subscript t indexing
time as well. We calculate one plus the geometric mean return as:
1  RG  T 1  R11  R2 1  RT 
We can represent this more compactly as:
1
T T
1  RG  1  Rt 
t 1 
where the capital Greek letter ‘pi,’ Π, denotes the arithmetical operation of mul-
tiplication of the T terms. Once we subtract one, this becomes the formula for the
geometric mean return.
For example, the returns on Country B’s index are given in Exhibit 35 as 7.8, 6.3,
and −1.5%. Putting the returns into decimal form and adding 1.0 produces 1.078,
1.063, and 0.985. Using Equation  4, we have 3 1.0781.0630.985  3 1.128725 =
1.041189. This number is 1 plus the geometric mean rate of return. Subtracting 1.0
from this result, we have 1.041189 − 1.0 = 0.041189, or approximately 4.12%. This is
lower than the arithmetic mean for County B’s index of 4.2%.
Geometric Mean Return Formula. Given a time series of holding period
returns Rt, t = 1, 2, …, T, the geometric mean return over the time period
spanned by the returns R1 through RT is:

1
T T
RG  1  Rt   1 (5)
t 1 
We can use Equation 5 to solve for the geometric mean return for any return data
series. Geometric mean returns are also referred to as compound returns. If the returns
being averaged in Equation 5 have a monthly frequency, for example, we may call the
geometric mean monthly return the compound monthly return. The next example
illustrates the computation of the geometric mean while contrasting the geometric
and arithmetic means.

EXAMPLE 10  

Geometric and Arithmetic Mean Returns


Using the data in Exhibit 35, calculate the arithmetic mean and the geometric
mean returns over the three years for each of the three stock indexes: those of
Country D, Country E, and Country F.
Solution
The arithmetic mean returns calculations are:
© CFA Institute. For candidate use only. Not for distribution.
114 Reading 2 ■ Organizing, Visualizing, and Describing Data

Annual Return (%) Sum


3
 Ri Arithmetic
Year 1 Year 2 Year 3 i 1 Mean

Country D −2.4 −3.1 6.2 0.7 0.233


Country E −4.0 −3.0 3.0 −4.0 −1.333
Country F 5.4 5.2 −1.0 9.6 3.200

Geometric mean returns calculations are:


1 + Return in Decimal 3rd root
Form Product 1
(1 + Rt) T 3  3 Geometric
Year Year Year 1  Rt  1  Rt  mean
1 2 3 t  t  return (%)

Country D 0.976 0.969 1.062 1.00438 1.00146 0.146


Country E 0.960 0.970 1.030 0.95914 0.98619 −1.381
Country F 1.054 1.052 0.990 1.09772 1.03157 3.157

In Example 10, the geometric mean return is less than the arithmetic mean return
for each country’s index returns. In fact, the geometric mean is always less than or
equal to the arithmetic mean. The only time that the two means will be equal is when
there is no variability in the observations—that is, when all the observations in the
series are the same.
In general, the difference between the arithmetic and geometric means increases
with the variability within the sample; the more disperse the observations, the greater
the difference between the arithmetic and geometric means. Casual inspection of the
returns in Exhibit 35 and the associated graph of means suggests a greater variability for
Country A’s index relative to the other indexes, and this is confirmed with the greater
deviation of the geometric mean return (−5.38%) from the arithmetic mean return
(−4.97%), as we show in Exhibit 40. How should the analyst interpret these results?
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 115

Exhibit 40  Arithmetic and Geometric Mean Returns for Country Stock


Indexes: Years 1 to 3
Country

A
B
C
D
E
F
G
H
I
J
K

6 –4 –2 0 2 4 6 8
Mean Return (%)
Geometric Mean Arithmetic Average

The geometric mean return represents the growth rate or compound rate of return
on an investment. One unit of currency invested in a fund tracking the Country B
index at the beginning of Year 1 would have grown to (1.078)(1.063)(0.985) = 1.128725
units of currency, which is equal to 1 plus the geometric mean return compounded
over three periods: [1 + 0.041189]3 = 1.128725, confirming that the geometric mean
is the compound rate of return. With its focus on the profitability of an investment
over a multi-­period horizon, the geometric mean is of key interest to investors. The
arithmetic mean return, focusing on average single-­period performance, is also of
interest. Both arithmetic and geometric means have a role to play in investment
management, and both are often reported for return series.
For reporting historical returns, the geometric mean has considerable appeal
because it is the rate of growth or return we would have to earn each year to match
the actual, cumulative investment performance. Suppose we purchased a stock for
€100 and two years later it was worth €100, with an intervening year at €200. The
geometric mean of 0% is clearly the compound rate of growth during the two years,
which we can confirm by compounding the returns: [(1 + 1.00)(1 − 0.50)]1/2 − 1 =
0%. Specifically, the ending amount is the beginning amount times (1  + RG)2. The
geometric mean is an excellent measure of past performance.
The arithmetic mean, which is [100% + −50%]/2  = 25% in the above example,
can distort our assessment of historical performance. As we noted previously, the
arithmetic mean is always greater than or equal to the geometric mean. If we want to
estimate the average return over a one-­period horizon, we should use the arithmetic
mean because the arithmetic mean is the average of one-­period returns. If we want
to estimate the average returns over more than one period, however, we should use
the geometric mean of returns because the geometric mean captures how the total
returns are linked over time. In a forward-­looking context, a financial analyst calcu-
lating expected risk premiums may find that the weighted mean is appropriate, with
the probabilities of the possible outcomes used as the weights.
Dispersion in cash flows or returns causes the arithmetic mean to be larger than
the geometric mean. The more dispersion in the sample of returns, the more diver-
gence exists between the arithmetic and geometric means. If there is zero variance in
a sample of observations, the geometric and arithmetic return are equal.
© CFA Institute. For candidate use only. Not for distribution.
116 Reading 2 ■ Organizing, Visualizing, and Describing Data

7.4.3  The Harmonic Mean


The arithmetic mean, the weighted mean, and the geometric mean are the most fre-
quently used concepts of mean in investments. A fourth concept, the harmonic
mean, X H , is another measure of central tendency. The harmonic mean is appropriate
in cases in which the variable is a rate or a ratio. The terminology “harmonic” arises
from its use of a type of series involving reciprocals known as a harmonic series.
Harmonic Mean Formula. The harmonic mean of a set of observations X1, X2,
…, Xn is:

n
XH  with Xi > 0 for i = 1, 2, …, n. (6)
n
1 X i 
i 1
The harmonic mean is the value obtained by summing the reciprocals of the observa-
tions—terms of the form 1/Xi—then averaging that sum by dividing it by the number
of observations n, and, finally, taking the reciprocal of the average.
The harmonic mean may be viewed as a special type of weighted mean in which an
observation’s weight is inversely proportional to its magnitude. For example, if there
is a sample of observations of 1, 2, 3, 4, 5, 6, and 1,000, the harmonic mean is 2.8560.
Compared to the arithmetic mean of 145.8571, we see the influence of the outlier (the
1,000) to be much less than in the case of the arithmetic mean. So, the harmonic mean
is quite useful as a measure of central tendency in the presence of outliers.
The harmonic mean is used most often when the data consist of rates and ratios,
such as P/Es. Suppose three peer companies have P/Es of 45, 15, and 15. The arithmetic
mean is 25, but the harmonic mean, which gives less weight to the P/E of 45, is 19.3.

EXAMPLE 11  

Harmonic Mean Returns and the Returns on Selected


Country Stock Indexes
Using data in Exhibit 35, calculate the harmonic mean return over the 2016–2018
period for three stock indexes: Country D, Country E, and Country F.

Calculating the Harmonic Mean for the Indexes


Inverse of 1 + Return, or
1
1  X i 
n
where Xi is the return 
n n
in decimal form 1 X i  1 X i Harmonic
Index Year 1 Year 2 Year 3 i i Mean (%)

Country D 1.02459 1.03199 0.94162 2.99820 1.00060 0.05999


Country E 1.04167 1.03093 0.97087 3.04347 0.98572 −1.42825
Country F 0.94877 0.95057 1.01010 2.90944 1.03113 3.11270
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 117

Comparing the three types of means, we see the arithmetic mean is higher
than the geometric mean return, and the geometric mean return is higher than
the harmonic mean return. We can see the differences in these means in the
following graph:
Harmonic, Geometric, and Arithmetic Means of Selected Country
Indexes
Country

0.060
D 0.146
0.233

−1.428
E −1.381
−1.333

3.113
F 3.157
3.200

–2 –1 0 1 2 3 4
Mean Return (%)
Harmonic Geometric Mean Arithmetic

The harmonic mean is a relatively specialized concept of the mean that is appro-
priate for averaging ratios (“amount per unit”) when the ratios are repeatedly applied
to a fixed quantity to yield a variable number of units. The concept is best explained
through an illustration. A well-­known application arises in the investment strategy
known as cost averaging, which involves the periodic investment of a fixed amount
of money. In this application, the ratios we are averaging are prices per share at
different purchase dates, and we are applying those prices to a constant amount of
money to yield a variable number of shares. An illustration of the harmonic mean to
cost averaging is provided in Example 12.

EXAMPLE 12  

Cost Averaging and the Harmonic Mean


Suppose an investor purchases €1,000 of a security each month for n = 2 months.
The share prices are €10 and €15 at the two purchase dates. What is the average
price paid for the security?
Purchase in the first month = €1,000/€10 = 100 shares
Purchase in the second month = €1,000/€15 = 66.67 shares
The purchases are 166.67 shares in total, and the price paid per share is
€2,000/166.67 = €12.
The average price paid is in fact the harmonic mean of the asset’s prices at
the purchase dates. Using Equation 6, the harmonic mean price is 2/[(1/10) +
(1/15)] = €12. The value €12 is less than the arithmetic mean purchase price
(€10 + €15)/2 = €12.5.
However, we could find the correct value of €12 using the weighted mean
formula, where the weights on the purchase prices equal the shares purchased
at a given price as a proportion of the total shares purchased. In our example,
© CFA Institute. For candidate use only. Not for distribution.
118 Reading 2 ■ Organizing, Visualizing, and Describing Data

the calculation would be (100/166.67)€10.00 + (66.67/166.67)€15.00 = €12. If


we had invested varying amounts of money at each date, we could not use the
harmonic mean formula. We could, however, still use the weighted mean formula.

h Evaluate alternative definitions of mean to address an investment problem


Since they use the same data but involve different progressions in their respective
calculations (that is, arithmetic, geometric, and harmonic progressions) the arithmetic,
geometric, and harmonic means are mathematically related to one another. While we
will not go into the proof of this relationship, the basic result follows:
Arithmetic mean × Harmonic mean = Geometric mean2.
However, the key question is: Which mean to use in what circumstances?

EXAMPLE 13  

Calculating the Arithmetic, Geometric, and Harmonic


Means for P/Es
Each year in December, a securities analyst selects her 10 favorite stocks for the
next year. Exhibit 41 gives the P/E, the ratio of share price to projected earnings
per share (EPS), for her top-­10 stock picks for the next year.

Exhibit 41  Analyst’s 10 Favorite Stocks for Next Year


Stock P/E

Stock 1 22.29
Stock 2 15.54
Stock 3 9.38
Stock 4 15.12
Stock 5 10.72
Stock 6 14.57
Stock 7 7.20
Stock 8 7.97
Stock 9 10.34
Stock 10 8.35

For these 10 stocks,


1 Calculate the arithmetic mean P/E.
2 Calculate the geometric mean P/E.
3 Calculate the harmonic mean P/E.
© CFA Institute. For candidate use only. Not for distribution.
Measures of Central Tendency 119

Solution

Natural Log of the P/E Inverse of the P/E


Stock P/E ln(Xi) 1/Xi

Stock 1 22.29 3.1041 0.0449


Stock 2 15.54 2.7434 0.0644
Stock 3 9.38 2.2386 0.1066
Stock 4 15.12 2.7160 0.0661
Stock 5 10.72 2.3721 0.0933
Stock 6 14.57 2.6790 0.0686
Stock 7 7.20 1.9741 0.1389
Stock 8 7.97 2.0757 0.1255
Stock 9 10.34 2.3360 0.0967
Stock 10 8.35 2.1223 0.1198
Sum 121.48 24.3613 0.9247

1 The arithmetic mean is 121.48/10 = 12.1480.


2 The geometric mean is e24.3613 10 = 11.4287.
3 The harmonic mean is 10/0.9247 = 10.8142.

A mathematical fact concerning the harmonic, geometric, and arithmetic means is


that unless all the observations in a dataset have the same value, the harmonic mean
is less than the geometric mean, which, in turn, is less than the arithmetic mean. The
choice of which mean to use depends on many factors, as we describe in Exhibit 42:
■■ Are there outliers that we want to include?
■■ Is the distribution symmetric?
■■ Is there compounding?
■■ Are there extreme outliers?
© CFA Institute. For candidate use only. Not for distribution.
120 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 42  Deciding Which Central Tendency Measure to Use

Collect Sample

Include all
values, Yes
Arithmetic Mean
including
outliers?

Yes
Compounding? Geometric Mean

Yes Harmonic mean,


Extreme Trimmed mean,
outliers? Winsorized mean

8 QUANTILES

i. Calculate quantiles and interpret related visualizations


Having discussed measures of central tendency, we now examine an approach to
describing the location of data that involves identifying values at or below which
specified proportions of the data lie. For example, establishing that 25, 50, and 75%
of the annual returns on a portfolio are at or below the values −0.05, 0.16, and 0.25,
respectively, provides concise information about the distribution of portfolio returns.
Statisticians use the word quantile (or fractile) as the most general term for a value at
or below which a stated fraction of the data lies. In the following section, we describe
the most commonly used quantiles—quartiles, quintiles, deciles, and percentiles—and
their application in investments.

8.1  Quartiles, Quintiles, Deciles, and Percentiles


We know that the median divides a distribution of data in half. We can define other
dividing lines that split the distribution into smaller sizes. Quartiles divide the dis-
tribution into quarters, quintiles into fifths, deciles into tenths, and percentiles
into hundredths. Given a set of observations, the yth percentile is the value at or
below which y% of observations lie. Percentiles are used frequently, and the other
measures can be defined with respect to them. For example, the first quartile (Q1)
divides a distribution such that 25% of the observations lie at or below it; therefore,
the first quartile is also the 25th percentile. The second quartile (Q2) represents the
50th percentile, and the third quartile (Q3) represents the 75th percentile (i.e., 75%of
the observations lie at or below it). The interquartile range (IQR) is the difference
between the third quartile and the first quartile, or IQR = Q3 − Q1.
© CFA Institute. For candidate use only. Not for distribution.
Quantiles 121

When dealing with actual data, we often find that we need to approximate the
value of a percentile. For example, if we are interested in the value of the 75th percen-
tile, we may find that no observation divides the sample such that exactly 75% of the
observations lie at or below that value. The following procedure, however, can help us
determine or estimate a percentile. The procedure involves first locating the position
of the percentile within the set of observations and then determining (or estimating)
the value associated with that position.
Let P y be the value at or below which y% of the distribution lies, or the yth per-
centile. (For example, P18 is the point at or below which 18% of the observations
lie; this implies that 100 − 18 = 82% of the observations are greater than P18.) The
formula for the position (or location) of a percentile in an array with n entries sorted
in ascending order is:
y
L y  n  1 (7)
100
where y is the percentage point at which we are dividing the distribution, and Ly is the
location (L) of the percentile (P y) in the array sorted in ascending order. The value of
Ly may or may not be a whole number. In general, as the sample size increases, the
percentile location calculation becomes more accurate; in small samples it may be
quite approximate.
To summarize:
■■ When the location, Ly, is a whole number, the location corresponds to an actual
observation. For example, if we are determining the third quartile (Q3) in a
sample of size n = 11, then Ly would be L75 = (11 + 1)(75/100) = 9, and the third
quartile would be P75 = X9, where Xi is defined as the value of the observation
in the ith (i = L75, so 9th), position of the data sorted in ascending order.
■■ When Ly is not a whole number or integer, Ly lies between the two closest
integer numbers (one above and one below), and we use linear interpolation
between those two places to determine P y. Interpolation means estimating an
unknown value on the basis of two known values that surround it (i.e., lie above
and below it); the term “linear” refers to a straight-­line estimate.
Example 14 illustrates the calculation of various quantiles for the daily return on the
EAA Equity Index.

EXAMPLE 14  

Percentiles, Quintiles, and Quartiles for the EAA Equity


Index
Using the daily returns on the fictitious EAA Equity Index over five years and
ranking them by return, from lowest to highest daily return, we show the return
bins from 1 (the lowest 5%) to 20 (the highest 5%) as follows:
© CFA Institute. For candidate use only. Not for distribution.
122 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 43  EAA Equity Index Daily Returns Grouped by Size of Return


Cumulative Daily Return (%) Between*
Percentage
of Sample
Trading Days Number of
Bin (%) Lower Bound Upper Bound Observations

1 5 −4.108 −1.416 63
2 10 −1.416 −0.876 63
3 15 −0.876 −0.629 63
4 20 −0.629 −0.432 63
5 25 −0.432 −0.293 63
6 30 −0.293 −0.193 63
7 35 −0.193 −0.124 62
8 40 −0.124 −0.070 63
9 45 −0.070 −0.007 63
10 50 −0.007 0.044 63
11 55 0.044 0.108 63
12 60 0.108 0.173 63
13 65 0.173 0.247 63
14 70 0.247 0.343 62
15 75 0.343 0.460 63
16 80 0.460 0.575 63
17 85 0.575 0.738 63
18 90 0.738 0.991 63
19 95 0.991 1.304 63
20 100 1.304 5.001 63

Note that because of the continuous nature of returns, it is not likely for a
return to fall on the boundary for any bin other than the minimum (Bin = 1)
and maximum (Bin = 20).
1 Identify the 10th and 90th percentiles.
2 Identify the first, second, and third quintiles.
3 Identify the first and third quartiles.
4 Identify the median.
5 Calculate the interquartile range.

Solution to 1
The 10th and 90th percentiles correspond to the bins or ranked returns that
include 10% and 90% of the daily returns, respectively. The 10th percentile
corresponds to the return of −0.876% (and includes returns of that much and
lower), and the 90th percentile corresponds to the return of 0.991% (and lower).
Solution to 2
The first quintile corresponds to the lowest 20% of the ranked data, or −0.432%
(and lower).
The second quintile corresponds to the lowest 40% of the ranked data, or
−0.070% (and lower).
© CFA Institute. For candidate use only. Not for distribution.
Quantiles 123

The third quintile corresponds to the lowest 60% of the ranked data, or
0.173% (and lower).
Solution to 3
The first quartile corresponds to the lowest 25% of the ranked data, or −0.293%
(and lower).
The third quartile corresponds to the lowest 75% of the ranked data, or
0.460% (and lower).
Solution to 4
The median is the return for which 50% of the data lies on either side, which is
0.044%, the highest daily return in the 10th bin out of 20.
Solution to 5
The interquartile range is the difference between the third and first quartiles,
0.460% and −0.293%, or 0.753%.

One way to visualize the dispersion of data across quartiles is to use a diagram,
such as a box and whisker chart. A box and whisker plot consists of a “box” with
“whiskers” connected to the box, as shown in Exhibit 44. The “box” represents the
lower bound of the second quartile and the upper bound of the third quartile, with
the median or arithmetic average noted as a measure of central tendency of the entire
distribution. The whiskers are the lines that run from the box and are bounded by the
“fences,” which represent the lowest and highest values of the distribution.

Exhibit 44  Box and Whisker Plot


Highest Value

Upper Boundary for Q3

Interquartile Median
Range × Arithmetic Average

Lowest Boundary for Q2


Lowest Value

There are several variations for box and whisker displays. For example, for ease
in detecting potential outliers, the fences of the whiskers may be a function of the
interquartile range instead of the highest and lowest values like that in Exhibit 44.
In Exhibit  44, visually, the interquartile range is the height of the box and the
fences are set at extremes. But another form of box and whisker plot typically uses
1.5 times the interquartile range for the fences. Thus, the upper fence is 1.5 times the
interquartile range added to the upper bound of Q3, and the lower fence is 1.5 times
the interquartile range subtracted from the lower bound of Q2. Observations beyond
the fences (i.e., outliers) may also be displayed.
We can see the role of outliers in such a box and whisker plot using the EAA
Equity Index daily returns, as shown in Exhibit  45. Referring back to Exhibit  43
(Example 13), we know:
■■ The maximum and minimum values of the distribution are 5.001 and −4.108,
respectively, while the median (50th percentile) value is 0.044.
© CFA Institute. For candidate use only. Not for distribution.
124 Reading 2 ■ Organizing, Visualizing, and Describing Data

■■ The interquartile range is 0.753 [= 0.460 − (−0.293)], and when multiplied by


1.5 and added to the Q3 upper bound of 0.460 gives an upper fence of 1.589 [=
(1.5 × 0.753) + 0.460].
■■ The lower fence is determined in a similar manner, using the Q2 lower bound,
to be −1.422 [= −(1.5 × 0.753) + (−0.293)].
As noted, any observation above (below) the upper (lower) fence is deemed to
be an outlier.

Exhibit 45  Box and Whisker Chart for EAA Equity Index Daily Returns
Daily Return (%)
6
5 Maximum of 5.001%
4
3
2 (Q3 Upper 1.589% (Upper Fence)
1 Bound)
0.460% Median of
0 (Q2 Lower 0.044%
–1 Bound) –1.422%
–0.293% (Lower Fence)
–2
–3
–4 Minimum of –4.108%
–5

EXAMPLE 15 

Quantiles
Consider the results of an analysis focusing on the market capitalizations of a
sample of 100 firms:

Cumulative Market Capitalization


Percentage of (in billions of €) Number of
Bin Sample (%) Lower Bound Upper Bound Observations

1 5 0.28 15.45 5
2 10 15.45 21.22 5
3 15 21.22 29.37 5
4 20 29.37 32.57 5
5 25 32.57 34.72 5
6 30 34.72 37.58 5
7 35 37.58 39.90 5
8 40 39.90 41.57 5
9 45 41.57 44.86 5
10 50 44.86 46.88 5
11 55 46.88 49.40 5
12 60 49.40 51.27 5
13 65 51.27 53.58 5
14 70 53.58 56.66 5
© CFA Institute. For candidate use only. Not for distribution.
Quantiles 125

Cumulative Market Capitalization


Percentage of (in billions of €) Number of
Bin Sample (%) Lower Bound Upper Bound Observations
15 75 56.66 58.34 5
16 80 58.34 63.10 5
17 85 63.10 67.06 5
18 90 67.06 73.00 5
19 95 73.00 81.62 5
20 100 81.62 96.85 5

Using this information, answer the following five questions.


1 The tenth percentile corresponds to observations in bins:
A 2.
B 1 and 2.
C 19 and 20.
2 The second quintile corresponds to observations in bins:
A 8
B 5, 6, 7, and 8.
C 6, 7, 8, 9, and 10.
3 The fourth quartile corresponds to observations in bins:
A 17.
B 17, 18, 19, and 20.
C 16, 17, 18, 19, and 20.
4 The median is closest to:
A 44.86.
B 46.88.
C 49.40.
5 The interquartile range is closest to:
A 20.76.
B 23.62.
C 25.52.

Solution to 1
B is correct because the tenth percentile corresponds to the lowest 10% of the
observations in the sample, which are in bins 1 and 2.
Solution to 2
B is correct because the second quintile corresponds to the second 20% of
observations. The first 20% consists of bins 1 through 4. The second 20% of
observations consists of bins 5 through 8.
Solution to 3
C is correct because a quartile consists of 25% of the data, and the last 25% of
the 20 bins are 16 through 20.
Solution to 4
B is correct because this is the center of the 20 bins. The market capitalization
of 46.88 is the highest value of the 10th bin and the lowest value of the 11th bin.
© CFA Institute. For candidate use only. Not for distribution.
126 Reading 2 ■ Organizing, Visualizing, and Describing Data

Solution to 5
B is correct because the interquartile range is the difference between the lowest
value in the second quartile and the highest value in the third quartile. The lowest
value of the second quartile is 34.72, and the highest value of the third quartile
is 58.34. Therefore, the interquartile range is 58.34 − 34.72 = 23.62.

8.2  Quantiles in Investment Practice


In this section, we briefly discuss the use of quantiles in investments. Quantiles are
used in portfolio performance evaluation as well as in investment strategy develop-
ment and research.
Investment analysts use quantiles every day to rank performance—for example,
the performance of portfolios. The performance of investment managers is often
characterized in terms of the percentile or quartile in which they fall relative to the
performance of their peer group of managers. The Morningstar investment fund star
rankings, for example, associate the number of stars with percentiles of performance
relative to similar-­style investment funds.
Another key use of quantiles is in investment research. For example, analysts often
refer to the set of companies with returns falling below the 10th percentile cutoff point
as the bottom return decile. Dividing data into quantiles based on some characteristic
allows analysts to evaluate the impact of that characteristic on a quantity of interest.
For instance, empirical finance studies commonly rank companies based on the mar-
ket value of their equity and then sort them into deciles. The first decile contains the
portfolio of those companies with the smallest market values, and the tenth decile
contains those companies with the largest market values. Ranking companies by decile
allows analysts to compare the performance of small companies with large ones.

9 MEASURES OF DISPERSION

j Calculate and interpret measures of dispersion


Few would disagree with the importance of expected return or mean return in invest-
ments: The mean return tells us where returns, and investment results, are centered.
To more completely understand an investment, however, we also need to know how
returns are dispersed around the mean. Dispersion is the variability around the central
tendency. If mean return addresses reward, then dispersion addresses risk.
In this section, we examine the most common measures of dispersion: range,
mean absolute deviation, variance, and standard deviation. These are all measures of
absolute dispersion. Absolute dispersion is the amount of variability present without
comparison to any reference point or benchmark.
These measures are used throughout investment practice. The variance or standard
deviation of return is often used as a measure of risk pioneered by Nobel laureate
Harry Markowitz. Other measures of dispersion, mean absolute deviation and range,
are also useful in analyzing data.

9.1  The Range


We encountered range earlier when we discussed the construction of frequency dis-
tributions. It is the simplest of all the measures of dispersion.
© CFA Institute. For candidate use only. Not for distribution.
Measures of Dispersion 127

Definition of Range. The range is the difference between the maximum and
minimum values in a dataset:

Range = Maximum value − Minimum value.   (8)


As an illustration of range, consider Exhibit  35, our example of annual returns for
countries’ stock indexes. The range of returns for Year 1 is the difference between
the returns of Country G’s index and Country A’s index, or 12.7 − (−15.6) = 28.3%.
The range of returns for Year 3 is the difference between the returns for the County
D index and the Country B index, or 6.2 − (−1.5) = 7.7%.
An alternative definition of range specifically reports the maximum and minimum
values. This alternative definition provides more information than does the range as
defined in Equation 8. In other words, in the above-­mentioned case for Year 1, the
range is reported as “from 12.7% to −15.6%.”
One advantage of the range is ease of computation. A disadvantage is that the
range uses only two pieces of information from the distribution. It cannot tell us how
the data are distributed (that is, the shape of the distribution). Because the range is
the difference between the maximum and minimum returns, it can reflect extremely
large or small outcomes that may not be representative of the distribution.

9.2  The Mean Absolute Deviation


Measures of dispersion can be computed using all the observations in the distribution
rather than just the highest and lowest. But how should we measure dispersion? Our
previous discussion on properties of the arithmetic mean introduced the notion of
distance or deviation from the mean  X i  X  as a fundamental piece of information
used in statistics. We could compute measures of dispersion as the arithmetic average
of the deviations around the mean, but we would encounter a problem: The deviations
around the mean always sum to 0. If we computed the mean of the deviations, the
result would also equal 0. Therefore, we need to find a way to address the problem of
negative deviations canceling out positive deviations.
One solution is to examine the absolute deviations around the mean as in the mean
absolute deviation. This is also known as the average absolute deviation.
Mean Absolute Deviation Formula. The mean absolute deviation (MAD) for a
sample is:

n
 Xi  X
i 1
MAD  (9)
n

where X is the sample mean, n is the number of observations in the


sample, and the | | indicate the absolute value of what is contained within these
bars.
In calculating MAD, we ignore the signs of the deviations around the mean. For
example, if Xi = −11.0 and X = 4.5, the absolute value of the difference is |−11.0 − 4.5|
= |−15.5| = 15.5. The mean absolute deviation uses all of the observations in the sample
and is thus superior to the range as a measure of dispersion. One technical drawback
of MAD is that it is difficult to manipulate mathematically compared with the next
measure we will introduce, sample variance. Example  16 illustrates the use of the
range and the mean absolute deviation in evaluating risk.
© CFA Institute. For candidate use only. Not for distribution.
128 Reading 2 ■ Organizing, Visualizing, and Describing Data

EXAMPLE 16  

Mean Absolute Deviation for Selected Countries’ Stock


Index Returns
Using the country stock index returns in Exhibit 35, calculate the mean absolute
deviation of the index returns for each year. Note the sample mean returns ( X )
are 3.5%, 2.5%, and 2.0% for Years 1, 2, and 3, respectively.
Solution

Absolute Value of Deviation from the Mean


Xi − X

Year 1 Year 2 Year 3

Country A 19.1 7.9 4.1


Country B 4.3 3.8 3.5
Country C 1.8 1.3 1.5
Country D 5.9 5.6 4.2
Country E 7.5 5.5 1.0
Country F 1.9 2.7 3.0
Country G 9.2 4.2 3.2
Country H 0.0 1.8 1.4
Country I 2.7 5.3 1.2
Country J 4.6 1.6 2.9
Country K 8.0 0.9 0.8
Sum 65.0 40.6 26.8

MAD 5.91 3.69 2.44

For Year 3, for example, the sum of the absolute deviations from the arithmetic
mean ( X = 2.0) is 26.8. We divide this by 11, with the resulting MAD of 2.44.

9.3  Sample Variance and Sample Standard Deviation


The mean absolute deviation addressed the issue that the sum of deviations from the
mean equals zero by taking the absolute value of the deviations. A second approach
to the treatment of deviations is to square them. The variance and standard deviation,
which are based on squared deviations, are the two most widely used measures of
dispersion. Variance is defined as the average of the squared deviations around the
mean. Standard deviation is the positive square root of the variance. The following
discussion addresses the calculation and use of variance and standard deviation.

9.3.1  Sample Variance


In investments, we often do not know the mean of a population of interest, usually
because we cannot practically identify or take measurements from each member of
the population. We then estimate the population mean using the mean from a sample
drawn from the population, and we calculate a sample variance or standard deviation.
© CFA Institute. For candidate use only. Not for distribution.
Measures of Dispersion 129

Sample Variance Formula. The sample variance, s2, is:


n
2
 X i  X 
s2  i 1
(10)
n 1

where X is the sample mean and n is the number of observations in the sample.
Given knowledge of the sample mean, we can use Equation 10 to calculate the sum
of the squared differences from the mean, taking account of all n items in the sample,
and then to find the mean squared difference by dividing the sum by n − 1. Whether
a difference from the mean is positive or negative, squaring that difference results in
a positive number. Thus, variance takes care of the problem of negative deviations
from the mean canceling out positive deviations by the operation of squaring those
deviations.
For the sample variance, by dividing by the sample size minus 1 (or n − 1) rather
than n, we improve the statistical properties of the sample variance. In statistical terms,
the sample variance defined in Equation 10 is an unbiased estimator of the population
variance (a concept covered later in the curriculum on sampling). The quantity n − 1 is
also known as the number of degrees of freedom in estimating the population variance.
To estimate the population variance with s2, we must first calculate the sample mean,
which itself is an estimated parameter. Therefore, once we have computed the sample
mean, there are only n − 1 independent pieces of information from the sample; that
is, if you know the sample mean and n − 1 of the observations, you could calculate
the missing sample observation.

9.3.2  Sample Standard Deviation


Because the variance is measured in squared units, we need a way to return to the
original units. We can solve this problem by using standard deviation, the square root
of the variance. Standard deviation is more easily interpreted than the variance because
standard deviation is expressed in the same unit of measurement as the observations.
By taking the square root, we return the values to the original unit of measurement.
Suppose we have a sample with values in euros. Interpreting the standard deviation
in euros is easier than interpreting the variance in squared euros.
Sample Standard Deviation Formula. The sample standard deviation, s, is:
n
2
 X i  X 
i 1
s (11)
n 1

where X is the sample mean and n is the number of observations in the sample.
To calculate the sample standard deviation, we first compute the sample variance.
We then take the square root of the sample variance. The steps for computing the
sample variance and the standard deviation are provided in Exhibit 46.

Exhibit 46  Steps to Calculate Sample Standard Deviation and Variance


Step Description Notation

1 Calculate the sample mean X


2 Calculate the deviations from the sample mean
X i  X 
(continued)
© CFA Institute. For candidate use only. Not for distribution.
130 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 46  (Continued)

Step Description Notation


3 Calculate each observation’s squared deviation from
the sample mean  X i  X 2
4 Sum the squared deviations from the mean n
2
 X i  X 
i 1

5 Divide the sum of squared deviations from the mean n


2
by n − 1. This is the variance (s2).  X i  X 
i 1
n 1
6 Take the square root of the sum of the squared devia- n
2
tions divided by n − 1. This is the standard deviation (s).
 X i  X 
i 1
n 1

We illustrate the process of calculating the sample variance and standard deviation
in Example 17 using the returns of the selected country stock indexes presented in
Exhibit 35.

EXAMPLE 17 

Calculating Sample Variance and Standard Deviation for


Returns on Selected Country Stock Indexes
Using the sample information on country stock indexes in Exhibit 35, calculate
the sample variance and standard deviation of the sample of index returns for
Year 3.
Solution

Sample Deviation from the Squared


Index Observation Sample Mean Deviation

Country A 6.1 4.1 16.810


Country B −1.5 −3.5 12.250
Country C 3.5 1.5 2.250
Country D 6.2 4.2 17.640
Country E 3.0 1.0 1.000
Country F −1.0 −3.0 9.000
Country G −1.2 −3.2 10.240
Country H 3.4 1.4 1.960
Country I 3.2 1.2 1.440
Country J −0.9 −2.9 8.410
Country K 1.2 −0.8 0.640
Sum 22.0 0.0 81.640

Sample variance = 81.640/10 = 8.164


© CFA Institute. For candidate use only. Not for distribution.
Downside Deviation and Coefficient of Variation 131

Sample standard deviation = 8.164 = 2.857

In addition to looking at the cross-­sectional standard deviation as we did in


Example 17, we could also calculate the standard deviation of a given country’s returns
across time (that is, the three years). Consider Country F, which has an arithmetic
mean return of 3.2%. The sample standard deviation is calculated as:

0.054  0.0322  0.052  0.0322  0.01  0.0322


2
0.000484  0.000400  0.001764

2
 0.001324
 3.6387%.
Because the standard deviation is a measure of dispersion about the arithmetic
mean, we usually present the arithmetic mean and standard deviation together when
summarizing data. When we are dealing with data that represent a time series of
percentage changes, presenting the geometric mean—representing the compound
rate of growth—is also very helpful.

9.3.3  Dispersion and the Relationship between the Arithmetic and the Geometric Means
We can use the sample standard deviation to help us understand the gap between the
arithmetic mean and the geometric mean. The relation between the arithmetic
mean  X  and geometric mean  X G  is:

s2
XG  X 
2
In other words, the larger the variance of the sample, the wider the difference
between the geometric mean and the arithmetic mean.
Using the data for Country F from Example 8, the geometric mean return is 3.1566%,
the arithmetic mean return is 3.2%, and the factor s2/2 is 0.001324/2 = 0.0662%:
3.1566% ≈ 3.2% − 0.0662%
3.1566% ≈ 3.1338%.
This relation informs us that the more disperse or volatile the returns, the larger
the gap between the geometric mean return and the arithmetic mean return.

DOWNSIDE DEVIATION AND COEFFICIENT OF


VARIATION 10
k Calculate and interpret target downside deviation
An asset’s variance or standard deviation of returns is often interpreted as a measure
of the asset’s risk. Variance and standard deviation of returns take account of returns
above and below the mean, or upside and downside risks, respectively. However,
investors are typically concerned only with downside risk—for example, returns
below the mean or below some specified minimum target return. As a result, analysts
have developed measures of downside risk.
© CFA Institute. For candidate use only. Not for distribution.
132 Reading 2 ■ Organizing, Visualizing, and Describing Data

In practice, we may be concerned with values of return (or another variable) below
some level other than the mean. For example, if our return objective is 6.0% annually
(our minimum acceptable return), then we may be concerned particularly with returns
below 6.0% a year. The 6.0% is the target. The target downside deviation, also referred
to as the target semideviation, is a measure of dispersion of the observations (here,
returns) below the target. To calculate a sample target semideviation, we first specify
the target. After identifying observations below the target, we find the sum of the
squared negative deviations from the target, divide that sum by the total number of
observations in the sample minus 1, and, finally, take the square root.
Sample Target Semideviation Formula. The target semideviation, sTarget, is:
2
n X i  B
sTarget =  n 1
, (12)
for allX i  B

where B is the target and n is the total number of sample observations. We illustrate
this in Example 18.

EXAMPLE 18 

Calculating Target Downside Deviation


Suppose the monthly returns on a portfolio are as shown:

Monthly Portfolio Returns


Month Return (%)

January 5
February 3
March −1
April −4
May 4
June 2
July 0
August 4
September 3
October 0
November 6
December 5

1 Calculate the target downside deviation when the target return is 3%.
2 If the target return were 4%, would your answer be different from that for
question 1? Without using calculations, explain how would it be different?
© CFA Institute. For candidate use only. Not for distribution.
Downside Deviation and Coefficient of Variation 133

Solution to 1

Squared
Deviation Deviations Deviations
from the 3% below the below the
Month Observation Target Target Target

January 5 2 — —
February 3 0 — —
March −1 −4 −4 16
April −4 −7 −7 49
May 4 1 — —
June 2 −1 −1 1
July 0 −3 −3 9
August 4 1 — —
September 3 0 — —
October 0 −3 −3 9
November 6 3 — —
December 5 2 — —
Sum 84

84
Target semideviation = = 2.7634%
11
Solution to 2
If the target return is higher, then the existing deviations would be larger and
there would be several more values in the deviations and squared deviations
below the target; so, the target semideviation would be larger.

How does the target downside deviation relate to the sample standard deviation?
We illustrate the differences between the target downside deviation and the standard
deviation in Example 19, using the data in Example 18.

EXAMPLE 19 

Comparing the Target Downside Deviation with the


Standard Deviation
1 Given the data in Example 18, calculate the sample standard deviation.
2 Given the data in Example 18, calculate the target downside deviation if
the target is 2%.
3 Compare the standard deviation, the target downside deviation if the
target is 2%, and the target downside deviation if the target is 3%.

Solution to 1

Deviation from Squared


Month Observation the mean deviation

January 5 2.75 7.5625


February 3 0.75 0.5625
(continued)
© CFA Institute. For candidate use only. Not for distribution.
134 Reading 2 ■ Organizing, Visualizing, and Describing Data

Deviation from Squared


Month Observation the mean deviation
March −1 −3.25 10.5625
April −4 −6.25 39.0625
May 4 1.75 3.0625
June 2 −0.25 0.0625
July 0 −2.25 5.0625
August 4 1.75 3.0625
September 3 0.75 0.5625
October 0 −2.25 5.0625
November 6 3.75 14.0625
December 5 2.75 7.5625
Sum 27 96.2500

96.2500
The sample standard deviation is = 2.958%.
11
Solution to 2

Squared
Deviation Deviations Deviations
from the 2% below the below the
Month Observation Target Target Target

January 5 3 — —
February 3 1 — —
March −1 −3 −3 9
April −4 −6 −6 36
May 4 2 — —
June 2 0 — —
July 0 −2 −2 4
August 4 2 — —
September 3 1 — —
October 0 −2 −2 4
November 6 4 — —
December 5 3 — —
Sum 53

53
The target semideviation with 2% target = = 2.195%.
11
Solution to 3
The standard deviation is based on the deviation from the mean, which is 2.25%.
The standard deviation includes all deviations from the mean, not just those
below it. This results in a sample standard deviation of 2.958%.
Considering just the four observations below the 2% target, the target
semideviation is 2.195%. It is less than the sample standard deviation since
target semideviation captures only the downside risk (i.e., deviations below the
target). Considering target semideviation with a 3% target, there are now five
observations below 3%, so the target semideviation is higher, at 2.763%.
© CFA Institute. For candidate use only. Not for distribution.
Downside Deviation and Coefficient of Variation 135

10.1  Coefficient of Variation


We noted earlier that the standard deviation is more easily interpreted than variance
because standard deviation uses the same units of measurement as the observations.
We may sometimes find it difficult to interpret what standard deviation means in terms
of the relative degree of variability of different sets of data, however, either because
the datasets have markedly different means or because the datasets have different
units of measurement. In this section, we explain a measure of relative dispersion,
the coefficient of variation that can be useful in such situations. Relative dispersion
is the amount of dispersion relative to a reference value or benchmark.
The coefficient of variation is helpful in such situations as that just described (i.e.,
datasets with markedly different means or different units of measurement).
Coefficient of Variation Formula. The coefficient of variation, CV, is the ratio
of the standard deviation of a set of observations to their mean value:
CV = s X (13)

where s is the sample standard deviation and X is the sample mean.


When the observations are returns, for example, the coefficient of variation mea-
sures the amount of risk (standard deviation) per unit of reward (mean return). An
issue that may arise, especially when dealing with returns, is that if X is negative,
the statistic is meaningless.
The CV may be stated as a multiple (e.g., 2 times) or as a percentage (e.g., 200%).
Expressing the magnitude of variation among observations relative to their average
size, the coefficient of variation permits direct comparisons of dispersion across
different datasets. Reflecting the correction for scale, the coefficient of variation is a
scale-­free measure (that is, it has no units of measurement).
We illustrate the usefulness of coefficient of variation for comparing datasets with
markedly different standard deviations using two hypothetical samples of companies
in Example 20.

EXAMPLE 20 

Coefficient of Variation of Returns on Assets


Suppose an analyst collects the return on assets (in percentage terms) for ten
companies for each of two industries:
Company Industry A Industry B

1 −5 −10
2 −3 −9
3 −1 −7
4 2 −3
5 4 1
6 6 3
7 7 5
8 9 18
9 10 20
10 11 22

These data can be represented graphically as the following:


© CFA Institute. For candidate use only. Not for distribution.
136 Reading 2 ■ Organizing, Visualizing, and Describing Data

Industry A

–5 –3 –1 2 4 67 9 10 11

Industry B

–10 –9 –7 –3 1 3 5 18 20 22

1 Calculate the average return on assets (ROA) for each industry.


2 Calculate the standard deviation of ROA for each industry.
3 Calculate the coefficient of variation of ROA for each industry.

Solution to 1
The arithmetic mean for both industries is the sum divided by 10, or 40/10 = 4%.
Solution to 2
The standard deviation using Equation 11 for Industry A is 5.60, and for Industry
B the standard deviation is 12.12.
Solution to 3

The coefficient of variation for Industry A = 5.60/4 = 1.40.


The coefficient of variation for Industry B = 12.12/4 = 3.03.
Though the two industries have the same arithmetic mean ROA, the dispersion
is different—with Industry B’s returns on assets being much more disperse than
those of Industry A. The coefficients of variation for these two industries reflects
this, with Industry B having a larger coefficient of variation. The interpretation
is that the risk per unit of mean return is more than two times (2.16 = 3.03/1.40)
greater for Industry B compared to Industry A.

11 THE SHAPE OF THE DISTRIBUTIONS

l. Interpret skewness
Mean and variance may not adequately describe an investment’s distribution of returns.
In calculations of variance, for example, the deviations around the mean are squared,
so we do not know whether large deviations are likely to be positive or negative.
We need to go beyond measures of central tendency and dispersion to reveal other
important characteristics of the distribution. One important characteristic of interest
to analysts is the degree of symmetry in return distributions.
If a return distribution is symmetrical about its mean, each side of the distribution
is a mirror image of the other. Thus, equal loss and gain intervals exhibit the same
frequencies. If the mean is zero, for example, then losses from −5% to −3% occur with
about the same frequency as gains from 3% to 5%.
© CFA Institute. For candidate use only. Not for distribution.
The Shape of the Distributions 137

One of the most important distributions is the normal distribution, depicted


in Exhibit 47. This symmetrical, bell-­shaped distribution plays a central role in the
mean–variance model of portfolio selection; it is also used extensively in financial risk
management. The normal distribution has the following characteristics:
■■ Its mean, median, and mode are equal.
■■ It is completely described by two parameters—its mean and variance (or stan-
dard deviation).
But with any distribution other than a normal distribution, more information than
the mean and variance is needed to characterize its shape.

Exhibit 47  The Normal Distribution

Density of Probability

–5 –4 –3 –2 –1 0 1 2 3 4 5
Standard Deviation

A distribution that is not symmetrical is skewed. A return distribution with positive


skew has frequent small losses and a few extreme gains. A return distribution with
negative skew has frequent small gains and a few extreme losses. Exhibit 48 shows
continuous positively and negatively skewed distributions. The continuous positively
skewed distribution shown has a long tail on its right side; the continuous negatively
skewed distribution shown has a long tail on its left side.
For a continuous positively skewed unimodal distribution, the mode is less than the
median, which is less than the mean. For the continuous negatively skewed unimodal
distribution, the mean is less than the median, which is less than the mode. For a given
expected return and standard deviation, investors should be attracted by a positive
skew because the mean return lies above the median. Relative to the mean return,
positive skew amounts to limited, though frequent, downside returns compared with
somewhat unlimited, but less frequent, upside returns.
© CFA Institute. For candidate use only. Not for distribution.
138 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 48  Properties of Skewed Distributions


A. Positively Skewed
Density of Probability

Mode Median Mean

B. Negatively Skewed
Density of Probability

Mean Median Mode

Skewness is the name given to a statistical measure of skew. (The word “skewness” is
also sometimes used interchangeably for “skew.”) Like variance, skewness is computed
using each observation’s deviation from its mean. Skewness (sometimes referred
to as relative skewness) is computed as the average cubed deviation from the mean
standardized by dividing by the standard deviation cubed to make the measure free
of scale. A symmetric distribution has skewness of 0, a positively skewed distribution
has positive skewness, and a negatively skewed distribution has negative skewness,
as given by this measure.
We can illustrate the principle behind the measure by focusing on the numera-
tor. Cubing, unlike squaring, preserves the sign of the deviations from the mean. If
a distribution is positively skewed with a mean greater than its median, then more
than half of the deviations from the mean are negative and less than half are positive.
However, for the sum of the cubed deviations to be positive, the losses must be small
and likely and the gains less likely but more extreme. Therefore, if skewness is positive,
the average magnitude of positive deviations is larger than the average magnitude of
negative deviations.
The approximation for computing sample skewness when n is large (100 or
more) is:
n
3
 X i  X 
1
Skewness    i 1
n s3
© CFA Institute. For candidate use only. Not for distribution.
The Shape of the Distributions 139

A simple example illustrates that a symmetrical distribution has a skewness mea-


sure equal to 0. Suppose we have the following data: 1, 2, 3, 4, 5, 6, 7, 8, and 9. The
mean outcome is 5, and the deviations are −4, −3, −2, −1, 0, 1, 2, 3, and 4. Cubing the
deviations yields −64, −27, −8, −1, 0, 1, 8, 27, and 64, with a sum of 0. The numerator
of skewness (and so skewness itself ) is thus equal to 0, supporting our claim.
As you will learn as the CFA Program curriculum unfolds, different investment
strategies may tend to introduce different types and amounts of skewness into returns.

11.1  The Shape of the Distributions: Kurtosis


m Interpret kurtosis
In the previous section, we discussed how to determine whether a return distribution
deviates from a normal distribution because of skewness. Another way in which a
return distribution might differ from a normal distribution is its relative tendency
to generate large deviations from the mean. Most investors would perceive a greater
chance of extremely large deviations from the mean as increasing risk.
Kurtosis is a measure of the combined weight of the tails of a distribution relative
to the rest of the distribution—that is, the proportion of the total probability that is
outside of, say, 2.5 standard deviations of the mean. A distribution that has fatter tails
than the normal distribution is referred to as leptokurtic or fat-­tailed; a distribution
that has thinner tails than the normal distribution is referred to as being platykurtic or
thin-­tailed; and a distribution similar to the normal distribution as concerns relative
weight in the tails is called mesokurtic. A fat-­tailed (thin-­tailed) distribution tends
to generate more-­frequent (less-­frequent) extremely large deviations from the mean
than the normal distribution.
Exhibit 49 illustrates a fat-­tailed distribution. It has fatter tails than the normal
distribution. By construction, the fat-­tailed and normal distributions in this exhibit
have the same mean, standard deviation, and skewness. Note that this fat-­tailed dis-
tribution is more likely than the normal distribution to generate observations in the
tail regions defined by the intersection of graphs near a standard deviation of about
±2.5. This fat-­tailed distribution is also more likely to generate observations that are
near the mean, defined here as the region ±1 standard deviation around the mean.
In compensation, to have probabilities sum to 1, this distribution generates fewer
observations in the regions between the central region and the two tail regions.

Exhibit 49  Fat-­Tailed Distribution Compared to the Normal Distribution


Density of Probability
0.6

0.5

0.4

0.3

0.2

0.1

0 –5 –4 –3 –2 –1 0 1 2 3 4 5
Standard Deviation
Normal Distribution Fat-Tailed Distribution
© CFA Institute. For candidate use only. Not for distribution.
140 Reading 2 ■ Organizing, Visualizing, and Describing Data

The calculation for kurtosis involves finding the average of deviations from the
mean raised to the fourth power and then standardizing that average by dividing by
the standard deviation raised to the fourth power. A normal distribution has kurtosis
of 3.0, so a fat-­tailed distribution has a kurtosis of above 3 and a thin-­tailed distribu-
tion of below 3.0.
Excess kurtosis is the kurtosis relative to the normal distribution. For a large sam-
ple size (n = 100 or more), sample excess kurtosis (KE) is approximately as follows:
 n 
   X i  X 4 
 1  
KE    i 1 3
n 4
  s 
 

As with skewness, this measure is free of scale. Many statistical packages report
estimates of sample excess kurtosis, labeling this as simply “kurtosis.”
Excess kurtosis thus characterizes kurtosis relative to the normal distribution. A
normal distribution has excess kurtosis equal to 0. A fat-­tailed distribution has excess
kurtosis greater than 0, and a thin-­tailed distribution has excess kurtosis less than 0. A
return distribution with positive excess kurtosis—a fat-­tailed return distribution—has
more frequent extremely large deviations from the mean than a normal distribution.
Summarizing:
And we refer to
then excess Therefore, the the distribution as
If kurtosis is … kurtosis is … distribution is … being …

above 3.0 above 0. fatter-­tailed than the fat-­tailed


normal distribution. (leptokurtic).
equal to 3.0 equal to 0. similar in tails to the mesokurtic.
normal distribution.
less than 3.0 less than 0. thinner-­tailed than the thin-­tailed
normal distribution. (platykurtic).

Most equity return series have been found to be fat-­tailed. If a return distribution
is fat-­tailed and we use statistical models that do not account for the distribution,
then we will underestimate the likelihood of very bad or very good outcomes. Using
the data on the daily returns of the fictitious EAA Equity Index, we see the skewness
and kurtosis of these returns in Exhibit 50.

Exhibit 50  Skewness and Kurtosis of EAA Equity Index Daily Returns


Daily Return (%)

Arithmetic mean 0.0347


Standard deviation 0.8341

Measure of Symmetry
Skewness −0.4260
Excess kurtosis 3.7962

We can see this graphically, comparing the distribution of the daily returns with
a normal distribution with the same mean and standard deviation:
© CFA Institute. For candidate use only. Not for distribution.
The Shape of the Distributions 141

Exhibit 50  (Continued)
Number of Observations

–5 –4 –3 –2 –1 0 1 2 3 4 5
Standard Deviation
EAA Daily Returns Normal Distribution
Using both the statistics and the graph, we see the following:
■■ The distribution is negatively skewed, as indicated by the negative calcu-
lated skewness of −0.4260 and the influence of observations below the
mean of 0.0347%.
■■ The highest frequency of returns occurs within the −0.5 to 0.0 standard
deviations from the mean (i.e., negatively skewed).
■■ The distribution is fat-­tailed, as indicated by the positive excess kurtosis of
3.7962. We can see fat tails, a concentration of returns around the mean,
and fewer observations in the regions between the central region and the
two-­tail regions.

EXAMPLE 21 

Interpreting Skewness and Kurtosis


Consider the daily trading volume for a stock for one year, as shown in the graph
below. In addition to the count of observations within each bin or interval, the
number of observations anticipated based on a normal distribution (given the
sample arithmetic average and standard deviation) is provided in the chart as
well. The average trading volume per day for this stock in this year is 8.6 million
shares, and the standard deviation is 4.9 million shares.
© CFA Institute. For candidate use only. Not for distribution.
142 Reading 2 ■ Organizing, Visualizing, and Describing Data

Histogram of Daily Trading Volume for a Stock for One Year


Number of Trading Days
70

60

50

40

30

20

10

0
3.1 4.6 6.1 7.7 9.2 10.7 12.3 13.8 15.3 16.8 18.4 19.9 21.4 23.0 24.5 26.0 27.5 29.1 30.6 32.1
to to to to to to to to to to to to to to to to to to to to
4.6 6.1 7.7 9.2 10.7 12.3 13.8 15.3 16.8 18.4 19.9 21.4 23.0 24.5 26.0 27.5 29.1 30.6 32.1 33.7
Trading Volume Range of Shares (millions)

Based on the Sample Based on the Normal Distribution

1 Describe whether or not this distribution is skewed. If so, what could


account for this situation?
2 Describe whether or not this distribution displays kurtosis. How would
you make this determination?

Solution to 1
The distribution appears to be skewed to the right, or positively skewed. This is
likely due to: (1) no possible negative trading volume on a given trading day, so the
distribution is truncated at zero; and (2) greater-­than-­typical trading occurring
relatively infrequently, such as when there are company-­specific announcements.
The actual skewness for this distribution is 2.1090, which supports this
interpretation.
Solution to 2
The distribution appears to have excess kurtosis, with a right-­side fat tail and
with maximum shares traded in the 4.6 to 6.1  million range, exceeding what
is expected if the distribution was normally distributed. There are also fewer
observations than expected between the central region and the tail.
The actual excess kurtosis for this distribution is 5.2151, which supports
this interpretation.

12 CORRELATION BETWEEN TWO VARIABLES

n Interpret correlation between two variables


© CFA Institute. For candidate use only. Not for distribution.
Correlation between Two Variables 143

Now that we have some understanding of sample variance and standard deviation, we
can more formally consider the concept of correlation between two random variables
that we previously explored visually in the scatter plots in Section 6. Correlation is a
measure of the linear relationship between two random variables.
The first step is to consider how two variables vary together, their covariance.
Definition of Sample Covariance. The sample covariance (sXY) is a measure
of how two variables in a sample move together:

n
 X i  X Yi  Y 
i 1
s XY  (14)
n 1
Equation 14 indicates that the sample covariance is the average value of the product
of the deviations of observations on two random variables (Xi and Yi) from their sample
means. If the random variables are returns, the units would be returns squared. Also,
note the use of n − 1 in the denominator, which ensures that the sample covariance
is an unbiased estimate of population covariance.
Stated simply, covariance is a measure of the joint variability of two random vari-
ables. If the random variables vary in the same direction—for example, X tends to be
above its mean when Y is above its mean, and X tends to be below its mean when Y is
below its mean—then their covariance is positive. If the variables vary in the opposite
direction relative to their respective means, then their covariance is negative.
By itself, the size of the covariance measure is difficult to interpret as it is not
normalized and so depends on the magnitude of the variables. This brings us to the
normalized version of covariance, which is the correlation coefficient.
Definition of Sample Correlation Coefficient. The sample correlation
coefficient is a standardized measure of how two variables in a sample move
together. The sample correlation coefficient (rXY) is the ratio of the sample cova-
riance to the product of the two variables’ standard deviations:

s XY
rXY = (15)
s X sY
Importantly, the correlation coefficient expresses the strength of the linear rela-
tionship between the two random variables.

12.1  Properties of Correlation


We now discuss the correlation coefficient, or simply correlation, and its properties
in more detail, as follows:
1 Correlation ranges from −1 and +1 for two random variables, X and Y:
−1 ≤ rXY ≤ +1.

2 A correlation of 0 (uncorrelated variables) indicates an absence of any linear


(that is, straight-­line) relationship between the variables.
3 A positive correlation close to +1 indicates a strong positive linear relationship.
A correlation of 1 indicates a perfect linear relationship.
4 A negative correlation close to −1 indicates a strong negative (that is, inverse)
linear relationship. A correlation of −1 indicates a perfect inverse linear
relationship.
© CFA Institute. For candidate use only. Not for distribution.
144 Reading 2 ■ Organizing, Visualizing, and Describing Data

We will make use of scatter plots, similar to those used previously in our discussion
of data visualization, to illustrate correlation. In contrast to the correlation coefficient,
which expresses the relationship between two data series using a single number, a
scatter plot depicts the relationship graphically. Therefore, scatter plots are a very
useful tool for the sensible interpretation of a correlation coefficient.
Exhibit  51 shows examples of scatter plots. Panel A shows the scatter plot of
two variables with a correlation of +1. Note that all the points on the scatter plot in
Panel A lie on a straight line with a positive slope. Whenever variable X increases by
one unit, variable Y increases by two units. Because all of the points in the graph lie
on a straight line, an increase of one unit in X is associated with exactly a two-­unit
increase in Y, regardless of the level of X. Even if the slope of the line were different
(but positive), the correlation between the two variables would still be +1 as long as
all the points lie on that straight line. Panel B shows a scatter plot for two variables
with a correlation coefficient of −1. Once again, the plotted observations all fall on a
straight line. In this graph, however, the line has a negative slope. As X increases by
one unit, Y decreases by two units, regardless of the initial value of X.

Exhibit 51  Scatter Plots Showing Various Degrees of Correlation


A. Variables With a Correlation of +1 B. Variables With a Correlation of –1
Variable Y Variable Y
35 20
30 15
25 10
20 5
15 0
10 –5
5 –10
0 –15
0 5 10 15 20 0 10 20
Variable X Variable X

C. Variables With a Correlation of 0 D. Variables With a Strong


Variable Y Nonlinear Association
12 Variable Y
60
10
50
8
40
6
30
4
20
2
5
0
0 10 20 0
Variable X 0 10 20
Variable X

Panel C shows a scatter plot of two variables with a correlation of 0; they have no
linear relation. This graph shows that the value of variable X tells us nothing about
the value of variable Y. Panel D shows a scatter plot of two variables that have a
© CFA Institute. For candidate use only. Not for distribution.
Correlation between Two Variables 145

non-­linear relationship. Because the correlation coefficient is a measure of the linear


association between two variables, it would not be appropriate to use the correlation
coefficient in this case.
Example 22 is meant to reinforce your understanding of how to interpret covari-
ance and correlation.

EXAMPLE 22 

Interpreting the Correlation Coefficient


Consider the statistics for the returns over twelve months for three funds, A,
B, and C, shown in Exhibit 52.

Exhibit 52 
Fund A Fund B Fund C

Arithmetic average 2.9333 3.2250 2.6250


Standard deviation 2.4945 2.4091 3.6668

The covariances are represented in the upper-­triangle (shaded area) of the matrix
shown in Exhibit 53.

Exhibit 53 
Fund A Fund B Fund C

Fund A 6.2224 5.7318 −3.6682


Fund B 5.8039 −2.3125
Fund C 13.4457

The covariance of Fund A and Fund B returns, for example, is 5.7318.


Why show just the upper-­triangle of this matrix? Because the covariance of
Fund A and Fund B returns is the same as the covariance of Fund B and Fund
A returns.
The diagonal of the matrix in Exhibit 53 is the variance of each fund’s return.
For example, the variance of Fund A returns is 6.2224, but the covariance of
Fund A and Fund B returns is 5.7138.
The correlations among the funds’ returns are given in Exhibit  54, where
the correlations are reported in the upper-­triangle (shaded area) of the matrix.
Note that the correlation of a fund’s returns with itself is +1, so the diagonal in
the correlation matrix consists of 1.000.
© CFA Institute. For candidate use only. Not for distribution.
146 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 54 
Fund A Fund B Fund C

Fund A 1.0000 0.9538 −0.4010


Fund B 1.0000 −0.2618
Fund C 1.0000

1 Interpret the correlation between Fund A’s returns and Fund B’s returns.
2 Interpret the correlation between Fund A’s returns and Fund C’s returns.
3 Describe the relationship of the covariance of these returns and the cor-
relation of returns.

Solutions

1 The correlation of Fund A and Fund B returns is 0.9538, which is pos-


itive and close to 1.0. This means that when returns of Fund A tend to
be above their mean, Fund B’s returns also tend to be above their mean.
Graphically, we would observe a positive, but not perfect, linear relation-
ship between the returns for the two funds.
2 The correlation of Fund A’s returns and Fund C’s returns is −0.4010,
which indicates that when Fund A’s returns are above their mean, Fund B’s
returns tend to be below their mean. This implies a negative slope when
graphing the returns of these two funds, but it would not be a perfect
inverse relationship.
3 There are two negative correlations: Fund A returns with Fund C returns,
and Fund B returns with Fund C returns. What determines the sign of the
correlation is the sign of the covariance, which in each of these cases is
negative. When the covariance between fund returns is positive, such as
between Fund A and Fund B returns, the correlation is positive. This fol-
lows from the fact that the correlation coefficient is the ratio of the covari-
ance of the two funds’ returns to the product of their standard deviations.

12.2  Limitations of Correlation Analysis


Exhibit  51 illustrates that correlation measures the linear association between two
variables, but it may not always be reliable. Two variables can have a strong nonlinear
relation and still have a very low correlation. For example, the relation Y = (X − 4)2 is a
nonlinear relation contrasted to the linear relation Y = 2X − 4. The nonlinear relation
between variables X and Y is shown in Panel D. Below a level of 4 for X, Y increases
with decreasing values of X. When X is 4 or greater, however, Y increases whenever X
increases. Even though these two variables are perfectly associated, there is no linear
association between them (hence, no meaningful correlation).
Correlation may also be an unreliable measure when outliers are present in one
or both of the variables. As we have seen, outliers are small numbers of observations
at either extreme (small or large) of a sample. The correlation may be quite sensitive
to outliers. In such a situation, we should consider whether it makes sense to exclude
those outlier observations and whether they are noise or news. As a general rule, we
must determine whether a computed sample correlation changes greatly by removing
outliers. We must also use judgment to determine whether those outliers contain
© CFA Institute. For candidate use only. Not for distribution.
Correlation between Two Variables 147

information about the two variables’ relationship (and should thus be included in the
correlation analysis) or contain no information (and should thus be excluded). If they
are to be excluded from the correlation analysis, as we have seen previously, outlier
observations can be handled by trimming or winsorizing the dataset.
Importantly, keep in mind that correlation does not imply causation. Even if two
variables are highly correlated, one does not necessarily cause the other in the sense
that certain values of one variable bring about the occurrence of certain values of
the other.
Moreover, with visualizations too, including scatter plots, we must be on guard
against unconsciously making judgments about causal relationships that may or may
not be supported by the data.
The term spurious correlation has been used to refer to: 1) correlation between
two variables that reflects chance relationships in a particular dataset; 2) correlation
induced by a calculation that mixes each of two variables with a third variable; and
3) correlation between two variables arising not from a direct relation between them
but from their relation to a third variable.
As an example of the chance relationship, consider the monthly US retail sales of
beer, wine, and liquor and the atmospheric carbon dioxide levels from 2000–2018.
The correlation is 0.824, indicating that there is a positive relation between the two.
However, there is no reason to suspect that the levels of atmospheric carbon dioxide
are related to the retail sales of beer, wine, and liquor.
As an example of the second kind of spurious correlation, two variables that are
uncorrelated may be correlated if divided by a third variable. For example, consider a
cross-­sectional sample of companies’ dividends and total assets. While there may be
a low correlation between these two variables, dividing each by market capitalization
may increase the correlation.
As an example of the third kind of spurious correlation, height may be positively
correlated with the extent of a person’s vocabulary, but the underlying relationships
are between age and height and between age and vocabulary.
Investment professionals must be cautious in basing investment strategies on high
correlations. Spurious correlations may suggest investment strategies that appear
profitable but actually would not be, if implemented.
A further issue is that correlation does not tell the whole story about the data.
Consider Anscombe’s Quartet, discussed in Exhibit 55, where very dissimilar graphs
can be developed with variables that have the same mean, same standard deviation,
and same correlation.

Exhibit 55  Anscombe’s Quartet


Francis Anscombe, a British statistician, developed datasets that illustrate why
just looking at summary statistics (that is, mean, standard deviation, and cor-
relation) does not fully describe the data. He created four datasets (designated
I, II, III, and IV), each with two variables, X and Y, such that:
■■ The Xs in each dataset have the same mean and standard deviation, 9.00
and 3.32, respectively.
■■ The Ys in each dataset have the same mean and standard deviation, 7.50
and 2.03, respectively.
■■ The Xs and Ys in each dataset have the same correlation of 0.82.
(continued)
© CFA Institute. For candidate use only. Not for distribution.
148 Reading 2 ■ Organizing, Visualizing, and Describing Data

Exhibit 55  (Continued)

I II III IV
Observation X Y X Y X Y X Y

1 10 8.04 10 9.14 10 7.46 8 6.6


2 8 6.95 8 8.14 8 6.77 8 5.8
3 13 7.58 13 8.74 13 12.74 8 7.7
4 9 8.81 9 8.77 9 7.11 8 8.8
5 11 8.33 11 9.26 11 7.81 8 8.5
6 14 9.96 14 8.1 14 8.84 8 7
7 6 7.24 6 6.13 6 6.08 8 5.3
8 4 4.26 4 3.1 4 5.39 19 13
9 12 10.8 12 9.13 12 8.15 8 5.6
10 7 4.82 7 7.26 7 6.42 8 7.9
11 5 5.68 5 4.74 5 5.73 8 6.9

N 11 11 11 11 11 11 11 11
Mean 9.00 7.50 9.00 7.50 9.00 7.50 9.00 7.50
Standard
deviation 3.32 2.03 3.32 2.03 3.32 2.03 3.32 2.03
Correlation 0.82 0.82 0.82 0.82

While the X variable has the same values for I, II, and III in the quartet of
datasets, the Y variables are quite different, creating different relationships.
The four datasets are:
I An approximate linear relationship between X and Y.
II A curvilinear relationship between X and Y.
III A linear relationship except for one outlier.
IV A constant X with the exception of one outlier.
Depicting the quartet visually,
© CFA Institute. For candidate use only. Not for distribution.
Summary 149

Exhibit 55  (Continued)
I II
Variable Y Variable Y
14 14
12 12
10 10
8
6 6
4 4
2 2
0 0
0 5 10 15 0 5 10 15
Variable X Variable X

III IV
Variable Y Variable Y
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
0 5 10 15 0 5 10 15 20
Variable X The bot-
Variable X
tom line? Knowing the means and standard deviations of the two variables, as
well as the correlation between them, does not tell the entire story.

Source: Francis John Anscombe, “Graphs in Statistical Analysis,” The American Statistician 27
(February 1973): 17–21.

SUMMARY
In this reading, we have presented tools and techniques for organizing, visualizing,
and describing data that permit us to convert raw data into useful information for
investment analysis.
■■ Data can be defined as a collection of numbers, characters, words, and text—as
well as images, audio, and video—in a raw or organized format to represent
facts or information.
■■ From a statistical perspective, data can be classified as numerical data and
categorical data. Numerical data (also called quantitative data) are values that
represent measured or counted quantities as a number. Categorical data (also
called qualitative data) are values that describe a quality or characteristic of a
group of observations and usually take only a limited number of values that are
mutually exclusive.
© CFA Institute. For candidate use only. Not for distribution.
150 Reading 2 ■ Organizing, Visualizing, and Describing Data

■■ Numerical data can be further split into two types: continuous data and discrete
data. Continuous data can be measured and can take on any numerical value in
a specified range of values. Discrete data are numerical values that result from a
counting process and therefore are limited to a finite number of values.
■■ Categorical data can be further classified into two types: nominal data and
ordinal data. Nominal data are categorical values that are not amenable to being
organized in a logical order, while ordinal data are categorical values that can be
logically ordered or ranked.
■■ Based on how they are collected, data can be categorized into three types:
cross-­sectional, time series, and panel. Time-­series data are a sequence of
observations for a single observational unit on a specific variable collected
over time and at discrete and typically equally spaced intervals of time. Cross-­
sectional data are a list of the observations of a specific variable from multiple
observational units at a given point in time. Panel data are a mix of time-­series
and cross-­sectional data that consists of observations through time on one or
more variables for multiple observational units.
■■ Based on whether or not data are in a highly organized form, they can be classi-
fied into structured and unstructured types. Structured data are highly orga-
nized in a pre-­defined manner, usually with repeating patterns. Unstructured
data do not follow any conventionally organized forms; they are typically alter-
native data as they are usually collected from unconventional sources.
■■ Raw data are typically organized into either a one-­dimensional array or a two-­
dimensional rectangular array (also called a data table) for quantitative analysis.
■■ A frequency distribution is a tabular display of data constructed either by
counting the observations of a variable by distinct values or groups or by tal-
lying the values of a numerical variable into a set of numerically ordered bins.
Frequency distributions permit us to evaluate how data are distributed.
■■ The relative frequency of observations in a bin (interval or bucket) is the num-
ber of observations in the bin divided by the total number of observations. The
cumulative relative frequency cumulates (adds up) the relative frequencies as
we move from the first bin to the last, thus giving the fraction of the observa-
tions that are less than the upper limit of each bin.
■■ A contingency table is a tabular format that displays the frequency distributions
of two or more categorical variables simultaneously. One application of contin-
gency tables is for evaluating the performance of a classification model (using a
confusion matrix). Another application of contingency tables is to investigate a
potential association between two categorical variables by performing a chi-­
square test of independence.
■■ Visualization is the presentation of data in a pictorial or graphical format for
the purpose of increasing understanding and for gaining insights into the data.
■■ A histogram is a bar chart of data that have been grouped into a frequency
distribution. A frequency polygon is a graph of frequency distributions obtained
by drawing straight lines joining successive midpoints of bars representing the
class frequencies.
■■ A bar chart is used to plot the frequency distribution of categorical data, with
each bar representing a distinct category and the bar’s height (or length) pro-
portional to the frequency of the corresponding category. Grouped bar charts
or stacked bar charts can present the frequency distribution of multiple cate-
gorical variables simultaneously.
© CFA Institute. For candidate use only. Not for distribution.
Summary 151

■■ A tree-­map is a graphical tool to display categorical data. It consists of a set of


colored rectangles to represent distinct groups, and the area of each rectangle is
proportional to the value of the corresponding group. Additional dimensions of
categorical data can be displayed by nested rectangles.
■■ A word cloud is a visual device for representing textual data, with the size of
each distinct word being proportional to the frequency with which it appears in
the given text.
■■ A line chart is a type of graph used to visualize ordered observations and often
to display the change of data series over time. A bubble line chart is a special
type of line chart that uses varying-­sized bubbles as data points to represent an
additional dimension of data.
■■ A scatter plot is a type of graph for visualizing the joint variation in two numer-
ical variables. It is constructed by drawing dots to indicate the values of the two
variables plotted against the corresponding axes. A scatter plot matrix organizes
scatter plots between pairs of variables into a matrix format to inspect all pair-
wise relationships between more than two variables in one combined visual.
■■ A heat map is a type of graphic that organizes and summarizes data in a tabular
format and represents it using a color spectrum. It is often used in displaying
frequency distributions or visualizing the degree of correlation among different
variables.
■■ The key consideration when selecting among chart types is the intended pur-
pose of visualizing data (i.e., whether it is for exploring/presenting distributions
or relationships or for making comparisons).
■■ A population is defined as all members of a specified group. A sample is a sub-
set of a population.
■■ A parameter is any descriptive measure of a population. A sample statistic (sta-
tistic, for short) is a quantity computed from or used to describe a sample.
■■ Sample statistics—such as measures of central tendency, measures of disper-
sion, skewness, and kurtosis—help with investment analysis, particularly in
making probabilistic statements about returns.
■■ Measures of central tendency specify where data are centered and include the
mean, median, and mode (i.e., the most frequently occurring value).
■■ The arithmetic mean is the sum of the observations divided by the number of
observations. It is the most frequently used measure of central tendency.
■■ The median is the value of the middle item (or the mean of the values of the two
middle items) when the items in a set are sorted into ascending or descending
order. The median is not influenced by extreme values and is most useful in the
case of skewed distributions.
■■ The mode is the most frequently observed value and is the only measure of
central tendency that can be used with nominal data. A distribution may be
unimodal (one mode), bimodal (two modes), trimodal (three modes), or have
even more modes.
■■ A portfolio’s return is a weighted mean return computed from the returns on
the individual assets, where the weight applied to each asset’s return is the frac-
tion of the portfolio invested in that asset.
■■ The geometric mean, X G , of a set of observations X1, X2, …, Xn,
is X G = n X1 X 2 X 3  X n , with Xi ≥ 0 for i = 1, 2, …, n. The geometric mean is
especially important in reporting compound growth rates for time-­series data.
The geometric mean will always be less than an arithmetic mean whenever
there is variance in the observations.
© CFA Institute. For candidate use only. Not for distribution.
152 Reading 2 ■ Organizing, Visualizing, and Describing Data

■■ The harmonic mean, X H , is a type of weighted mean in which an observation’s


weight is inversely proportional to its magnitude.
■■ Quantiles—such as the median, quartiles, quintiles, deciles, and percentiles—
are location parameters that divide a distribution into halves, quarters, fifths,
tenths, and hundredths, respectively.
■■ A box and whiskers plot illustrates the interquartile range (the “box”) as well as
a range outside of the box that is based on the interquartile range, indicated by
the “whiskers.”
■■ Dispersion measures—such as the range, mean absolute deviation (MAD),
variance, standard deviation, target downside deviation, and coefficient of varia-
tion—describe the variability of outcomes around the arithmetic mean.
■■ The range is the difference between the maximum value and the minimum
value of the dataset. The range has only a limited usefulness because it uses
information from only two observations.
■■ The MAD for a sample is the average of the absolute deviations of observations
n
 Xi  X
i 1
from the mean, , where X is the sample mean and n is the number
n
of observations in the sample.
■■ The variance is the average of the squared deviations around the mean, and the
standard deviation is the positive square root of variance. In computing sample
variance (s2) and sample standard deviation (s), the average squared deviation is
computed using a divisor equal to the sample size minus 1.
■■ The target downside deviation, or target semideviation, is a measure of the risk
of being below a given target. It is calculated as the square root of the average
squared deviations from the target, but it includes only those observations
2
n X i  B
below the target (B), or  n 1
.
for allX i  B
■■ The coefficient of variation, CV, is the ratio of the standard deviation of a set
of observations to their mean value. By expressing the magnitude of variation
among observations relative to their average size, the CV permits direct com-
parisons of dispersion across different datasets. Reflecting the correction for
scale, the CV is a scale-­free measure (i.e., it has no units of measurement).
■■ Skew or skewness describes the degree to which a distribution is asymmetric
about its mean. A return distribution with positive skewness has frequent small
losses and a few extreme gains compared to a normal distribution. A return
distribution with negative skewness has frequent small gains and a few extreme
losses compared to a normal distribution. Zero skewness indicates a symmetric
distribution of returns.
■■ Kurtosis measures the combined weight of the tails of a distribution relative
to the rest of the distribution. A distribution with fatter tails than the normal
distribution is referred to as fat-­tailed (leptokurtic); a distribution with thin-
ner tails than the normal distribution is referred to as thin-­tailed (platykurtic).
Excess kurtosis is kurtosis minus 3, since 3 is the value of kurtosis for all normal
distributions.
■■ The correlation coefficient is a statistic that measures the association between
two variables. It is the ratio of covariance to the product of the two variables’
standard deviations. A positive correlation coefficient indicates that the two
variables tend to move together, whereas a negative coefficient indicates that
© CFA Institute. For candidate use only. Not for distribution.
Summary 153

the two variables tend to move in opposite directions. Correlation does not
imply causation, simply association. Issues that arise in evaluating correlation
include the presence of outliers and spurious correlation.
© CFA Institute. For candidate use only. Not for distribution.
154 Reading 2 ■ Organizing, Visualizing, and Describing Data

PRACTICE PROBLEMS
1 Published ratings on stocks ranging from 1 (strong sell) to 5 (strong buy) are
examples of which measurement scale?
A Ordinal
B Continuous
C Nominal
2 Data values that are categorical and not amenable to being organized in a logi-
cal order are most likely to be characterized as:
A ordinal data.
B discrete data.
C nominal data.
3 Which of the following data types would be classified as being categorical?
A Discrete
B Nominal
C Continuous
4 A fixed-­income analyst uses a proprietary model to estimate bankruptcy proba-
bilities for a group of firms. The model generates probabilities that can take any
value between 0 and 1. The resulting set of estimated probabilities would most
likely be characterized as:
A ordinal data.
B discrete data.
C continuous data.
5 An analyst uses a software program to analyze unstructured data—specifically,
management’s earnings call transcript for one of the companies in her research
coverage. The program scans the words in each sentence of the transcript and
then classifies the sentences as having negative, neutral, or positive sentiment.
The resulting set of sentiment data would most likely be characterized as:
A ordinal data.
B discrete data.
C nominal data.

Use the following information to answer


Questions 6 and 7
An equity analyst gathers total returns for three country equity indexes over the past
four years. The data are presented below.

© 2020 CFA Institute. All rights reserved.


© CFA Institute. For candidate use only. Not for distribution.
Practice Problems 155

Time Period Index A Index B Index C

Year t–3 15.56% 11.84% −4.34%


Year t–2 −4.12% −6.96% 9.32%
Year t–1 11.19% 10.29% −12.72%
Year t 8.98% 6.32% 21.44%

6 Each individual column of data in the table can be best characterized as:
A panel data.
B time-­series data.
C cross-­sectional data.
7 Each individual row of data in the table can be best characterized as:
A panel data.
B time-­series data.
C cross-­sectional data.

8 A two-­dimensional rectangular array would be most suitable for organizing a


collection of raw:
A panel data.
B time-­series data.
C cross-­sectional data.
9 In a frequency distribution, the absolute frequency measure:
A represents the percentages of each unique value of the variable.
B represents the actual number of observations counted for each unique value
of the variable.
C allows for comparisons between datasets with different numbers of total
observations.
10 An investment fund has the return frequency distribution shown in the follow-
ing exhibit.
Return Interval (%) Absolute Frequency

−10.0 to −7.0 3
−7.0 to −4.0 7
−4.0 to −1.0 10
−1.0 to +2.0 12
+2.0 to +5.0 23
+5.0 to +8.0 5

Which of the following statements is correct?


A The relative frequency of the bin “−1.0 to +2.0” is 20%.
B The relative frequency of the bin “+2.0 to +5.0” is 23%.
C The cumulative relative frequency of the bin “+5.0 to +8.0” is 91.7%.
11 An analyst is using the data in the following exhibit to prepare a statistical
report.
© CFA Institute. For candidate use only. Not for distribution.
156 Reading 2 ■ Organizing, Visualizing, and Describing Data

Portfolio’s Deviations from Benchmark Return for a 12-­Year Period (%)


Year 1 2.48 Year 7 −9.19
Year 2 −2.59 Year 8 −5.11
Year 3 9.47 Year 9 1.33
Year 4 −0.55 Year 10 6.84
Year 5 −1.69 Year 11 3.04
Year 6 −0.89 Year 12 4.72

The cumulative relative frequency for the bin −1.71% ≤ x < 2.03% is closest to:
A 0.250.
B 0.333.
C 0.583.

Use the following information to answer ques-


tions 12 and 13
A fixed-­income portfolio manager creates a contingency table of the number of bonds
held in her portfolio by sector and bond rating. The contingency table is presented here:

Bond Rating
Sector A AA AAA

Communication Services 25 32 27
Consumer Staples 30 25 25
Energy 100 85 30
Health Care 200 100 63
Utilities 22 28 14

12 The marginal frequency of energy sector bonds is closest to:


A 27.
B 85.
C 215.
13 The relative frequency of AA rated energy bonds, based on the total count, is
closest to:
A 10.5%.
B 31.5%.
C 39.5%.

The following information relates to Questions


14–15
The following histogram shows a distribution of the S&P 500 Index annual returns
for a 50-­year period:
© CFA Institute. For candidate use only. Not for distribution.
Practice Problems 157

Frequency
8

0
–37 –32 –27 –22 –17 –12 –7 –2 3 8 13 18 23 28 33
to to to to to to to to to to to to to to to
–32 –27 –22 –17 –12 –7 –2 3 8 13 18 23 28 33 38
Return Intervals (%)

14 The bin containing the median return is:


A 3% to 8%.
B 8% to 13%.
C 13% to 18%.
15 Based on the previous histogram, the distribution is best described as being:
A unimodal.
B bimodal.
C trimodal.

16 The following is a frequency polygon of monthly exchange rate changes in the


US dollar/Japanese yen spot exchange rate for a four-­year period. A positive
change represents yen appreciation (the yen buys more dollars), and a negative
change represents yen depreciation (the yen buys fewer dollars).
© CFA Institute. For candidate use only. Not for distribution.
158 Reading 2 ■ Organizing, Visualizing, and Describing Data

Monthly Changes in the US Dollar/Japanese Yen Spot Exchange Rate


Frequency
20

15

10

0
–5 –3 –1 1 3
Return Interval Midpoint (%)

Based on the chart, yen appreciation:


A occurred more than 50% of the time.
B was less frequent than yen depreciation.
C in the 0.0 to 2.0 interval occurred 20% of the time.
17 A bar chart that orders categories by frequency in descending order and
includes a line displaying cumulative relative frequency is referred to as a:
A Pareto Chart.
B grouped bar chart.
C frequency polygon.
18 Which visualization tool works best to represent unstructured, textual data?
A Tree-­Map
B Scatter plot
C Word cloud
19 A tree-­map is best suited to illustrate:
A underlying trends over time.
B joint variations in two variables.
C value differences of categorical groups.
20 A line chart with two variables—for example, revenues and earnings per
share—is best suited for visualizing:
A the joint variation in the variables.
B underlying trends in the variables over time.
C the degree of correlation between the variables.
21 A heat map is best suited for visualizing the:
A frequency of textual data.
B degree of correlation between different variables.
C shape, center, and spread of the distribution of numerical data.
22 Which valuation tool is recommended to be used if the goal is to make compar-
isons of three or more variables over time?
© CFA Institute. For candidate use only. Not for distribution.
Practice Problems 159

A Heat map
B Bubble line chart
C Scatter plot matrix
23 The annual returns for three portfolios are shown in the following exhibit.
Portfolios P and R were created in Year 1, Portfolio Q in Year 2.

Annual Portfolio Returns (%)


Year 1 Year 2 Year 3 Year 4 Year 5

Portfolio P −3.0 4.0 5.0 3.0 7.0


Portfolio Q −3.0 6.0 4.0 8.0
Portfolio R 1.0 −1.0 4.0 4.0 3.0

The median annual return from portfolio creation to Year 5 for:


A Portfolio P is 4.5%.
B Portfolio Q is 4.0%.
C Portfolio R is higher than its arithmetic mean annual return.
24 At the beginning of Year X, an investor allocated his retirement savings in the
asset classes shown in the following exhibit and earned a return for Year X as
also shown.
Asset Allocation Asset Class Return for Year X
Asset Class (%) (%)

Large-­cap US equities 20.0 8.0


Small-­cap US equities 40.0 12.0
Emerging market equities 25.0 −3.0
High-­yield bonds 15.0 4.0

The portfolio return for Year X is closest to:


A 5.1%.
B 5.3%.
C 6.3%.
25 The following exhibit shows the annual returns for Fund Y.

Fund Y (%)

Year 1 19.5
Year 2 −1.9
Year 3 19.7
Year 4 35.0
Year 5 5.7

The geometric mean return for Fund Y is closest to:


A 14.9%.
B 15.6%.
C 19.5%.
26 A portfolio manager invests €5,000 annually in a security for four years at the
prices shown in the following exhibit.
© CFA Institute. For candidate use only. Not for distribution.
160 Reading 2 ■ Organizing, Visualizing, and Describing Data

Purchase Price of Security (€ per unit)

Year 1 62.00
Year 2 76.00
Year 3 84.00
Year 4 90.00

The average price is best represented as the:


A harmonic mean of €76.48.
B geometric mean of €77.26.
C arithmetic average of €78.00.

The following information relates to Questions


27–28
The following exhibit shows the annual MSCI World Index total returns for a 10-­year
period.
Year 1 15.25% Year 6 30.79%
Year 2 10.02% Year 7 12.34%
Year 3 20.65% Year 8 −5.02%
Year 4 9.57% Year 9 16.54%
Year 5 −40.33% Year 10 27.37%

27 The fourth quintile return for the MSCI World Index is closest to:
A 20.65%.
B 26.03%.
C 27.37%.
28 For Year 6–Year 10, the mean absolute deviation of the MSCI World Index total
returns is closest to:
A 10.20%.
B 12.74%.
C 16.40%.

29 Annual returns and summary statistics for three funds are listed in the follow-
ing exhibit:

Annual Returns (%)


Year Fund ABC Fund XYZ Fund PQR

Year 1 −20.0 −33.0 −14.0


Year 2 23.0 −12.0 −18.0
Year 3 −14.0 −12.0 6.0
Year 4 5.0 −8.0 −2.0
Year 5 −14.0 11.0 3.0
© CFA Institute. For candidate use only. Not for distribution.
Practice Problems 161

Annual Returns (%)


Year Fund ABC Fund XYZ Fund PQR
Mean −4.0 −10.8 −5.0
Standard deviation 17.8 15.6 10.5

The fund with the highest absolute dispersion is:


A Fund PQR if the measure of dispersion is the range.
B Fund XYZ if the measure of dispersion is the variance.
C Fund ABC if the measure of dispersion is the mean absolute deviation.
30 The mean monthly return and the standard deviation for three industry sectors
are shown in the following exhibit.
Standard Deviation of
Sector Mean Monthly Return (%) Return (%)

Utilities (UTIL) 2.10 1.23


Materials (MATR) 1.25 1.35
Industrials (INDU) 3.01 1.52

Based on the coefficient of variation, the riskiest sector is:


A utilities.
B materials.
C industrials.
31 The average return for Portfolio A over the past twelve months is 3%, with a
standard deviation of 4%. The average return for Portfolio B over this same
period is also 3%, but with a standard deviation of 6%. The geometric mean
return of Portfolio A is 2.85%. The geometric mean return of Portfolio B is:
A less than 2.85%.
B equal to 2.85%.
C greater than 2.85%.
32 An analyst calculated the excess kurtosis of a stock’s returns as −0.75. From this
information, we conclude that the distribution of returns is:
A normally distributed.
B thin-­tailed compared to the normal distribution.
C fat-­tailed compared to the normal distribution.
33 When analyzing investment returns, which of the following statements is
correct?
A The geometric mean will exceed the arithmetic mean for a series with non-­
zero variance.
B The geometric mean measures an investment’s compound rate of growth
over multiple periods.
C The arithmetic mean measures an investment’s terminal value over multiple
periods.
© CFA Institute. For candidate use only. Not for distribution.
162 Reading 2 ■ Organizing, Visualizing, and Describing Data

The following information relates to Questions


34–38
A fund had the following experience over the past 10 years:
Year Return

1 4.5%
2 6.0%
3 1.5%
4 −2.0%
5 0.0%
6 4.5%
7 3.5%
8 2.5%
9 5.5%
10 4.0%

34 The arithmetic mean return over the 10 years is closest to:


A 2.97%.
B 3.00%.
C 3.33%.
35 The geometric mean return over the 10 years is closest to:
A 2.94%.
B 2.97%.
C 3.00%.
36 The harmonic mean return over the 10 years is closest to:
A 2.94%.
B 2.97%.
C 3.00%.
37 The standard deviation of the 10 years of returns is closest to:
A 2.40%.
B 2.53%.
C 7.58%.
38 The target semideviation of the returns over the 10 years if the target is 2% is
closest to:
A 1.42%.
B 1.50%.
C 2.01%.

39 A correlation of 0.34 between two variables, X and Y, is best described as:


A changes in X causing changes in Y.
B a positive association between X and Y.
C a curvilinear relationship between X and Y.
© CFA Institute. For candidate use only. Not for distribution.
Practice Problems 163

40 Which of the following is a potential problem with interpreting a correlation


coefficient?
A Outliers
B Spurious correlation
C Both outliers and spurious correlation

The following relates to questions 41 and 42


An analyst is evaluating the tendency of returns on the portfolio of stocks she man-
ages to move along with bond and real estate indexes. She gathered monthly data on
returns and the indexes:
Returns (%)
Bond Index Real Estate Index
Portfolio Returns Returns Returns

Arithmetic average 5.5 3.2 7.8


Standard deviation 8.2 3.4 10.3

Portfolio Returns and Portfolio Returns and Real


Bond Index Returns Estate Index Returns
Covariance 18.9 −55.9

41 Without calculating the correlation coefficient, the correlation of the portfolio


returns and the bond index returns is:
A negative.
B zero.
C positive.
42 Without calculating the correlation coefficient, the correlation of the portfolio
returns and the real estate index returns is:
A negative.
B zero.
C positive.

43 Consider two variables, A and B. If variable A has a mean of −0.56, variable B


has a mean of 0.23, and the covariance between the two variables is positive, the
correlation between these two variables is:
A negative.
B zero.
C positive.
© CFA Institute. For candidate use only. Not for distribution.
164 Reading 2 ■ Organizing, Visualizing, and Describing Data

The following information relates to Questions


44–45
180

160
154.45

140

120
114.25
100 100.49

80 79.74

60
51.51
40

44 The median is closest to:


A 34.51.
B 100.49.
C 102.98.
45 The interquartile range is closest to:
A 13.76.
B 25.74.
C 34.51.

The following information relates to Questions


46–48
An analyst examined a cross-­section of annual returns for 252 stocks and calculated
the following statistics:
Arithmetic Average 9.986%
Geometric Mean 9.909%
Variance 0.001723
Skewness 0.704
Excess Kurtosis 0.503

46 The coefficient of variation is closest to:


A 0.02.
B 0.42.
C 2.41.
47 This distribution is best described as:
A negatively skewed.
B having no skewness.
C positively skewed.
© CFA Institute. For candidate use only. Not for distribution.
Practice Problems 165

48 Compared to the normal distribution, this sample’s distribution is best


described as having tails of the distribution with:
A less probability than the normal distribution.
B the same probability as the normal distribution.
C more probability than the normal distribution.
© CFA Institute. For candidate use only. Not for distribution.
166 Reading 2 ■ Organizing, Visualizing, and Describing Data

SOLUTIONS
1 A is correct. Ordinal scales sort data into categories that are ordered with
respect to some characteristic and may involve numbers to identify categories
but do not assure that the differences between scale values are equal. The buy
rating scale indicates that a stock ranked 5 is expected to perform better than a
stock ranked 4, but it tells us nothing about the performance difference between
stocks ranked 4 and 5 compared with the performance difference between
stocks ranked 1 and 2, and so on.
2 C is correct. Nominal data are categorical values that are not amenable to being
organized in a logical order. A is incorrect because ordinal data are categorical
data that can be logically ordered or ranked. B is incorrect because discrete
data are numerical values that result from a counting process; thus, they can be
ordered in various ways, such as from highest to lowest value.
3 B is correct. Categorical data (or qualitative data) are values that describe a
quality or characteristic of a group of observations and therefore can be used
as labels to divide a dataset into groups to summarize and visualize. The two
types of categorical data are nominal data and ordinal data. Nominal data are
categorical values that are not amenable to being organized in a logical order,
while ordinal data are categorical values that can be logically ordered or ranked.
A is incorrect because discrete data would be classified as numerical data (not
categorical data). C is incorrect because continuous data would be classified as
numerical data (not categorical data).
4 C is correct. Continuous data are data that can be measured and can take on
any numerical value in a specified range of values. In this case, the analyst is
estimating bankruptcy probabilities, which can take on any value between 0 and
1. Therefore, the set of bankruptcy probabilities estimated by the analyst would
likely be characterized as continuous data. A is incorrect because ordinal data
are categorical values that can be logically ordered or ranked. Therefore, the
set of bankruptcy probabilities would not be characterized as ordinal data. B is
incorrect because discrete data are numerical values that result from a counting
process, and therefore the data are limited to a finite number of values. The pro-
prietary model used can generate probabilities that can take any value between
0 and 1; therefore, the set of bankruptcy probabilities would not be character-
ized as discrete data.
5 A is correct. Ordinal data are categorical values that can be logically ordered or
ranked. In this case, the classification of sentences in the earnings call transcript
into three categories (negative, neutral, or positive) describes ordinal data,
as the data can be logically ordered from positive to negative. B is incorrect
because discrete data are numerical values that result from a counting process.
In this case, the analyst is categorizing sentences (i.e., unstructured data) from
the earnings call transcript as having negative, neutral, or positive sentiment.
Thus, these categorical data do not represent discrete data. C is incorrect
because nominal data are categorical values that are not amenable to being
organized in a logical order. In this case, the classification of unstructured data
(i.e., sentences from the earnings call transcript) into three categories (negative,
neutral, or positive) describes ordinal (not nominal) data, as the data can be
logically ordered from positive to negative.
6 B is correct. Time-­series data are a sequence of observations of a specific vari-
able collected over time and at discrete and typically equally spaced intervals of
time, such as daily, weekly, monthly, annually, and quarterly. In this case, each
© CFA Institute. For candidate use only. Not for distribution.
Solutions 167

column is a time series of data that represents annual total return (the specific
variable) for a given country index, and it is measured annually (the discrete
interval of time). A is incorrect because panel data consist of observations
through time on one or more variables for multiple observational units. The
entire table of data is an example of panel data showing annual total returns
(the variable) for three country indexes (the observational units) by year. C is
incorrect because cross-­sectional data are a list of the observations of a specific
variable from multiple observational units at a given point in time. Each row
(not column) of data in the table represents cross-­sectional data.
7 C is correct. Cross-­sectional data are observations of a specific variable from
multiple observational units at a given point in time. Each row of data in the
table represents cross-­sectional data. The specific variable is annual total return,
the multiple observational units are the three countries’ indexes, and the given
point in time is the time period indicated by the particular row. A is incor-
rect because panel data consist of observations through time on one or more
variables for multiple observational units. The entire table of data is an exam-
ple of panel data showing annual total returns (the variable) for three country
indexes (the observational units) by year. B is incorrect because time-­series data
are a sequence of observations of a specific variable collected over time and
at discrete and typically equally spaced intervals of time, such as daily, weekly,
monthly, annually, and quarterly. In this case, each column (not row) is a time
series of data that represents annual total return (the specific variable) for a
given country index, and it is measured annually (the discrete interval of time).
8 A is correct. Panel data consist of observations through time on one or more
variables for multiple observational units. A two-­dimensional rectangular array,
or data table, would be suitable here as it is comprised of columns to hold
the variable(s) for the observational units and rows to hold the observations
through time. B is incorrect because a one-­dimensional (not a two-­dimensional
rectangular) array would be most suitable for organizing a collection of data
of the same data type, such as the time-­series data from a single variable. C is
incorrect because a one-­dimensional (not a two-­dimensional rectangular) array
would be most suitable for organizing a collection of data of the same data type,
such as the same variable for multiple observational units at a given point in
time (cross-­sectional data).
9 B is correct. In a frequency distribution, the absolute frequency, or simply the
raw frequency, is the actual number of observations counted for each unique
value of the variable. A is incorrect because the relative frequency, which is
calculated as the absolute frequency of each unique value of the variable divided
by the total number of observations, presents the absolute frequencies in terms
of percentages. C is incorrect because the relative (not absolute) frequency
provides a normalized measure of the distribution of the data, allowing compar-
isons between datasets with different numbers of total observations.
10 A is correct. The relative frequency is the absolute frequency of each bin
divided by the total number of observations. Here, the relative frequency is cal-
culated as: (12/60) × 100 = 20%. B is incorrect because the relative frequency of
this bin is (23/60) × 100 = 38.33%. C is incorrect because the cumulative relative
frequency of the last bin must equal 100%.
11 C is correct. The cumulative relative frequency of a bin identifies the fraction of
observations that are less than the upper limit of the given bin. It is determined
by summing the relative frequencies from the lowest bin up to and including
the given bin. The following exhibit shows the relative frequencies for all the
bins of the data from the previous exhibit:
© CFA Institute. For candidate use only. Not for distribution.
168 Reading 2 ■ Organizing, Visualizing, and Describing Data

Lower Limit Upper Limit Absolute Relative Cumulative Relative


(%) (%) Frequency Frequency Frequency

−9.19 ≤ < −5.45 1 0.083 0.083


−5.45 ≤ < −1.71 2 0.167 0.250
−1.71 ≤ < 2.03 4 0.333 0.583
2.03 ≤ < 5.77 3 0.250 0.833
5.77 ≤ ≤ 9.47 2 0.167 1.000

The bin −1.71% ≤ x < 2.03% has a cumulative relative frequency of 0.583.
12 C is correct. The marginal frequency of energy sector bonds in the portfolio is
the sum of the joint frequencies across all three levels of bond rating, so 100 +
85 + 30 = 215. A is incorrect because 27 is the relative frequency for energy
sector bonds based on the total count of 806 bonds, so 215/806 = 26.7%, not
the marginal frequency. B is incorrect because 85 is the joint frequency for AA
rated energy sector bonds, not the marginal frequency.
13 A is correct. The relative frequency for any value in the table based on the total
count is calculated by dividing that value by the total count. Therefore, the rela-
tive frequency for AA rated energy bonds is calculated as 85/806 = 10.5%.
B is incorrect because 31.5% is the relative frequency for AA rated energy
bonds, calculated based on the marginal frequency for all AA rated bonds, so
85/(32 + 25 + 85 + 100 + 28), not based on total bond counts. C is incorrect
because 39.5% is the relative frequency for AA rated energy bonds, calculated
based on the marginal frequency for all energy bonds, so 85/(100 + 85 + 30),
not based on total bond counts.
14 C is correct. Because 50 data points are in the histogram, the median return
would be the mean of the 50/2 = 25th and (50 + 2)/2 = 26th positions. The sum
of the return bin frequencies to the left of the 13% to 18% interval is 24. As a
result, the 25th and 26th returns will fall in the 13% to 18% interval.
15 C is correct. The mode of a distribution with data grouped in intervals is the
interval with the highest frequency. The three intervals of 3% to 8%, 18% to 23%,
and 28% to 33% all have a high frequency of 7.
16 A is correct. Twenty observations lie in the interval “0.0 to 2.0,” and six observa-
tions lie in the “2.0 to 4.0” interval. Together, they represent 26/48, or 54.17%, of
all observations, which is more than 50%.
17 A is correct. A bar chart that orders categories by frequency in descending
order and includes a line displaying cumulative relative frequency is called a
Pareto Chart. A Pareto Chart is used to highlight dominant categories or the
most important groups. B is incorrect because a grouped bar chart or clustered
bar chart is used to present the frequency distribution of two categorical vari-
ables. C is incorrect because a frequency polygon is used to display frequency
distributions.
18 C is correct. A word cloud, or tag cloud, is a visual device for representing
unstructured, textual data. It consists of words extracted from text with the size
of each word being proportional to the frequency with which it appears in the
given text. A is incorrect because a tree-­map is a graphical tool for displaying
and comparing categorical data, not for visualizing unstructured, textual data. B
is incorrect because a scatter plot is used to visualize the joint variation in two
numerical variables, not for visualizing unstructured, textual data.
© CFA Institute. For candidate use only. Not for distribution.
Solutions 169

19 C is correct. A tree-­map is a graphical tool used to display and compare


categorical data. It consists of a set of colored rectangles to represent distinct
groups, and the area of each rectangle is proportional to the value of the cor-
responding group. A is incorrect because a line chart, not a tree-­map, is used
to display the change in a data series over time. B is incorrect because a scatter
plot, not a tree-­map, is used to visualize the joint variation in two numerical
variables.
20 B is correct. An important benefit of a line chart is that it facilitates showing
changes in the data and underlying trends in a clear and concise way. Often a
line chart is used to display the changes in data series over time. A is incorrect
because a scatter plot, not a line chart, is used to visualize the joint variation in
two numerical variables. C is incorrect because a heat map, not a line chart, is
used to visualize the values of joint frequencies among categorical variables.
21 B is correct. A heat map is commonly used for visualizing the degree of cor-
relation between different variables. A is incorrect because a word cloud, or
tag cloud, not a heat map, is a visual device for representing textual data with
the size of each distinct word being proportional to the frequency with which
it appears in the given text. C is incorrect because a histogram, not a heat map,
depicts the shape, center, and spread of the distribution of numerical data.
22 B is correct. A bubble line chart is a version of a line chart where data points
are replaced with varying-­sized bubbles to represent a third dimension of the
data. A line chart is very effective at visualizing trends in three or more vari-
ables over time. A is incorrect because a heat map differentiates high values
from low values and reflects the correlation between variables but does not
help in making comparisons of variables over time. C is incorrect because a
scatterplot matrix is a useful tool for organizing scatterplots between pairs of
variables, making it easy to inspect all pairwise relationships in one combined
visual. However, it does not help in making comparisons of these variables over
time.
23 C is correct. The median of Portfolio R is 0.8% higher than the mean for
Portfolio R.
24 C is correct. The portfolio return must be calculated as the weighted mean
return, where the weights are the allocations in each asset class:
(0.20 × 8%) + (0.40 × 12%) + (0.25 × −3%) + (0.15 × 4%) = 6.25%, or ≈ 6.3%.

25 A is correct. The geometric mean return for Fund Y is found as follows:


Fund Y = [(1 + 0.195) × (1 − 0.019) × (1 + 0.197) × (1 + 0.350) × (1 + 0.057)]
(1/5) − 1
= 14.9%.

26 A is correct. The harmonic mean is appropriate for determining the average


price per unit. It is calculated by summing the reciprocals of the prices, then
averaging that sum by dividing by the number of prices, then taking the recip-
rocal of the average:
4/[(1/62.00) + (1/76.00) + (1/84.00) + (1/90.00)] = €76.48.

27 B is correct. Quintiles divide a distribution into fifths, with the fourth quintile
occurring at the point at which 80% of the observations lie below it. The fourth
quintile is equivalent to the 80th percentile. To find the yth percentile (P y),
© CFA Institute. For candidate use only. Not for distribution.
170 Reading 2 ■ Organizing, Visualizing, and Describing Data

we first must determine its location. The formula for the location (Ly) of a yth
percentile in an array with n entries sorted in ascending order is Ly = (n + 1) ×
(y/100). In this case, n = 10 and y = 80%, so
L80 = (10 + 1) × (80/100) = 11 × 0.8 = 8.8.

With the data arranged in ascending order (−40.33%, −5.02%, 9.57%, 10.02%,
12.34%, 15.25%, 16.54%, 20.65%, 27.37%, and 30.79%), the 8.8th position would
be between the 8th and 9th entries, 20.65% and 27.37%, respectively. Using
linear interpolation, P80 = X8 + (Ly − 8) × (X9 − X8),
P80 = 20.65 + (8.8 − 8) × (27.37 − 20.65)
= 20.65 + (0.8 × 6.72) = 20.65 + 5.38
= 26.03%.

28 A is correct. The formula for mean absolute deviation (MAD) is


n
 Xi  X
i 1
MAD 
n

Column 1: Sum annual returns and divide by n to find the arithmetic mean  X 
of 16.40%.
Column 2: Calculate the absolute value of the difference between each year’s
return and the mean from Column 1. Sum the results and divide by n to find
the MAD.
These calculations are shown in the following exhibit:

Column 1 Column 2

Xi − X
Year Return

Year 6 30.79% 14.39%


Year 7 12.34% 4.06%
Year 8 −5.02% 21.42%
Year 9 16.54% 0.14%
Year 10 27.37% 10.97%

Sum: 82.02% Sum: 50.98%


n: 5 n: 5

X: 16.40% MAD: 10.20%

29 C is correct. The mean absolute deviation (MAD) of Fund ABC’s returns is


greater than the MAD of both of the other funds.
n
 Xi  X
i 1
MAD  , where X is the arithmetic mean of the series.
n
MAD for Fund ABC =

 20  4  23  4   14  4  5  4   14  4


 14.4%
5
© CFA Institute. For candidate use only. Not for distribution.
Solutions 171

MAD for Fund XYZ =

 33  10.8   12  10.8   12  10.8   8  10.8  11  10.8


 9.8%
5
MAD for Fund PQR =

 14  5   18  5  6  5   2  5  3  5


 8.8%
5
A and B are incorrect because the range and variance of the three funds are as
follows:

Fund ABC Fund XYZ Fund PQR

Range 43% 44% 24%


Variance 317 243 110

The numbers shown for variance are understood to be in “percent squared”


terms so that when taking the square root, the result is standard deviation in
percentage terms. Alternatively, by expressing standard deviation and variance
in decimal form, one can avoid the issue of units. In decimal form, the vari-
ances for Fund ABC, Fund XYZ, and Fund PQR are 0.0317, 0.0243, and 0.0110,
respectively.
30 B is correct. The coefficient of variation (CV) is the ratio of the standard devia-
tion to the mean, where a higher CV implies greater risk per unit of return.
s 1.23%
CVUTIL
= = = 0.59
X 2.10%

s 1.35%
CVMATR
= = = 1.08
X 1.25%

s 1.52%
CVINDU
= = = 0.51
X 3.01%
31 A is correct. The more disperse a distribution, the greater the difference
between the arithmetic mean and the geometric mean.
32 B is correct. The distribution is thin-­tailed relative to the normal distribution
because the excess kurtosis is less than zero.
33 B is correct. The geometric mean compounds the periodic returns of every
period, giving the investor a more accurate measure of the terminal value of an
investment.
34 B is correct. The sum of the returns is 30.0%, so the arithmetic mean is
30.0%/10 = 3.0%.
35 B is correct.
Year Return 1+ Return

1 4.5% 1.045
2 6.0% 1.060
3 1.5% 1.015
4 −2.0% 0.980
5 0.0% 1.000
6 4.5% 1.045
(continued)
© CFA Institute. For candidate use only. Not for distribution.
172 Reading 2 ■ Organizing, Visualizing, and Describing Data

Year Return 1+ Return


7 3.5% 1.035
8 2.5% 1.025
9 5.5% 1.055
10 4.0% 1.040

The product of the 1 + Return is 1.3402338.


Therefore, X G  10 1.3402338  1 = 2.9717%.
36 A is correct.
Year Return 1+ Return 1/(1+Return)

1 4.5% 1.045 0.957


2 6.0% 1.060 0.943
3 1.5% 1.015 0.985
4 −2.0% 0.980 1.020
5 0.0% 1.000 1.000
6 4.5% 1.045 0.957
7 3.5% 1.035 0.966
8 2.5% 1.025 0.976
9 5.5% 1.055 0.948
10 4.0% 1.040 0.962
Sum 9.714

The harmonic mean return = (n/Sum of reciprocals) − 1 = (10 / 9.714)


− 1.
The harmonic mean return = 2.9442%.

37 B is correct.
Year Return Deviation Deviation Squared

1 4.5% 0.0150 0.000225


2 6.0% 0.0300 0.000900
3 1.5% −0.0150 0.000225
4 −2.0% −0.0500 0.002500
5 0.0% −0.0300 0.000900
6 4.5% 0.0150 0.000225
7 3.5% 0.0050 0.000025
8 2.5% −0.0050 0.000025
9 5.5% 0.0250 0.000625
10 4.0% 0.0100 0.000100
Sum 0.0000 0.005750

The standard deviation is the square root of the sum of the squared deviations
divided by n − 1:
0.005750
s= = 2.5276%.
9
38 B is correct.
© CFA Institute. For candidate use only. Not for distribution.
Solutions 173

Deviation Squared
Year Return below Target of 2%

1 4.5%
2 6.0%
3 1.5% 0.000025
4 −2.0% 0.001600
5 0.0% 0.000400
6 4.5%
7 3.5%
8 2.5%
9 5.5%
10 4.0%
Sum 0.002025

The target semi-­deviation is the square root of the sum of the squared devia-
tions from the target, divided by n − 1:
0.002025
sTarget = = 1.5%.
9
39 B is correct. The correlation coefficient is positive, indicating that the two series
move together.
40 C is correct. Both outliers and spurious correlation are potential problems with
interpreting correlation coefficients.
41 C is correct. The correlation coefficient is positive because the covariation is
positive.
42 A is correct. The correlation coefficient is negative because the covariation is
negative.
43 C is correct. The correlation coefficient is positive because the covariance is
positive. The fact that one or both variables have a negative mean does not
affect the sign of the correlation coefficient.
44 B is correct. The median is indicated within the box, which is the 100.49 in this
diagram.
45 C is correct. The interquartile range is the difference between 114.25 and 79.74,
which is 34.51.
46 B is correct. The coefficient of variation is the ratio of the standard deviation to
the arithmetic average, or 0.001723 0.09986 = 0.416.
47 C is correct. The skewness is positive, so it is right-­skewed (positively skewed).
48 C is correct. The excess kurtosis is positive, indicating that the distribution is
“fat-­tailed”; therefore, there is more probability in the tails of the distribution
relative to the normal distribution.

You might also like