EJ1165803

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

A peer-reviewed electronic journal.

Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission
is granted to distribute this article for nonprofit, educational purposes if it is copied in its entirety and the journal is credited. PARE has the
right to authorize third party reproduction of this article in print, electronic and database forms.
Volume 22 Number 13, December 2017 ISSN 1531-7714

An Evaluation of Normal Versus Lognormal Distribution in


Data Description and Empirical Analysis
Rekha Diwakar, University of Sussex

Many existing methods of statistical inference and analysis rely heavily on the assumption that the
data are normally distributed. However, the normality assumption is not fulfilled when dealing with
data which does not contain negative values or are otherwise skewed – a common occurrence in
diverse disciplines such as finance, economics, political science, sociology, philology, biology and
physical and industrial processes. In this situation, a lognormal distribution may better represent the
data than the normal distribution. In this paper, I re-visit the key attributes of the normal and
lognormal distributions, and demonstrate through an empirical analysis of the ‘number of political
parties' in India, how logarithmic transformation can help in bringing a lognormally distributed data
closer to a normal one. The paper also provides further empirical evidence to show that many
variables of interest to political and other social scientists could be better modelled using the
lognormal distribution. More generally, the paper emphasises the potential for improved description
and empirical analysis of quantitative data by paying more attention to its distribution, and
complements previous publications in Practical Research and Assessment Evaluation (PARE) on this
subject.

Statistical analysis of empirical data is widespread normally distributed, before applying standard
in literature, and is particularly useful in analysing and statistical techniques and methods. Further, a common
characterising random variations of the variables being practice is to use mean ± standard deviation to
studied. Frequency distribution of the data used in summarise and describe empirical data, even though
statistical analysis is a crucial factor which underpins the underlying principles or the data may suggest a
the quality of the inference drawn from such an skewed distribution.
exercise. Normal or the Gaussian distribution is the Based on analysis of empirical data from various
most well-known distribution in probability and branches of science, Limpert et al. (2001:342) state that
statistics, and existing methods such as t-tests, although it is commonly assumed that quantitative
ANOVA (analysis of variance) and linear regression variability is generally bell shaped and symmetrical, in
rely heavily on the assumption of data being normally a number of cases the variability is clearly asymmetrical
distributed1. Despite the importance of the normality because subtracting three standard deviations from the
assumption, many empirical studies do not explicitly mean produces negative values. Since many variables
test whether the data used is sufficiently close to being across diverse disciplines show a standard deviation

1
These ‘parametric’ statistical procedures rely on special form, are known as non‐parametric. In general, non‐
assumptions about the shape of the distribution (for example parametric procedures are considered to be less powerful
a normal distribution). Statistical procedures whose validity than parametric methods.
does not depend on the underlying random variables having a
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 2
Diwakar, Evaluation of Normal versus Lognormal Distribution

that is higher than the mean, it follows that they can distribution is symmetric, a known percentage of all
take negative values, if one assumes a normal possible values of X lie within ± a certain number of
distribution. However, the quality of such a fit is poor, standard deviations of the mean. For example, 68.3%
given that the normal curve extends into the negative of the values of any normally distributed variable lie
region, while the data do not (Taagepera, 1999:424). within the interval (µ - 1σ, µ + 1σ). Theoretically, the
Some research has shown that parametric tests can be normal distribution covers the entire real number line
robust to modest violations of normality but almost all running from minus infinity to plus infinity.
analyses benefit from improving the normality of The estimate of probability of a value occurring
variables, particularly where substantial non-normality within a certain interval in a normal distribution is
is present (Osborne, 2010). Log-transformation of data easier done by translating each set of X values into
is a viable method available to researchers for standard normal distribution which has a mean of 0
improving normality of variables in data description and a standard deviation of 12. Any point x from a
and empirical analysis. normal distribution can be converted to the standard
This paper examines the key attributes of the normal distribution with the formula Z = (x- μ)/σ.
normal and lognormal distributions, and discusses The Z value for any value of x shows how many
their use in empirical research that is based on standard deviations it is away from the mean3.
statistical inference. Through an empirical analysis of a Naturally occurring distributions are rarely normal
large data set of the number of (political) parties in in shape, but the Central Limit Theorem (CLT) states
India (as an example of a much wider occurrence), it is that if the sum of independent identically distributed
shown that its distribution is lognormally distributed, random variables has a finite variance, then it will be
and how log-transformation can help in bringing the approximately normally distributed. Most theoretical
original data closer to a normally distributed one. The arguments for the use of normal distribution are based
paper also provides further empirical evidence to show on forms of central theorems, stating conditions under
that many variables of interest to political and other which the distribution of standardised sums of random
social scientists could be better modelled using the variables tends to a unit normal distribution as the
lognormal distribution. More generally, it stresses that number of variables in the sum increases, that is, with
scholars across disciplines can gain from paying more conditions sufficient to ensure an asymptotic unit normal
attention to the distribution of data before assuming distribution (Johnson et al., 1994:85).
normality.
The CLT refers to the sum of independent
Normal and Lognormal Distributions random variables, but how do we address variables that
The normal or the Gaussian distribution represent products of variables? The logarithm of a
represents the well-known bell-shaped curve, which is product is sum of the logs of the factors, and thus the
characterised by arithmetic mean μ and the standard log of a product of random variables that take only
deviation σ. Its density function is symmetrical relative positive values tends to have a normal distribution,
to the vertical axis passing through the mean μ, and the which makes the product itself to follow a lognormal
area under a normal distribution can be described in distribution. A key difference between the normal and
terms of μ ± σ. As with any continuous probability the lognormal distribution is that the former is based
function, the area under the curve must equal 1, and on additive, and latter on multiplicative underlying
the area between two values of variable X, which effects, and taking logarithms enables us to change
follows the distribution, represents the probability that multiplication into addition4. As Limpert & Stahel
it lies between those two values. Since normal (2011:5) point out that ‘Whereas additive effects lead

2 4
For a discussion on the history of the normal and Limpert et al. (2001:342) demonstrate the distinction
lognormal distributions, refer to Johnson et al. (1994). between additive and multiplicative effects by throw of dice.
3 Thus, adding the numbers on two dice leads to values from 2
Table of areas under standard normal distribution are
to 12 with a mean of 7, and a symmetrical distribution –
widely published so that areas under any normal distribution
additive effect. On the other hand, multiplying the two
can be found by translating the X values to Z values and then
numbers leads to values between 1 and 36 with a highly‐
using the table for the standardised normal.
skewed distribution – multiplicative effect.
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 3
Diwakar, Evaluation of Normal versus Lognormal Distribution

to the normal distribution according to the Central suitable because it represents the centre of the
Limit Theorem (CLT) in its additive form, …the distribution of the logarithms (which is symmetric) and
superposition of many small random multiplicative corresponds to the median (Taagepera, 1999:424).
effects results in a log-normally distributed random Limpert et al. (2001: 341) note that ‘Skewed
variable according to the multiplicative CLT that needs distributions are particularly common when mean
to be better known, and understood.’ values are low, variances large, and values cannot be
Lognormal distribution is not new. Crow & negative…Such skewed distributions often closely fit
Shimizu (1988:2) point out that Galton (1879) and the log-normal distribution.’ Since many political and
McAlister (1879) initiated the study of the lognormal other social science variables can only take positive
distribution in their papers relating it to the use of the values, and some cannot take a value below a certain
geometric mean as an estimate of location. Aitchison positive threshold, using normal distribution to
& Brown (1957:100-105) provide many examples of describe and analyse these variables can lead to
lognormal distributions found in diverse disciplines misleading interpretation. This issue can be addressed
such as economics (e.g. bank deposits), sociology (e.g. by taking logarithm of the distribution, since logarithm
number of inhabitants of a town), biology (e.g. of zero is minus infinity. And therefore, wherever our
biological size), anthropometry (e.g. bodyweight), data can take values between 0 and +∞, taking
philology (e.g. number of words in a sentence) and logarithms transforms this range to -∞to +∞, which is
physical and industrial processes (e.g. effective length the range of normal distribution. Limpert & Stahel
of life of a material). Cabral & Mata (2003) found that (2011:6) show that the use of lognormal distribution
the firm size of Portuguese manufacturing firms was also enables savings in sample size and experimental
significantly right-skewed evolving over time towards effort that can be considerable.
a lognormal distribution.
In many cases, both normal and lognormal
The features and mathematics of lognormal distributions can fit the data that can only take positive
distribution have been described in detail by scholars values. This is likely to be the case where arithmetic
(for example Aitchison & Brown, 1957; Shimizu & mean is much larger than the standard deviation and
Crow, 1988) – it is a distribution which is skewed to coefficient of variation (CV) is low (Limpert et al.:
the right, whose probability density function starts at 351)5. For example, refer to Figure 1, which plots the
zero, increases to its mode and decreases thereafter. distribution of voter turnout across 199 countries for
Formally, a random variable X is said to follow a elections held during 1945-2014. The figure uses a
lognormal distribution if log(X) follows a normal kernel density smoothed curve to depict empirical
distribution. When a variable X can only take positive probabilities whereby each point of the estimated
values, the arithmetic mean, median and mode may not density function represents a weighted sum of the data
be the same, and in particular, the arithmetic mean is frequencies in the vicinity of the point being
affected heavily by the presence of large values in the estimated6. As can be seen, because the mean turnout
data. In this case, X is said to follow the lognormal at 70.8 is much higher than the standard deviation of
distribution, and the geometric mean typically 16.7, the distribution is reasonably close to normality
represents the median value, while the arithmetic mean to cause a concern; this is also evident by a low CV of
exceeds the median leading to a right skew in the 0.247.
distribution. When we use normal distribution, using Logarithmic transformation
arithmetic mean as a measure of central tendency is
acceptable because in a symmetric distribution According to Limpert et al. (2001), the difficulty
arithmetic mean is same as its median. However, for in interpreting and understanding logarithms and
lognormally distributed data, geometric mean is more

5
CV is standard deviation divided by the mean. of origin corresponding to the location of the bins in a
6 histogram (Stata Graphics Reference Manual, 2017).
Kernel density estimators approximate the density f(x)
7
from observations on x. A Kernel density curve represents a The probability of negative values occurring in a
smoothed histogram, calculating the density at each point as normal distribution is greater for higher values of CV.
it moves along the x‐axis. It is also independent of the choice
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 4
Diwakar, Evaluation of Normal versus Lognormal Distribution

and (μ*.(σ*)2) and 99.7% is contained between


(μ*/(σ*)3) and (μ*.(σ*)3).
Thus, by using multiplication and division of μ*
and σ*, it is possible to define the distribution of a
Density

lognormal distribution in the same way as addition and


subtraction of μ and σ helps in defining a normal
distribution. According to Limpert et al. (2001:345),
‘…the most precise method for estimating the
parameters μ* and σ* relies on log transformation. The
0 20 40 60
Voter Turnout (%)
80 100 mean and empirical standard deviation of the
Kernel density estimate
logarithms of the data are calculated and then back-
Normal density transformed. These estimators are called x * and s*,
N= 2509 Mean = 70.8 Median = 73.1 Std. Deviation =
where x * is the geometric mean of the data.’ 10
16.7 Source: IDEA database The question then is that why should we care
Figure 1. Voter turnout in 199 countries 1945- about choosing between normal and lognormal
2014 distributions in data description and empirical
research. Firstly, many variables of interest to us across
diverse disciplines represent multiplicative or
inadequate methods of describing lognormal
interaction effects, and therefore, may be better
distribution might have led to an aversion to its use and
modelled using lognormal rather than normal
adoption as against normal distribution8. They point
distribution. For example, Brambor et al. (2005:2) state
out that most people prefer to think in terms of the
‘Multiplicative interaction models are common in the
original rather than the log-transformed data, and
quantitative political science literature. This is so for
demonstrate the use of parameters allowing for
good reason. Institutional arguments frequently imply
characterisation of the data in the original (non-
that the relationship between political inputs and
transformed) scale. To describe a lognormal
outcomes varies depending on the institutional
distribution of X, usually the mean and the standard
context.’ Similarly, Osborne (2010:3) notes that ‘Log-
deviation of log (X) are used. Limpert et al. (2001:344)
normal variables seem to be more common when
argue that there are clear advantages to using ‘back-
outcomes are influenced by many independent factors
transformed values’, which are in terms of the
(e.g., biological outcomes), also common in the social
measured and not log-transformed data. They describe
sciences.’ Secondly, since many variables of interest to
μ* = eμ and σ* = eσ, which are referred to as the median
scholars cannot take negative values, normal
and multiplicative standard deviation of X. While μ*,
distribution, which ranges from minus to plus infinity
the median of the lognormal distribution is also the
is usually not a good fit for the data. As Taagepera
geometric mean of the untransformed distribution, σ*
(1999:423) points out ‘In principle, a lognormal
represents the multiplicative standard deviation which
distribution can be expected to yield a better fit than
determines the shape of the distribution9. Since both
normal distribution wherever a variable faces a
μ* and σ* are in the units of the original measurement,
conceptual lower limit at zero.’ Thirdly, it has been
these are more easy to interpret and can also describe
reported that both parametric and nonparametric
the lognormal distribution in terms of these variables:
statistical tests tend to benefit from normally
68.3% of the distribution is contained between (μ*/σ*)
distributed data (Osborne, 2010; Zimmerman, 1998).
and (μ*.σ*), 95.5% is contained between (μ*/(σ*)2)
Lastly, since normality is usually achievable by a simple

8 10
Appendix A2 provides a comparison of the main s* is referred to as multiplicative standard deviation
properties of normal and lognormal distributions. (Limpert & Stahel, 2011).
9
Limpert et al. (2001:344‐45) show that σ* is related to
the coefficient of variation (CV) by a monotonic, increasing
transformation. Thus, CV is a function of σ only.
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 5
Diwakar, Evaluation of Normal versus Lognormal Distribution

logarithmic transformation, we can use measures of from 1 to infinity, its logarithms are likely to be
log-transformed data in respect of original values, normally distributed11. India is world’s largest
which are relatively easy to estimate and interpret. democracy, where members of the lower house of the
national parliament (the Lok Sabha) are elected from
Below, I provide an empirical analysis of a large
single member districts in different Indian states
data set of the number of political parties (in India), an
following the first-past-the-post electoral system
important variable of interest to political scientists, to
(Diwakar, 2016). Table 1 presents summary statistics
demonstrate that this variable is lognormally
of the number of (contesting and effective) political
distributed, and that a lognormal transformation helps
parties in Indian national elections held between 1952
in bringing it closer to a normal distribution.
and 200412.
Table 1 shows that the number of contesting
Modelling the Distribution of the Number of parties at state level has a mean of 103.5 and a standard
Parties in India deviation of 217.9, and assuming a normal distribution,
its 95% data range would be -332.2 to 757.2, and about
According to Taagepera (1999:427), ‘if one had to 32% of the distribution will be negative, which is
give a single number to characterize the politics of any theoretically impossible. Similarly, the 95% data range
country that employs competitive elections, it would be for the other two ‘number of parties’ variables also
the number of parties active in its national assembly.’
Since the conceptual range of this variablee extends

Table 1. Number of parties in India 1952-2004


Variable Description N ± SD 95% range 99% range
( ±2SD) ( ±3SD)

1. Number of
Raw number of parties 401 103.5±217.9 ‐332.2 to 757.2 ‐550.2 to 757.2
contesting parties –
state level
2. Number of contesting Raw number of parties 7187 9.3±11.5 ‐13.7 to 32.3 ‐25.2 to 43.8
parties– district level
3.Effective number of
Weighted by votes 7187 2.7±0.9 0.9 to 4.5 0.0 to 5.4
parties – district level
Notes:
(1) State level: The number of states in India have varied in different elections (as a result of reorganisation of state boundaries and creation
of new states). Currently, there are 29 states and 7 centrally administered union territories. Each data point represents number of parties at
the state level.
(2) District level: The number of electoral districts (constituencies) have varied in different elections. Currently, there are 543 electoral
districts in India. Each data point represents number of parties at the district level.
(3) Anomalies: Values outside theoretically possible values (<1) are highlighted in bold italics.
(4) SD is standard deviation.
Source: Author’s analysis of data sourced from Election Commission of India reports. Data sources and definitions of the variables are
provided in Appendix A1.

11
Since log 1 = 0. greater than 7000 data points at the district level and 401
12 data points at the state level from the Indian general
Two more national elections have taken place in India
elections held during 1952‐2004.
in 2009 and 2014. However, for the purpose of showing the
distribution of the data, we have a large enough data set –
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 6
Diwakar, Evaluation of Normal versus Lognormal Distribution

extends beyond the theoretically possible boundaries, distribution. The P-P chart compares an empirical
if normal distribution is assumed13.14 cumulative distribution function of a variable with a
specific theoretical cumulative distribution function.
Below, I show graphically that the distribution of
The closer the empirical observations are to the
the three variables shown in Table 1 is skewed, and
predicted diagonal line, closer is the distribution to
demonstrate how logarithmic transformation can help
normal.
in bringing it closer to a normal distribution. To do so,
I use the empirical density distribution for these Figure 2(a) shows the distribution of the number
variables and contrast them to a normal distribution. of contesting parties measured at the state level in
In addition to kernel density curves, I also use India. The distribution’s minimum point is 1, but has
probability-probability (P-P) charts to depict the many outliers towards the right tail. It is important to
respective distributions’ deviation from a normal note that the highest value of the series is 2643, and the

(a) Kernel density - original values (b) Kernel density – log transformed values

Density
Density

0 500 1000 1500 2000 2500 0 2 4 6 8


Log of number of contesting parties
Number of contesting parties
Kernel density estimate
Kernel density estimate Normal density
Normal density

(c) PP plot - original values (d) PP plot – log transformed values


1.00
1.00

Normal distribution probability


Normal distribution probability

0.75
0.75

0.50
0.50

0.25
0.25

0.00
0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Empirical probability - Log of Number of contesting parties
Empirical probability - Number of contesting parties

N= 401 Mean = 103.5 Median = 33.0 Std. Deviation = 217.9 x * = 29.9 s* = 5.2
Source: Author’s analysis of data sourced from Election Commission of India reports.
Further details on definition of variables and data sources are provided in Appendix A1.
Figure 2. Number of Contesting Parties in India at State Level 1952 -2004

13 14
If 95% data interval contains these values, the 99% The theoretical lower bound for number of parties is
data range will also contain these theoretically infeasible 1.
values.
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 7
Diwakar, Evaluation of Normal versus Lognormal Distribution

standard deviation at 217.9 is much higher than the from a normal distribution, the log of the distribution
mean of 103.5. The series’ median is 33.0, and is very close to being normally distributed.
therefore the distribution is far from being normally Figure 3(a) shows the distribution of the number
distributed. The geometric mean or the transformed of contesting parties measured at the district level in
mean x * at 29.9 is much closer to the median, and the India. The distribution is tall with a long right tail, but
s* at 5.2 smaller than x * . Figure 2(b) shows the deviates from a normal fit – which is also visible from
distribution of log of number of contesting parties at looking at the P-P plot in Figure 3(c). The mean of the
the state level, and it can be seen that log- series is 9.3, the median 6.0, while the standard
transformation makes the distribution a more deviation is higher than the mean at 11.5. The
symmetric one15. The effect of log-transformation can geometric mean or the transformed mean x * is 6.9,
be seen more clearly in P-P charts – Figures 2(c) and which is much closer to the median of the distribution,
2(d) which show that while the original data deviates and s* is 2.1 which is smaller than x * . Figure 3(b)

(a) Kernel density - original values (b) Kernel density – log transformed values
Density

Density

0 100 200 300 400 500 0 2 4 6


Number of Contesting parties Log of Number of Contesting parties
Kernel density estimate Kernel density estimate
Normal density Normal density

(c) PP plot - original values (d) PP plot – log transformed values


1.00
Normal distribution probability

Normal distribution probability


0.75
0.50
0.25
0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Empirical probability - Number of contesting parties
Empirical probability - Log of Number of contesting parties

N= 7187 Mean = 9.3 Median = 6.0 Std. Deviation = 11.5 x * = 6.9 s* = 2.1
Source: Author’s analysis of data sourced from Election Commission of India reports.
Further details on definition of variables and data sources are provided in Appendix A1.
Figure 3. Number of Contesting Parties in India at District Level 1952 - 2004

15
In this paper, I use natural logarithm to log‐transform
the data.
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 8
Diwakar, Evaluation of Normal versus Lognormal Distribution

shows the distribution of natural log of the number of original values becomes almost a perfect normal
contesting parties at the district level, and we can see distribution.
that the log-transformation makes the distribution Other Examples of Lognormal Distributions
almost a normal distribution. The P-P plot in Figure
3(d) shows that the log-transformed distribution lies The analysis of the ‘number of parties’ is only one
almost fully on the diagonal representing proximity to illustrative example of an important political science
the normal distribution, and as seen in the case of variable, which is lognormally distributed. In Appendix
number of contesting parties at the state level, there is A3, I provide further evidence that many other
a marked improvement of the distribution’s fit with a variables of interest to political and other social
normal distribution after log-transformation. scientists could be better represented by lognormal
rather than normal distribution. This has been collated
Taagepera (2008:127) points out that for some and analysed from data presented in published articles
distributions with a lower conceptual limit of 1, a single and databases (details are provided in Appendix A4)18.
log-transformation might not be enough to make it These variables cannot theoretically take negative
normal, and we might need to take a double log (or log values, and in some cases, cannot be less than 1 (for
of log) of the distribution to achieve normality. When example size of a country’s population or legislature).
the conceptual lower limit of a variable is not 0 but 1, However, as can be seen, the 95% data range for these
taking logarithms once moves this limit at 1 to 0, and variables, assuming a normal distribution, includes
taking it twice would shift it to minus infinity, as is negative values or values which are outside theoretical
required for normal distribution16. For example, limits. This indicates that the distribution for these
Taagepera (2008:128) finds that the estimator s* variables will be skewed, and could be better
devised by Limpert et al. (2001), which must be at least represented by lognormal rather than a normal
1 by definition, requires double log transformation to distribution.
transform it to a fairly symmetrical distribution that
approximates the normal distribution17. Below, I use Appendix A3 also shows the parameters x * and
the example of effective number of parties at the s* for the log transformed data for these variables, and
district level (referred to in Table 1) in India to where data was available, the resultant data range for
demonstrate the effect of double log-transformation. the log transformed distribution. As can be seen, the
Figure 4 shows the distribution of effective transformed distribution does not contain theoretically
number of parties in India at district level in terms of impossible values, and therefore represents a better fit
original values, log of original values, and log of log of for the data. For example, for the variable in Appendix
original values. Figure 4(a) shows that the distribution A3 – District Magnitude, the 95% interval for the
of the original series deviates from being normal, is original data assuming a normal distribution is -214 to
taller than the normal distribution, and has a long right 373 which includes theoretically impossible negative
tail. The log-transformed series in Figure 4(b) moves values, and a relatively high CV of 1.85. After log-
closer to the normal distribution, but is still taller than transformation, the 95% interval does not contain
the normal distribution and has a mild right skew. negative values, and represents a better fit with s* of
Figure 4(c) shows that by using log of log of original 6.8. Similar improvements are seen for other variables,
values, the series becomes more symmetrical and where log transformation brings the data within the
resembles a normal distribution. The P-P plots in permissible theoretical limits. Overall, this analysis
Figures 4(d) – 4(f) confirm this proposition, as the P-P shows that it is important to examine our data prior to
plot of the log of the original values is closer to the undertaking statistical analysis and inference.
normal distribution diagonal line, and the log of log of

16 18
Taagepera (2008:127) alerts us that for double log The log‐transformation was undertaken by the author
transformation, only natural logarithms should be used. using replication data, where available.
17
Taagepera’s (2008:127) conclusion is based on
graphing 61 values of s* presented in Limpert et al. (2001).
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 9
Diwakar, Evaluation of Normal versus Lognormal Distribution

(a) Kernel density - original values (b) Kernel density – log transformed values

Density
Density

0 2 4 6 8 10 0 .5 1 1.5 2 2.5
Effective number of parties Log of Effective number of parties

Kernel density estimate Kernel density estimate


Normal density Normal density

(c) Kernel density – log log transformed values (d) PP plot - original values

1.00
Normal distribution probability
0.75
Density

0.50
0.25
-3 -2 -1 0 1
Log of Log of Effective number of parties
0.00

Kernel density estimate


0.00 0.25 0.50 0.75 1.00
Normal density
Empirical probability - Effective number of parties

(e) PP plot – log transformed values (f) PP plot – log log transformed values
1.00

1.00
Normal distribution probability

Normal distribution probability


0.75

0.75
0.50

0.50
0.25

0.25
0.00

0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Empirical probability - Log of Effective number of parties Empirical probability - Log of log of Effective number of parties

N= 7187 Mean = 2.7 Median = 2.5 Std. Deviation = 0.9 x * = 2.6 s* = 1.3
Source: Author’s analysis of data sourced from Election Commission of India reports.
Further details on definition of variables and data sources are provided in Appendix A1
Figure 4. Effective Number of Parties in India at District Level 1952 – 2004

Can the choice of distribution effect regression distributed; only the residuals or prediction errors need
results?19 to be normally distributed20. Although normality is not
required to obtain unbiased estimates of the regression
Technically, the Ordinary Least Square (OLS) coefficients, it ensures that hypothesis testing, ie p-
regression does not require the variables to be normally values for the t-test and F-test are valid. The violation

19 20
This discussion focuses on OLS regression. In other The residuals are defined as the differences between
types of regression, there may not be requirements regarding the observed response variable values and the values
distribution of the residuals or the variables. predicted by the estimated regression model.
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 10
Diwakar, Evaluation of Normal versus Lognormal Distribution

of normality of the regression residuals can often result data cleaning process. For some variables with the
from the distribution of the variables being conceptual lower limit of 1, taking logarithms not once,
significantly non-normal. Further, a significant but twice may be required to bring the data closer to a
violation of the normal distribution of the variables can normal distribution.
indicate an inappropriate model specification, and also It is however important to acknowledge that the
distort relationships and statistical tests of significance lognormal distribution may not always be the best
(Osborne & Waters, 2002). As Cohen et al. (2002:141) model for skewed data, and it is appropriate to select a
point out that one of the primary reasons for model that describes the variation of data, and use the
examining normality of residuals is to identify model corresponding optimal statistical procedures (Limpert
misspecification or inappropriately influential cases & Stahel, 2011:6). While discussing various traditional
rather than the normality or non-normality of the transformation methods (e.g. square root, log, inverse),
residuals themselves. Osborne (2010) states that the Box-Cox
Conclusion transformation (Box & Cox, 1964) incorporates and
extends the traditional options to help researchers find
In this paper, I have presented evidence, and the optimal normalising transformation for their data.
provided arguments in favour of the use of lognormal The Box-Cox transformation is based on the idea of
rather than normal distribution in describing and having a range of power transformations to improve
interpreting skewed data in empirical research. This is the efficacy of normalising and variance equalising for
consistent with Limpert et al. (2001:351) who state that skewed data (Osborne, 2010).
increasing realisation of the knowledge of the
lognormal distribution ‘would lead to a general Beyond propagating a more active consideration
preference for the log-normal, or multiplicative of using the lognormal distribution for describing and
normal, distribution over the Gaussian distribution modelling variables, the intention of this paper is to
when describing original data.’ Our general preference motivate a more rigorous examination of data prior to
for the normal distribution may be because it has been undertaking empirical analysis. Taagepera (2008:125-
around for a longer time, and is considered easier to 126) provides some thumb rules to help decide
describe and interpret compared to the lognormal between normal and lognormal distributions, but in
distribution. As Aitchison and Brown (1957:2) state general it can be said that we can gain from paying
‘Man has found addition an easier operation than more attention to the distribution of our empirical
multiplication, and so it is not surprising that an data.
additive law of errors was the first to be formulated.’ Recommended Text
However, as has been stressed in this paper, the
characterisation of the lognormal distribution by Taagepera, R (2008). Making Social Sciences More Scientific.
New York. Oxford University Press.
parameters x * and s* (Limpert et al., 2001) offers
several advantages to facilitate its use in data References
description and empirical analysis.
Aitchison, J. & Brown, J.A.C (1957). The lognormal
In principle, a lognormal distribution can be distribution with special reference to its use in economics.
expected to yield a better fit than normal distribution Cambridge University Press. Cambridge.
whenever a variable faces a conceptual lower limit at Box, G.E.P., & Cox, D.R. (1964). An analysis of
zero. However, lognormal and normal distributions transformations. Journal of the Royal Statistical Society,
become quite similar when the latter’s standard Series B (Methodological), 26(2):211-252.
deviation is many times smaller than the mean, in Brambor, T., Clark, W.R & Golder, M (2001).
which case, for simplicity, we can shift to normal Understanding Interaction Models: Improving
distribution (Taagepera, 1999). Researchers can benefit Empirical Analyses. Political Analysis 13:1-20.
from visually inspecting their data (e.g. using kernel
Cabral, L. M. B. & Mata, J (2003). On the Evolution of
density or P-P plots), carry out more sophisticated
the Firm Size Distribution: Facts and Theory. The
statistical tests (e.g. Kolmorogov-Smirnov test) to American Economic Review 93(4):1075-1090.
check significant deviations from normality, and
consider using log transformation as part of routine
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 11
Diwakar, Evaluation of Normal versus Lognormal Distribution

Cohen, J., Cohen, P., West, S. & Aiken, L. S (2002). Quality and Efficiency of Data Analysis. PLoS ONE
Applied Multiple Regression/Correlation Analysis for the 6(7): e21403.
Behavioral Sciences. NJ: Lawrence Erlbaum. McAlister, D. (1879). The law of the geometric mean.
Diwakar, R. (2016). Change and Continuity in Indian Proceedings of the Royal Society of London 29:367-
Politics and the Indian Party System. Asian Journal of 376.
Comparative Politics. 2(4):327-346. Osborne, J. W. (2010). Improving your data
Election Commission of India election results reports – transformations: Applying Box-Cox transformations
various years. Available at as a best practice. Practical Assessment Research &
http://eci.nic.in/eci_main1/ElectionStatistics.aspx . Evaluation, 15(12), 1-9.
Galton, F. (1879). The geometric mean in vital and social http://pareonline.net/getvn.asp?v=15&n=12
statistics. Proceedings of the Royal Society of London Osborne, J. W. (2013). Normality of residuals is a
29:365-367. continuous variable, and does seem to influence the
International Institute for Democracy and Electoral trustworthiness of confidence intervals (2013).
Assistance (IDEA) database. Practical Assessment, Research & Evaluation, 18(12).
http://www.idea.int/vt/viewdata.cfm , accessed 15 http://pareonline.net/getvn.asp?v=18&n=12
April 2016. Osborne, J. W. & Waters, E. (2002). Four assumptions of
Johnson, N. l., Kotz, S. & Balakrishnan, N. (1994). multiple regression that researchers should always
Continuous Univariate Distributions – Volume I. New test. Practical Assessment, Research, & Evaluation, 8(2).
York. John Wiley & Sons. http://pareonline.net/getvn.asp?v=8&n=2
Laakso, M. & Taagepera, R. (1979). Effective Number of Stata Graphics Reference Manual (2017). kdensity —
Parties: A measure with Application to West Europe. Univariate kernel density estimation. Available at
Comparative Political Studies 12:3-27. https://www.stata.com/manuals/rkdensity.pdf#rkde
nsity .
Lijphart, A. (1994). Electoral Systems and Party Systems: A
Study of Twenty-Seven Democracies, 1945-1990. New Taagepera, R. (1999). Ignorance-based quantitative models
York. Oxford University Press. and their practical implications. Journal of Theoretical
Politics 11(3): 421 – 431.
Limpert, E., Stahel, W.A., & Abbt, M. (2001). Log-normal
Distributions across the Sciences: Keys and Clues. Zimmerman, D. W. (1998). Invalidation of parametric and
BioScience 51(5), 341-352. nonparametric statistical tests by concurrent violation
of two assumptions. Journal of Experimental Education,
Limpert, E. & Stahel, W. A. (2011). Problems with Using 67(1), 55-68.
the Normal Distribution – and Ways to Improve

Appendix A1 Description of variables and data sources for number of parties


in India
Variable Description Data Source
Contesting parties at state level in Number of parties that contested Election Commission of India
India elections at the state level. reports for parliamentary elections
– various years.

Contesting parties at district level Number of parties that contested Election Commission of India
in India elections at the district level. reports for parliamentary elections
– various years.

Effective number of parties at Effective number of parties at Raw data sourced from Election
district level in India. district level calculated following Commission of India reports for
Laakso & Taagepera’s (1997) parliamentary elections – various
method using share of votes: years.
1/[Ʃpi2] where p represents vote
seat share of the ith party.
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 12
Diwakar, Evaluation of Normal versus Lognormal Distribution

Appendix A2 Comparison of Normal and Lognormal distributions (Limpert et


al., 2001: 345-46; Johnson et al., 1994:207)
Normal distribution Lognormal distribution
1
Functional form 1 . √2
√2
Shape Symmetrical Skewed
Effects (central limit theorem) Additive Multiplicative
Description
Mean x , Arithmetic x * , Geometric
Standard deviation SD, Additive S*, Multiplicative
Measure of dispersion CV = SD/ x S*
Confidence interval
68.3% x ± SD x * / S* to x * x S*
95.5% x ± 2SD x * / (S*)2 to x * x (S*)2
99.7% x ± 3SD x * / (S*)3 to x * x (S*)3

Note: (1) CV is Coefficient of Variation.

APPENDIX A3 Other Examples of Social Science Variables Used in Literature -–


Original and Log Transformed Data

Source of
Original data – assuming normal distribution Log‐transformed data information
‐ Appendix 4
Reference

Category/variables N ± SD 95% range 99% range CV S* 95% range 99% range

( ±2SD) ( ±3SD) ( * / (S*)2) ( ∗ ∗/ (S*)3)
A. Government
features
1.Government 1242 633±506 ‐378 to 1644 ‐884 to 2150 0.80 415.7 2.9 50 to 3481 17 to 10072
duration (days) A4.1

2.Government 1005 606±488 ‐370 to 1582 ‐858 to 2171 0.81 399.4 2.9 48 to 3341 17 to 9662
A4.2
duration (days)
3. Women ministers in 723 7.3±6.7 ‐6.2 to 20.8 ‐12.9 to 27.5 0.92 8.8 1.8 2.9 to 27.1 1.6 to 47.5
A4.3
cabinet (%)
4. Executive years in 723 10.6±7.8 ‐5.0 to 26.2 ‐12.8 to 34.0 0.74 6.4 2.6 0.9 to 44.4 0.3 to 117.2
A4.4
Office
B. Electoral system
and legislature size
5.Electoral 69 6.1±5.56 ‐5.0 to 22.68 ‐10.4 to 22.6 0.90 4.2 2.3 0.8 to 22.1 0.3 to 50.9
disproportionality A4.5
index
6. Effective electoral 69 11.5±11.7 ‐12.0 to 35.0 ‐23.7 to 46.7 1.02 6.4 3.4 0.6 to 74.2 0.2 to 252.4
A4.6
threshold
7. District Magnitude 69 80±147 ‐214 to 373 ‐361 to 520 1.85 17.3 6.8 0.4 to 798 0.1 to 5425 A4.7
8. District Magnitude 2449 11.6±22.8 ‐34 to 57 ‐57 to 80 1.97 na na na na A4.8
9. Assembly Size 69 223±187 ‐150 to 783 ‐337 to 783 0.84 144 2.9 17 to 1190 6 to 3422 A4.9
10. Electoral 266 0.2±0.1 ‐0.1 to 0.4 ‐0.2 to 0.5 0.64 na na na na
A4.10
competitiveness
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 13
Diwakar, Evaluation of Normal versus Lognormal Distribution

Source of
Original data – assuming normal distribution Log‐transformed data information
‐ Appendix 4
Reference

Category/variables N ± SD 95% range 99% range CV S* 95% range 99% range

( ±2SD) ( ±3SD) ( * / (S*)2) ( ∗ ∗/ (S*)3)
C. Political parties and
party system
11. Average age of 65 45.1±35.6 ‐26 to 116 ‐62 to 152 0.79 na na na na
A4.11
parties
12. Effective number 330 2.4±1.3 ‐0.1 to 5.0 ‐1.4 to 6.2 0.52 2.2 1.6 0.8 to 5.6 0.5 to 9.1
A4.12
of legislative parties
13. Effective number 684 3.3±14 0.5 to 6.1 ‐0.9 to 7.5 0.42 na na na na
of parliamentary A4.13
parties
14. Effective number 2288 4.4±1.9 0.7 to 8.1 ‐1.2 to 10.0 0.42 na na na na
of parliamentary A4.14
parties
D. Demographic /
economic
15. Effective number 684 0.3±0.2 ‐0.1 to 0.7 ‐0.4 to 0.9 0.75 na na na na
A4.15
of ethnic groups
16. Number of 2531 15.6±45.5 ‐75 to 107 ‐121 to 152 2.91 2.7 9.5 0.03 to 247 0.001 to 2345
A4.16
registered voters (m)
17. County Population 28272 82±271 ‐460 to 624 ‐732 to 896 3.30 25.8 4.0 1.6 to 411 0.4 to 1639
A4.17
(000)
18. GDP per capita 65 19.1±12.8 ‐6.5 to 44.6 ‐19.2 to 57.3 0.67 na na na na A4.18
Notes: (1) is arithmetic mean, SD is standard deviation of the original data (2) CV is Coefficient of Variation defined as standard
deviation divided by the mean of the original data (3) Figures in bold and italics represent theoretical anomalies in the original data
assuming normal distribution (4) x * is the exponential of the log the transformed data (geometric mean of the original data) (5) s*
is the exponential of the standard deviation of the log transformed distribution. (6) na means that data for calculating log‐
transformed variables was not available (7) s* for variables 16 and 17 are absolute values.
Source: Author’s analysis based on data sourced from published journal articles or database. See Appendix A4 for details.

Appendix A4 –Sources of information for variables shown in Appendix A3


A4.1 Government Duration
Seki, K., and L.K. Williams (2014). Updating the Party Government data set. Electoral Studies 34. 270-
79.
A4.2 Government Duration
Woldendorp, J., H. Keman and I. Budge (2011). Party Government in 40 Democracies 1945-2008.
Composition-Duration-Personnel.
A4.3 Share of women ministers in cabinet
Arriola, L, R., M. C Johnson (2014). Ethnic Politics and Women’s Empowerment in Africa: Ministerial
Appointments to Executive Cabinets. American Journal of Political Science 58(2): 495–510.
A4.4 Executive Years in Office
Arriola, L, R., and M. C Johnson (2014). Ethnic Politics and Women’s Empowerment in Africa:
Ministerial Appointments to Executive Cabinets. American Journal of Political Science 58(2): 495–510.
A4.5 Electoral disproportionality (largest-deviation) index
Lijphart, A. (1994). Electoral Systems and Party Systems: A Study of Twenty-Seven Democracies,
1945-1990. Oxford University Press.
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 14
Diwakar, Evaluation of Normal versus Lognormal Distribution

A4.6 Effective electoral threshold


Lijphart, A. (1994). Electoral Systems and Party Systems: A Study of Twenty-Seven Democracies,
1945-1990. Oxford University Press.
A4.7 District Magnitude
Lijphart, A. 1994. Electoral Systems and Party Systems: A Study of Twenty-Seven Democracies, 1945-
1990. Oxford University Press.
A4.8 District Magnitude
West, K. J., and J. J. Spoon (2012). Credibility Versus Competition: The Impact of Party Size on
Decisions to Enter Presidential Elections in South America and Europe. Comparative Political Studies.
46(4) 513–539.
A4.9 Assembly Size
Lijphart, A. (1994). Electoral Systems and Party Systems: A Study of Twenty-Seven Democracies,
1945-1990. Oxford University Press.
A4.10 Electoral Competitiveness
Canes-Wrone, B., and J. Park. Electoral Business Cycles in OECD Countries (2012). American Political
Science Review 106(1):102-122.
A4.11 Average age of parties
Wang, Ching-Hsing (2014). The effects of party fractionalization and party polarization on democracy.
Party Politics 20(5): 687–699.
A4.12 Effective number of legislative parties
Arriola, L, R., and M. C Johnson (2014). Ethnic Politics and Women’s Empowerment in Africa:
Ministerial Appointments to Executive Cabinets. American Journal of Political Science. 58(2) 495–510.
A4.13 Effective number of parliamentary parties
Mukherjee, N. (2011). Party systems and human well-being. Party Politics 19(4): 601–623.
A4.14 Effective number of parliamentary parties
West, K. J., and J. J. Spoon (2012). Credibility Versus Competition: The Impact of Party Size on
Decisions to Enter Presidential Elections in South America and Europe. Comparative Political Studies.
46(4) 513–539.
A4.15 Effective number of ethnic groups
Mukherjee, N. (2011). Party systems and human well-being. Party Politics 19(4) 601–623.
A4.16 Number of registered voters
International Institute for Democracy and Electoral Assistance (IDEA) database.
A4.17 County population
Burden, B. C., and A. Wichowsky (2014). Economic Discontent as a Mobilizer: Unemployment and
Voter Turnout. Journal of Politics 76(4). 887-898
A4.18 GDP per capita
Wang, Ching-Hsing (2014). The effects of party fractionalization and party polarization on democracy.
Party Politics 20(5):687–699.
Practical Assessment, Research & Evaluation, Vol 22 No 13 Page 15
Diwakar, Evaluation of Normal versus Lognormal Distribution

Acknowledgement
The author wishes to thank two anonymous reviewers of this journal for their comments on the earlier version
of this paper.
Citation:
Diwakar, R. (2017). An Evaluation of Normal Versus Lognormal Distribution in Data Description and
Empirical Analysis. Practical Assessment, Research & Evaluation, 22(13). Available online:
http://pareonline.net/getvn.asp?v=22&n=13

Corresponding Author
Rekha Diwakar
Department of Politics
School of Law, Politics and Sociology
Freeman Building
University of Sussex
Brighton BN1 9QE

email: r.diwakar [at] sussex.ac.uk

You might also like