OBJECTIVE
Analyze the
distribution of one or several variables in a data set.
DESCRIPTION
In statistical analysis the first step is to
analyze the available data. This step is also useful to check for outliers or
for the assumption of normality to use these data for a particular statistical
model or test (see 36.
INTRODUCTION TO REGRESSIONS). Since the analysis of these assumptions is
included in the chapter introducing regressions, here I will focus on the
descriptive statistics that are useful for describing numeric variables:
Statistic
|
Description
|
Mean
|
Arithmetic mean of the data
|
Standard Error
|
Represents the difference between the expected
value and the actual value
|
Median
|
Central value (the value that divides the data
in two – in the case of an even number of values, the median is the mean of
the two central values)
|
Mode
|
Most frequent value
|
Standard
Deviation
|
A measure of how values are spread out.
Mathematically, it is the square root of the variance
|
Sample
Variance
|
Average of the squared differences between
each value and the mean (it is also a measure of how values are spread out)
|
Kurtosis
|
A measure of the “peakedness” and flatness of
the distribution *. “0” means that the shape is that of a normal distribution,
a flatter distribution has negative kurtosis, and a more peaked distribution
has positive kurtosis
|
Skewness
|
A measure of the symmetry of the distribution.
“0” means that the distribution is symmetrical. If the value is negative, the
distribution has a long tail on the left, and if it is positive, it has a
long tail on the right. As a rule of thumb, a distribution is considered to
be symmetrical if the kurtosis is between 1 and -1
|
Range
|
The difference between the largest and the
smallest value
|
Minimum
|
The smallest value
|
Maximum
|
The largest value
|
Sum
|
The sum of values
|
Count
|
The number of values
|
As shown in the
template, these statistics can be calculated either using the Excel complement
“Data Analysis” or using the Excel functions. The same is valid for creating a
histogram, with which we can analyze the frequency of values and gain an idea
of the type of distribution. In Figure
30 a sample including age data is represented in a
histogram. On the right a box plot provides more information, dividing our data
into quartiles (grouping the values into 4 groups containing 25% of the
values). The plot shows that 50% of people are aged approximately between 33
and 46 years, while the rest are spread across a bigger range of ages (25% from
46 to 64 and 25% from 18 to 33).
Histogram and Box Plot
In the template we can
see how the two graphs have been created. For the histogram we need to decide which
age groups we want to use and fill a table with them. Then we can use the
formula “=FREQUENCY” by selecting all the cells on the right of the age groups
and pressing “SHIFT + CONTROL + ENTER,” and the formula will provide the
frequencies. For the box plot we have to make some calculations and perform some
tricks using a normal column chart if we have an older version than Excel 2016.
The template and several tutorials can be consulted on the Internet.
Finally, we may have
to identify which kind of distribution our data approximate the most (for
example to conduct a Monte Carlo simulation). There is no specific method, but we
can start by using a histogram and comparing the shape of our data with the
shapes of theoretical distributions. The following URL provides 22 Excel
templates with graphs and data of different distributions: http://www.quantitativeskills.com/sisa/rojo/distribs.htm.
If our variables are categorical, we can
analyze them using a frequency table (count and percentage frequencies). We can
also analyze the distribution of frequencies. In the case that our variables are ordinal, we should use
the same method for categorical variables (for example if the categories are
the answer to a satisfaction question with ordinal answers like “very bad,”
“bad,” etc.). However, in some cases we may want to analyze ordinal variables
with statistics used for numerical ones (for example, if we are analyzing answers
to a question about the quality of services on a scale from 1 to 10, it can be
interesting to calculate the average score, range, etc.).
TEMPLATE
* Even if Kurtosis has
been traditionally explained in terms of peakedness/flatness it has been
demonstrated that this is incorrect since it’s the tails that mostly account
for it, not the central peak. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321753/
But kurtosis does not measure peakedness or flatness at all. You can have an infinitely peaked distribution with negative excess kurtosis, and you can have a flat peak wit infinite kurtosis. Kurtosis only measures outliers, or outlier-producing potential, of distribution. See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321753/
ReplyDeleteThank you Peter for correcting this, in fact even if this is a widespread concept about kurtosis, I think you made a good case about why this is not correct
ReplyDelete