Thursday, March 16, 2017

35. DESCRIPTIVE STATISTICS

OBJECTIVE

Analyze the distribution of one or several variables in a data set.


DESCRIPTION

In statistical analysis the first step is to analyze the available data. This step is also useful to check for outliers or for the assumption of normality to use these data for a particular statistical model or test (see 36. INTRODUCTION TO REGRESSIONS). Since the analysis of these assumptions is included in the chapter introducing regressions, here I will focus on the descriptive statistics that are useful for describing numeric variables:


Statistic
Description
Mean
Arithmetic mean of the data
Standard Error
Represents the difference between the expected value and the actual value
Median
Central value (the value that divides the data in two – in the case of an even number of values, the median is the mean of the two central values)
Mode
Most frequent value
Standard Deviation
A measure of how values are spread out. Mathematically, it is the square root of the variance
Sample Variance
Average of the squared differences between each value and the mean (it is also a measure of how values are spread out)
Kurtosis
A measure of the “peakedness” and flatness of the distribution *. “0” means that the shape is that of a normal distribution, a flatter distribution has negative kurtosis, and a more peaked distribution has positive kurtosis
Skewness
A measure of the symmetry of the distribution. “0” means that the distribution is symmetrical. If the value is negative, the distribution has a long tail on the left, and if it is positive, it has a long tail on the right. As a rule of thumb, a distribution is considered to be symmetrical if the kurtosis is between 1 and -1
Range
The difference between the largest and the smallest value
Minimum
The smallest value
Maximum
The largest value
Sum
The sum of values
Count
The number of values

As shown in the template, these statistics can be calculated either using the Excel complement “Data Analysis” or using the Excel functions. The same is valid for creating a histogram, with which we can analyze the frequency of values and gain an idea of the type of distribution. In Figure 30 a sample including age data is represented in a histogram. On the right a box plot provides more information, dividing our data into quartiles (grouping the values into 4 groups containing 25% of the values). The plot shows that 50% of people are aged approximately between 33 and 46 years, while the rest are spread across a bigger range of ages (25% from 46 to 64 and 25% from 18 to 33).


Descriptive Statistics Graphs

Histogram and Box Plot

In the template we can see how the two graphs have been created. For the histogram we need to decide which age groups we want to use and fill a table with them. Then we can use the formula “=FREQUENCY” by selecting all the cells on the right of the age groups and pressing “SHIFT + CONTROL + ENTER,” and the formula will provide the frequencies. For the box plot we have to make some calculations and perform some tricks using a normal column chart if we have an older version than Excel 2016. The template and several tutorials can be consulted on the Internet.

Finally, we may have to identify which kind of distribution our data approximate the most (for example to conduct a Monte Carlo simulation). There is no specific method, but we can start by using a histogram and comparing the shape of our data with the shapes of theoretical distributions. The following URL provides 22 Excel templates with graphs and data of different distributions: http://www.quantitativeskills.com/sisa/rojo/distribs.htm.

If our variables are categorical, we can analyze them using a frequency table (count and percentage frequencies). We can also analyze the distribution of frequencies. In the case that our variables are ordinal, we should use the same method for categorical variables (for example if the categories are the answer to a satisfaction question with ordinal answers like “very bad,” “bad,” etc.). However, in some cases we may want to analyze ordinal variables with statistics used for numerical ones (for example, if we are analyzing answers to a question about the quality of services on a scale from 1 to 10, it can be interesting to calculate the average score, range, etc.).



TEMPLATE




* Even if Kurtosis has been traditionally explained in terms of peakedness/flatness it has been demonstrated that this is incorrect since it’s the tails that mostly account for it, not the central peak. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321753/



2 comments:

  1. But kurtosis does not measure peakedness or flatness at all. You can have an infinitely peaked distribution with negative excess kurtosis, and you can have a flat peak wit infinite kurtosis. Kurtosis only measures outliers, or outlier-producing potential, of distribution. See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321753/

    ReplyDelete
  2. Thank you Peter for correcting this, in fact even if this is a widespread concept about kurtosis, I think you made a good case about why this is not correct

    ReplyDelete