OBJECTIVE
Verify whether two (or
more) groups are significantly different from each other, usually by comparing
their means or medians.
DESCRIPTION
Generally speaking,
statistical hypothesis testing concerns all the techniques that test a null
hypothesis versus an alternative hypothesis. Although it also includes
regressions, I will only focus on the testing performed on samples.
There are three main steps
in hypothesis testing:
- -
Definition:
identify the problem, study it, and formulate hypotheses;
- -
Experiment:
choose and define the data collection technique and sampling method;
- -
Results
and conclusion: check the data, choose the most appropriate test, analyze the
results, and make conclusions.
DEFINITION
The first step in
hypothesis testing is to identify the problem and analyze it. The three main
categories of hypothesis testing are:
- -
to test
whether two samples are significantly different; for example, after conducting
a survey in two hotels of the same hotel chain, we want to check whether the difference
in average satisfaction is significant or not;
- -
to test
whether a change in a factor has a significant impact on the sample by
conducting an experiment (for example to check whether a new therapy has better
results than the traditional one);
- -
to test
whether a sample taken from a population truly represents it (if the
population’s parameters, i.e. the mean, are known); for example, if a production
line is expected to produce objects with
a specific weight, it can be checked by taking random samples and weighting
them. If the average weight difference from the expected weight is
statistically significant, it means that the machines should be revised.
After defining and
studying the problem, we need to define the null hypothesis (H0) and
alternative hypothesis (Ha), which are mutually exclusive and
represent the whole range of possibilities. We usually compare the means of the
two samples or the sample mean with the expected population mean. There are
three possible hypothesis settings:
- -
To test
any kind of difference (positive or negative), the H0 is that there
is no difference in the means (H0: μ = μ0 and Ha:
μ ≠ μ0);
- -
To test
just one kind of difference:
- o
positive
(H0: μ ≤ μ0 and Ha: μ > μ0);
- o
negative
(H0: μ ≥ μ0 and Ha: μ < μ0).
EXPERIMENT
The sampling technique
is extremely important; it must be certain that the sample is randomly chosen
(in general) and, in the case of an experiment, the participants must not know in
which group they are placed. Depending on the problem to be testing and the
test to be performed, different techniques are used to calculate the required sample
size (check www.powerandsamplesize.com, which allows the calculation of the sample
size for different kinds of tests).
RESULTS AND CONCLUSIONS
Once the data have
been collected, it is necessary to check for outliers and missing data (see 36.
INTRODUCTION TO REGRESSIONS) and choose the most appropriate test
depending on the problem studied, the kind of variables, and their
distribution. There are two main approaches to testing hypotheses:
- -
The frequentist
approach: this makes assumptions on the population distribution and uses a null
hypothesis and p-value to make conclusions (almost all the methods presented
here are frequentist);
- -
The Bayesian
approach: this approach needs prior knowledge about the population or the
sample, and the result is the probability for a hypothesis (see 42.
BAYESIAN APPROACH TO HYPOTHESIS TESTING).
DEPENDENT VARIABLE
|
SAMPLE CHARACTERISTICS
(INDEPENDENT VARIABLES)
|
CORRELATION
|
1 SAMPLE
|
2 SAMPLES
|
SAMPLES > 2
|
INDEPENDENT
|
DEPENDENT
|
INDEPENDENT
|
DEPENDENT
|
DICHOTOMOUS
|
Test of proportions
|
McNemar test
|
X²
|
Cochran's Q
|
Phi coefficient, contingency tables
|
CATEGORICAL
|
X²
|
X²
|
ORDINAL
|
X²
|
Mann‒Whitney U test
|
Wilcoxon signed-rank test
|
Kruskal‒Wallis test, Wilcoxon rank sum test
|
Scheirer‒Ray‒Hare test (two-way), Friedman test (one-way)
|
Spearman’s correlation
|
INTERVAL OR RATIO
|
One-sample z-test or t-test
|
Two-sample t-test
|
Paired t-test
|
One-way ANOVA
|
Repeated measure ANOVA
|
Pearson’s correlation
|
Two-way ANOVA
|
Summary of Parametric and Non-parametric Tests
Tests usually analyze
the difference in means, and the result is whether or not the difference is
significant. When we make these conclusions, we have two types of possible
errors:
-
α: the
null hypothesis is true (there is no difference) but we reject it (false
positive);
-
β: the
null hypothesis is false (there is a difference) but we do not reject it (false
negative).
Possible
outcomes of hypothesis testing
|
NOT
REJECT NULL HYPOTHESIS
|
REJECT
NULL
HYPOTHESIS
|
THE
NULL HYPOTHESIS IS TRUE
|
1-α
|
Type I error: α
|
THE
NULL HYPOTHESIS IS FALSE
|
Type
II error: β
|
1-β
|
Possible Outcomes of Hypothesis Testing
The significance of
the test depends on the size of α, that is, the possibility of rejecting the
null hypothesis when it is true. Usually we use 0.05 or 0.01 as a critical
value and reject the null hypothesis when α is smaller than the p-value (the critical
value representing the probability, assuming that the null hypothesis is true,
of observing a result at least as extreme as the one that we have (i.e. the
actual mean difference).
It is important to
remember that, if we are running several tests, the likelihood of committing a type
I error (false positive) increases. For this reason we should use a corrected
α, for example by applying the Bonferroni correction (divide α by the number of
experiments).
In addition, it is
necessary to remember that, with an equal sample size, the smaller the α chosen,
the larger the β will be (false negative).
If the test is
significant, we should also compute the effect size. It is important not only whether
the difference is significant but also how large the difference is. The effect
size can be calculated by dividing the difference between the means by the standard
deviation of the control group (to be precise, we should use a pooled standard
deviation, but it requires some calculation). As a rule of thumb, an effect
size of 0.2 is considered to be small, 0.5 medium, and above 0.8 large. However,
in order contexts the effect size can be given by other statistics, such as the
odds ratio or correlation coefficient.
Confidence intervals
are also usually calculated to have a probable range of values to derive a
conclusion in which there is, for example, 95% confidence that the true value
of the parameter is within the confidence interval X‒Y. The confidence interval
reflects a specific interval level; for example, a 95% interval reflects a
significance level of 5% (or 0.05). When comparing the difference between two
means, if 0 is within the confidence interval, it means that the test is not
significant.
ALTERNATIVE METHODS
In the following
chapters I will present several methods for hypothesis testing, some of which
have specific requirements or assumptions (type of variables, distribution,
variance, etc.). However, there is also an alternative that we can use when we
have numerical variables but are not sure about the population distribution or
variance. This alternative method uses two simulations:
- -
Shuffling (an alternative to the significance test): we randomize the groups’ elements (we mix the
elements of the two groups randomly, each time creating a new pair of groups)
and compute the mean difference in each simulation. After several iterations we
calculate the percentage of trials in which the difference in the means is higher
than the one calculated between the two original groups. This can be compared with
the significance test; for example, if fewer than 5% of the iterations indicate
a larger difference, the test is significant with α < 0.05.
- -
Bootstrapping (an alternative to confidence intervals): we
resample each of our groups by drawing randomly with replacement from the
groups’ elements. In other words, with the members of a group, we recreate new
groups that can contain an element multiple times and not contain another one
at all. An alternative resampling method would be to resample the original
groups in smaller subgroups (jackknifing). After calculating the difference in means
of the new pairs of samples, we have a distribution of means and can compute our
confidence interval (i.e. 95% of the computed mean differences are between X
and Y).