Wednesday, November 30, 2016

41. INTRODUCTION TO HYPOTHESIS TESTING

OBJECTIVE

Verify whether two (or more) groups are significantly different from each other, usually by comparing their means or medians.


DESCRIPTION
Generally speaking, statistical hypothesis testing concerns all the techniques that test a null hypothesis versus an alternative hypothesis. Although it also includes regressions, I will only focus on the testing performed on samples.
There are three main steps in hypothesis testing:
  • -          Definition: identify the problem, study it, and formulate hypotheses;
  • -          Experiment: choose and define the data collection technique and sampling method;
  • -          Results and conclusion: check the data, choose the most appropriate test, analyze the results, and make conclusions.


DEFINITION

The first step in hypothesis testing is to identify the problem and analyze it. The three main categories of hypothesis testing are:
  • -          to test whether two samples are significantly different; for example, after conducting a survey in two hotels of the same hotel chain, we want to check whether the difference in average satisfaction is significant or not;
  • -          to test whether a change in a factor has a significant impact on the sample by conducting an experiment (for example to check whether a new therapy has better results than the traditional one);
  • -          to test whether a sample taken from a population truly represents it (if the population’s parameters, i.e. the mean, are known); for example, if a production line is  expected to produce objects with a specific weight, it can be checked by taking random samples and weighting them. If the average weight difference from the expected weight is statistically significant, it means that the machines should be revised.

After defining and studying the problem, we need to define the null hypothesis (H0) and alternative hypothesis (Ha), which are mutually exclusive and represent the whole range of possibilities. We usually compare the means of the two samples or the sample mean with the expected population mean. There are three possible hypothesis settings:
  • -          To test any kind of difference (positive or negative), the H0 is that there is no difference in the means (H0: μ = μ0 and Ha: μ ≠ μ0);
  • -          To test just one kind of difference:
    • o   positive (H0: μ ≤ μ0 and Ha: μ > μ0);
    • o   negative (H0: μ ≥ μ0 and Ha: μ < μ0).


EXPERIMENT

The sampling technique is extremely important; it must be certain that the sample is randomly chosen (in general) and, in the case of an experiment, the participants must not know in which group they are placed. Depending on the problem to be testing and the test to be performed, different techniques are used to calculate the required sample size (check www.powerandsamplesize.com, which allows the calculation of the sample size for different kinds of tests).

RESULTS AND CONCLUSIONS
Once the data have been collected, it is necessary to check for outliers and missing data (see 36. INTRODUCTION TO REGRESSIONS) and choose the most appropriate test depending on the problem studied, the kind of variables, and their distribution. There are two main approaches to testing hypotheses:
  • -          The frequentist approach: this makes assumptions on the population distribution and uses a null hypothesis and p-value to make conclusions (almost all the methods presented here are frequentist);
  • -          The Bayesian approach: this approach needs prior knowledge about the population or the sample, and the result is the probability for a hypothesis (see 42. BAYESIAN APPROACH TO HYPOTHESIS TESTING).

DEPENDENT VARIABLE
SAMPLE CHARACTERISTICS (INDEPENDENT VARIABLES)
CORRELATION
1 SAMPLE
2 SAMPLES
SAMPLES > 2
INDEPENDENT
DEPENDENT
INDEPENDENT
DEPENDENT
DICHOTOMOUS
Test of proportions
McNemar test
Cochran's Q
Phi coefficient, contingency tables
CATEGORICAL
ORDINAL
MannWhitney U test
Wilcoxon signed-rank test
KruskalWallis test, Wilcoxon rank sum test
ScheirerRayHare test (two-way), Friedman test (one-way)
Spearman’s correlation
INTERVAL OR RATIO
One-sample z-test or t-test
Two-sample t-test
Paired t-test
One-way ANOVA
Repeated measure ANOVA
Pearson’s correlation
Two-way ANOVA

Summary of Parametric and Non-parametric Tests

Tests usually analyze the difference in means, and the result is whether or not the difference is significant. When we make these conclusions, we have two types of possible errors:
-          α: the null hypothesis is true (there is no difference) but we reject it (false positive);
-          β: the null hypothesis is false (there is a difference) but we do not reject it (false negative).

Possible outcomes of hypothesis testing
NOT REJECT NULL HYPOTHESIS
REJECT NULL
HYPOTHESIS
THE NULL HYPOTHESIS IS TRUE
1-α
Type I error: α
THE NULL HYPOTHESIS IS FALSE
Type II error: β
1-β

Possible Outcomes of Hypothesis Testing

The significance of the test depends on the size of α, that is, the possibility of rejecting the null hypothesis when it is true. Usually we use 0.05 or 0.01 as a critical value and reject the null hypothesis when α is smaller than the p-value (the critical value representing the probability, assuming that the null hypothesis is true, of observing a result at least as extreme as the one that we have (i.e. the actual mean difference).
It is important to remember that, if we are running several tests, the likelihood of committing a type I error (false positive) increases. For this reason we should use a corrected α, for example by applying the Bonferroni correction (divide α by the number of experiments).[1]
In addition, it is necessary to remember that, with an equal sample size, the smaller the α chosen, the larger the β will be (false negative).

If the test is significant, we should also compute the effect size. It is important not only whether the difference is significant but also how large the difference is. The effect size can be calculated by dividing the difference between the means by the standard deviation of the control group (to be precise, we should use a pooled standard deviation, but it requires some calculation). As a rule of thumb, an effect size of 0.2 is considered to be small, 0.5 medium, and above 0.8 large. However, in order contexts the effect size can be given by other statistics, such as the odds ratio or correlation coefficient.

Confidence intervals are also usually calculated to have a probable range of values to derive a conclusion in which there is, for example, 95% confidence that the true value of the parameter is within the confidence interval X‒Y. The confidence interval reflects a specific interval level; for example, a 95% interval reflects a significance level of 5% (or 0.05). When comparing the difference between two means, if 0 is within the confidence interval, it means that the test is not significant.


ALTERNATIVE METHODS

In the following chapters I will present several methods for hypothesis testing, some of which have specific requirements or assumptions (type of variables, distribution, variance, etc.). However, there is also an alternative that we can use when we have numerical variables but are not sure about the population distribution or variance. This alternative method uses two simulations:

  • -          Shuffling (an alternative to the significance test): we randomize the groups’ elements (we mix the elements of the two groups randomly, each time creating a new pair of groups) and compute the mean difference in each simulation. After several iterations we calculate the percentage of trials in which the difference in the means is higher than the one calculated between the two original groups. This can be compared with the significance test; for example, if fewer than 5% of the iterations indicate a larger difference, the test is significant with α < 0.05.

  • -          Bootstrapping (an alternative to confidence intervals): we resample each of our groups by drawing randomly with replacement from the groups’ elements. In other words, with the members of a group, we recreate new groups that can contain an element multiple times and not contain another one at all. An alternative resampling method would be to resample the original groups in smaller subgroups (jackknifing). After calculating the difference in means of the new pairs of samples, we have a distribution of means and can compute our confidence interval (i.e. 95% of the computed mean differences are between X and Y).






[1] There are also other methods that can be more or less conservative, for example the Šidák correction or the false discount rate controlling procedure.

No comments:

Post a Comment