Tuesday, August 22, 2017

43. t-TEST

OBJECTIVE

Verify whether two groups are significantly different.


DESCRIPTION

There are three main applications of the t-test:
  • -          One-sample t-test: compare a sample mean with the mean of its population;
  • -          Two-sample t-test: compare two sample means;
  • -          Paired t-test: compare two means of the same sample in different situations (i.e. before and after a treatment).[1]

To perform a t-test, it is necessary to check the normality assumption (see 36.INTRODUCTION TO REGRESSIONS); however, the t-test tolerates deviations from normality as long as the sample size is large and the two samples have a similar number of elements. In the case of important normality deviations, we can either transform the data or use a non-parametric test (see Figure 40 in chapter 41.INTRODUCTION TO HYPOTHESIS TESTING).

An alternative to the t-test is the z-test; however, besides the normality assumption, it needs a larger sample size (usually > 30) and the standard deviation of the population.
Each of the three kinds of t-tests described above has two variations depending on the kind of hypothesis to be tested. If the alternative hypothesis is that the two means are different, then a two-tailed test is necessary. If the hypothesis is that one mean is higher or lower than the other one, then a one-tailed test is required. It is also possible to specify in the hypothesis that the difference will be larger than a certain number (in two-sample and paired t-tests).

After performing the test, we can reject the null hypothesis (there are no differences) if the p-value is lower than the alpha (α) chosen (usually 0.05) and if the t-stat value is not between the negative and the positive t-critical value (see the template). The critical value of t for a two-tailed test (t-critical two-tail) is used to calculate the confidence interval that will be at the 1 minus the α chosen (if we choose 0.05 we will have a 95% confidence interval).


ONE-SAMPLE T-TEST

With this test we compare a sample mean with the mean of the population. For example, we have a shampoo factory and we know that each bottle has to be filled with 300 ml of shampoo. To control the quality of the final product, we take random samples from the production line and measure the amount of shampoo.

One-Sample t-Test


Figure 43: Input Data of a One-Sample t-Test

Since we want to stop and fix the production line if the amount of shampoo is smaller or larger than the expected quantity (300 ml), we have to run a two-tailed test. Figure 43 shows the input data as well as the calculated standard deviation and sample mean, which is 295. A confidence level of 0.05 is chosen. We then calculate the t-critical value and p-value (the formulas can be checked in the template).

One-Sample t-Test

Figure 44: Results of a One-Sample t-Test

Since the p-value is lower than the alpha (0.05), we conclude that the difference in the means is significant and that we should fix our production line. The results also include the confidence interval of 95%, which means that we are 95% confident that the bottles are filled with a minimum of 292 ml and a maximum of 298 ml.


TWO-SAMPLE T-TEST

A practical example would be to determine whether male clients buy more or less than female ones.
First of all, we should define our hypothesis. In our example our hypothesis is that male and female clients do not buy the same amount of goods, so we should use a two-tailed test; that is, we do not infer that one group buys more than the other one. On the other hand, if we would like to test whether males buy more, in this case we would use a one-tailed test.
In the Excel complement “Data Analysis,” we choose the option “t-Test: Two-Sample Assuming Unequal Variances” by default, since, even if the variances are equal, the results will not be different, but if we assume equal variances, than we will have a problem in the results if finally the variances are not equal. We select the two-sample data and specify our confidence level (alpha, by default 0.05).

Two-Sample t-test Assuming Unequal Variances

Figure 45: Output of a Two-Sample t-test Assuming Unequal Variances

Since we are testing the difference, either positive or negative, in the output, we have to use the two-tailed p-value and two-tailed t-critical value. In this example the difference is significant, since the p-value is smaller than the chosen alpha (0.05).
Confidence intervals are also calculated in the template, concluding that we are 95% confident that women buy between 8 and 37 more products than men.


PAIRED T-TEST

We want to test two different products on several potential consumers to decide which one is better by asking participants to try each one and rate them on a scale from 1 to 10. Since we have decided to use the same group to test both products, we are going to run a paired two-tailed t-test. The alpha chosen is 0.05.

Paired t-Test

Figure 46: Output of a Paired t-Test

The results of the example show that there is no significant difference in the rating of the two products, since the p-value (two-tail) is larger than the alpha (0.05). The template  also contains the confidence interval of the mean difference, which in this case includes 0 since there is no significant difference.



TEMPLATE






[1] In some cases it is also possible to use two samples and match each component on a certain dimension.

Tuesday, August 8, 2017

70. TIME SERIES ANALYSIS

OBJECTIVE

Forecast the demand for the next periods.


DESCRIPTION

Time series analysis is useful for forecasting based on the patterns underlying the past data. There are four main components:

  • - Trend: a long-term movement concerning time series that can be upward, downward, or stationary (an example can be the upward trend in population growth);
  • -  Cyclical: a pattern that is usually observed over two or more years, and it is caused by circumstances that repeat in cycles (for example economic cycles, which present four phases: prosperity, decline, depression, and recovery);
  • -     Seasonal: variations within a year that usually depend on the weather, customers’ habits, and so on;
  • - Irregular components: random events with unpredictable influences on the time series.
Time Series Analysis



Time Series Analysis


There are two main types of models depending on how the previous four components are included:

(     1)    Y(t)=T(t) x S(t) x C(t) x I(t)
Multiplicative models: the four components are multiplied, and in this case we assume that the components can affect each other.

(     2)    Y(t)=T(t) + S(t) + C(t) + I(t)
Additive models: we make the assumption that the components are independent.


Another important element of time series is stationarity. A process is stationary when an event is influenced by a previous event or events. For example, if today the temperature is quite high, it is more likely that tomorrow it will be quite high as well.

There are many models for time series analysis, but one of the most used is ARIMA (autoregressive integrated moving average). There are some variations of it as well as non-linear models. However, linear models such as ARIMA are widely used due to their simplicity of implementation and understanding.

A good time series analysis implies several exploratory analyses and model validation, which requires statistical knowledge and experience. The template contains a simplification of a time series model in which seasonality and trends are isolated to forecast future sales.
The data can be collected at every instance of time (continuous time series), for example temperature reading, or at discrete points of time (discrete time series), when they are observed daily, weekly, monthly, and so on.



TEMPLATE