Tuesday, April 25, 2017

37. PEARSON CORRELATION

OBJECTIVE

Find out which quantitative variables are related to each other and define the degree of correlation between pairs of variables.


DESCRIPTION

This method estimates the Pearson correlation coefficient, which quantifies the strength and direction of the linear association of two variables. It is useful when we have several variables that may be correlated with each other and we want to select the ones with the strongest relationship. Correlation can be performed to choose the variables for a predictive linear regression.

Pearson Correlation

Correlation Matrix

With the Excel Data Analysis complement, we can perform a correlation analysis resulting in a double-entry table with correlation coefficients (Pearson’s coefficients). We can also calculate correlations using the Excel formula “=CORREL().” The sign of the coefficient (Pearson correlation coefficient) represents the direction (if x increases then y increases = positive correlation; if x increases then y decreases = negative correlation), while the absolute value from 0 to 1 represents the strength of the correlation. Usually above 0.8 it is very strong, from 0.6 to 0.8 it is strong, and when it is lower than 0.4 there is no correlation (or it is very weak).

The figure above shows that there is a very strong positive correlation between X1 and Y and a strong positive correlation between X1–X3 and X3–Y. X3 and X4 have a weak negative correlation.



TEMPLATE


Wednesday, April 12, 2017

36. INTRODUCTION TO REGRESSIONS

Regressions are parametric models that predict a quantitative outcome (dependent variable) from one or more quantitative predictor variables (independent variable). The model to be applied depends on the kind of relationship that the variables exhibit. 

Regressions take the form of equations in which “y” is the response variables that represent the outcome and “x” is the input variable, that is to say the explanatory variable. Before undertaking the analysis, it is important that several conditions are met:
  • -          Y values must have a normal distribution: this can be analyzed with a standardized residual plot, in which most of the values should be close to 0 (in samples larger than 50, this is less important), or a probability residual plot, in which there should be an approximate straight line (Figure 31);
  • -          Y values must have a similar variance around each x value: we can use a best-fit line in a scatter plot (Figure 32);
  • -          Residuals must be independent; specifically, in the residual plot (Figure 33), the points must be equally distributed around the 0 line and not show any pattern (randomly distributed).


Normal probability plot

Figure 31: Normal Probability Plot

Best-Fit Line Scatter Plot

Figure 32: Best-Fit Line Scatter Plot

If the conditions are not met, we can either transform the variables[1] or perform a non-parametric analysis (see 47. INTRODUCTION TO NON-PARAMETRIC MODELS).

In addition, regressions are sensitive to outliers, so it is important to deal with them properly. We can detect outliers using a standardized residual plot, in which data that fall outside +3 and -3 (standard deviations) are usually considered to be outliers. In this case we should first check whether it was a mistake in collecting the data (for example a 200-year-old person is a mistake) and eliminate the outlier from the data set or replace it (see below how to deal with missing data). If it is not a mistake, a common practice is to carry out the regression with and without the outliers and present both results or to transform the data. For example, we may apply a log transformation or a rank transformation. In any case we should be aware of the implications of these transformations.

Standardized Residuals Plot with an Outlier

Figure 33: Standardized Residuals Plot with an Outlier

Another problem with regressions is that records with missing data are excluded from the analysis. First of all we should understand the meaning of a missing piece of information: does it mean 0 or does it mean that the interviewee preferred not to respond? In the second case, if it is important to include this information, we can substitute the missing data with a value:
  • -          With central tendency measures, if we think that the responses have a normal distribution, meaning that there is no specific reason for not responding to this question, we can use the mean or median of the existing data;
  • -          Predict the missing values using other variables; for example, if we have some missing data for the variable “income,” maybe we can use age and profession to for the prediction.

Check the linear regression template (see 38. LINEAR REGRESSION), which provides an example of how to generate the standardized residuals plot.





[1] To improve normality and variance conditions, we can try applying a log transformation to the response (dependent) variable. We can also use other types of transformations, but we must remember that the interpretation of the results is more complex when we transform our variables.