Sunday, October 29, 2017

63. PRINCIPAL COMPONENT ANALYSIS

OBJECTIVE

Analyze the interrelations among several variables and explain them with a reduced number of variables.


DESCRIPTION
A principal component analysis (PCA) analyzes the interrelations among a large number of variables to find a small number of variables (components) that explain the variance of the original variables. This method is usually performed as the first step in a series of analyses; for example, it can be used when there are too many predictor variables compared with the number of observations or to avoid multicollinearity.

Suppose that a company is obtaining responses about many characteristics of a product, say a new shampoo: color, smell, cleanliness, and shine. After a PCA it finds out that the four original variables can be reduced to two components[1]:
  • -          Component “quality”: color and smell;
  • -          Component “effect on hair”: cleanliness and shine.

Even though it is possible to run a PCA in Excel with complex calculations or special complements,[2] I suggest using a proper statistical tool. Here I will only explain some guidelines when performing a PCA.

First of all, the analysis starts with a covariance or correlation matrix. I suggest using a correlation matrix, since we cannot use a covariance matrix if the variables have different scales or the variances are too different. Then, eigenvectors (the direction of the variance) and eigenvalues (the degree of variance in a certain direction) are calculated. Now we have a number of components that is equal to the number of variables, each one with a specific eigenvalue.

Results of Principal Component Analysis
Figure 74: Results of a PCA

The more variance (eigenvalue) that a component explains, the more important it is. There are several approaches that we can use to choose the number of components to retain:
-          Defining a threshold before the analysis:

  •   choose all the components with a certain eigenvalue (usually > 1);
  •    choose a priori a specific number of components (then verify the total variance       explained and other validity tests);
  •    choose the first x components that explain at least X% of the variance, for example      80% if using the results for descriptive purposes or higher if the results will be used      in other statistical analysis (Figure 74);

-          Use a scree plot (Figure 75) and “cut” the line at the main inflexion point or at one of the main inflexion points where there is an acceptable total variance explained (for example, in Figure 75 the first four components can be chosen, since there is an important inflexion point, but they just explain 60% of the variance).

Scree Plot Principal Component Analysis

 Figure 75: Scree Plot

The next step is to analyze the principal components’ correlation coefficients in a matrix with variables and components. Ideally we want one variable to have a high correlation with one component to define each component conceptually (smell and color = component “quality”). However, even if we cannot explain the resulting components conceptually, we have to bear in mind that the main objective of a PCA is to reduce a large number of variables to a manageable number of components, while interpreting the component is not strictly necessary. In chapter 64. EXPLORATORY FACTOR ANALYSIS, PCA analysis will be used as the method for a factor analysis, and I will introduce optimization methods, factor scoring, and validity tests.


TEMPLATE






[1] In spite of this example, PCA is usually performed when we have a larger number of variables.