Thursday, February 2, 2017

59. BINARY CLASSIFICATION

OBJECTIVE

Classify the elements of a data set into two groups.


DESCRIPTION

In binary classification the goal is to classify the elements of a data set into two groups according to a more or less complex classification rule. The example proposed in the template concerns a company that wants to promote a very exclusive perfume by giving a free sample to some of its customers. The cost of providing a free sample is €50, but in the case of reaching the right customer, the expected return is worth €950 (€1000 minus the cost of the sample). To classify the clients, the company decides to use a score that is calculated from the number of purchases and the average amount spent by each customer (see 33.SCORING MODELS for more information about the creation of score indicators). Our classification rule is “the higher the score, the more likely it is that the free sample will work.”

The first step is to check whether the chosen classification rule is valid, that is to say how efficiently it allows us to classify the elements (customers). The ROC curve measures the efficiency of the classification, and it is the combination of:

  • -          Sensitivity: TRUE POSITIVE RATIO = TP / (TP+FN)[1]
  • -          1-specificity: FALSE POSITIVE RATIO = 1 - TN / (TN + FP)

The curve is obtained by ordering our sample of 20 clients from the highest to the lowest score and reporting the results of the experiment (1 = the client bought the product, 0 = the client did not buy the product; see Figure below).

Calculation of the ROC Curve (Binary Classification)


Table for the Calculation of the ROC Curve

The ROC curve is the continuous line in the following graph (Figure below), while the dashed line is the theoretical ROC curve in a model in which the classification is random. Graphically, we understand that our model is more efficient than a random classification method, since the continuous line is above the dashed line. The area under the curve (AUC) is the measure of classification efficiency and represents the probability of a positive event being classified as positive. A random model (red line) has an AUC of 0.5, while a good model has an AUC higher than 0.7. Our model has an AUC of approximately 0.82, so we can conclude that our classification rule classifies customers efficiently.

ROC Curve (Binary Classification)

ROC Curve

The next step is to find the optimal threshold, which is the minimum score that a client should have to receive a free sample. For that we have to assign the costs and gains of the four possible results of the classification:
  • -        True positive: if we give a free sample to the right customer, we have a cost of €50 for the sample and a return of €1000, so we assign to the TP a revenue of €950;
  • -         False positive: if we give the sample to the wrong customer, we have a cost of €50;
  • -        True negative: we correctly establish that this customer will not buy in spite of the free sample, so we have neither costs nor revenues;
  • -        False negative: we fail to identify a customer who would have bought the product, but still we have neither costs nor revenues.

When we define this matrix, we do not have to include either opportunity costs (the potential revenues lost in a false negative) or opportunity gains (the money saved from not sending a free sample to the wrong customer in a true negative). If we included them, we would double the costs and revenues. In our example the optimal threshold is a score of 65, which means that the company has to send a free sample to customers with at least this score.

This cost/revenue matrix can be used to establish the probability threshold in a logistic regression (see 60. LOGISTIC REGRESSION). Using the same example, we can perform a logistic regression with several predictor variables (number of purchases, amount spent, location, marital status, etc.) and calculate the probability of individual customers buying a product. On the basis of the cost of incentives and the gains from reaching the right customers, the probability threshold may be higher or lower than 0.5.

TEMPLATE




[1] T = true, F = false, P = positive, and N = negative. 

No comments:

Post a Comment