Predict your churn
The Case: Predict your churn
Customer churn is one of the most relevant metrics to businesses, informing about how good the company is at retaining customers. In this case study, we aim to obtain insights into the factors involved in the decision of a bank customer to churn and to build accurate, explainable models in order to make predictions about churn that allow anticipating churn before it actually happens.
With this use case, we present a toy example of the construction of a simple data analysis solution aimed toward behavior analysis using Shapelets. In particular, we are interested in the factors that influence the decision of churning in bank customers.
While this case study focuses on bank customers, the following approach is applicable to any other sector involving customer retention, as long as large customer databases are available. Furthermore, while churning reduction is the objective of this use case, other objectives could be achieved, such as live marketing strategies, obtaining an understanding of customer habits to improve service quality based on demand, or reducing product failures based on user profiles.
The use case is based on a dataset containing customer information from 10k anonymized bank customers which contains 20 features commonly available, involving demographic, customer relationship and transactional information. Some of the customers in this dataset have already churned and this information is used as ground truth to try to figure out what differentiates churning from non-churning customers and build a model that can execute this classification task minimizing the classification errors.
Furthermore, the models obtained in this study can provide the churn probability for a given customer. This is great as it allows to prioritize actions on highly probable churning customers, for example providing them with special discounts or promotions.
The use case is organized as follows. First, a high-level dataset review is performed to understand the data available and its quality. Then, an exploratory data analysis (EDA) is performed in order to quickly discover relevant features or engineer them. Next comes the data modelling stage, in which predictive models are built and the performance of the model on new, unseen data is estimated. Finally, we obtain the most relevant conclusions from the analysis.
Several challenges arise in this case study, some of which are quite common to many data science studies:
The methodology to predict churn is based on three main steps commonly followed in Data Science studies:
A high-level dataset review to understand the data available and its quality. Here, three tasks are performed: learning which features and labels are available, discovering missing features and learning about the characteristics of the features to see if they are binary or categorical.
An exploratory data analysis (EDA) in order to quickly discover important features or engineer them. In this case, we are simply visualizing each of the relevant features using the right plot according to their nature in order to learn if there is a bias in that feature when the customer churns.
A data modelling stage, in which predictive models are built and the performance of the model on new, unseen data is estimated. In this example, we simply split the data set into train/test sets to train and evaluate three state-of-the-art models with an arbitrary choice of hyperparameters. A more elaborated approach to guarantee correct model generalization and to obtain reliable classification metrics would involve considering a validation dataset or using some cross-validation procedure in order to select the best model and its hyperparameters.
Since the problem is posed in the form of a classification problem, the chosen metrics are precision and recall, which are defined next:
Recall or True Positive Rate (TPR) – The number of predicted positives that are actual positives, divided by the number of actual positives.
Precision or Positive Predictive Value (PPV) – The number of predicted positives that are actual positives, divided by the number of predicted positives.
Another relevant metric is the probability of false alarm or False Positive Rate (FPR) – The number of false positives divided by the number of negatives.
Recall is more relevant in this case as it penalizes the wrong classification of actual positives. A model may consider many or even all the samples as positives and thus obtain a precision as high as desired, but in order to make sure the right customers are addressed, the number of correct guesses should be compared against the number of actual positives. This is exactly what recall does.
The receiver operating characteristics (ROC) curve is another common way of visualizing the performance of classification models. It helps visualize the different ways in which a model can be used to provide a more or less conservative behaviour in the predictions, helping to define the right trade-off between the number of predicted positives and the probability of false alarms. This allows for choosing the right model threshold.
Finally, the confusion matrix is a very straightforward way of visualizing classification performance once the model threshold has been chosen. It basically summarizes the classification responses against the ground truth data.
An immediate indicator to obtain from the dataset is that about 24% of the customers have churned. This is a static figure, which could and should be computed periodically to monitor how it is affected by the actions of the company:
Relevant information can be easily concluded just by drawing adequate plots of the available data. In the next figure, for example, it can be quickly deduced that most churning customers have credit cards and that the bank has a very large amount of inactive customers. Again, these metrics could be monitored frequently to try to learn more about their drivers.
As a conclusion of the use-case, we can obtain a model capable of classifying any customer, old or new, and providing a probability of churn. With this probability, the customers most likely to churn can be immediately addressed with the right retention strategy in order to revert the possible churn. Of course, the model will make mistakes in its predictions, but overall it does pretty good, as can be observed in the following confusion matrix: when the model predicts that a customer will not churn it only gets it wrong in 7% of customers, and it is already able to remove more than 70% of the customers from the analysis, letting the company focus in those more likely to churn.
With Shapelets relevant metrics/KPIs can be monitored frequently and insights like the aforementioned ones can be instantly and seamlessly shared from the data scientist to all relevant departments in the organization.
Several interesting results arise from this study:
How does Shapelets help solve this challenge?
Shapelets is great for solving data science and data analysis problems and for easily sharing across the organization the solutions produced. The access to databases and distributed processing is immediate, seamless and fully scalable. No skills in web development or development operations are needed in order to come up with fully-featured data apps and to effortlessly share them across the organization. For building use cases, the user does not need to learn new ways to solve data science problems, since Shapelets relies on several native tools commonly used by data scientists. In this particular use case, we rely on matplotlib and seaborn for visualization and scikit learn for machine learning.
Leader Data Scientist
Adrián Carrio received his degree in Industrial Engineering from the University of Oviedo and his PhD in Automation and Robotics (Cum Laude) from the Technical University of Madrid. He has also worked as a researcher in Arizona State University and the Massachusetts Institute of Technology.