Predict yor churn with shapelets



Use Case


17 October


10 Minutes + 18Media

The case : Predict your churn

With this use case, we present a toy example of the construction of a simple data analysis solution aimed towards behaviour analysis using Shapelets. In particular, we are interested in the factors that influence the decision of churning in bank customers.  

See the full dataset

Customer churn is one of the most relevant metrics to businesses, informing about how good the company is at retaining customers. In this case study we aim to obtain insights about the factors involved in the decision of a bank customer to churn and to build accurate, explainable models in order to make predictions about churn that allow to anticipate churn before it actually happens. 

While this case study focuses on bank customers, the following approach is applicable to any other sector involving customer retention, as long as large customer databases are available. Furthermore, while churning reduction is the objective of this use case, other objectives could be achieved, such as: live marketing strategies, obtaining understanding of customer habits to improve service quality based on demand or to reduce product failures based on usage profiles.   

The use case is based on a dataset containing customer information from 10k anonymized bank customers which contains 20 features commonly available, involving demographic, customer relationship and transactional information. Some of the customers in this dataset have already churned and this information is used as ground truth to try to figure out what differentiates churning from non-churning customers and build a model that can execute this classification task minimizing the classification errors. 

Furthermore, the models obtained in this study can provide the churn probability for a given customer. This is great as it allows to prioritize actions on highly probable churning customers, for example providing them with special discounts or promotions. 

The use case is organized as follows. First, a high-level dataset review is performed to understand the data available and its quality. Then, an exploratory data analysis (EDA) is performed in order to quickly discover relevant features or engineer them. Next comes the data modelling stage, in which predictive models are built and the performance of the model on new, unseen data is estimated. Finally, the most relevant conclusions from the analysis are


Several challenges arise in this case study, some of which are quite common to many data science studies: 

Working with datasets when limited background information is available. However, this should not be the case in real applications. 

Dealing with missing data or in general with datasets that have been produced without consideration to posterior data analysis processes. 

Identify biases in the data that allow to distinguish churning from non-churning customers, which can be used to filter the relevant features to be used in predictive models. 

Understand and choose the right metrics to the specific problem being solved. 

Understand, select, train and use predictive models efficiently, maximizing their expected performance on the chosen metrics for unseen data. 

-Come up with useful insights for the business and help prioritize customer-related activities and their targets.


The methodology to predict churn is based on three main steps commonly followed in Data Science studies:  

A high-level dataset review to understand the data available and its quality. Here, three tasks are performed: learning which features and labels are available, discovering missing features and learning about the characteristics of the features to see if they are binary or categorical. 

An exploratory data analysis (EDA) in order to quickly discover important features or engineer them. In this case, we are simply visualizing each of the relevant features using the right plot according to their nature in order to learn if there is a bias in that feature when the customer churns. 

A data modelling stage, in which predictive models are built and the performance of the model on new, unseen data is estimated. In this example, we simply split the data-set into train/test sets to train and evaluate three state-of-the-art models with an arbitrary choice of hyperparameters. A more elaborated approach to guarantee correct model generalization and to obtain reliable classification metrics would involve considering a validation dataset or using some cross-validation procedure in order to select the best model and its hyperparameters.  


Since the problem is posed in the form of a classification problem, the chosen metrics are precision and recall, which are defined next:  

Recall or True Positive Rate (TPR) – Number of predicted positives that are actual positives, divided by the number of actual positives. 

Precision or Positive Predictive Value (PPV) – Number of predicted positives that are actual positives, divided by the number of predicted positives. 

Another relevant metric is the probability of false alarm or False Positive Rate (FPR) – Number of false positives divided by the number of negatives. 

Recall is more relevant in this case as it penalizes the wrong classification of actual positives.  A model may consider many or even all the samples as positives and thus obtain a precision as high as desired, but in order to make sure the right customers are addressed, the number of correct guesses should be compared against the number of actual positives. This is exactly what recall does. 

The receiver operating characteristics (ROC) curve is another common way of visualizing the performance of classification models. It helps visualize the different ways in which a model can be used to provide a more or less conservative behaviour in the predictions, helping to define the right trade-off between the number of predicted positives and the probability of false alarms. This allows to choose the right model threshold. 

Finally, the confusion matrix is a very straightforward way of visualizing classification performance once the model threshold has been chosen. It basically summarizes the classification responses against the ground truth data. 


See here the full resolution in Phyton

An immediate indicator to obtain from the dataset is that about 24% of the customers have churned. This is a static figure, which could and should be computed periodically to monitor how it is affected by the actions of the company:

Relevant information can be easily concluded just by drawing adequeate plots of the available data. In the next figure, for example, it can be quickly deduced that most churning customers have credit cards and that the bank has a very large amount of inactive customers. Again, these metrics could be monitored frequently to try to learn more about their drivers. 

As a conclusion of the use-case we can obtain a model capable of classifying any customer, old or new, and providing a probability of churn. With this probability the customers most likely to churn can be immediately addressed with the right retention strategy in order to revert the possible churn. Of course, the model will make mistakes in its predictions, but overall it does pretty good, as can be observed in the following confusion matrix: when the model predicts that a customer will not churn it only gets it wrong in 7% of customers, and it is already able to remove more than 70% of the customers from the analysis, letting the company focus in those more likely to churn. 

With Shapelets relevant metrics/KPIs can be monitored frequently and insights like the aforementioned ones can be instantly and seamlessly shared from the data scientist to all relevant departments in the organization. 


Several interesting results arise from this study:  


  • The first result that is obtained is probably already available since it is quite straightforward to obtain: the churning rate. In this example, about 24% of the customers have churned. 
  • One can discover issues in international business branches, by comparing the churning ratios across countries. In this case, the churning ratio remains constant across countries
  • Gender appears to be a relevant feature to churning. The proportion of female customers churning is greater than that of male customers, but overall, most churning customers are male. 
  • The overall proportion of inactive members is quite high suggesting that the bank may need a program implemented to turn this group to active customers. 
  • Customers with extreme salaries churn more. 
  • With regard to the tenure, churning is less common on customers that have been with the bank for several years. An effort in retention during the first 2-3 years could reduce churning. 
  • Random forest appears to be a good model for this classification problem. However, the use of validation techniques is recommended in order to select the best type of model and its hyperparameters. 
  • The best model obtained the following metrics: a precision of 50% (half of the predicted churning customers actually churn), a recall of around 21% (this fraction of the churning customers can be correctly classified) and a false positive rate of 7% is obtained (7% of the customers that the model believes will not churn actually do churn). 


Shapelets is great for solving data science and data analysis problems and for easily sharing across the organization the solutions produced. The access to databases and distributed processing is immediate, seamless and fully scalable. No skills in web development or development operations are needed in order to come up with fully featured data apps and to effortlessly share them across the organization. For building use cases, the user does not need to learn new ways to solve data science problems, since Shapelets relies on several native tools commonly used by data scientists. In this particular use case, we rely on matplotlib and seaborn for visualization and scikit learn for machine learning.