Client Churn Predictor - Comparison of Classification Models

Author: Paul A. Beata
GitHub: pbeata


_The data set is available in the project repository under the "data" folder: Link to Repository_

The goal of this project is to analyze and visualize important aspects of the data using exploratory data analysis, then create various predictive models for customer churn. The main focus will be to use tree-based machine learning methods.

Part 1: The Data

The data set contains 7032 rows and 21 columns. It has already been cleaned and there should be no missing values. Instead we focus here on using machine learning models to predict whether or not a customer will churn. Here, churn is defined as leaving or stopping the service. In this case, we are analyzing the customer data from a telecommunications company.

Rows: Each row is an observation, i.e., a customer of the company.

Columns: Each column represents a different feature of the data set and final column is the target, i.e., whether the client "churned". There are both numerical (continuous) and categorical data types totaling 20 features. Each column is labelled with a descriptive name clearly telling what the feature is. One important column to mention is the "tenure"; this refers to the number of months that the client has been with this service.

Note: The feature "SeniorCitizen" is actually a categorical feature since the values are either 1 or 0 (yes or no). We will leave it as is here since there's no need to make a dummy variable out of this feature: it is already encoded with 0 (not a senior citizen) or 1 (senior citizen).

Part 2: Exploratory Data Analysis

Check the balance of the target values ("Churn"): there can only be two unique values in the Churn column: yes or no, for whether or not the customer left (yes) or not (no). For our ML models, we may need to do sampling in order to have a balanced training data set with an equal number of "yes" and "no" churn results.

Explore the distrbution of Total Charges between Churn categories with a violin plot:

Display the distributions of the tenure (i.e., months using the service) split by Churn class (yes or no):

Create a box plot showing the distribution of Total Charges per Contract type, also add in a hue coloring based on the Churn class (churn yes or no):

Visualize the Features Correlated with Churn

Here we will create a bar plot showing the correlation of a subset of the features to the churn class label (i.e., features that have correlation with customer churn). For the categorical features, we will need to convert them into dummy variables first, as we can only calculate correlation for numerical features. Note that we specifically list the features below: we do not want to check the correlation for every feature in the data set as some features have too many unique instances for such an analysis (such as customerID, which has no predictive power anyway).

['gender', 'SeniorCitizen', 'Partner', 'Dependents','PhoneService', 'MultipleLines', 
 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'InternetService',
   'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

Observations

Part 3: Cohort Analysis

This section focuses on segementing customers based on their tenure, creating "cohorts", allowing us to examine differences between customer cohort segments.

Histogram displaying the distribution of the 'tenure' column, which is the amount of months a customer was or has been using the service:

Now we will create histograms separated by two additional features: Churn and Contract Type.

Scatter plot of Total Charges incurred by each customer versus their current Monthly Charges and color the plot by churn class (yes or no):

Create Cohorts Based on Tenure

Here we convert each unique tenure length (1 month, 2 months, 3 months, ..., N months) into a simple cohort grouping: essentially we are putting each customer into a "bin" according to their tenure in months. By treating each unique tenure group as a cohort, we can calculate the Churn rate (percentage that have "yes" Churn) per cohort.

Now we can plot the Churn Rate (percent) per tenure group 1-72 months:

Create Broader Cohort Groups

Based on the tenure column values, we can create a new column called "Tenure Cohort" that creates 4 separate categories:

Count plot showing the churn count per cohort:

Now we can create a grid of count plots showing the number of customers per Tenure Cohort, separated out by contract type and colored by the Churn hue:


Part 4: Predictive Modeling

Let's explore 4 different tree based methods to try and predict customer churn:

Data Preparation for ML Models

Balanced Target Labels?

In this data set, we do NOT have balanced data for the target values. To check this, we can see the number of observations in the training and testing data sets here:

Ideally, we would have a 50-50% split such that an equal number of people churned and didn't churn in each data set. This is an issue that we will address later on in the notebook. For now, let's assume that it's not a problem and continue with the classification task.

UPDATE: Using Balanced Target Data

Without balancing the data, we achieve a max accuracy of 81% when using a single decision tree with a max depth of 6 OR Ada boosting with a max depth of 17. However, our recall results are too low and in this application, the recall is important.

Now we will use a balanced data set for the machine learning models:

Do we need to scale the data as well?

We will test with and without scaling the features.

Single Decision Tree Classifier

Decision tree performance evaluation:

  1. Training a single decision tree model
  2. Evaluate performance metrics from decision tree, including classification report and plotting a confusion matrix
  3. Calculate feature importances from the decision tree

We notice here that the most important features for the single decision tree model are related to the contract type. Since we only use a max depth of 4, we can also plot the decision tree below:

Random Forest Classifier

Boosted Trees

AdaBoost

Gradient Boosted Classifier

Additional Classification Methods

Support Vector Machines

K-Nearest Neighbors

For the KNN method, we must also use the scaled features as in SVM.

Since we don't know the best value for K, we will compare the results for K=1 to K=20 by looping over these values and finding the best accuracy:

Conclusion

In this analysis, we have build several tree-based classifier models and also explored support vector machines (SVM) and K-nearest neighbors (KNN). After balancing the data manually and using a train-test split of 90%/10%, we achieve the best overall performance when using an optimized (via grid search) AdaBoost classifier. To assess each model, we recorded the accuracy and recall.

Model Type Recall (0) Recall (1) Accuracy
single decision tree (default) 0.64 0.68 0.68
single decision tree (max_depth=4) 0.78 0.76 0.77
random forest (default) 0.81 0.75 0.78
random forest (max_depth=7) 0.78 0.80 0.79
ada-boosted (default) 0.80 0.80 0.80
ada-boosted (n_estimators=16) 0.79 0.82 0.81
ada-boosted (grid search CV) 0.80 0.83 0.82
gradient-boosted (default) 0.80 0.79 0.79
support vector machines (default) 0.76 0.78 0.77

Observations:

Here we can check the individual results to understand how well our model is performing:

Of the 187 customers in our balanced (test) data set who churned, we predicted 155 of them correctly: 83%. The overall accuracy was about 82% as we also correctly predicted 150 of the customers who did not churn: (150 + 155)/374 = 81.6%.

Part 5: Next Steps ...

Feature Selection and Improving the Models

Additionaly, we can try to use the other ML models again in order to improve performance. However, this time we could try to only include the most important features by performing feature engineering and investigating how each aspect of the data affects customer churn.