Telco Churn Rate

This is a data mining project done mainly on SAS Miner. The dataset was extracted from Kaggle.

The objective of the research is to predict the churning rate of Telco customers based on their age, region, gender, payment method and transaction value through insight on the pattern of churning that occurs and to identify the factors that contribute to Telco customers churn.

The details of the project is as follows:

Data Transformation

There are five datasets concerning a telco company based in Germany. There are two derived attributes created from the datasets to have a more meaningful analysis. The region attribute was created in Excel through the filter function according to the postal code. As this dataset is from Germany, after researching, the first two digits of the postal code indicates a wider area like the states. There is a total of 7 states in the dataset which are Bavaria, Berlin, Baden-Württemberg, Hamburg, Hesse, Lower Saxony and North Rhine-Westphalia. The second derived attribute is the age which was derived from the birthdate of the customers. The datasets were merged for easier analysis using full join in R. After merging the dataset, there are 13 attributes and 4324 rows. The attributes are customer, transaction, payment method, payment date, firstname, gender, postal code, region, birthday, age, termination, churn and churn date.

Data Cleaning

The dataset is relatively clean with not many impurities. There is, however, a single record of missing birth date which was deleted due to the very small amount of missing value. After transforming the dataset, there were six invalid postal codes and a single state that consisted of a person. Both of the data problems were handled by the deletion method.

Data Selection

Based on the problem statement, only Age, Churn, CustomerID, Gender, PaymentMethod, Region and TransactionValue attributes were selected and Birthday, ChurnDate, Date, Firstname, PostalCode and Termination were rejected in SAS Miner. The Churn attribute (Yes or NA) is chosen as the target.

Testing parameters

Multiple testing parameters are selected during the modelling phase to obtain the lowest misclassification rate which is the best model. Firstly, the decision tree models are tested with different data partition ratios (train:validation), such as 50:50, 60:40, 70:30, 80:20 and 90:10. Then, an interactive tree with data partition ratio of 60:40 with different splitting rules based on the variables logworth value. Initially, the 60:40 partition ratio is chosen as it is the ratio that produces the best autonomous tree. However, underfitting occurs so 90:10 partition data is used and has proved to show a more desired result for the subtree assessment plot. The same data partition ratio is also used to perform Logistic Regression. The regression models are generated with default method, stepwise with selection defaults and without selection defaults. The regression model without selection defaults is entered with a different entry significance level (1.0) and stay significance level (0.5).

Choosing the best model

Based on the model comparison in Figure 1, Decision Tree 1 (Tree), chosen by SAS, which has the data partition ratio of 60:40 produced the lowest validation misclassification rate at 0.0756. The other model, Decision Tree 7 has a slightly higher misclassification rate at 0.0806. Between these two models, the Decision Tree 7 (Tree) is chosen due to the insufficient training for Decision Tree 1 which indicates underfitting. The plot for Decision Tree 1 shows that there is no intersection between the Train and Validation lines that indicates underfitting occurs (Refer to Figure 5). This could be explained as the attribute Churn=Y, which is the target of this analysis, occurs only in a small number of observations within the data set. Generally, one solution to tackle underfitting is to assign more data into the training model, therefore, the 90:10 partition decision tree model was chosen, although SAS suggested otherwise.

Also, the difference in the misclassification rate is not too significant with only 0.5% difference. Besides that, Decision Tree 7 has a lower complexity with 6 numbers of leaves to predict the churn rate of the customers as compared to the 12 leaves model chosen for Decision Tree 1. Figure 2 shows the comparison between the logistic regression models. Although most of the regression models produced the same validation misclassification rate at approximately 0.0826, Reg2 is chosen as the best regression model by SAS Miner. The best decision tree model is then used to compare with the best regression model. The best model is chosen based on the lowest validation misclassification rate, which in this scenario is Decision Tree 7 (Tree).

Interpretation of the best model

As seen in Figure 3, the decision tree, Decision Tree 7 is chosen as it has a lower complexity and it shows that there is enough training. It has an accuracy of 91.94% to predict correctly and it has a misclassification rate of 8.06%. Referring to Figure 4, it can be seen that Node 28 has the largest separation for Churn=Y as compared to Churn=NA. It is deemed as the best node since the churn percentage is the highest which is at 67%. This goes to show that if the transaction value is more than 0.015, region is in Hesse, payment method is either cheque or cash and the age of customers is less than 24.5, they will have 67% chances of churning. On the other hand, the model shows that it is able to predict not churn better than churn as shown in node rule and the decision tree. As an example, in Node 29, if the transaction value is more than 0.015, region is North Rhine-Westphalia, payment method is either cheque or cash and the age of customers is less than 24.5, they will have 90% chances of not churning. From both nodes 28 and 29, it can be seen that the only difference between them is region. Hence, through the analysis, the telco company should dive deep into the region Hesse to find out the exact reason why this region is more probable to churn. There is a possibility that the region has a worse telecommunication signal than North Rhine-Westphalia. Based on Node 3, the company should retain their customers who are above age 44.5 with initiatives like free data plan for a week, discounted customer bills during their company anniversary and so on.

Based on the subtree assessment plot, the maximum leaves decision tree generated goes beyond 20. At leaf number 14, overfitting started. Based on leaf 6, the validation dataset has the lowest misclassification rate = 0.0806 and the number of leaves is at a compromise point. As observed, the validation line is a straight line with the same rate after the 6th leaf. SAS miner has chosen the 6-leaf tree as the optimum level. This is because the lesser number of leaves, the lesser the complexity of the model, which aligns with the Parsimony's Principles, therefore the model is a good compromise to select.

MagdaleneHo / Telco-Churn-Rate