Telecom Churn - Indian & South East Asian Market
For many incumbent operators, retaining high profitable customers is the number one business goal.To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.
In this project, we will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.
In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and we directly know that this is an instance of churn.
However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again).
Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully. Also, prepaid is the most common model in India and southeast Asia, while postpaid is more common in Europe in North America.This project is based on the Indian and Southeast Asian market.
- Revenue-based churn: Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as ‘customers who have generated less than INR 4 per month in total/average/median revenue’.
The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas. - Usage-based churn: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time.
A potential shortcoming of this definition is that when the customer has stopped using the services for a while, it may be too late to take any corrective actions to retain them. For e.g., if you define churn based on a ‘two-months zero usage’ period, predicting churn could be useless since by that time the customer would have already switched to another operator.
In this project, we will define high-value customers based on a certain metric and predict churn only on high-value customers.
- The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual.
- The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.)
- The ‘churn’ phase: In this phase, the customer is said to have churned. We define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to us for prediction. Thus, after tagging churn as 1/0 based on this phase, we discard all data corresponding to this phase.
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline import time import warnings warnings.filterwarnings('ignore') ## Suppress Warnings from sklearn.metrics import recall_score, accuracy_score, confusion_matrix, f1_score,classification_report from sklearn.metrics import precision_score, auc, roc_auc_score, roc_curve, precision_recall_curve,plot_roc_curve
pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) pd.set_option('display.max_colwidth', -1)
tel_df = pd.read_csv('telecom_churn_data.csv')
tel_df.head()
tel_df.shape
tel_df.info(verbose= True)
Data has 99999 rows and 226 columns. The data types of the columns are float, integer and object type. We have 179 columsn with float values, 35 columns with integer values and 12 columns with Object Values. Some of the columns are actually Date, they need to be converted to Date Type format. We can see there are some null values in the column. Let's inspect the null values first
tel_df.describe().T
There seem to be columns with single value only. These won't contribute to understand anything in the model. Hence identifying and dropping them
single_val_columns = [] for i in tel_df.columns: if tel_df[i].nunique()==1: single_val_columns.append(i) else: pass
tel_df.drop(single_val_columns, axis = 1, inplace = True) print('dropped columns: ',single_val_columns)
((tel_df.isnull().sum()/tel_df.shape[0])*100).round(2).sort_values(ascending=False)
- There are 40 columns which have more than 70% null values. Ideally we can remove these columns from calculation as imputing values with such high null values will not result in accurate predictor values. But, we need to first analyze the types of columns that have more than 70% null values.
- Some of the columns like night_pack_X or fb_user_X (where X signified the month value 6/7/8/9) have high null values. In these cases, we can consider that the particular customer did not take those packages or they did not opt for social media utilities. For such columns, we can impute null values with 0.
- We also have to calculate the High Value Customer where as High value customers can be considered as those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months, which signifies the good phase.
Thus, we have to first treat the columns related to recharge amount to identify high value customer.
- total_rech_data_6 - 74.85 % missing values
- total_rech_data_7 - 74.43 % missing values
- av_rech_amt_data_6 - 74.85 % missing values
- av_rech_amt_data_7 - 74.43 % missing values
- date_of_last_rech_data_6 - 74.85 % missing values
- date_of_last_rech_data_7 - 74.43 % missing values
The datetime values are being read incorrectly as object
date_cols = tel_df.select_dtypes(include=['object'])
for i in date_cols.columns: tel_df[i]=pd.to_datetime(tel_df[i])
tel_df.shape
Rechecking the data types
tel_df.info(verbose=True)
Last recharge data seems to be meaningful, then trying to chekc and impute over it
tel_df[['date_of_last_rech_data_6','total_rech_data_6','max_rech_data_6']].head()
The nulls/nans in last recharge data means the customer has not recharged which is meaningful. Hence imputing the values with 0. It simply means the customer has not recharged mobile internet in a while.
for i in range(len(tel_df)): if pd.isnull((tel_df.total_rech_data_6[i]) and (tel_df.max_rech_data_6[i])): if pd.isnull(tel_df.date_of_last_rech_data_6[i]): tel_df.total_rech_data_6[i]=0 tel_df.max_rech_data_6[i]=0
if pd.isnull((tel_df.total_rech_data_7[i]) and (tel_df.max_rech_data_7[i])):
if pd.isnull(tel_df.date_of_last_rech_data_7[i]):
tel_df.total_rech_data_7[i]=0
tel_df.max_rech_data_7[i]=0
if pd.isnull((tel_df.total_rech_data_8[i]) and (tel_df.max_rech_data_8[i])):
if pd.isnull(tel_df.date_of_last_rech_data_8[i]):
tel_df.total_rech_data_8[i]=0
tel_df.max_rech_data_8[i]=0
if pd.isnull((tel_df.total_rech_data_9[i]) and (tel_df.max_rech_data_9[i])):
if pd.isnull(tel_df.date_of_last_rech_data_9[i]):
tel_df.total_rech_data_9[i]=0
tel_df.max_rech_data_9[i]=0
tel_df[['count_rech_2g_6','count_rech_3g_6','total_rech_data_6']].head()
Given total_rech_data is count_rech_2g + count_rech_3g, it will pop the issue of multicolinearity. Hence dropping count_rech_2g and count_rech_3g for every month.
tel_df.drop(['count_rech_2g_6','count_rech_3g_6','count_rech_2g_7','count_rech_3g_7','count_rech_2g_8','count_rech_3g_8','count_rech_2g_9','count_rech_3g_9'],axis=1, inplace=True)
tel_df.shape
Checking and handling the ARPU columns now
tel_df[['arpu_3g_6','arpu_2g_6','av_rech_amt_data_6']].head()
for i in range(6,10): print("\n\nCorrelation Month: ",i,"\n") arpu_3g_i = 'arpu_3g_'+str(i) arpu_2g_i = 'arpu_2g_'+str(i) av_rech_amt_data_i = 'av_rech_amt_data_'+str(i) print(tel_df[[arpu_3g_i,arpu_2g_i,av_rech_amt_data_i]].corr())
Dropping the columns arpu_2g and arpu_3g as they have very high correlation with av_rech_amt_data as can be seen from above
tel_df.drop(['arpu_3g_6','arpu_2g_6','arpu_3g_7','arpu_2g_7','arpu_3g_8','arpu_2g_8','arpu_3g_9','arpu_2g_9'],axis=1, inplace=True)
tel_df.shape
tel_df.drop(['fb_user_6','fb_user_7','fb_user_8','fb_user_9', 'night_pck_user_6','night_pck_user_7','night_pck_user_8','night_pck_user_9'], axis=1, inplace=True)
tel_df.shape
Handling av_rech_amt_data for every month
tel_df[['av_rech_amt_data_7','max_rech_data_7','total_rech_data_7']].head()
From the above table it is deduced that the missing values for the av_rech_amt_data for each month can be replaced with 0 if total_rech_data for each month from 6 to 9 respectively is 0 i.e. if the total recharge done is 0 then the average recharge amount shall also be 0
for i in range(len(tel_df)): if (pd.isnull(tel_df['av_rech_amt_data_6'][i]) and (tel_df['total_rech_data_6'][i]==0)): tel_df['av_rech_amt_data_6'][i]=0
if (pd.isnull(tel_df['av_rech_amt_data_7'][i]) and (tel_df['total_rech_data_7'][i]==0)):
tel_df['av_rech_amt_data_7'][i]=0
if (pd.isnull(tel_df['av_rech_amt_data_8'][i]) and (tel_df['total_rech_data_8'][i]==0)):
tel_df['av_rech_amt_data_8'][i]=0
if (pd.isnull(tel_df['av_rech_amt_data_9'][i]) and (tel_df['total_rech_data_9'][i]==0)):
tel_df['av_rech_amt_data_9'][i]=0
((tel_df.isnull().sum()/tel_df.shape[0])*100).round(0).sort_values(ascending=False)
tel_df.info()
There are no values in date_of_last_rech_data to any of the months as total_rech_data and max_rech_data are enough. Also, as their null values are high, we can drop them
tel_df.drop(["date_of_last_rech_data_6","date_of_last_rech_data_7","date_of_last_rech_data_8","date_of_last_rech_data_9"], axis=1, inplace=True)
tel_df.drop(["date_of_last_rech_6","date_of_last_rech_7","date_of_last_rech_8","date_of_last_rech_9"], axis=1, inplace=True)
tel_df.shape
Since the columns used to determine the High Value Customer is clear of null values, we can filter the overall data and then handle the remaining missing values in each of the other columns
tel_df['total_rech_amt_data_6'] = tel_df['av_rech_amt_data_6'] * tel_df['total_rech_data_6'] tel_df['total_rech_amt_data_7'] = tel_df['av_rech_amt_data_7'] * tel_df['total_rech_data_7']
tel_df['overall_rech_amt_6'] = tel_df['total_rech_amt_data_6']+tel_df['total_rech_amt_6'] tel_df['overall_rech_amt_7'] = tel_df['total_rech_amt_data_7']+tel_df['total_rech_amt_7']
tel_df['avg_rech_amt_6_7'] = (tel_df['overall_rech_amt_6'] + tel_df['overall_rech_amt_7'])/2
Finding the value of 70th percentage in the overall revenues defining the high value customer criteria for the company
cut_off = tel_df['avg_rech_amt_6_7'].quantile(0.7) print("The 70th quantile value to determing the High Value customer is: ",cut_off,'\n')
tel_df = tel_df[tel_df['avg_rech_amt_6_7']>=cut_off]
tel_df.shape
The total number of customers is now limited to 30k who lie under the High value customer criteria. We shall build our model upon these
((tel_df.isnull().sum()/tel_df.shape[0])*100).round(0).sort_values(ascending=False)
num_col = tel_df.select_dtypes(include=['int64','float64']).columns.tolist()
from sklearn.impute import KNNImputer from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
tel_df[num_col]=scaler.fit_transform(tel_df[num_col])
knn = KNNImputer(n_neighbors=3)
tel_df_knn = pd.DataFrame(knn.fit_transform(tel_df[num_col])) tel_df_knn.columns = tel_df[num_col].columns
tel_df_knn.isnull().sum().sum()
The KNN Imputer has replaced all the null values in the numerical column using K-means algorithm successfully
tel_df[num_col] = scaler.inverse_transform(tel_df_knn)
tel_df.head()
((tel_df.isnull().sum()/tel_df.shape[0])*100).round(0).sort_values(ascending=False)
tel_df.isnull().sum().sum()
churn_col = ['total_ic_mou_9','total_og_mou_9','vol_2g_mb_9','vol_3g_mb_9'] tel_df[churn_col].info()
tel_df['churn'] =0
tel_df['churn'] = np.where(tel_df[churn_col].sum(axis=1)==0,1,0)
tel_df.head()
print((tel_df['churn'].value_counts()/len(tel_df))*100) ((tel_df['churn'].value_counts()/len(tel_df))*100).plot(kind="pie") plt.show()
churn_phase_cols = [col for col in tel_df.columns if '_9' in col] tel_df.drop(churn_phase_cols, axis=1, inplace= True) tel_df.shape
We can drop the good phase variables as well as we have already created the derived variables out of it
tel_df.drop(['total_rech_amt_data_6','av_rech_amt_data_6','total_rech_data_6','total_rech_amt_6','total_rech_amt_data_7','av_rech_amt_data_7','total_rech_data_7','total_rech_amt_7'], axis=1, inplace=True)
Lets check for collinearity of these independent variables to understand their inter-dependencies before proceding to handle the missing values in other variables
mon_6_cols = [col for col in tel_df.columns if '_6' in col] mon_7_cols = [col for col in tel_df.columns if '_7' in col] mon_8_cols = [col for col in tel_df.columns if '_8' in col]
tel_df_corr = tel_df.corr()
plt.figure(figsize=(20,20)) sns.heatmap(tel_df_corr) plt.show()
tel_df_corr.loc[:,:] = np.tril(tel_df_corr,k=-1) tel_df_corr = tel_df_corr.stack() tel_df_corr[(tel_df_corr>0.80) | (tel_df_corr < -0.80)].sort_values(ascending=False)
tel_df.drop(['total_rech_amt_8','isd_og_mou_8','isd_og_mou_7','sachet_2g_8','total_ic_mou_6', 'total_ic_mou_8','total_ic_mou_7','std_og_t2t_mou_6','std_og_t2t_mou_8','std_og_t2t_mou_7', 'std_og_t2m_mou_7','std_og_t2m_mou_8'],axis=1,inplace=True)
tel_df.shape
tel_df['tenure'] = (tel_df['aon']/30).round(0)
tel_df.drop('aon',axis=1, inplace=True)
sns.distplot(tel_df['tenure'],bins=30) plt.show()
tn_range = [0,6,12,24,60,61] tn_label = [ '0-6 Months', '6-12 Months', '1-2 Yrs', '2-5 Yrs', '5 Yrs and above'] tel_df['tenure_range'] = pd.cut(tel_df['tenure'],tn_range, labels=tn_label) tel_df['tenure_range'].head()
plt.figure(figsize=[12,7]) sns.barplot(x='tenure_range',y='churn', data=tel_df) plt.show()
The average revenue per user in good phase is given by arpu in 6th and 7th Month. Since we have 2 separate values, we can average them out and drop the original ones
tel_df['avg_arpu_6_7'] = (tel_df['arpu_6']+tel_df['arpu_7'])/2 tel_df['avg_arpu_6_7'].head()
tel_df.drop(['arpu_6','arpu_7'],axis=1, inplace=True) tel_df.shape
sns.distplot(tel_df['avg_arpu_6_7']) plt.show()
plt.figure(figsize=(10,50)) heatmap_churn = sns.heatmap(tel_df.corr()[['churn']].sort_values(ascending=False, by='churn'),annot=True, cmap='crest') heatmap_churn.set_title('Features correlating to churn variable',fontsize=12)
Avg Outgoing calls and calls on roaming for 6th and 7th months are positively correlated with Churn while Avg Revenue, no. of recharges for 8th month has negative correlation with churn
tel_df[['total_rech_num_8','arpu_8']].plot.scatter(x='total_rech_num_8',y='arpu_8') plt.show()
sns.boxplot(x=tel_df.churn, y=tel_df.tenure) plt.show()
ax = sns.kdeplot(tel_df.max_rech_amt_8[(tel_df["churn"]==0)], color='Red',shade=True) ax = sns.kdeplot(tel_df.max_rech_amt_8[(tel_df["churn"]==1)],ax=ax, color='Blue', shade=True) ax.legend(["No-Churn",'Churn'],loc="lower right") ax.set_ylabel('Density') ax.set_xlabel('cost based on volume') ax.set_title('Distribution of Avg Rech Amt vs Churn') plt.show()
tel_df.shape
tel_df['total_rech_data_group_8'] = pd.cut(tel_df['total_rech_data_8'],[-1,0,10,25,100],labels=["No_Recharge","<=10_Recharges","10-25_Recharges",">25_Recharges"]) tel_df['total_rech_num_group_8']=pd.cut(tel_df['total_rech_num_8'],[-1,0,10,25,1000],labels=["No_Recharge","<=10_Recharges","10-25_Recharges",">25_Recharges"])
plt.figure(figsize=[12,5]) sns.countplot(data=tel_df,x='total_rech_data_group_8',hue='churn') print(tel_df['total_rech_data_group_8'].value_counts()) plt.show()
plt.figure(figsize=[12,5]) sns.countplot(data=tel_df,x="total_rech_num_group_8",hue='churn') print(tel_df['total_rech_num_group_8'].value_counts()) plt.show()
dummy_df = pd.get_dummies(tel_df[['total_rech_data_group_8','total_rech_num_group_8','tenure_range']], drop_first=True) dummy_df.head()
df = tel_df[:].copy()
df.drop(['tenure_range','mobile_number','total_rech_data_group_8','total_rech_num_group_8','sep_vbc_3g','tenure'], axis=1, inplace=True)
df.head()
Creatig X Dataset for model building
X= df.drop(['churn'],axis=1)
X.head()
y=df['churn'] y.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, train_size=0.7, random_state=1) print("Dimension of X_train:", X_train.shape) print("Dimension of X_test:", X_test.shape)
X_train.info(verbose=True)
num_col = X_train.select_dtypes(include = ['int64','float64']).columns.tolist()
from sklearn import preprocessing from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() X_train[num_col] = scaler.fit_transform(X_train[num_col])
X_train.head()
from imblearn.over_sampling import SMOTE sm = SMOTE(random_state=42) X_train_sm,y_train_sm = sm.fit_resample(X_train,y_train)
print((y_train_sm.value_counts()/len(y_train_sm))*100) ((y_train_sm.value_counts()/len(y_train_sm))*100).plot(kind="pie") plt.show()
print("Dimension of X_train_sm Shape:", X_train_sm.shape) print("Dimension of y_train_sm Shape:", y_train_sm.shape)
Importing necessary libraries for Model creation
import statsmodels.api as sm
logm1 = sm.GLM(y_train_sm,(sm.add_constant(X_train_sm)), family = sm.families.Binomial()) logm1.fit().summary()
from sklearn.linear_model import LogisticRegression logreg = LogisticRegression()
from sklearn.feature_selection import RFE
rfe = RFE(logreg,n_features_to_select=20)
rfe = rfe.fit(X_train_sm, y_train_sm)
rfe.support_
rfe_columns=X_train_sm.columns[rfe.support_] print("The selected columns by RFE for modelling are: \n\n",rfe_columns)
list(zip(X_train_sm.columns, rfe.support_, rfe.ranking_))
X_train_SM = sm.add_constant(X_train_sm[rfe_columns]) logm2 = sm.GLM(y_train_sm,X_train_SM, family = sm.families.Binomial()) res = logm2.fit() res.summary()
# we can drop the column 'loc_ic_t2t_mou_8' as it has high p-value of 0.80
rfe_columns_1=rfe_columns.drop('loc_ic_t2t_mou_8',1) print("\nThe new set of edited featured are:\n",rfe_columns_1)
X_train_SM = sm.add_constant(X_train_sm[rfe_columns_1]) logm2 = sm.GLM(y_train_sm,X_train_SM, family = sm.families.Binomial()) res = logm2.fit() res.summary()
# we can drop the column 'loc_ic_t2m_mou_8' as it has high p-value of 0.80
rfe_columns_2=rfe_columns_1.drop('loc_ic_t2m_mou_8',1) print("\nThe new set of edited featured are:\n",rfe_columns_2)
X_train_SM = sm.add_constant(X_train_sm[rfe_columns_2]) logm2 = sm.GLM(y_train_sm,X_train_SM, family = sm.families.Binomial()) res = logm2.fit() res.summary()
y_train_sm_pred = res.predict(X_train_SM) y_train_sm_pred = y_train_sm_pred.values.reshape(-1) y_train_sm_pred[:10]
y_train_sm_pred_final = pd.DataFrame({'Converted':y_train_sm.values, 'Converted_prob':y_train_sm_pred}) y_train_sm_pred_final.head()
y_train_sm_pred_final = pd.DataFrame({'Converted':y_train_sm.values, 'Converted_prob':y_train_sm_pred}) y_train_sm_pred_final.head()
y_train_sm_pred_final['churn_pred'] = y_train_sm_pred_final.Converted_prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_sm_pred_final.head()
from sklearn import metrics
confusion = metrics.confusion_matrix(y_train_sm_pred_final.Converted, y_train_sm_pred_final.churn_pred ) print(confusion)
print("The overall accuracy of the model is:",metrics.accuracy_score(y_train_sm_pred_final.Converted, y_train_sm_pred_final.churn_pred))
from statsmodels.stats.outliers_influence import variance_inflation_factor
Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame() vif['Features'] = X_train_sm[rfe_columns_2].columns vif['VIF'] = [variance_inflation_factor(X_train_sm[rfe_columns].values, i) for i in range(X_train_sm[rfe_columns_2].shape[1])] vif['VIF'] = round(vif['VIF'], 2) vif = vif.sort_values(by = "VIF", ascending = False) vif
Type Markdown and LaTeX: 𝛼2
Metrics beyond simply accuracy
TP = confusion[1,1] # true positive TN = confusion[0,0] # true negatives FP = confusion[0,1] # false positives FN = confusion[1,0] # false negatives
print("Sensitivity = ",TP / float(TP+FN))
print("Specificity = ",TN / float(TN+FP))
print("False Positive Rate = ",FP/ float(TN+FP))
print ("Precision = ",TP / float(TP+FP))
print ("True Negative Prediction Rate = ",TN / float(TN+ FN))
def draw_roc( actual, probs ): fpr, tpr, thresholds = metrics.roc_curve( actual, probs, drop_intermediate = False ) auc_score = metrics.roc_auc_score( actual, probs ) plt.figure(figsize=(5, 5)) plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score ) plt.plot([0, 1], [0, 1], 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate or [1 - True Negative Prediction Rate]') plt.ylabel('True Positive Rate') plt.title('Receiver operating characteristic example') plt.legend(loc="lower right") plt.show()
return None
fpr, tpr, thresholds = metrics.roc_curve( y_train_sm_pred_final.Converted, y_train_sm_pred_final.Converted_prob, drop_intermediate = False )
draw_roc(y_train_sm_pred_final.Converted, y_train_sm_pred_final.Converted_prob)
numbers = [float(x)/10 for x in range(10)] for i in numbers: y_train_sm_pred_final[i]= y_train_sm_pred_final.Converted_prob.map(lambda x: 1 if x > i else 0) y_train_sm_pred_final.head()
cutoff_df = pd.DataFrame( columns = ['probability','accuracy','sensitivity','specificity']) from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] for i in num: cm1 = metrics.confusion_matrix(y_train_sm_pred_final.Converted, y_train_sm_pred_final[i] ) total1=sum(sum(cm1)) accuracy = (cm1[0,0]+cm1[1,1])/total1
specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
cutoff_df.loc[i] =[ i ,accuracy,sensitivity,specificity]
print(cutoff_df)
cutoff_df.plot.line(x='probability', y=['accuracy','sensitivity','specificity']) plt.show()
Initially we selected the optimum point of classification as 0.5.
From the above graph, we can see the optimum cutoff is slightly higher than 0.5 but lies lower than 0.6. So lets tweek a little more within this range.
numbers = [0.50,0.51,0.52,0.53,0.54,0.55,0.56,0.57,0.58,0.59] for i in numbers: y_train_sm_pred_final[i]= y_train_sm_pred_final.Converted_prob.map(lambda x: 1 if x > i else 0) y_train_sm_pred_final.head()
cutoff_df = pd.DataFrame( columns = ['probability','accuracy','sensitivity','specificity']) from sklearn.metrics import confusion_matrix
num = [0.50,0.51,0.52,0.53,0.54,0.55,0.56,0.57,0.58,0.59] for i in num: cm1 = metrics.confusion_matrix(y_train_sm_pred_final.Converted, y_train_sm_pred_final[i] ) total1=sum(sum(cm1)) accuracy = (cm1[0,0]+cm1[1,1])/total1
specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
cutoff_df.loc[i] =[ i ,accuracy,sensitivity,specificity]
print(cutoff_df)
cutoff_df.plot.line(x='probability', y=['accuracy','sensitivity','specificity']) plt.show()
From the above graph we can conclude, the optimal cutoff point in the probability to define the predicted churn variabe converges at 0.54
y_train_sm_pred_final['final_churn_pred'] = y_train_sm_pred_final.Converted_prob.map( lambda x: 1 if x > 0.54 else 0)
y_train_sm_pred_final.head()
print("The overall accuracy of the model now is:",metrics.accuracy_score(y_train_sm_pred_final.Converted, y_train_sm_pred_final.final_churn_pred))
confusion2 = metrics.confusion_matrix(y_train_sm_pred_final.Converted, y_train_sm_pred_final.final_churn_pred ) print(confusion2)
TP2 = confusion2[1,1] # true positive TN2 = confusion2[0,0] # true negatives FP2 = confusion2[0,1] # false positives FN2 = confusion2[1,0] # false negatives
print("Sensitivity = ",TP2 / float(TP2+FN2))
print("Specificity = ",TN2 / float(TN2+FP2))
print("False Positive Rate = ",FP2/ float(TN2+FP2))
print ("Precision = ",TP2 / float(TP2+FP2))
print ("True Negative Prediction Rate = ",TN2 / float(TN2 + FN2))
Precision and recall tradeoff
from sklearn.metrics import precision_recall_curve
p, r, thresholds = precision_recall_curve(y_train_sm_pred_final.Converted, y_train_sm_pred_final.Converted_prob)
plt.plot(thresholds, p[:-1], "g-") plt.plot(thresholds, r[:-1], "r-") plt.show()
X_test[num_col] = scaler.transform(X_test[num_col]) X_test.head()
X_test=X_test[rfe_columns_2] X_test.head()
X_test_SM = sm.add_constant(X_test)
y_test_pred = res.predict(X_test_SM) print("\n The first ten probability value of the prediction are:\n",y_test_pred[:10])
y_pred = pd.DataFrame(y_test_pred) y_pred.head()
y_pred=y_pred.rename(columns = {0:"Conv_prob"})
y_test_df = pd.DataFrame(y_test) y_test_df.head()
y_pred_final = pd.concat([y_test_df,y_pred],axis=1) y_pred_final.head()
y_pred_final['test_churn_pred'] = y_pred_final.Conv_prob.map(lambda x: 1 if x>0.54 else 0) y_pred_final.head()
metrics.accuracy_score(y_pred_final.churn, y_pred_final.test_churn_pred)
confusion2_test = metrics.confusion_matrix(y_pred_final.churn, y_pred_final.test_churn_pred) print("Confusion Matrix\n",confusion2_test)
TP3 = confusion2_test[1,1] # true positive TN3 = confusion2_test[0,0] # true negatives FP3 = confusion2_test[0,1] # false positives FN3 = confusion2_test[1,0] # false negatives
print("Sensitivity = ",TP3 / float(TP3+FN3))
print("Specificity = ",TN3 / float(TN3+FP3))
print("False Positive Rate = ",FP3/ float(TN3+FP3))
print ("Precision = ",TP3 / float(TP3+FP3))
print ("True Negative Prediction Rate = ",TN3 / float(TN3+FN3))
print("The accuracy of the predicted model is: ",round(metrics.accuracy_score(y_pred_final.churn, y_pred_final.test_churn_pred),2)*100,"%") print("The sensitivity of the predicted model is: ",round(TP3 / float(TP3+FN3),2)*100,"%")
print("\nAs the model created is based on a sentivity model, i.e. the True positive rate is given more importance as the actual and prediction of churn by a customer\n")
fpr, tpr, thresholds = metrics.roc_curve(y_pred_final.churn,y_pred_final.Conv_prob, drop_intermediate = False )
draw_roc(y_pred_final.churn,y_pred_final.Conv_prob)
The AUC score for train dataset is 0.90 and the test dataset is 0.87.
This model can be considered as a good model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, train_size=0.7, random_state=1) print("Dimension of X_train:", X_train.shape) print("Dimension of X_test:", X_test.shape)
scaler = MinMaxScaler() X_train[num_col] = scaler.fit_transform(X_train[num_col]) X_test[num_col] = scaler.transform(X_test[num_col])
sm = SMOTE(random_state=42) X_train_sm,y_train_sm = sm.fit_resample(X_train,y_train) print("Dimension of X_train_sm Shape:", X_train_sm.shape) print("Dimension of y_train_sm Shape:", y_train_sm.shape)
X_train_sm.head()
from sklearn.decomposition import PCA pca = PCA(random_state=42)
pca.fit(X_train_sm)
X_train_sm_pca=pca.fit_transform(X_train_sm) print("Dimension of X_train_sm_pca: ",X_train_sm_pca.shape)
X_test_pca=pca.transform(X_test) print("Dimension of X_test_pca: ",X_test_pca.shape)
#Viewing the PCA components pca.components_
Performing Logistic Regression
from sklearn.linear_model import LogisticRegression from sklearn import metrics logreg_pca = LogisticRegression() logreg_pca.fit(X_train_sm_pca, y_train_sm)
y_pred = logreg_pca.predict(X_test_pca)
y_pred_df = pd.DataFrame(y_pred) print("Dimension of y_pred_df:", y_pred_df.shape)
from sklearn.metrics import confusion_matrix, accuracy_score
print("Confusion Matirx for y_test & y_pred\n",confusion_matrix(y_test,y_pred),"\n")
print("Accuracy of the logistic regression model with PCA: ",accuracy_score(y_test,y_pred))
plt.bar(range(1,len(pca.explained_variance_ratio_)+1),pca.explained_variance_ratio_) plt.show()
var_cumu = np.cumsum(pca.explained_variance_ratio_)
fig = plt.figure(figsize=[12,7]) plt.plot(var_cumu) plt.xlabel('no of principal components') plt.ylabel('explained variance - cumulative') plt.show()
np.cumsum(np.round(pca.explained_variance_ratio_, decimals=3)*100)
90% of the data can be explained with 30 PCA components
Fitting the dataset with the 30 explainable components
pca_30 = PCA(n_components=35)
train_pca_30 = pca_30.fit_transform(X_train_sm) print("Dimension for Train dataset using PCA: ", train_pca_30.shape)
test_pca_30 = pca_30.transform(X_test) print("Dimension for Test dataset using PCA: ", test_pca_30.shape)
def classification_algo_metrics(y_actual, y_pred): print("Classification report:\n", classification_report(y_actual,y_pred))
accuracy = round(accuracy_score(y_actual, y_pred),4)
precision = round(precision_score(y_actual, y_pred),4)
recall = round(recall_score(y_actual, y_pred),4)
f1 = round(f1_score(y_actual, y_pred),4)
conf_matrix = confusion_matrix(y_actual, y_pred) # confusion matrix
model_roc_auc = round(roc_auc_score(y_actual, y_pred),4) # roc_auc_score
print("Accuracy Score : ", accuracy)
print("Precision Score : ", precision)
print("Recall Score : ", recall)
print("F1 Score : ", f1)
print("Area under curve : ", model_roc_auc,"\n")
# Confusion Matrix
cm = metrics.confusion_matrix( y_actual, y_pred, [0,1] )
sns.heatmap(cm, annot=True, fmt='.0f', cmap="PuBu",
xticklabels = ["Not Churned", "Churned"] ,
yticklabels = ["Not Churned", "Churned"] )
plt.ylabel('True labels')
plt.xlabel('Predicted labels')
plt.show()
return (accuracy, precision, recall, f1, model_roc_auc)
logreg_pca_30 = LogisticRegression() logreg_pca_30.fit(train_pca_30, y_train_sm)
y_pred_30 = logreg_pca_30.predict(test_pca_30)
y_pred_df_30 = pd.DataFrame(y_pred_30) print("Dimension of y_pred_df_30: ", y_pred_df_30.shape)
print("Confusion Matirx for y_test & y_pred\n",confusion_matrix(y_test,y_pred_30),"\n")
print("Accuracy of the logistic regression model with PCA: ",accuracy_score(y_test,y_pred_30))
accuracy, precision, recall, f1, model_roc_auc = classification_algo_metrics(y_test,y_pred_30)
from sklearn.ensemble import RandomForestClassifier clf_rf1 = RandomForestClassifier( random_state=0) clf_rf1.fit(X_train_sm,y_train_sm) y_rf_pred = clf_rf1.predict(X_test) print("Confusion Matirx for y_test & y_pred\n",confusion_matrix(y_test,y_rf_pred),"\n")
print("Accuracy of the logistic regression model with PCA: ",accuracy_score(y_test,y_rf_pred))
accuracy, precision, recall, f1, model_roc_auc = classification_algo_metrics(y_test,y_rf_pred)
feature_imp = pd.DataFrame(sorted(zip(clf_rf1.feature_importances_,X.columns)), columns=['Value','Feature']) feature_imp_top10=feature_imp.sort_values(by="Value", ascending=False).head(10) plt.figure(figsize=(15, 10)) sns.barplot(x="Value", y="Feature", data=feature_imp_top10.sort_values(by="Value", ascending=False)) plt.title('Random Forest Features') plt.tight_layout() plt.show()
Now let's do hyperparametertuning for RandomForest and try to find best parameters to improve the precision and recall score :
from sklearn.model_selection import train_test_split,GridSearchCV,KFold,cross_val_score params = { 'n_estimators':[500,1000], 'max_features': ['auto','sqrt'] , 'max_depth' : [10,20] , 'min_samples_leaf':[50,100], 'min_samples_split':[100,150] }
g_search_rf = GridSearchCV(estimator = RandomForestClassifier(random_state = 100,n_jobs = -1), scoring = 'recall', cv=3, param_grid = params) g_search_rf.fit(X_train_sm,y_train_sm) print("Random Forest Best Score : " ,g_search_rf.best_score_) print("Random Forest Best Params : " ,g_search_rf.best_params_)
from sklearn.ensemble import RandomForestClassifier clf_rf_o = RandomForestClassifier( max_depth= 20, max_features= 'auto', min_samples_leaf= 50, min_samples_split= 100, n_estimators= 1000) ###grid search parameter clf_rf_o.fit(X_train_sm,y_train_sm) y_rf_pred_o = clf_rf_o.predict(X_test) print("Confusion Matirx for y_test & y_pred\n",confusion_matrix(y_test,y_rf_pred_o),"\n")
print("Accuracy of the logistic regression model with PCA: ",accuracy_score(y_test,y_rf_pred_o)) accuracy, precision, recall, f1, model_roc_auc = classification_algo_metrics(y_test,y_rf_pred_o)
feature_imp = pd.DataFrame(sorted(zip(clf_rf1.feature_importances_,X.columns)), columns=['Value','Feature']) feature_imp_top10=feature_imp.sort_values(by="Value", ascending=False).head(10) plt.figure(figsize=(15, 10)) sns.barplot(x="Value", y="Feature", data=feature_imp_top10.sort_values(by="Value", ascending=False)) plt.title('Random Forest Features') plt.tight_layout() plt.show()
from lightgbm import LGBMClassifier model_lgbm = LGBMClassifier(boosting_type='gbdt', n_estimators=600, learning_rate=0.1, max_depth=10, min_child_samples=20, num_leaves=100, objective='binary', random_state=100, subsample=.1, colsample_bytree=1, n_jobs=-1, silent=True) model_lgbm.fit(X_train_sm,y_train_sm)
y_pred_lgbm = model_lgbm.predict(X_test) print("Evaluation on training data set: \n") print("Confusion Matirx for y_test & y_pred\n",confusion_matrix(y_test,y_pred_lgbm),"\n")
print("Accuracy of the logistic regression model with PCA: ",accuracy_score(y_test,y_pred_lgbm)) accuracy, precision, recall, f1, model_roc_auc = classification_algo_metrics(y_test,y_pred_lgbm)
feature_imp = pd.DataFrame(sorted(zip(model_lgbm.feature_importances_,X.columns)), columns=['Value','Feature']) feature_imp_top10=feature_imp.sort_values(by="Value", ascending=False).head(10) plt.figure(figsize=(15, 10)) sns.barplot(x="Value", y="Feature", data=feature_imp_top10.sort_values(by="Value", ascending=False)) plt.title('LightGBM Features') plt.tight_layout() plt.show()
- Incoming Calls : Total incoming minutes, local & std incoming minutes are in top 10 features. It shows that more incoming calls have high impact on retaining the customer. Incoming calls (local & std) should be made free for all customer
- Total Recharge Number : Total Recharge Number (total_rech_num) is 4th highest coefficient. It shows that customers with high frequency of recharge numbers tends to stay. The company should launch small value top-up option for calls. Many times customer may hesitate to top up (prepaid option) with higher value amount. For these customer retention, small value top-up will be ideal. The similar logic is applicable for Total Recharge Number for data (total_rech_data) which has the 6th largest coefficient. Similar small top-up values for data might be useful. Also a combination of calling & data which is cost efficient might attract customers to stay with the telecom company
- Outgoing Calls : Local, STD and Total outgoing call minutes are among the top 20 feature. It shows that higher the outgoing calls, the better chances of the customer to stay with the telecom company. To ensure that, telecom company should launch various discounted offer on outgoing calls. Lower price/minute or free outgoing calls to same telecom company provided connections will also benefit in retaining customer group.
- Roaming : Incoming and Outgoing calls during roaming are critical factor as these are part of top 20 coefficients. Incoming call roaming charges should be made free within the country. STD outgoing call should be same and shouldn't attract extra cost if the roaming option is within the country. For outside country incoming and outgoing roaming as well as data connection, company should launch special country wise roaming packages valid for 7 days, 15 days, 1 month etc. These benefits will attract customer to stay with the telecom company for longer periof
- Total Amount Recharge : Total Amount Recharge on calls and data also have high coefficent value. Telecom company should launch discounted price for higher /longer terms (days/month etc) recharge options on calls and data. This may ensure that the customer is ready to available long terms plans which also indicates that the customer may tend to stay longer.
- Introduce more services to the existing customer or informing them and generating more average revenue would lessen their chances to churn out.
If the goal is to engage and talk to the customers to prevent them from churning, its ok to engage with those who are mistakenly tagged as 'not churned,' ( False Positives) as it does not cause any negative problem. It could potentially make them even happier for the extra attention they are getting.