by: Paige Rackley
[Project Description] [Project Planning] [Data Dictionary] [Data Acquire and Prep] [Data Exploration] [Modeling] [Conclusion]
. Create functions (acquire.py and prepare.py) that will bring in Telco database and clean it up in order to explore the data for churn. . Run statistical tests to help understand drivers for churn. . Contruct models using train,validate,test and make predictions for churn using classification methods.
- Find drivers for customer churn at Telco. Why are customers churning?
- Construct a ML classification model that accurately predicts customer churn.
- Deliver a report that a non-data scientist can read through and understand.
- My target audience is for fellow Codeup Students and staff.
- A final report notebook
- A final report notebook presentation
- All necessary modules to make my project reproducible
- On your best model, a chart visualizing how it performed on test would be valuable.
Initial Hypothesis: Churn is most directly associated with 4 factors: Senior citizens, electronic checks, fiber optic internet, and tech support
- H0: Rate of churn is not dependent on being a senior citizen.
- H1: Rate of churn is dependent on being a senior citizen.
- H0: Churn is not dependent on having fiber optic internet.
- H1: Churn is dependent on having fiber optic internet.
- H0: Churn is not dependent on electronic check payment type.
- H1: Churn is dependent on electronic check payment type.
- H0: Churn is not dependent on if a customer receives tech support.
- H1: Churn is dependent on if a customer receives tech support.
Target | Datatype | Definition |
---|---|---|
churn | 7043 non-null: object | customer churn Yes or No |
Feature | Datatype | Definition |
---|---|---|
internet_service_type_id | 7043 non-null: int64 | id refering to type of internet service used |
payment_type_id | 7043 non-null: int64 | id refering to type of payment used |
contract_type_id | 7043 non-null: int64 | id refering to type of contract used |
customer_id | 7043 non-null: object | individual customer id string |
gender | 7043 non-null: object | customer male or female |
senior_citizen | 7043 non-null: int64 | is customer senior |
partner | 7043 non-null: object | does customer have a partner |
dependents | 7043 non-null: object | does customer have dependents |
tenure | 7043 non-null: int64 | length customer with company in months |
phone_service | 7043 non-null: object | uses phone service Yes or No |
multiple_lines | 7043 non-null: object | Yes, No, or No phone service |
online_security | 7043 non-null: object | Yes, No, No internet service |
online_backup | 7043 non-null: object | Yes, No, No internet service |
device_protection | 7043 non-null: object | Yes, No, No internet service |
tech_support | 7043 non-null: object | Yes, No, No internet service |
streaming_tv | 7043 non-null: object | Yes, No, No internet service |
streaming_movies | 7043 non-null: object | Yes, No, No internet service |
paperless_billing | 7043 non-null: object | uses paperless billing Yes or No |
monthly_charges | 7043 non-null: float64 | monthly bill amount in USD |
total_charges | 7043 non-null: object | lifetime total charged to customer in USD |
contract_type | 7043 non-null: object | One Year, Two Year, Month-to-month |
payment_type | 7043 non-null: object | Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic) |
internet_service_type | 7043 non-null: object | Fiber optic, DSL, None |
Data is aquired from the company SQL database using MySQLWorkBench. Functions are stored in the acquire.py file, which allows quick access to the data. Once the aquire file is imported, it can be used each time using the data.
Within the prepare.py file: Any duplicate observations are removed Convert the total charges column to a float value. Changed all columns that were binary to numeric.
- For example, columns that were either 'Yes/No to 1/0. Stored non-binary data in a 'dummies dataframe' Added the dummies dataframe to the original. Assigned more readable names to columns that needed it. Dropped duplicate columns.
- all '_id' categories (all of these are covered in different columns that can be encoded) Split the data into the 3 needed dataframes: train, validate, and test. We stratify on 'churn' since this is our main target
- Finding which features have the highest correlation to churn
- Testing hypothesis with Chi-Squared Tests
- Visualizing churn with plots
- Using bar charts using matplotlib since these items have been encoded to categorical value
The features tested all rejected the null, so they will be the focal points in the models. All other columns will be excluded to produce more precise results.
After splitting and exploring the data, we move on to modeling.
With the train data set, try four different classification models, determining which data features and model parameters create better predictions
- Decision Tree
- Random Forest
- KNN
- Logistic Regression Evaluate the 3 top models on the validate data set Evaluate the best model on the test data set
The factors that were explored and tested were proven to be associated with churn and not independent of churn.
- Marketing to non senior citizens.
- Create marketing to keep senior citizens, such as discounts or promotional deals for staying.
- There could be potential issues with the fiber optic service, so performing an investigation would be insightful.
- Create incentives to switch to different payment types to potentially reduce churn
- Create promotions for switching payment types.
- Increase tech support coverage and make tech support resources more available.
- Prioritize making it easier to get to tech support on website.
Next steps: With more time, I would like to investigate the issue with Fiber Optic even more. Fiber optic is usually the faster internet option, so the reason for churn could be connectivity issues.
How to Reproduce
- Read this README.md
- Download the aquire.py and prepare.py into your working directory
- Have fun doing your own exploring, modeling, and more!