ColtAllen / btyd

Buy Till You Die and Customer Lifetime Value statistical models in Python.

Home Page:https://btyd.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CLV value too high in btyd

SSMK-wq opened this issue · comments

I am trying to predict the CLTV of a customer in next 365 days (12 month) window. I encountered the same issue with lifetimes package as well (and I see there is one open bug there). The issues can be found here and here

I use 4 years data as calibration data and 1 year data as holdout data.

I wrote the below piece of code

ggf = GammaGammaFitter(penalizer_coef=0.01) # model object
ggf.fit(monetary_cal_df['frequency_cal'], monetary_cal_df['avg_monetary_value_cal']) # model fitting
# Prediction of expected amount of average profit
monetary_cal_df["expct_avg_spend"] = ggf.conditional_expected_average_profit(monetary_cal_df['frequency_cal'], monetary_cal_df['avg_monetary_value_cal'])

monetary_cal_df["cltv_12m"] = ggf.customer_lifetime_value(bgf,
                                   monetary_cal_df['frequency_cal'],
                                   monetary_cal_df['recency_cal'],
                                   monetary_cal_df['T_cal'],
                                   monetary_cal_df['avg_monetary_value_cal'],
                                   time=12,  # 12 month
                                   freq="D",  # frequency of T
                                   discount_rate=0.01)
monetary_cal_df.sort_values("cltv_12m",ascending=False).head()

But the problem is CLTV values are high. For instance, in the below screenshot, (Don't worry about the column header name in the screenshot.Both expected purchase and CLV is for 12 months only)

image

When I do a multiplication of "expected_avg_spend" and "expected_purchase", they are low by more than 100K.

Can you guide me on whether it is normal to see such a huge differences (and have you encountered cases like this in your work-setting doing this project)? How is CLTV calculated? Is it correct from my end to think that they should be somewhat closer to the result of multiplication (between expected_avg_spend and expected_purchase)

Apart from R2, RMSE, I also plotted the below graph to see whether they are okay visually and I guess it is okay but is there any red flag that I should be aware of when doing assessments like this?

image

Accordingly, few related questions on the theory of CLTV

a) I see that we compute CLTV for a specific time horizon like 6 months, 12 months etc. So, when we mean Customer Lifetime Value in next 6 or 12 months or t months, am I right to interpret that this is the amount/revenue that we expect from the customer in the next 6 or 12 months or t months? The keyword lifetime doesn't mean the total revenue from the customer during his lifetime with us (ex: A customer can continue to stay with us for 10 years or more (as it is future, we may not know). but the CLV we predict is specifically for the next 6/12/t months.). Is my understanding right?

b) Can the time parameter in ggf.customer_lifetime_value() accept values only in terms of months as shown in code snippet comments. Because, I see that freq parameter is for a different column (T). So, CLTV can only be computed based on month intervals. Am I right to understand that?

c) For GGM, the prerequisite correlation between avg_monetary_cal and frequency_cal is 0.28. So, is 0.28 indicative of weak correlation to proceed further? Because I see in all tutorials online, it is all very low such as 0.03, 0.07 etc. So, 0.28 is low enough to proceed further?

d) When I compute r2 between my monetary_val_cal and expected_avg_spend, there is 92% of variance explained. But when I do the r2 calculation for monetary_val_holdout and expected_avg_spend (which is what we are expected to do I guess), I get only negative values. I guess it is overfitting. Is there any pointers that you can help me on how I can help improve the performance? but the r2 is around 72% for expected future transaction count (when I compare it with holdout_frequency). any suggestions on how can this be improved? I already set the penalizer to 0 (as it resulted in better r2 score) and subset my dataframe for GGM (to include only customers who have monetary_value_cal > 0). I filter only based on monetary value_cal > 0 because the package gave me an error that there are values = 0. So, I included the filter criteria. Basically, my BG/NBD models works well but it is GGM that is causing issue.

e) I use 4 years data as calibration data and 1 year data as holdout data. Not all customers are present in both sets. Some are present only in calibration and some are present only in holdout (because they became a customer only during holdout period). Anyway, we restrict our analysis only people who are present in both sets. Is that the right thing to do? or we should let the model predict for people who are missing in holdout set (may be they left during holdout period) but were present in calibration? Or this package should be used only for customers who are present in both calibration and holdout sets

f) Do you think it would be wise to reduce the dataset size? May be this problem doesn't require 5 years of data (may be they just add noise). Just choosing to model only based on past 2 years would be good route to take to solve this problem?

Currently, the metrics looks like below for monetary_holdout and expct_avg_spending. I drop NA because the model returned NaN as CLV value using gamma function (for rows where the frequency_holdout is zero). So, I dropped those records

image

Hey @SSMK-wq,

I don't know the details of your use case, but those are enormous monetary values. Are you sure this is the average spend for individual daily transactions? Can you also run a ggf.summary so we can see what the parameters look like?

a) I see that we compute CLTV for a specific time horizon like 6 months, 12 months etc. So, when we mean Customer Lifetime Value in next 6 or 12 months or t months, am I right to interpret that this is the amount/revenue that we expect from the customer in the next 6 or 12 months or t months? The keyword lifetime doesn't mean the total revenue from the customer during his lifetime with us (ex: A customer can continue to stay with us for 10 years or more (as it is future, we may not know). but the CLV we predict is specifically for the next 6/12/t months.). Is my understanding right?

Yes

b) Can the time parameter in ggf.customer_lifetime_value() accept values only in terms of months as shown in code snippet comments. Because, I see that freq parameter is for a different column (T). So, CLTV can only be computed based on month intervals. Am I right to understand that?

Although the time parameter is based in months, freq is a multiplier which will convert it to the specified time unit. In the case of 'D', it will multiply the number of months by 30 and convert it to days. It is important recency, T, and freq are the same time units.

c) For GGM, the prerequisite correlation between avg_monetary_cal and frequency_cal is 0.28. So, is 0.28 indicative of weak correlation to proceed further? Because I see in all tutorials online, it is all very low such as 0.03, 0.07 etc. So, 0.28 is low enough to proceed further?

I've found in practice that anything below 0.30 is usually fine, but you're right at that threshold.

d) When I compute r2 between my monetary_val_cal and expected_avg_spend, there is 92% of variance explained. But when I do the r2 calculation for monetary_val_holdout and expected_avg_spend (which is what we are expected to do I guess), I get only negative values. I guess it is overfitting. Is there any pointers that you can help me on how I can help improve the performance? but the r2 is around 72% for expected future transaction count (when I compare it with holdout_frequency). any suggestions on how can this be improved? I already set the penalizer to 0 (as it resulted in better r2 score) and subset my dataframe for GGM (to include only customers who have monetary_value_cal > 0). I filter only based on monetary value_cal > 0 because the package gave me an error that there are values = 0. So, I included the filter criteria. Basically, my BG/NBD models works well but it is GGM that is causing issue.

Sounds like your calibration and holdout sets are not identically distributed. I'm still brainstorming how to add an easy way to check for this, but if you want you can look into the Mann-Whitney U rank test:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html#scipy.stats.mannwhitneyu

e) I use 4 years data as calibration data and 1 year data as holdout data. Not all customers are present in both sets. Some are present only in calibration and some are present only in holdout (because they became a customer only during holdout period). Anyway, we restrict our analysis only people who are present in both sets. Is that the right thing to do? or we should let the model predict for people who are missing in holdout set (may be they left during holdout period) but were present in calibration? Or this package should be used only for customers who are present in both calibration and holdout sets

These are population-based models, so it's not an issue if some customers are in calibration and not holdout, and vice-versa.

f) Do you think it would be wise to reduce the dataset size? May be this problem doesn't require 5 years of data (may be they just add noise). Just choosing to model only based on past 2 years would be good route to take to solve this problem?

I have found in practice that going too far back in time will skew transaction rates, so if your business audience is fine with just 2 years, that may help improve model performance.

It's worth doing a bgf.summary on your BG/NBD model as well. Any parameters that have high values is usually a bad sign.

The issue can be found at CamDavidsonPilon#313

Thanks for posting this. That discussion has me wondering if CLV is being calculated properly in the legacy function. I'll do some research and mark this as a bug until I get a definitive answer.

@ColtAllen - Thanks for your patience and help. My bgf parameters looks like below. But yes, am using the old version gammagammafitter and btyd old version I guess. But I installed only a week or so ago..

image

Thanks; alpha is a the scaling parameter for the transaction rate. It's a little high, but that's normal when modeling across long time periods, which is the case here.

Can you share a summary on the GammaGammaFitter as well?

Also did some digging and it seems the CLV calculation is derived from pages 6-8 in this paper:

http://brucehardie.com/papers/rfm_clv_2005-02-16.pdf

I think I see where the problem may lie, but I need to do some more testing.

@ColtAllen - Please find my bgf and ggf summary below for dataset of 5 years from Jan 2017 to May 2022.

image

image

You can also see a table below which shows how my expected_avg_spend is more close to my avg_monetary_cal (calibration) and not aligned with avg_monetary_holdout (holdout) values.

image

My code looks like below

ggf = GammaGammaFitter() # model object
ggf.fit(monetary_cal_df['frequency_cal'],monetary_cal_df['avg_monetary_value_cal']) # model fitting
# Prediction of expected amount of average profit
monetary_cal_df["expct_avg_spend"] = ggf.conditional_expected_average_profit(monetary_cal_df['frequency_cal'], monetary_cal_df['avg_monetary_value_cal'])

monetary_cal_df["cltv_5m"] = ggf.customer_lifetime_value(bgf,
                                   monetary_cal_df['frequency_cal'],
                                   monetary_cal_df['recency_cal'],
                                   monetary_cal_df['T_cal'],
                                   monetary_cal_df['avg_monetary_value_cal'],
                                   time=5,  # 12 month
                                   freq="D",  # frequency of T
                                   discount_rate=0.01)

monetary_cal_df.sort_values("cltv_5m",ascending=False).head()

My performance metrics looks like as below (avg_monetary_holdout vs expected_avg_spend)

image

However, if I reduce my dataset to include only past two years data, my performance is improved by few points as shown below

image

the summary of bgf and ggf models are shown below

image

image

Looks like additional exploratory data analysis is needed to determine why the calibration and holdout datasets aren't identically distributed. You can try plotting histograms of monetary values for comparison, as well as a time series of the combined dataset to see if there are any changepoints.

I've summed up my CLV findings in this issue: #76

If there's nothing else, I'm gonna close this issue.

Apologies for the delay. As I have been traveling, couldn't attend to this earlier.