[BUG] Initial stability for "Good" will be larger than for "Easy" if "Good" has more datapoint

Question

[BUG] Initial stability for "Good" will be larger than for "Easy" if "Good" has more datapoint

L-M-Sherlock opened this issue a year ago · comments

@L-M-Sherlock I think in the current version of the optimizer it's possible that a value for "Good" will be larger than for "Easy" if "Good" has more datapoints.
params, _ = curve_fit(power_forgetting_curve, delta_t, recall, sigma=1/np.sqrt(count), bounds=((0.1), (30 if total_count < 1000 else 365)))
You should probably add some kind of extra cap to ensure that S0 for "Good" cannot be greater than S0 for "Easy" even if total_count is greater than 1000 for "Good" and less than 1000 for "Easy".

Originally posted by @Expertium in open-spaced-repetition/fsrs4anki#348 (comment)

Expertium · Answer 1 · Sun Jul 30 2023 03:26:54 GMT+0800 (China Standard Time)

There are 2 simple ways to solve this:

if S0_good > S0_easy:
    S0_good = S0_easy

or

if S0_good > S0_easy:
    S0_easy = S0_good

In the first method we artificially decrease S0 for Good, in the second method we artificially increase S0 for Easy. I don't know which one makes more sense, but probably the latter. If S0 for Good is based on a larger number of reviews, then it is calculated more accurately than S0 for Easy, and therefore we shouldn't change it and instead we should change the less accurate S0 for Easy.

user1823 · Answer 2 · Sun Jul 30 2023 17:27:29 GMT+0800 (China Standard Time)

In my opinion, the second approach makes more sense.

Deleted user · Answer 3 · Sun Jul 30 2023 17:45:13 GMT+0800 (China Standard Time)

Maybe decide this based on the number of datapoints in each case?

if S0_good > S0_easy:
    if n_datapoints_good > n_datapoints_easy:
        S0_easy = S0_good
    else:
        S0_good = S0_easy

However if you look at the table in #5 (comment), there are cases where S0_again > S0_hard or S0_hard > S0_good so this issue is not limited to the pair of Good and Easy.

Expertium · Answer 4 · Sun Jul 30 2023 19:54:13 GMT+0800 (China Standard Time)

I suppose the idea above should be applied to all pairs: Again-Hard, Hard-Good and Good-Easy.

Expertium · Answer 5 · Tue Aug 01 2023 22:58:55 GMT+0800 (China Standard Time)

@L-M-Sherlock here are some good ideas:

The one above by nb9618, but apply it to all pairs: Again-Hard, Hard-Good and Good-Easy. There will likely be issues with that, though. I don't expect it to work on the first try without creating new problems.
When using additive smoothing, instead of using retention of the entire collection/deck, only use retention based on second reviews to calculate p0 (the initial guess).
When using the outlier filter based on IQR, use ln(delta_t) rather than delta_t itself. Filtering based on IQR doesn't work well on data that isn't normally distributed, and delta_t certainly isn't.

Of course, all of these changes should be evaluated using statistical significance tests, I hope by now you have set up an automated system to run tests on all 66 collections.

Oh, also: in the scheduler code change // recommended setting: 0.8 ~ 0.9 to // recommended values: 0.75 ~ 0.97

Expertium · Answer 6 · Sun Aug 06 2023 23:40:25 GMT+0800 (China Standard Time)

@L-M-Sherlock you've been inactive for a couple of days, so there is a good chance you missed my comment above. I'm pinging you just to remind you about it.

Jarrett Ye · Answer 7 · Mon Aug 07 2023 00:20:09 GMT+0800 (China Standard Time)

I am just tired to maintain the optimizer module. You can check these parameters in the batch training in collected data: open-spaced-repetition/fsrs4anki#351 (comment). There are some cases where the initial stability of again is large than the initial stability or the initial stability of good is large than the initial stability of easy. These cases would have different reason. We should deal with these problem according to the concrete cases.

Expertium · Answer 8 · Mon Aug 07 2023 01:15:59 GMT+0800 (China Standard Time)

Ok, forget about 1, but I would still ask you to test 2 and 3.

Jarrett Ye · Answer 9 · Mon Aug 07 2023 09:47:40 GMT+0800 (China Standard Time)

For 2, here is a extreme case:

The user always remember in the next review when he pressed easy in the first learning. In this case, the retention is 100%. If we use this value, the additive smoothing will be useless.

Expertium · Answer 10 · Mon Aug 07 2023 11:39:04 GMT+0800 (China Standard Time)

I think you misunderstood my idea a little bit. I didn't mean "use four different initial guesses for each grade", I meant "use the same initial guess for each grade". So just calculate average retention for all second reviews.

Expertium · Answer 11 · Mon Aug 07 2023 11:40:13 GMT+0800 (China Standard Time)

By the way, have you automated running statistical significance tests on all collections?

Jarrett Ye · Answer 12 · Mon Aug 07 2023 11:57:20 GMT+0800 (China Standard Time)

3. When using the outlier filter based on IQR, use ln(delta_t) rather than delta_t itself. Filtering based on IQR doesn't work well on data that isn't normally distributed, and delta_t certainly isn't.

I'm testing this in all 66 collections.

Jarrett Ye · Answer 13 · Mon Aug 07 2023 12:00:18 GMT+0800 (China Standard Time)

I think you misunderstood my idea a little bit. I didn't mean "use four different initial guesses for each grade", I meant "use the same initial guess for each grade". So just calculate average retention for all second reviews.

OK. I will test it after above test. It will cost nearly 3 hours.

Jarrett Ye · Answer 14 · Mon Aug 07 2023 14:21:46 GMT+0800 (China Standard Time)

I'm testing this in all 66 collections.

Before:

Weighted RMSE: 0.04149183369953192
Weighted Log loss: 0.3815897150075234
Weighted MAE: 0.02342977913950602
Weighted R-squared: 0.7697902622572932

After:

Weighted RMSE: 0.04174954832152736
Weighted Log loss: 0.38212856042129156
Weighted MAE: 0.02374078044685508
Weighted R-squared: 0.7672438581669868

p = 0.0045 (for RMSE)

3. When using the outlier filter based on IQR, use ln(delta_t) rather than delta_t itself. Filtering based on IQR doesn't work well on data that isn't normally distributed, and delta_t certainly isn't.

It's worse than the current version with statistical significance.

Here is the code:

        def remove_outliers(group: pd.DataFrame) -> pd.DataFrame:
            # threshold = np.mean(group['delta_t']) * 1.5
            # threshold = group['delta_t'].quantile(0.95)
            Q1 = group['delta_t'].map(np.log).quantile(0.25)
            Q3 = group['delta_t'].map(np.log).quantile(0.75)
            IQR = Q3 - Q1
            threshold = Q3 + 1.5 * IQR
            group = group[group['delta_t'].map(np.log) <= threshold]
            return group

Expertium · Answer 15 · Mon Aug 07 2023 15:05:44 GMT+0800 (China Standard Time)

Huh, I'm surprised. Maybe the more data is removed, the easier it is for FSRS to fit the remaining data well? In other words, what if we cannot rely on RMSE when removing outliers because, between two methods that both aim at removing outliers, the one that removes more data will always result in a lower RMSE?

Jarrett Ye · Answer 16 · Mon Aug 07 2023 15:23:31 GMT+0800 (China Standard Time)

Removing more data not always results in a lower RMSE. Removing too many data might lead to underfitting, where the model fails to capture the underlying trend of the data. This can also increase the RMSE.

Expertium · Answer 17 · Mon Aug 07 2023 15:39:42 GMT+0800 (China Standard Time)

Alright, then test the idea with p0 for additive smoothing, and that's it.
After that I would like you to benchmark all 5 algorithms, I'll explain it in a bit more detail in the relevant issue.

Jarrett Ye · Answer 18 · Mon Aug 07 2023 16:38:47 GMT+0800 (China Standard Time)

additive smoothing:

Weighted RMSE: 0.04147353655819303
Weighted Log loss: 0.3815885589708383
Weighted MAE: 0.023376754517799636
Weighted R-squared: 0.7699164899424069

p=0.38

It is slightly better but not statistically significant.

user1823 · Answer 19 · Mon Aug 07 2023 19:33:08 GMT+0800 (China Standard Time)

Removing more data not always results in a lower RMSE. Removing too many data might lead to underfitting, where the model fails to capture the underlying trend of the data. This can also increase the RMSE.

I agree that removing more data would not always result in a lower RMSE.

But here, we are selectively removing the data which lies at the right-hand-side of the curve (and not any random data). So, the remaining data is more homogenous and this might explain why the RMSE is lower.

Expertium · Answer 20 · Mon Aug 07 2023 19:49:36 GMT+0800 (China Standard Time)

So, the remaining data is more homogenous and this might explain why the RMSE is lower.

Yeah, I'm just surprised that my approach is somehow worse, even though in theory IQR should work better with normally distributed data.

user1823 · Answer 21 · Mon Aug 07 2023 20:12:54 GMT+0800 (China Standard Time)

I think that the increase in RMSE that we saw when using log of delta_t is just an artifact.

For example, when the optimizer filtered out all the cards with first rating = Again in my collection, the RMSE got a crazy low value (0.0056). I first mentioned this here: open-spaced-repetition/fsrs4anki#348 (comment)

Jarrett Ye · Answer 22 · Tue Aug 08 2023 12:13:12 GMT+0800 (China Standard Time)

I think that the increase in RMSE that we saw when using log of delta_t is just an artifact.

So we should not only consider the RMSE, right? We should have other criterion to make decision whether a idea should be employed in FSRS.

Jarrett Ye · Answer 23 · Tue Aug 08 2023 12:21:40 GMT+0800 (China Standard Time)

Maybe decide this based on the number of datapoints in each case?

I will adopt this idea, not for the sake of enhancing the model's accuracy, but to alleviate users' confusion. Therefore, I would not to run evaluation tests.

user1823 · Answer 24 · Tue Aug 08 2023 19:08:10 GMT+0800 (China Standard Time)

I think that the increase in RMSE that we saw when using log of delta_t is just an artifact.

So we should not only consider the RMSE, right? We should have other criterion to make decision whether a idea should be employed in FSRS.

Yes, but I don't know which metric would be appropriate in this case.

Also, let's discuss this further in #16.