open-spaced-repetition / fsrs-optimizer

FSRS Optimizer Package

Home Page:https://pypi.org/project/FSRS-Optimizer/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use results from benchmark experiment as initial values of S0

user1823 opened this issue · comments

@L-M-Sherlock, I recommend using s0 = 1.5 for Hard. Currently, 0.6 is used, which is too small.

Originally posted by @user1823 in #16 (comment)

When I was running the benchmark on 66 collections, I also wrote down S0. Here are the average values, weighted by ln(reviews):
S0(Again)=0.6
S0(Hard)=1.4
S0(Good)=3.3
S0(Easy)=10.1

I suggest running a statistical significance test to determine whether these values are better than the ones currently used.

Originally posted by @Expertium in #16 (comment)

In my opinion, we should just replace the S0 for Hard because the currently used value for Hard doesn't make much sense.

Also, the result of such a change would not be statistically significant because it would only affect the values in those collections that have a very low number of reviews with Hard as the first rating. So, we don't need to run a statistical significance test here.

Originally posted by @user1823 in #16 (comment)

I want Sherlock to replace all 4 values though. There is a pretty big difference between the currently used values (all four of them) and the ones I obtained from the benchmark. We need to find out which ones provide a better fit to user's repetition histories.
Current: 0.4, New: 0.6
Current: 0.6, New: 1.4
Current: 2.4, New: 3.3
Current: 5.8, New: 10.1
The values obtained from benchmarking are 50-100% greater.

Originally posted by @Expertium in #16 (comment)

I prefer a conservative set of initial values of S0. It's more about new users.

I prefer a conservative set of initial values of S0. It's more about new users.

Yes, that's why I think that we should the change the value for Hard only.

When a user presses Hard, it means that they can successfully recall the card. So, it is safe to assume that they will most likely be able to recall it the next day also. So, the S0 for Hard should be greater than 1 day.

By the way, I noticed an outlier in the data collected by Expertium (revlog number 22). It has a S0 of 21.92 days for Hard (and also Good & Easy).

After setting its count to 0, I got 1.04 as weighted mean of S0 for Hard, using ln(count) as the weight.

So, I suggest using 1 as the initial S0 for Hard.

Also, if you are wondering, here are the S0 for all the first ratings weighted by ln(capped count) after setting the count of revlog 22 to 0.

  • Again: 0.54
  • Hard: 1.04
  • Good: 2.95
  • Easy: 9.89
  • Easy: 9.89

It's a little too long. I think it should not be longer than 7 days.

  • Easy: 9.89

It's a little too long. I think it should not be longer than 7 days.

So, let's use the following formula to calculate S0 for Easy:

rating_stability[4] = np.power(rating_stability[2], 1-1/w2) * np.power(rating_stability[3], 1/w2)

Using this formula, S0 for Easy comes out to be 5.9 days.

Edit:
But, this looks too small (it is just the double of Good). So, let's take the initial S0 to be 6.8 days.

Look, we can keep arguing about which set of initial S0 is better, and we can keep coming up with clever arguments, but why not just test it? @L-M-Sherlock run the most recent verison of the optimizer with both sets (current and this one) and run the usual statistical significance test.

Look, we can keep arguing about which set of initial S0 is better, and we can keep coming up with clever arguments, but why not just test it?

The problem is that the S0 for collections with large number of reviews wouldn't be affected much (if at all).

The only effect of this change would be on small collections. So, if you want to test this, test it only on the smaller collections.

My concern with setting a high initial stability for the "easy" rating is that users might avoid using it. They could pick "good" instead, making the "easy" situations even rarer and potentially increasing its stability value further. This can create a negative cycle.

Ideally, the user shouldn't rate the card based on the displayed interval. But we can't prevent the user doing that.

Ideally, the user shouldn't rate the card based on the displayed interval. But we can't prevent the user doing that.

I have definitely done this myself, especially when I had an exam deadline, but also when the initial easy interval was longer than I liked it. Also, there was a time when I used the pass/fail two-button system.

This reminds me that Anki has, in fact, a setting to turn on/off the next intervals display (default: on).

Perhaps it should be mentioned in the Wiki/FAQ:

Draft

Grading your answer

The grade should be chosen based only on how easy it was to answer the card, not how long you want to wait until you see it again.

For example, if you habitually avoid the easy button because it shows long intervals, you can end up in a negative cycle: You'd be making the "easy" situations even rarer and easy grade's intervals longer and longer.

This means you should ignore the intervals shown above the answer buttons and instead focus on how well you recall the information. To help you, you can hide the intervals in the Anki preferences:

image

If you still want to see a deck sooner rather than later, for example because you have an exam coming up, you can use the Advance function of the Helper add-on. Advance is the preferable method because it doesn't skew the grading history of the cards.

My concern with setting a high initial stability for the "easy" rating is that users might avoid using it.

I completely agree. But, 6.8 days is not too large in my opinion. So, I think that we can consider it for the initial S0 for Easy. The initial S0 for other ratings can be taken from the benchmark (#23 (comment)).

@Expertium, if you are willing to test the effect of this change, I can push the change in my fork of the repository and then tell you how to use it in Colab. If you want, you can also do it on your own.

But, keep in mind that the most noticeable effect would be on the smaller collections. So, while calculating the statistical significance, it would be wiser to include only the smaller collections.

The new initial values of S0 has been generated and recorded here: https://github.com/open-spaced-repetition/fsrs-benchmark#median-weights

Nice. Tell that to Dae so that he can update the initial parameters in the beta.