Are `predict_win` and `predict_draw` functions accidentally using Thurstone-Mosteller specific calculations?

Question

Are `predict_win` and `predict_draw` functions accidentally using Thurstone-Mosteller specific calculations?

asyncth opened this issue 2 years ago · comments

If I understand it correctly, those two functions seem to perform calculations using equations numbered (65) in the paper. However, those equations seems to be specific to Thurstone-Mosteller model and as far as I can tell, the proper way to calculate probabilities for Bradley-Terry model would be to use equations (48) and (51) (also seen as p_iq in equation (49)). Is this intended? Or am I misunderstanding either the paper or the code of these functions?

Vivek Joshy · Answer 1 · Mon Dec 19 2022 23:24:16 GMT+0800 (China Standard Time)

The prediction functions are not derived from solely from (65), but rather from it's combination with (72). AFAIK there are no papers or articles that describe how to generalize $f_q(\bf{z})$. Equation (37) gives the general form of $q$ given $C_q$ through factorization:

$$ f(\pmb{z}) = \prod_{q=1}^{m} f_q(\pmb{z}) $$

But when I tried to apply the modified prediction function (it's easy enough to alter it), it produced virtually the same results.

Example

Using the current generalized formula as implemented, BradleyTerryFull produces this result in the benchmarks:

Enter Model: BradleyTerryFull
Benchmark Processor: Win
Enter Random Seed: 1
----------------------------------------
Confident Matches:  5661
Predictions Made with OpenSkill's BradleyTerryFull Model:
Correct: 583 | Incorrect: 52
Accuracy: 91.81%
Process Duration: 0.8336913585662842
----------------------------------------
Predictions Made with TrueSkill Model:
Correct: 593 | Incorrect: 42
Accuracy: 93.39%
Process Duration: 2.950780153274536
Mean Matches: 2.3195027353377617

Here is a benchmark with equation (48) implemented into predict_win:

def predict_win(teams: List[List[Rating]], **options) -> List[Union[int, float]]:
    if len(teams) < 2:
        raise ValueError(f"Expected at least two teams.")

    n = len(teams)

    pairwise_probabilities = []
    for pairwise_subset in itertools.permutations(teams, 2):
        current_team_a_rating = team_rating([pairwise_subset[0]])
        current_team_b_rating = team_rating([pairwise_subset[1]])
        mu_a = current_team_a_rating[0][0]
        sigma_a = current_team_a_rating[0][1]
        mu_b = current_team_b_rating[0][0]
        sigma_b = current_team_b_rating[0][1]
        ciq = math.sqrt(n * beta(**options) ** 2 + sigma_a**2 + sigma_b**2)
        probability_iq = 1 / (1 + math.exp((mu_a - mu_b) / ciq))
        pairwise_probabilities.append(
            1 - probability_iq
        )

    if n > 2:
        cache = deque(pairwise_probabilities)
        probabilities = []
        partial = len(pairwise_probabilities) / n
        while len(cache) > 0:
            aggregate = []
            for length in range(int(partial)):
                aggregate.append(cache.popleft())
            aggregate_sum = sum(aggregate)
            aggregate_multiple = n
            for length in range(1, n - 2):
                aggregate_multiple *= n - length
            probabilities.append(1 - (aggregate_sum / aggregate_multiple))
        return probabilities
    else:
        return pairwise_probabilities

This is a bit inefficient code seen and has a worse time complexity than the current implementation. But this is generally what (48) would look like translated to code. Here are the benchmark results:

Enter Model: BradleyTerryFull
Benchmark Processor: Win
Enter Random Seed: 1
----------------------------------------
Confident Matches:  5661
Predictions Made with OpenSkill's BradleyTerryFull Model:
Correct: 583 | Incorrect: 52
Accuracy: 91.81%
Process Duration: 0.8177695274353027
----------------------------------------
Predictions Made with TrueSkill Model:
Correct: 593 | Incorrect: 42
Accuracy: 93.39%
Process Duration: 3.0598957538604736
Mean Matches: 2.3195027353377617

As you can see, in practical terms the results are virtually the same. It might lend some credence to having to custom prediction functions for model if there was some data that showed they are more effective. Perhaps they are, perhaps they aren't. But without n-team match data it's not worth having such a piecewise function.

If time allowed and the there was some evidence, I am willing to implement or merge such code. If this answers your question feel free to close this issue.

AsyncTh · Answer 2 · Mon Dec 19 2022 23:30:20 GMT+0800 (China Standard Time)

Thanks for replying, there is no need to change it if it's intentional, thought it might be a mistake.