[BUG] SBMEstimator is not right; it is systematically underbiased for loopless graphs

Question

[BUG] SBMEstimator is not right; it is systematically underbiased for loopless graphs

ebridge2 opened this issue 2 years ago · comments

Expected Behavior

SBMEstimator does not seem to properly account for loops.

Actual Behavior

I cannot find anywhere in the codebase to support that the loops are properly removed before estimating block probabilities. From what I gather, the SBMEstimator will first obtain block indices for each vertex https://github.com/microsoft/graspologic/blob/ff34382d1ffa0b7ea5f0e005525b7364f977e86f/graspologic/models/sbm_estimators.py#L208, and then move in the next line to calculate the block probability matrix https://github.com/microsoft/graspologic/blob/ff34382d1ffa0b7ea5f0e005525b7364f977e86f/graspologic/models/sbm_estimators.py#L212 but does not take an input for whether or not the network is directed, which feels like it should be critical. This probability is then computed https://github.com/microsoft/graspologic/blob/ff34382d1ffa0b7ea5f0e005525b7364f977e86f/graspologic/models/sbm_estimators.py#L482, which is then passed along to https://github.com/microsoft/graspologic/blob/ff34382d1ffa0b7ea5f0e005525b7364f977e86f/graspologic/models/base.py#L16, all of which are also performed ignorant whether or not there are loops, so I cannot possibly see how loops could be being properly accounted for. Seems unnecessarily tedious to trickle through with a debugger line-by-line, as I can just prove that this mishandle occurs using code instead.

Example Code

The below code simulates/fits an ER(10, 0.5) graph, and uses the fact that the adjustment factor for the probability is going to be n^2 in the computation (not right) instead of n*(n-1) (which is right). If this adjustment is handled wrong, we should be able to just account for it when producing our probability estimate, which is what I do below.

import graspologic as gp
import numpy as np

n = 10
A = gp.simulations.er_np(n, 0.5, directed=False, loops=False)
# fit it using graspologic
fit_mod = gp.models.EREstimator(directed=False, loops=False).fit(A)

print("Estimate: {:.4f}".format(fit_mod.p_))
# if loops were improperly accounted for (e.g., ignored)
# we should get the same answer for the next two...
# the first one assumes that loops result in an overcount by a factor of n^2/(2*binom(n, 2)) based on how
# graspologic's sbm handles directed networks (ignores)
print("Estimate (adjusted with assumption of loops being mishandled): {:.4f}".format(fit_mod.p_ * (n**2/(n*(n-1)))))

inds = np.triu_indices(n, k=1)
print("Correct answer: {:.4f}".format(np.mean(A[inds])))

Which produces:

Estimate: 0.5400
Estimate (adjusted with assumption of loops being mishandled): 0.6000
Correct answer: 0.6000

The first estimate is underbiased in all my attempts, whereas the second two are both always correct, which tells me that the diagonal is being ignored, as if the diagonal were not being ignored, these would otherwise not be giving the same answer.

Your Environment

Python version: 3.8.4
graspologic version: whatever is on master as of 3/10/2022; looks like 1.0.1

Additional Details

Any other contextual information you might feel is important.

Ben Pedigo · Answer 1 · Thu Mar 10 2022 21:20:25 GMT+0800 (China Standard Time)

you're right, i noticed this a few months ago and have yet to push my fix - thanks for the PR