Aalen-Johansen fit() input - different cif table when I pass a numpy array vs pd.Series

Question

Aalen-Johansen fit() input - different cif table when I pass a numpy array vs pd.Series

ygivenx opened this issue a year ago · comments

import lifelines

def get_estimated_cif(durations, events, event_of_interest=1):
  ajf = lifelines.AalenJohansenFitter(calculate_variance=True)
  ajf.fit(durations, events, event_of_interest=event_of_interest)
  return ajf.cumulative_density_

get_estimated_cif(df["durations"].values, df["event"].values)  
get_estimated_cif(df["durations"].values, df["event"])  # cif table is different from the table above

There are tied event times in my dataset - so _jitter is called.

Paul Zivich · Answer 1 · Fri May 05 2023 02:47:36 GMT+0800 (China Standard Time)

The Aalen-Johansen estimator can't handle ties, as when events of different types occur, there needs to be a clear ordering for the computation. When AalenJohansenFitter sees ties, it randomly breaks them. The difference you see between a pd.Series and np.array is because the random number generator shifts the observations differently between the two calls.

If you want multiple calls to AalenJohansenFitter to produce the same result, you should manually break any tied event times in the data set. That way _jitter is not called, so the same event table should be produced each time.