CamDavidsonPilon / lifelines

Survival analysis in Python

Home Page:lifelines.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Aalen-Johansen fit() input - different cif table when I pass a numpy array vs pd.Series

ygivenx opened this issue · comments

import lifelines

def get_estimated_cif(durations, events, event_of_interest=1):
  ajf = lifelines.AalenJohansenFitter(calculate_variance=True)
  ajf.fit(durations, events, event_of_interest=event_of_interest)
  return ajf.cumulative_density_

get_estimated_cif(df["durations"].values, df["event"].values)  
get_estimated_cif(df["durations"].values, df["event"])  # cif table is different from the table above

There are tied event times in my dataset - so _jitter is called.

The Aalen-Johansen estimator can't handle ties, as when events of different types occur, there needs to be a clear ordering for the computation. When AalenJohansenFitter sees ties, it randomly breaks them. The difference you see between a pd.Series and np.array is because the random number generator shifts the observations differently between the two calls.

If you want multiple calls to AalenJohansenFitter to produce the same result, you should manually break any tied event times in the data set. That way _jitter is not called, so the same event table should be produced each time.