k-sys / covid-19

A collection of work related to COVID-19

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature request: use a binomial distribution to take the number of tests into account

ababino opened this issue · comments

Hi,
Awesome work! Thank you for such a valuable tool!

Since covidtracking.com has data on testing, I added it to the model. I used a binomial distribution instead of a Poisson. This change helps with variability in testing. Here is the result of the Binomial model for NY
download (3)

As you can see, the model has less variance, and I think it's more accurate. I put this together in a notebook here.

I hope this is useful.

Let me know if you want me to clarify anything, or work on how to incorporate this change into the new pymc3 notebook.

Hi @ababino! Very interesting idea---it's good to see something that handles the day-to-day fluctuations well. However, I don't think it is quite right to use the Binomial unless you are in the limit np << n (in which case it is just Poisson over again). To see why, consider what happens in an extreme case, where most of the tested population on day i-1 has COVID-19. Then the exponential formula use you predicts p_i n_i = p_{i-1} n_{i-1} exp(...), and p_i n_i > n_i (i.e. p_i > 1). That can't be right!

Here's a modification that could work, instead. Assume that the number of people being tested each day is the number that present with some set of symptoms / are in a situation that looks like they could be positive. (Since we generally don't do randomized testing in the US, except in small trial studies that are just getting started.). The set of people who can be tested is

n_test = p_covid * f_present * N + p_other * f_present * N

where p_covid is the fraction of the overall population with COVID-19 appearing on that day, f_present is the fraction of those eligible who will actually present for a test on that day, and p_other is the fraction of the population that has symptoms or otherwise would be in a position for testing. (Note that we have made the---questionable, but reasonable---assumption that the fraction of those eligible for a test because of symptoms or whatever who will actually present for a test is the same between the COVID and non-COVID groups.). p_covid * N is the thing that grows exponentially, via the usual formula involving R_t. If we are willing to assume that p_other is constant in time, then we have that the ratio

p_covid f_present N / (p_other f_presest N) = p_confirm / (1 - p_confirm)

where p_confirm is the fraction of the tests that are positive on a given day, grows exponentially, independent of the population size, the number of tests, fraction presenting, etc. This formula solves the problem where p_confirm can be above 1 if the exponential growth is fast enough.

Hi @farr! Thank you for your feedback! You are right. If the exponential growth is too fast, my model would have predicted p>1. I made the changes that you suggested.

Here are the results:
binomial

And, for reference here is the Poisson model:
poisson

As you can see, there is a disagreement between the models. For the last day with data:
Rt_Poisson = 1.17 (1.04,1.29)
Rt_Binomial = 0.83 (0.68, 0.96)

In the last week, NY doubled its testing capacity up to 40k tests daily. The rise in the number of tests was followed by an increase in the number of new positive cases. You can see it in the following chart from the NY state covid-19 tracker

Screenshot from 2020-04-27 09-47-28

The increase in the number of positive cases explains the increase in the Rt of the Poisson model.

At the same time, the number of new hospitalizations in NY due to covid-19 went down. Also, the NYS DOH started to estimate the Rt, and their estimate is 0.8. These two facts are in agreement with the Binomial model and not with the Poisson one.

Again, I hope this is useful, and if you have any comments, I am glad to hear them.

Also, I can work on how to incorporate this change into the new pymc3 notebook.

I implemented the Binomial model in Stan; notebook here, Stan file here. For states that have carefully curated data like NY, I agree that the binomial model makes more sense, and gives reasonable output:

image

but for other states, where the total number of tests is of much worse quality than the number of positive tests (I even see drops in the total number of tests day-to-day in some states; in others the total number of tests remains constant while the number of positive tests goes up---presumably because some negatives subsequently converted to positive and were recorded on a different day?). Probably in these states the Poisson model is BS, too (if you can't trust the number of negative tests why would you trust the number of positive?); but it at least has a chance of returning correct results. Here is tonight's state-by-state lineup:

image

Not very pretty huh?

Still, it's a useful teaching example; and I tried to motivate (you might even say "derive") the Binomial model in the above notebook, so that could be instructive.

@ababino Would you please explain how did you obtain pt = odds_t/(odds_t +1)

Hi! No problem!

odds = p / (1-p)
odds*(1-p) = p
odds - odds*p = p
odds = p + odds*p
odds = p(1+odds)
odds /(1+odds) = p

Is it clear? Do you think that there is something wrong?