dkobak / excess-mortality

Excess mortality during COVID-19 pandemic

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Moscow and St. Petersburg mortality data reliability

rdaysky opened this issue · comments

In the paper you state:

Moscow and St. Petersburg, two regions with arguably the most reliable reporting of Covid‐19 mortality.

However, the data for both of those cities exhibits the following peculiarity:

  • On 2020-07-28, 23 deaths were reported in St. Petersburg. The figure was the same on 2020-07-29.
  • On 2020-08-13, 11 deaths were reported in Moscow. The next day’s number was the same.
  • Since then, in both cities the figure never repeated on two consecutive days.

This is not characteristic of the other regions. SPb’s 192-day streak and Moscow’s 176 are followed by a very distant third, Nizhegorodskaya oblast where the figures haven’t repeated for 34 days. The median is 2 days.

A back-of-the-envelope calculation: assuming that a particular day’s figure could have been anything within range given by data reported within ±3 days of the given date (distributed uniformly), what’s the probability it never matches the previous day’s value? This probability is 5×10⁻⁴ for SPb and 5×10⁻¹² for Moscow (the latter includes 60 consecutive days, 2020-11-15 through 2020-01-13, where all the data fell into the [70, 77] interval, yet, despite its narrowness—in itself uncharacteristic of a random process—never repeated, and never exceeded the previous maximum, the 2020-05-30 value of 78).

A histogram of the Moscow data is below.

Perhaps the statement needs revision?

 0  3 |||
 1  3 |||
 2  3 |||
 3  1 |
 4  0 
 5  2 ||
 6  0 
 7  3 |||
 8  2 ||
 9  3 |||
10  9 |||||||||
11 15 |||||||||||||||
12 18 ||||||||||||||||||
13 12 ||||||||||||
14 14 ||||||||||||||
15  3 |||
16  3 |||
17  2 ||
18  1 |
19  1 |
20  2 ||
21  1 |
22  1 |
23  2 ||
24  5 |||||
25  3 |||
26  1 |
27  4 ||||
28  6 ||||||
29  4 ||||
30  1 |
31  2 ||
32  3 |||
33  1 |
34  4 ||||
35  4 ||||
36  0 
37  3 |||
38  1 |
39  2 ||
40  0 
41  3 |||
42  0 
43  0 
44  2 ||
45  0 
46  0 
47  1 |
48  2 ||
49  3 |||
50  2 ||
51  3 |||
52  5 |||||
53  4 ||||
54  2 ||
55  4 ||||
56  3 |||
57  2 ||
58  4 ||||
59  2 ||
60  1 |
61  4 ||||
62  3 |||
63  4 ||||
64  2 ||
65  3 |||
66  3 |||
67  5 |||||
68  7 |||||||
69  5 |||||
70  3 |||
71 11 |||||||||||
72  9 |||||||||
73 11 |||||||||||
74 13 |||||||||||||
75 14 ||||||||||||||
76 15 |||||||||||||||
77 10 ||||||||||
78  1 |
79  1 |
80  0 
81  1 |
82  0 
83  0 
84  1 |

Thanks. That's a good point and a very nice analysis. I've never seen it done before, but I have seen something related: in many regions, the variance of the reported numbers is lower than the Poisson variance, which also suggests that the reported numbers were somehow "processed" and are not really true counts.

IMHO this does not necessarily prove malicious intent: maybe the numbers were somehow "spread" over the neighboring days or even "guesstimated" by somebody, and these people avoided putting the same number twice. Not good -- but not necessarily an evidence of deliberate fraud... Could be though.

Anyway, that sentence in the Significance paper ("Moscow and St. Petersburg, two regions with arguably the most reliable reporting of Covid‐19 mortality") refers to the Rosstat Covid death numbers and not to the "daily reported" Covid death numbers. And specifically it refers to the fact that the sum of Rosstat numbers across four categories matches the excess mortality, as shown in the figure.

Sorry, I don’t quite follow. The numbers were published in real time. How could the reporting entity stay in the 70–77 interval for two months without somehow straying from the true data, which they of course did not know in advance? (Your excess mortality data shows a linear increase during the period, no signs of a plateau.) Did they initially over-report the deaths with the long-term plan of staying true to the average? I can’t imagine an official doing such a thing, but I can easily imagine someone being told not to exceed the May maximum (on paper, of course).

By the way, there’s also the earlier 22-day-long interval of {10, 11, 12} without repetitions in Moscow data.

As for Rosstat Covid death numbers vs daily reported numbers, doesn’t one equal the sum of the other? How can one be reliable when the other is so suspect? Once again, it’s not like they published monthly totals first and invented daily breakdowns later, the reporting was realtime.

Just wanted to chime in.

The process is not totally random either, so I wonder if it's correct to treat it as one. For example, officials stated daily values have peaks on Wednesday and Thursday. Their explanation was that most people come for testing on Monday and result processing takes 48 hours. Link

@mkpankov Good point, but things like that should cause an overdispersion (compared to Poisson), not underdispersion. What we see in many Russian regions is underdispersion (aka "собянинские коридоры").

@rdaysky Well, I agree it's suspicious and probably means some kind of data tampering by the "оперативный штаб". Maybe the formulation in the article was not ideal. However, the fact remains that the sum of the Rosstat-reported Covid deaths (sum over their four categories) is very close to the excess mortality in Moscow and St. Petersburg.

@dkobak The entire point of your research is to figure out what the real data is, so maybe one thing you could do is find regions where the data obeys statistical laws and use that to extrapolate.

There has been no reasons to doubt the all-cause mortality numbers from Rosstat. Which is what I am primarily using.

Anyway, thanks everybody for the comments but I am closing this issue now.