timriffe / covid_age

COVerAGE-DB: COVID-19 cases, deaths, and tests by age and sex

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Time series of cumulative count could have bumps, and not always increasing

liuyanguu opened this issue · comments

I just realized this interesting issue since my colleagues were asking for time series of cases/deaths from the database. Because the database was built in an accumulative way, it is possible to get time series for some countries that (luckily) have frequently data updates.

This is not common as data for most countries are quite consistent. But happened in several countries like the deaths of Indonesia. I plotted the deaths against different update dates and they should monotonically increase as in most countries. I assume this bump is caused by a change in the data source?
image

The effect is less obvious if zoom out to all age groups:
image

Just to give some other examples:
image
image
image

library(covidAgeData)
library(data.table)
library(ggplot2)
dt5_ori <- covidAgeData::download_covid(data = "Output_5", temp = TRUE,
                          verbose = FALSE, progress = FALSE, return = "data.table")
c_list <- sort(unique(dt5_ori$Country))
plot_by_c <- function(country0, measure0 = "Deaths"){
  dt5_c <- dt5_ori[Country == country0 & Sex=="b" & Region == "All"  & !is.na(get(measure0))]
  # dt5_c <- dt5_ori[Country == country0 & Sex=="b" & Region == "All" & Age <=20 & !is.na(get(measure0))]
  if(nrow(dt5_c) == 0) return(NULL)
  dt5_c[, Date:= as.Date(Date, format = "%d.%m.%Y")]
  dt5_c[, Age:=factor(as.factor(Age), levels = seq(0, 100 ,by = 5))]
  g_IDN <- ggplot(data = dt5_c, aes_string(x = "Date", y = measure0, color = "Age", group = "Age")) +
    # geom_bar(stat="identity", width=0.5, show.legend = FALSE, color = "#0058AB") +
    geom_line() +
    labs(x = "", y = "") + 
    scale_x_date(date_labels = "%Y-%m") +
    scale_y_continuous(expand = c(0,0)) +
    ggtitle(paste(country0, "-", measure0)) + 
    theme_classic() 
  return(g_IDN)
}
plot_by_c("Indonesia", measure0 = "Deaths")
# too see all the countries 
plist <- invisible(lapply(c_list, plot_by_c, measure0 = "Deaths"))
plist <- plist[!sapply(plist, is.null)]
plist <- lapply(plist, ggplotGrob)
ggsave(filename = "time series of deaths_all_age.pdf",
       plot = gridExtra::marrangeGrob(grobs = plist, ncol = 2, nrow = 2), width = 20, height = 15)

Thanks @liuyanguu for reporting. It would seem some of these artifacts have different causes, a mix of inconsistent sources, and surely a few manual data entry errors (Jordan). To pick out the manual entry errors, jumps in daily fractions where one age pops in the opposite direction of the others serves as a good indicator. If there is a daily spike but no change in age-specific fractions, then it's a potential error in the registered total. In that case, scaling to an external consistent series of totals would cure it (we can do this internally once it's identified). Kazahkstan looks like it needs its own investigation. Let's leave this issue open as cases are addressed. cc-ing @jessicadonzowa @kikeacosta