gadenbuie / covid19-florida

Florida COVID19 Data parsed from Florida DOH Dashboard and PDF reports

Home Page:https://covid19-florida.garrickadenbuie.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Line list extraction issues

gadenbuie opened this issue · comments

  • One page in 2020-03-19 09:54 report is not separating sex and travel_related columns

       timestamp                case county         age sex             travel_related
       <chr>                   <dbl> <chr>        <dbl> <chr>           <chr>         
     1 2020-03-19 09:54:00 EDT    63 Pinellas        54 Female Yes      FL            
     2 2020-03-19 09:54:00 EDT    64 Broward         43 Female No       FL            
     3 2020-03-19 09:54:00 EDT    65 Orange          46 Male    Unknown FL    
    
  • jurisdiction in 2020-03-17 09:59 report is missing the final "FL" in "Not diagnosed/isolated in FL".

    # A tibble: 4 x 8
       timestamp                case county     age sex    travel_related jurisdiction              travel_detail
       <chr>                   <dbl> <chr>    <dbl> <chr>  <chr>          <chr>                     <chr>        
     1 2020-03-17 09:59:00 EDT   189 Leon        59 Female Yes            Not diagnosed/isolated in JAPAN        
     2 2020-03-17 09:59:00 EDT   190 Dade        64 Male   Yes            Not diagnosed/isolated in JAPAN        
     3 2020-03-17 09:59:00 EDT   191 Okaloosa    66 Female Yes            Not diagnosed/isolated in JAPAN        
     4 2020-03-17 09:59:00 EDT   192 Gadsden     54 Male   Yes            Not diagnosed/isolated in JAPAN
    
  • Extraction of line list tables from 2020-03-24 10:12:00 EDT report failed

Cells that span multiple lines, e.g. in covid-19-data---daily-report-2020-03-24-1012.pdf, are probably not being parsed correctly. I would expect that I'm losing the text that appears on its own in a line.

image