devinit / digital-platform

PostgreSQL/analyst → MongoDB → Development Data Hub

Home Page:http://data.devinit.org:8888/#!/ & http://data.devinit.org/#!/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data warehouse check, WB data series

dw8547 opened this issue · comments

@xriss, this is a separate issue for the data check of the automated WB data series in DDW. I'll use it to OK these:

1. country-year/gdp-usd-current.csv <- fact."gdp_usd_current"
2. country-year/gni-pc-usd-current.csv <- fact."gni_pc_usd_current"
3. country-year/gni-usd-current.csv <- fact."gni_usd_current"

4. country-year/income-share-bottom-20pc.csv <- fact."income_share_bottom_20pc"
5. country-year/income-share-by-quintile.csv <- fact."income_share_by_quintile"

6. country-year/life-expectancy-at-birth.csv <- fact."life_expectancy_at_birth"
7. country-year/maternal-mortality.csv <- fact."maternal_mortality"

8. country-year/population-total.csv <- fact."population_total"
9. country-year/population-rural.csv <- fact."population_rural"
10. country-year/population-urban.csv <- fact."population_urban"
11. country-year/population-rural-urban.csv <- fact."population_rural_urban"

12. country-year/population-by-age.csv <- fact."population_by_age"
13. country-year/population-0-14.csv <- fact."population_by_age_0_14"
14. country-year/population-15-64.csv <- fact."population_by_age_15_64"
15. country-year/population-65-.csv <- fact."population_by_age_65_and_above"

I run a final sanity check on the automated WB World Development Indicators (WDI) data series as follows: I downloaded all of the relevant .csv files from: https://github.com/devinit/digital-platform/tree/master/country-year and read the .csv files into tables. I then compared the content in tables on the RHS with the content in the corresponding tables on the LHS, i.e., I compared:

fact."gdp_usd_current" with gdp-usd-current.csv,
fact."gni_pc_usd_current" with gni-pc-usd-current.csv,
etc.

The files from https://github.com/devinit/digital-platform/tree/master/country-year were created/updated at the time/date recorded in the repository. The automated WB WDI data series/tables were updated on 2016/01/21. The source DB, WB WDI Mirror was updated on 2016/01/20.

@xriss, it looks like the fact tables have the right data in them (assuming what's in the .csv files in right). You can go ahead and replace the data with that from the DDW for these data series with two things to look out for:

  • The values in country-year/population-rural.csv are in unit of number of people
  • The values in fact."population_rural" are in units of % of total population
  • The values in country-year/population-urban.csv are in unit of number of people
  • The values in fact."population_urban" are in units of % of total population

country-year/population-rural.csv & country-year/population-urban.csv are used to put together country-year/population-rural-urban.csv.

fact."population_rural" & fact."population_urban" are used to put together fact."population_rural_urban".

The units in country-year/population-rural-urban.csv are number of people and the units in fact."population_rural_urban" are number of people.

@xriss do you need two tables fact."population_rural" & fact."population_urban" in units of people or can you fetch this data from fact."population_rural_urban"? @timstrawson is happy to have fact."population_rural" & fact."population_urban" in the DDW given in both units.

@robtew, @timstrawson, @bill-anderson, I have checked the above automated data series originating from the WB as follows: I downloaded all of the relevant .csv files from: https://github.com/devinit/digital-platform/tree/master/country-year and read the .csv files into tables. I then compared the content in tables on the RHS with the content in the corresponding tables on the LHS, i.e., I compared:

fact."gdp_usd_current" with gdp-usd-current.csv,
fact."gni_pc_usd_current" with gni-pc-usd-current.csv,
etc.

The files from https://github.com/devinit/digital-platform/tree/master/country-year were created/updated at the time/date recorded in the repository. The automated WB WDI data series/tables were updated on 2016/01/21while the source DB, WB WDI Mirror, was updated on 2016/01/20.

As the individual row differences look reasonable, I have let @xriss know that it is OK for him @notshi to go ahead and replace the relevant data.

Here is a summary of the comparison in terms of the % error calculated as (dh - ddw) / dh) * 100. Please get in touch if you would like to examine and/or talk through any of these individually.

1. country-year/gdp-usd-current.csv <- fact."gdp_usd_current"
 %_error_sum | matching_no_of_rows | avg_%_error | min_%_error | max_%_error 
-------------+---------------------+-------------+-------------+-------------
         915 |                3008 |           0 |         -46 |          81
(1 row)

2. country-year/gni-pc-usd-current.csv <- fact."gni_pc_usd_current"
 %_error_sum | matching_no_of_rows | avg_%_error | min_%_error | max_%_error 
-------------+---------------------+-------------+-------------+-------------
        4614 |                2968 |           2 |        -139 |          67
(1 row)

3. country-year/gni-usd-current.csv <- fact."gni_usd_current"
 %_error_sum | matching_no_of_rows | avg_%_error | min_%_error | max_%_error 
-------------+---------------------+-------------+-------------+-------------
       -2355 |                2968 |          -1 |        -170 |          43
(1 row)

4. country-year/income-share-bottom-20pc.csv <- fact."income_share_bottom_20pc"
 %_error_sum | matching_no_of_rows | avg_%_error | min_%_error | max_%_error 
-------------+---------------------+-------------+-------------+-------------
        -488 |                1696 |           0 |         -87 |          45
(1 row)

5. country-year/income-share-by-quintile.csv <- fact."income_share_by_quintile"
 %_error_sum_bottom_20pc | %_error_sum_second_20pc | %_error_sum_third_20pc | %_error_sum_fourth_20pc | %_error_sum_highest_20pc | matching_no_of_rows | avg_%_error_bottom_20pc | avg_%_error_second_20pc | avg_%_error_third_20pc | avg_%_error_fourth_20pc | avg_%_error_highest_20pc | min_%_error_bottom_20pc | min_%_error_second_20pc | min_%_error_third_20pc | min_%_error_fourth_20pc | min_%_error_highest_20pc | max_%_error_bottom_20pc | max_%_error_second_20pc | max_%_error_third_20pc | max_%_error_fourth_20pc | max_%_error_highest_20pc 
-------------------------+-------------------------+------------------------+-------------------------+--------------------------+---------------------+-------------------------+-------------------------+------------------------+-------------------------+--------------------------+-------------------------+-------------------------+------------------------+-------------------------+--------------------------+-------------------------+-------------------------+------------------------+-------------------------+--------------------------
                     252 |                     -11 |                    164 |                      58 |                     -141 |                2968 |                       0 |                       0 |                      0 |                       0 |                        0 |                     -59 |                     -39 |                    -20 |                      -9 |                      -36 |                      95 |                      61 |                     38 |                      20 |                       10
(1 row)

6. country-year/life-expectancy-at-birth.csv <- fact."life_expectancy_at_birth"
 %_error_sum | matching_no_of_rows | avg_%_error | min_%_error | max_%_error 
-------------+---------------------+-------------+-------------+-------------
        -674 |                2968 |           0 |         -37 |           8
(1 row)

7. country-year/maternal-mortality.csv <- fact."maternal_mortality"
 %_error_sum | matching_no_of_rows | avg_%_error | min_%_error | max_%_error 
-------------+---------------------+-------------+-------------+-------------
       -4269 |                2968 |          -1 |        -300 |          71
(1 row)

Remember that @akmiller01 interpolated this data series for the DH and we have not.

8. country-year/population-total.csv <- fact."population_total"
 %_error_sum | matching_no_of_rows | avg_%_error | min_%_error | max_%_error 
-------------+---------------------+-------------+-------------+-------------
       -2485 |                3008 |          -1 |         -43 |          25
(1 row)


9. country-year/population-rural.csv <- fact."population_rural"
The values in the DH file are in different units compared with the DDW. DH = number of people, DDW = %.

10. country-year/population-urban.csv <- fact."population_urban"
The values in the DH file are in different units compared with the DDW. DH = number of people, DDW = %.

11. country-year/population-rural-urban.csv <- fact."population_rural_urban"
 %_error_sum_rur | %_error_sum_urb | matching_no_of_rows | avg_%_error_rur | avg_%_error_urb | min_%_error_rur | min_%_error_urb | max_%_error_rur | max_%_error_urb 
-----------------+-----------------+---------------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------
            -612 |            -568 |                2968 |               0 |               0 |             -15 |             -15 |              21 |              21
(1 row)

12. country-year/population-by-age.csv <- fact."population_by_age"
 %_error_sum_0_14 | %_error_sum_15_64 | %_error_65_and_above | matching_no_of_rows | avg_%_error_0_14 | avg_%_error_15_64 | avg_%_error_65_and_above | min_%_error_0_14 | min_%_error_15_64 | min_%_error_65_and_above | max_%_error_0_14 | max_%_error_15_64 | max_%_error_65_and_above 
------------------+-------------------+----------------------+---------------------+------------------+-------------------+--------------------------+------------------+-------------------+--------------------------+------------------+-------------------+--------------------------
             -210 |              -589 |                -2850 |                2968 |                0 |                 0 |                       -1 |              -18 |               -18 |                     -131 |               22 |                21 |                       37
(1 row)

13. country-year/population-0-14.csv <- fact."population_by_age_0_14"
 %_error_sum | matching_no_of_rows | avg_%_error | min_%_error | max_%_error 
-------------+---------------------+-------------+-------------+-------------
         236 |                1096 |           0 |         -15 |          18
(1 row)

14. country-year/population-15-64.csv <- fact."population_by_age_15_64"
 %_error_sum | matching_no_of_rows | avg_%_error | min_%_error | max_%_error 
-------------+---------------------+-------------+-------------+-------------
         -59 |                1096 |           0 |         -10 |           5
(1 row)

15. country-year/population-65-.csv <- fact."population_by_age_65_and_above"
 %_error_sum | matching_no_of_rows | avg_%_error | min_%_error | max_%_error 
-------------+---------------------+-------------+-------------+-------------
       -1166 |                1096 |          -1 |         -54 |          26
(1 row)

Dead issue.