Turn College Scorecard data into parquet files, a SQLite database, CSV and RDS (R data storage) files.
License: GPL-3
This project uses renv to manage
dependencies. To install the dependencies, run renv::restore()
in the
R console on R 4.4.0.
renv::restore()
We use targets to manage the data prep pipeline. To download the source data and build all outputs, run the following in the R console:
targets::tar_make()
The raw College Scorecard data is available as a set of parquet files in
data/parquet
. The complete scorecard dataset is stored as a set of
parquet files in data/parquet/merged
. Two informational tables contain
the data dictionary (data/parquet/dd_info.parquet
) and the variable
labels (data/parquet/dd_labels.parquet
) for categorical variables in
the scorecard dataset.
scorecard <- targets::tar_read("path_data_full_merged") |> arrow::read_dataset()
dd_info <- targets::tar_read("path_data_full_info") |> arrow::read_parquet()
dd_labels <- targets::tar_read("path_data_full_labels") |> arrow::read_parquet()
The most approachable tables are the school_tidy
and scorecard_tidy
targets.
school_tidy <- targets::tar_read("school_tidy")
skimr::skim_tee(school_tidy)
── Data Summary ────────────────────────
Values
Name data
Number of rows 11300
Number of columns 25
_______________________
Column type frequency:
character 9
factor 3
logical 10
numeric 3
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 name 0 1 3 93 0 10525 0
2 city 0 1 3 23 0 2915 0
3 state 0 1 2 2 0 59 0
4 zip 0 1 5 10 0 8757 0
5 url 4943 0.563 15 123 0 5435 0
6 deg_predominant 998 0.912 8 11 0 4 0
7 deg_highest 1193 0.894 8 11 0 4 0
8 locale_type 5415 0.521 4 6 0 4 0
9 locale_size 5415 0.521 5 7 0 6 0
── Variable type: factor ───────────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique
1 control 1 1.00 FALSE 3
2 adm_req_test 8685 0.231 FALSE 4
3 religious_affiliation 10449 0.0753 FALSE 60
top_counts
1 For: 5908, Non: 2760, Pub: 2631
2 Con: 1205, Not: 1015, Req: 273, Rec: 122
3 Rom: 232, Uni: 85, Bap: 56, Pre: 54
── Variable type: logical ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean count
1 is_hbcu 5412 0.521 0.0168 FAL: 5789, TRU: 99
2 is_pbi 5412 0.521 0.0105 FAL: 5826, TRU: 62
3 is_annhi 5412 0.521 0.00272 FAL: 5872, TRU: 16
4 is_tribal 5412 0.521 0.00594 FAL: 5853, TRU: 35
5 is_aanapii 5412 0.521 0.0350 FAL: 5682, TRU: 206
6 is_hsi 5412 0.521 0.0909 FAL: 5353, TRU: 535
7 is_nanti 5412 0.521 0.00543 FAL: 5856, TRU: 32
8 is_only_men 5412 0.521 0.0102 FAL: 5828, TRU: 60
9 is_only_women 5412 0.521 0.00510 FAL: 5858, TRU: 30
10 is_only_distance 2832 0.749 0.00709 FAL: 8408, TRU: 60
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25
1 id 0 1 2550768. 8357052. 100654 182632.
2 latitude 5412 0.521 37.3 5.87 -14.3 33.9
3 longitude 5412 0.521 -90.4 18.2 -171. -97.5
p50 p75 p100 hist
1 367422 455666. 49664501 ▇▁▁▁▁
2 38.6 41.2 71.3 ▁▁▆▇▁
3 -86.3 -78.9 171. ▂▇▁▁▁
scorecard_tidy <- targets::tar_read("scorecard_tidy")
skimr::skim_tee(scorecard_tidy)
── Data Summary ────────────────────────
Values
Name data
Number of rows 183306
Number of columns 24
_______________________
Column type frequency:
character 1
logical 6
numeric 17
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 academic_year 0 1 7 7 0 27 0
── Variable type: logical ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean count
1 cost_med_similar 183306 0 NaN ": "
2 cost_med_overall 183306 0 NaN ": "
3 amnt_earnings_med_10y_similar 183306 0 NaN ": "
4 amnt_earnings_med_10y_overall 183306 0 NaN ": "
5 rate_completion_med_similar 183306 0 NaN ": "
6 rate_completion_med_overall 183306 0 NaN ": "
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd
1 id 0 1 1163825. 5311085.
2 n_undergrads 20563 0.888 2285. 5101.
3 cost_avg 104675 0.429 15809. 8216.
4 cost_avg_income_0_30k 105507 0.424 13996. 7669.
5 cost_avg_income_30_48k 115448 0.370 14736. 7723.
6 cost_avg_income_48_75k 120294 0.344 16807. 7750.
7 cost_avg_income_75_110k 129900 0.291 19224. 7843.
8 cost_avg_income_110k_plus 138593 0.244 21352. 9036.
9 amt_earnings_med_10y 138347 0.245 35399. 14829.
10 rate_completion 134721 0.265 0.333 0.237
11 rate_admissions 181562 0.00951 0.722 0.221
12 score_act_p25 156788 0.145 20.2 3.69
13 score_act_p75 156794 0.145 25.4 3.55
14 score_sat_verbal_p25 156906 0.144 484. 72.3
15 score_sat_verbal_p75 156905 0.144 592. 69.3
16 score_sat_math_p25 156771 0.145 486. 75.7
17 score_sat_math_p75 156773 0.145 594. 72.3
p0 p25 p50 p75 p100 hist
1 100654 164562 213987 416670 49664501 ▇▁▁▁▁
2 0 117 490 2050 253594 ▇▁▁▁▁
3 -103168 9319 15281 21090. 112050 ▁▁▇▁▁
4 -117833 7952 13336 19036. 111962 ▁▁▇▂▁
5 -44508 8618. 14055 19740 113384 ▁▇▃▁▁
6 -17804 10711. 16289 21716 113427 ▂▇▁▁▁
7 -18045 13149. 18933 24285. 114298 ▁▇▁▁▁
8 -17487 14415 20448 26533 113314 ▁▇▁▁▁
9 8400 25700 33200 42400 250000 ▇▁▁▁▁
10 0 0.145 0.3 0.498 1 ▇▇▅▂▁
11 0.0106 0.612 0.773 0.889 1 ▁▁▂▆▇
12 1 18 20 22 35 ▁▁▇▃▁
13 2 23 25 27 36 ▁▁▂▇▂
14 100 440 480 520 799 ▁▁▇▃▁
15 100 540 590 630 800 ▁▁▂▇▂
16 100 440 475 520 799 ▁▁▇▃▁
17 100 550 588 630 800 ▁▁▂▇▂
For ease of use, the variables in school_tidy
and scorecard_tidy
have been renamed from the original names. I don’t yet have a mapping
back to the original names, but you can reference the data_dictionary
target to see information about the original names.
data_dictionary <- targets::tar_read("data_dictionary")
head(data_dictionary$info, 10)
# A tibble: 10 × 11
name_of_data_element dev_category developer_friendly_n…¹ api_data_type
<chr> <chr> <chr> <chr>
1 Unit ID for institution root id integer
2 8-digit OPE ID for institu… root ope8_id string
3 6-digit OPE ID for institu… root ope6_id string
4 Institution name school name autocomplete
5 City school city autocomplete
6 State postcode school state string
7 ZIP code school zip string
8 Accreditor for institution school accreditor string
9 URL for institution's home… school school_url string
10 URL for institution's net … school price_calculator_url string
# ℹ abbreviated name: ¹developer_friendly_name
# ℹ 7 more variables: data_type <chr>, index <chr>, variable_name <chr>,
# source <chr>, shown_use_on_site <chr>, notes <chr>, has_labels <lgl>