SoeHtutAung / RDMS

Generating an executive report automatically from a PostgreSQL database using R programming through ETL (Extract, Transform, Load) processes. This is the submitted assignment package during HDM module of MSc Health Data Science course.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

README for executive report generation

1. Description

This is about processing an executive report for the hospital management, by synthesizing patient data and survey data from PostgreSQL databsae, and API from government national statistics. With the help of this readme file, codes, markdown file, folder and prototype setup, our data analyst department will be able to routinely generate similar reports. This readme file will work through steps in the pipeline of analysis to enhance reproducibility and audibility.

In a nutshell, only 4 simple steps as listed under section 5 are needed to produce a routine report.

2. Installation instructions {#sec-2.-installation-instructions}

This analysis was done using following software.

  • RStudio 2023.09.1+494
  • R version 4.3.2

Following packages in R were used to extract, transform, load, analyze, visualize and publish the report.

  • dbplyr 2.4.0
  • DBI 1.1.3
  • RPostgreSQL 0.7-5
  • config 0.3.2
  • ruODK 1.4.0
  • tidyverse 2.0.0
  • dplyr 1.1.4
  • stringr 1.5.1
  • ggplot2 3.4.4
  • gridExtra 2.3
  • treemapify 2.5.6
  • shiny 1.8.0
  • yaml 2.3.8
  • palmerpenguins 0.1.1
  • knitr 1.45
  • readr 2.1.4
  • httr 1.4.7
  • tinytex 0.49
  • kableExtra 1.3.4

To generate report from Quarto and LaTex compilation following programs were installed

3. Folder structure

3.1 First level

At first level of folder, there are 2 folders and 1 file as seen in below figure.

R scripts folder:

  • A reference folder that R scripts for specific tasks are kept
  • Details in 3.3

Report folder:

  • Main folder that you can generate the report from
  • Details in 3.2

executive- report.pdf:

  • A prototype of two-page executive report

README_report generation guideline.html:

  • Pipeline of analysis and readme for executive report generation to guide audit and reproduction
  • This is the current file you are viewing

Note.pdf:

  • A one-page feedback note to reflect challenges, solutions and suggestions during the workflow

(You may find a folder named 'images', but it is just for storage of screenshots used in this readme file)

3.2 Second level - Report folder {#2ndlevel}

This is the main folder that you need to use during report generation.

executive report.qmd:

  • This is Quarto Markdown file, containing a mix of codes, text, and output, to render the report
  • This file is the main file to set up R Studio, connect to servers, extract, load, transform, manipulate, analyze and generate the report

config.yml:

  • This YAML file is part of the credential management system
  • This file must be kept within the same folder as 'executive report.qmd' file at all time
  • Being a public folder, this file is excluded in this repository (You may find a folder named 'media', but it is just for storage of pictures used in report)

3.3 Second level - R scripts folder {#sec-3.3-second-level---r-scripts-folder}

There is another second level folder 'R scripts', which is for reference purpose only. The syntax for each task of the pipeline were kept as seperate R files as in below screenshot.

setup.R

  • Install and load necessary R packages.
  • Credential management using keyring package.
  • Connect to PostgreSQL server (MIMIC III database) and ODK central server
  • Extract, transfer and load minimally required information
  • Create data frame from MIMIC III database for specific analysis task
  • Create and clean data frame from ODK central to integrate vaccine information
  • Save the data frames into environment

i.R to v.R

  • Using 'patient_demo_table' data frame created using 'setup.R', manipulate and prepare the data for anlaysis
  • Create a data fame to summarise the information for specific task, using both the ODK and MIMIC III database information
  • Save the data frames into environment

scratch.R

  • This is just a place to test run syntax and check data consistency (You may find a R file named 'README_report generation pipeline.md', but it is just a markdown file during development of readme file.)

4. Pipeline of analysis {#sec-4.-pipeline-of-analysis}

This session is to explain about pipeline of analysis. The syntax for whole pipeline is included within Quarto Markdown file 'executive report.qmd'. However, as a reference, 'R scripts' folder is also developed to see R codes for each step and task. (Detail syntax for can be observed in the Quarto Markdown file as in figure 3 or and R script files as in the figure 2.)

4.1 Setting up RStudio

  • Required packages as listed in section 2 are installed

4.2 Credentials management

  • A YAML file called 'config.yml' is created to store required credentials, and it is put in the same directory as Quarto Markdown file.

  • Using 'config' package, the stored credentials from YAML file are retrieved in Quarto Markdown file using 'get()' function of config package.

  • Note: If users are using the same pc or the a same network, 'Keyring' would be a preferable package. Because we can use password to secure the keyring, so that the credentials can't be exposed without the password for keyring.

4.3 Establishing connection to servers

  • 'DBI', 'dbplyr' and 'RPostgre' packages are used to connect PostgreSQL server from RStudio and connection is established as 'dbcon'

  • 'ruODK' package is used to connect ODK central server from RStudio to extract dataset

  • In above connections, 'config' package is used to manage credentials

  • 'httr' and 'readr' packages are used to connect API of COVID-19 dataset as available on the website to extract dataset

4.4 Extracting, transferring and loading {#sec-4.4-extracting-transfering-and-loading}

In this step, data from various sources are extracted, cleaned and loaded into RStudio as below.

Note: If you try to use individual R files from 'RScripts folder', the extracted information are kept as objects (e.g., data frames, lists, plots) inside RStudio environment. But if you use Quarto Markdown file form 'Report folder', they will be kept temporarily and you will not see them inside global environment of RStudio.

(a) MIMIC III database form PostgreSQL server

  • After connecting PostgreSQL server using 'dbConnect' function, minimally required information for each task is extracted from MIMIC III database

  • SQL syntax is used inside RStudio to extract, transfer and load the data into R environment

  • Patients' demographic information is retrieved from 'admissions' and 'patients' tables. Admission information, including ICU stays, ICD9 codes are extracted from joining between 'diagnoses_icd', 'icd9_code' tables

  • 'subject_id', 'hadm_id' and 'icustay_id' are often served as keys for data joins

  • Calculation were made from timestamp variables (e.g., admission and discharge time) to get 'age' or 'duration' information

  • The extracted data is kept as data frames inside RStudio environment

(b) Vaccination data from ODK Central server

  • Functions inside 'ruODK' package is used to get survey data collected using ODK

  • The extracted data is kept as a data frame inside RStudio environment

(c) Community COVID-19 data from website

(d) Close connection (and secure the credentials, if 'Keyring' package is used)

  • After extracting data from server, 'dbDisconnect' function is used to disconnect from PostgreSQL server

  • Note: If we are using Keyring package, keyring is supposed to be locked after extracting required information from server, if the Markdown file is to be shared with outsiders.

4.5 Data manipulation and preparation {#sec-4.5-data-manipulation-and-preparation}

  • Data cleaning for data object 'ODK' was done inside 'setup.R' file to remove '_' in front of subject ID. Because 'subject_ID' will serve as key to join between data frame in later step

  • Other data cleaning and data manipulation are done inside respective R Scripts (r chunks of Quarto Markdown file). Because grouping, renaming and sorting are to be done specific to analysis task

  • There were some cases observed with discharge date earlier than admission date, with unrealistic ages and one without out time for ICU. Those cases were handled by case by case approach within each analysis task

  • The syntax for above steps are already embedded inside Quarto Markdown file, and you can run only one file, i.e., 'executive report.qmd', for report generation. However, you can also see those steps inside individual R files inside 'R Script folder'.

4.6 Data analysis and visualization

  • The analysis was done inside Quarto Markdown file 'executive report.qmd' as shown in 2nd level folder

4.7 Publishing report

  • A Quarto Markdown file was created using RStudio

  • First of all, YAML code is modified to setup the page and parameters

  • Code chunks in R are used to analyse and visualize the data frames.

  • At the start of code chunks, parameters are set to customize the layout, figures and tables

  • 'kableExtra' package is used to organize the table, where list of 36 patients is developed as an annex

  • Click on 'Render' inside the menu bar in RStudio is used to generate PDF report

    Click on 'Render'{#fig-4 width="217"}

5. Creating a routine report {#sec-5.-creating-a-routine-report}

To preview the prototype report, 'One-click' report generation can be achieved with 4 simple steps:

  • Step 1: Make sure that 'executive report.qmd' and 'config.yml' files are inside the same folder

  • Step 2: Make sure that necessary software and packages as listed under section 2 are installed in the computer

  • Step 3: Open 'executive report.qmd' (Quarto Markdown file) with RStudio

  • Step 4: Click on 'Render'

This is the copy of my submitted assignment for Health Data Management module duirng MSc Health Data Science Course

About

Generating an executive report automatically from a PostgreSQL database using R programming through ETL (Extract, Transform, Load) processes. This is the submitted assignment package during HDM module of MSc Health Data Science course.

License:Creative Commons Zero v1.0 Universal


Languages

Language:R 100.0%