sanger / unified_warehouse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DPL-437-2 Billing report: Create stored procedure

harrietc52 opened this issue · comments

User story
As a developer, I would like to turn the billing query into a stored procedure on the MLWH database, so it can be used by Tableau to show the billing data

Who are the primary contacts for this story
@harrietc52

Who is the nominated tester for UAT
e.g. John S (don't include surnames in public repos)

Acceptance criteria
To be considered successful the solution must allow:

  • Named billing_report_stored_proc
  • Accept two values
  • 1st param: ‘from’ e.g. '2022-07-26 00:00:00'
  • 2nd param: ‘to’ e.g. '2022-08-24 23:59:59'
  • Can be called by e.g. CALL billing_report_stored_proc('2022-07-26 00:00:00', '2022-08-24 23:59:59')
  • Keep Matt Francis informed; Matt F is creating a Tableau view which calls this stored proc
  • Ensure this stored procedure is accessible for Matt, from Tableau (we are not tracking the Tableau report creation)

Dependencies
This story is blocked by the following dependencies:

References
This story has a non-blocking relationship with:

  • This is a spin-out story from #386

Additional context
Add any other context or screenshots about the feature request here.

This has already been created for testing in Training (MLWH prod_data), but is waiting for final improvements from the query. So this stories scope involves the updating of the existing stored proc, when the query is finished

/***************************************************************************************************
Create Date:        2022-11-11
Author:             Harriet Craven
Description:        Stored Procedure for Automating the Billing report
Used By:            Richard Rance, via Tableau
Parameter(s):       @from_date (DATETIME)
                    @to_date (DATETIME)
Usage:              CALL billing_report_stored_proc('2022-06-25 00:00:00', '2022-07-25 23:59:59');
Additional notes:   Date parameters are used to get runs for given timeframe (usually a financial month)
****************************************************************************************************/

-- Change delimiter to //
delimiter //

-- Create Stored Procedure
CREATE PROCEDURE billing_report_stored_proc (IN from_date DATETIME, IN to_date DATETIME)

BEGIN
  -- Outer query
  -- This grouping calculates the `total` amount of lanes occupied by samples in a given "group" (see end of query)
  SELECT
    iseq_run_lane_metrics.instrument_model     AS platform
    , iseq_flowcell.cost_code                  AS project_cost_code
    , study.name                               AS study_name
    , IF(iseq_run_lane_metrics.qc_seq = 1, 'passed', IF(iseq_run_lane_metrics.qc_seq = '0', 'failed', iseq_run_lane_metrics.qc_seq ))
                                               AS qc_outcome
    , IF(iseq_run.rp__sbs_consumable_version = '1', 'v1', IF(iseq_run.rp__sbs_consumable_version = '3', 'v1.5', iseq_run.rp__sbs_consumable_version))
                                               AS 'v1/1.5'
    , IF(iseq_run.rp__workflow_type = 'NovaSeqXp', 'XP', IF(iseq_run.rp__workflow_type = 'NovaSeqStandard', 'No XP', iseq_run.rp__workflow_type) )
                                               AS xp
    , iseq_run.rp__flow_cell_mode              AS sp
    , iseq_run.rp__read1_number_of_cycles      AS read1
    , iseq_run.rp__read2_number_of_cycles      AS read2
    , SUM(lanes.proportion_of_lane_per_sample) AS total
  FROM
    iseq_run
    INNER JOIN
      (
        -- Inner query 1
        -- There can be multiple QC complete run events,
        -- this query finds all "QC complete" runs within a given timeframe.
        -- Group by run ID.
        -- If there are more than 1 "QC complete" events for a given run ID,
        -- select only the first completed run (based on min `date`)
        SELECT
          id_run
          , MIN(date) AS qc_complete_date
        FROM
          iseq_run_status
          INNER JOIN
            iseq_run_status_dict
            ON iseq_run_status_dict.id_run_status_dict = iseq_run_status.id_run_status_dict
        WHERE
          iseq_run_status_dict.description = 'qc complete'
          AND iseq_run_status.date >= from_date
          AND iseq_run_status.date <= to_date
        GROUP BY
          iseq_run_status.id_run
      )
      AS qc_complete
      ON qc_complete.id_run = iseq_run.id_run
    INNER JOIN
      iseq_product_metrics
      ON iseq_run.id_run = iseq_product_metrics.id_run
    INNER JOIN
      iseq_flowcell
      ON iseq_product_metrics.id_iseq_flowcell_tmp = iseq_flowcell.id_iseq_flowcell_tmp
    INNER JOIN
      study
      ON iseq_flowcell.id_study_tmp = study.id_study_tmp
    INNER JOIN
      iseq_run_lane_metrics
      ON iseq_product_metrics.id_run = iseq_run_lane_metrics.id_run
      AND iseq_product_metrics.position = iseq_run_lane_metrics.position
    INNER JOIN
      (
        -- Inner query 2
        -- Group samples by lane ID
        -- Count the number of samples (exluding controls) in a lane
        -- Assuming equal distribution, calculate the proportion of lane occupied per sample (1/ number of samples)
        -- Append this information to the sample, joining on lane ID
        SELECT
          samples.*
          , format(1 / COUNT(*), 10) AS proportion_of_lane_per_sample
        FROM
          (
            -- Inner query 3
            -- Get the samples for the specific runs
            -- Excluding controls
            SELECT
              iseq_flowcell.entity_id_lims AS lane_id
              , iseq_flowcell.cost_code AS project_cost_code
              , study.name
            FROM
              iseq_run
              INNER JOIN
                (
                  -- Inner query 4
                  -- (Duplication of Inner query 1)
                  SELECT
                    id_run
                    , MIN(date) AS qc_complete_date
                  FROM
                    iseq_run_status
                    INNER JOIN
                      iseq_run_status_dict
                      ON iseq_run_status_dict.id_run_status_dict = iseq_run_status.id_run_status_dict
                  WHERE
                    iseq_run_status_dict.description = 'qc complete'
                    AND date >= from_date
                    AND date <= to_date
                  GROUP BY
                    id_run
                )
                AS qc_complete
                ON qc_complete.id_run = iseq_run.id_run
              INNER JOIN
                iseq_product_metrics
                ON iseq_run.id_run = iseq_product_metrics.id_run
              INNER JOIN
                iseq_run_lane_metrics
                ON iseq_product_metrics.id_run = iseq_run_lane_metrics.id_run
                AND iseq_product_metrics.position = iseq_run_lane_metrics.position
              INNER JOIN
                iseq_flowcell
                ON iseq_product_metrics.id_iseq_flowcell_tmp = iseq_flowcell.id_iseq_flowcell_tmp
              INNER JOIN
                study
                ON iseq_flowcell.id_study_tmp = study.id_study_tmp
            WHERE
              study.name NOT IN ('Heron PhiX', 'Illumina Controls')
          )
          AS samples
        GROUP BY
          samples.lane_id
      )
      AS lanes
      ON lanes.lane_id = iseq_flowcell.entity_id_lims
  WHERE
    study.name NOT IN ('Heron PhiX', 'Illumina Controls') -- Alternative: WHERE iseq_flowcell.cost_code IS NOT NULL
  GROUP BY
    study.id_study_lims
    , project_cost_code
    , platform
    , qc_outcome
    , iseq_run.rp__workflow_type
    , iseq_run.rp__flow_cell_mode
  ;
END
//

-- Change delimiter back to ;
delimiter ;


-- Call Stored Procedure

-- October (118 without controls)
-- date >= '2022-10-01 00:00:00'
-- date <= '2022-10-24 23:59:59'
CALL billing_report_stored_proc('2022-10-01 00:00:00', '2022-10-24 23:59:59');

-- September (195 without controls)
-- date >= '2022-08-25 00:00:00'
-- date <= '2022-09-30 23:59:59'
CALL billing_report_stored_proc('2022-08-25 00:00:00', '2022-09-30 23:59:59');

-- August (155 without controls)
-- from =  '2022-07-26 00:00:00'
-- to = '2022-08-24 23:59:59'
CALL billing_report_stored_proc('2022-07-26 00:00:00', '2022-08-24 23:59:59');

-- July (146 without controls)
-- from = '2022-06-25 00:00:00'
-- to =  '2022-07-25 23:59:59'
CALL billing_report_stored_proc('2022-06-25 00:00:00', '2022-07-25 23:59:59');


-- Drop Stored Procedure
DROP PROCEDURE billing_report_stored_proc;

Ok, looks good; I was thinking on how to do it with a view instead but I'm thinking that I don't know how I could make it work to accept the input dates and apply them to the subqueries... which is a thing the stored procedure seems to solve without any problems.