r-b-g-b / clean-water-tool

Reporting Tool to Support Safe Drinking Water in California’s Disadvantaged Communities

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Specify logic for de-duplicating the data

ruckeralex opened this issue · comments

One of the first discrete actions we need to perform for each reporting period is removing duplicates within the data set. Initially, let's focus on the current status only, and not the historic pattern of violations and compliance. In summary, each dataset contains records of multiple violations; each violation is a record of exceeding the maximum contaminant level for a specific analyte for a specific system for a specific start and end date. In reporting the CURRENT STATUS of the violations by water system, we simply want to report which water systems have analytes that are actively in violation. Therefore, we want to show violation records where, for the most recent VIOL_END_DATE, the associated ENF_ACTION_TYPE_ISSUED is NOT "RETURN TO COMPLIANCE."

For more information, see HR2W Data Field Descriptions and Policy Interpretations in https://docs.google.com/document/d/15gRM0AsVV7dSKgfN4blwbRMbf97jMjeWK67ZLIfaXCw/edit?usp=sharing or contact ruckeralex@gmail.com.

Thanks @ruckeralex! I took a stab at this issue just now. You can see the work in https://github.com/r-b-g-b/clean-water-tool/blob/rg/deduplicate-%237/notebooks/002-deduplicate-by-water-system-and-analyte.ipynb

Just to make sure I understood correctly, here's what I did:

  1. Load the hr2w exceedance table
  2. For each combination of water system/analyte, take the most recent ENF_ACTION_ISSUE_DATE.
  3. Remove any water system/analyte combinations for which the most recent record was RETURN TO COMPLIANCE.
  4. Group the remaining violations by water system.

The result is 324 water systems with at least one analyte out of compliance.

One question -- there are other types of enforcement actions in the table:

enforcement action name number of records
FORMAL ENFORCEMENT ACTION ISSUED 5009
STATE INTENTIONAL NO ACTION TAKEN 356
RETURN TO COMPLIANCE 329
OTHER INFORMAL ENFORCEMENT ACTION TAKEN 311
INFORMAL ENFORCEMENT ACTION ISSUED 204
CA STATE ACTION ISSUED 169
FORMAL ENFORCEMENT ACTION WITH PENALTY ISSUED 62
US EPA FEDERAL ADMINSTRATIVE ORDER ON CONSENT 27

How should I treat these? A few options:

  1. Remove them.
  2. Treat them the same as FORMAL ENFORCEMENT ACTION ISSUED.
  3. Treat them the same as RETURN TO COMPLIANCE

Awesome, thanks for jumping in, @r-b-g-b (and @skyballin )! For the Feb 2019 "active" data set, there should be 330 systems out of compliance. I talked through the data logic with the SWRCB and they recommended that the VIOL_END_DATE is the key variable. So can you try to follow this process to report on the existing out of compliance systems:

  1. Load the current hr2w exceedance table -- filename should contain "active", not "RTC"
  2. For each combination of water system/analyte, look at the most recent VIOL_END_DATE. There will often, but not always, be 2 records associated with the most recent VIOL_END_DATE.
  3. For the record(s) associated with the most recent VIOL_END_DATE, check the associated ENF_ACTION_TYPE_ISSUED. If the most recent VIOL_END_DATE has an ENF_ACTION_TYPE_ISSUED = RETURN TO COMPLIANCE, then the water system/analyte combination is in compliance and should not be featured in the current version of the report.
  4. Remove any water system/analyte combinations for which the most recent record contained RETURN TO COMPLIANCE.
  5. Group the remaining violations by water system, though we will also want to query these violations by other characteristics, such as County, Analyte, Regulating Agency.

Excel tripped up over this when a system potentially had more than 1 analyte out of compliance in the reporting period.

Here's one check-- for the Feb 2019 data set, the system Triple R Mutual and analyte Arsenic should be IN compliance, because the most recent VIOL_END_DATE of 3/31/2017 has the ENF_ACTON_TYPE_ISSUED = RETURN TO COMPLIANCE. However, the Triple R Mutual system for Nitrate is OUT of compliance since the most recent VIOL_END_DATE of 12/31/2018 has only one record, and the ENF_ACTION_TYPE_ISSUED with it is not RETURN TO COMPLIANCE. Note that the SWRCB confirmed that sometimes the enforcement action date associated with the most recent Violation Date may have an older "RETURN TO COMPLIANCE". If you find something like that, flag it, but their guidance is that if you have 2 most recent VIOL_END_DATE records and one has RETURN TO COMPLIANCE associated with it, then the system should be back in compliance.

As to your other question, our main interest is identifying the systems that are currently out of compliance and that have a history of being out of compliance. Therefore, the ENF_ACTION_TYPE_ISSUED can be binary-- given the context provided above, it's either OUT OF COMPLIANCE or it's any other enforcement action (we don't foresee needing to know which enforcement action is holding this out of compliance).

Note that after we nail down the reporting for the current status of systems we will want to factor in the historical compliance data. That will require pulling from this data set and from the "RTC" (return to compliance) data set, too.

Can you confirm if these updated instructions produce report with 330 out of compliance systems?

From the HRTW Data Analysis Tool Scope document:

"remove violation record duplications"

This is very broad...should we reduce the dataset down to just 330 datapoints (one for each system)? Or should we be keeping historical information on the water system?

It sounds like from the words used here, the dedupe method you are using will reduce the dataset to 1 violation record per water system and remove the system if the last enforcement action is return to compliance -- is that the intent?

Also, for the removal of 'RETURN TO COMPLIANCE', should we replace with the most recent record before that, if it exists for the system?

@ruckeralex Ajay and I had a chance to check in this evening. One point we want to clear up: are you saying the result of deduplication should result in 330 water systems? We noticed that there are a total of 330 unique water systems in the table. Does this mean that after our deduplication process, we expect all of the water systems to have at least one analyte that is out of compliance?

Or were you simply saying that there should be 330 unique water systems in the table and the deduplication could end up with fewer than that?

I also wanted to get a bit more clarification on the ENF_ACTION_TYPE_ISSUED question, but I don't want to clutter this issue too much, so I'm opening #11 . Thanks!

If the point of this is to only show current, active violations, then would we only want to keep violations that currently do not have a 'VIOL_END_DATE'? If that is the case, then 2) below would be the valid approach. Or, can a water system be in current violation even if there is a VIOL_END_DATE?

If the point of this is to show the most recent violation, then my belief is that method 4) below should be the approach.

To summarize:

  1. 330 water systems total in the dataset

  2. 8 water systems with a violation that does not have an end date in the dataset
    EXAMPLE: violation number 1200050

  3. 309 water systems when sorting by water system and ENF_ACTION_ISSUE_DATE and removing the system if the last action was RETURN TO COMPLIANCE

  4. 330 water systems when sorting by water system and VIOL_END_DATE and removing the system if the last action was RETURN TO COMPLIANCE (0 cases with this sorting method)
    This method has the 8 systems where the violation does not have an end date in the dataset as the 'most current' violation on record for the system.

Hi! Ultimately the reporting tool should show both the active violations for this reporting period as well as previous (historic) violations.

This is the first time I'm seeing violation records in these data sets with empty VIOL_END_DATE dates. The April set, which perhaps has not yet been fully vetted, has 143 empty records. The Feb set has no empty VIOL_END_DATE records. I've asked the SWRCB staff to comment and will let you know. Thanks!

Also, your method 4 sounds right to me. Just remember that if a water system has ANY analyte out of compliance for the reporting period, the data set pulls in ALL historic data related to that and other analytes for that system. Therefore it's best to look at the water system-analyte combinations when determining compliance. Does that make sense?

Yes! I think that makes sense. Let me reword and make sure that it is clear:

If a water system is out of compliance for a reporting period, all of the data for all history for that water system's violations is returned.

We want this view to only show the data with the water systems out-of-compliance record for the current period, whether it is a closed violation or not.

If the system was found to be in compliance after the violation report ("RETURN TO COMPLIANCE"), then we do not want to show that system's information.

In the event that there is more than one record for a water system that is in the period, we want to only show the most recent one.

Hi Ajay, that is correct for the report on current reporting period only. Keep in mind that a water system is out of compliance for at least 1 analyte (such as arsenic or nitrate). We need to be able to report which specific analyte(s) is out of compliance for each water system for the current reporting period. Then, we ALSO want to be able to show the history of violations. See "Desired Report Template" section (pages 2 to 4): https://docs.google.com/document/d/15gRM0AsVV7dSKgfN4blwbRMbf97jMjeWK67ZLIfaXCw/edit#

I know this is confusing, so please let me know if you have other questions. Thank you!

Also, I received reply to your question about the empty VIOL_END_DATE fields from the SWRCB agency.

In essence, I think we should interpret this to mean that if a system has an empty VIOL_END_DATE field we should treat it as the most recent date record for the purpose of evaluating whether a system is currently in compliance.

SWRCB RESPONSE: "the violation end date for Surface Water Treatment Rule (SWTR) is up to the discretion of the District Engineer who issues the citation. Sometimes they choose not to put an end date. When a water system does not comply by the date written into their citation (“viol_end_date” in the HR2W spreadsheets), then a new citation is issued, and is continuously issued until compliance. Since the SWTR is based on treatment technique rather than a maximum contaminant level, I hear that water systems can take years before attempting to comply. Because of this, it can be very time consuming for districts to continuously write a new citation until compliance. Some engineers will leave the end date open due to this, and personally follow up with the system without continuing to issue citations. They will then enter an end date once the system has complied."

@ruckeralex @skyballin Thanks! I think the interpretation of the the missing VIOL_END_DATE was the last thing I needed to make this work. Pushed new changes to #9 that correctly identify 330 water systems in violation. Are there any other checks I can perform? Otherwise, I think we can start generating the report described in the "Desired Report Template."

Wow, excellent! Yes, you can check for the February 2019 data set that for the Triple R Mutual agency, the analyte Nitrate is out of compliance (active violation), but the analyte Arsenic is in compliance (no active violation). Thanks!