mjec / rc-niceties

End of batch niceties for the Recurse Center

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Enforce nicety uniqueness in database

jasonaowen opened this issue · comments

For each batch a person does, an author can only write a single nicety for them. An author could write two niceties for the same person if that person did multiple batches, but for a given author-person-batch, we want to read/update the existing nicety rather than create a second nicety.

In database parlance, that means that the natural key on our nicety table is (author_id, target_id, end_date)[1]. We're probably going to keep our surrogate key of id, but we should add a UNIQUE constraint to let the database enforce this data integrity issue.

[1] Note that we want to change this to (author_id, person_id, start_date) to address the bug described in #10 where extending a batch causes previously written niceties to go missing.

Discovered while reviewing #47.

Note that in production, we do have some rows that violate this proposed constraint:

=> select author_id, target_id, end_date, count(*) from nicety group by author_id, target_id, end_date having count(*) > 1;
 author_id | target_id |  end_date  | count 
-----------+-----------+------------+-------
(omitted for privacy)
(8 rows)

Fortunately, all of those rows appear to be duplicates:

SELECT a.id,
  b.id,
  a.text = b.text AS text_identical,
  a.anonymous = b.anonymous AS anonymous_identical,
  a.no_read = b.no_read AS no_read_identical,
  a.faculty_reviewed = b.faculty_reviewed AS faculty_reviewed_identical,
  a.starred = b.starred AS starred_identical,
  a.date_updated = b.date_updated AS date_updated_identical
FROM nicety a
  INNER JOIN nicety b
    ON a.author_id = b.author_id
    AND a.target_id = b.target_id
    AND a.end_date = b.end_date
    AND a.id < b.id
ORDER BY a.id ASC;

The results look like they're all duplicate rows which can be safely deleted without any meaningful loss of data. I'd appreciate someone with read access to the production database double-checking my work here, though!

So, as part of this, I think we'll need a data migration to remove the duplicate rows, in addition to the schema migration of adding the unique constraint.

I ran a similar (but simpler) query and retrieved what are probably the same 8 rows. After deleting them from my local copy of the prod DB, the remaining 15 rows with duplicate author-target values all show differing end-dates, so I'm happy to accept those are legitimate cases where someone wrote niceties for someone else doing 2 different batches.

Next I should figure out how to make a data migration happen.

I created an empty migration file using flask db revision and wrote some code to get the IDs of the duplicate rows, then prints them out: gist

I also wrote in comments some ideas for executing the deletion of those rows. I haven't tested any of them yet, because I wanted to get confirmation that this is roughly what you had in mind @jasonaowen or if you had a different tactic.

Yes, that approach looks good to me! Either of the options outlined in the comments is fine; both are classic N+1 queries, but for a one-off and with only 8 rows, that's not a problem.

We should create the unique constraint as part of the migration; not that it's at all likely for the bug to turn up in general, much less in the span of time between running one migration and the next, but they're logically connected and doing it all at once makes that more clear.

I'm not sure how you're supposed to get the engine; the gist is of course some prototype code for discussion, and as you say, hopefully running through the normal migration creation flow will give you a more-fully-formed template to work with!

Looking at models.py I see there is already a constraint:
batch_author_target_unique = db.UniqueConstraint('batch', 'author', 'target')
Do we want to keep that and add the proposed author_target_end_date_unique or replace it? Or are they functionally equivalent?

It looks like I can get the current DB engine with op.get_bind().

That line appears not to be functional! We no longer seem to have a batch column or attribute on the model. As you just noted on this call, @christalee, we also don't have author or target - we have author_id and target_id. So this seems like a relic from an earlier time that was not properly updated, and we should delete it!