webcompat / webcompat-metrics-server

Server in charge of delivering different data to the webcompat-metrics-client

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[db] Create a table for collecting milestones data

karlcow opened this issue · comments

This is a followup on #4

Summary:
Currently needsdiagnosis data are kept in a JSON file which is growing indefinitely.
We need to move the data and the collection to a DB.

Things to do:

  • Create table for milestones data
  • Move the old data (JSON file) to the new DB
  • How do we keep collecting when the DB is down

@laghee As you have been the "DB Grandmaster Flash", do you have preferences on where I should put this? In models.py?

Thanks.

So if we add a table to postgresql, we need to also run a migration script to add the new model. I guess. I think we need a dev environment to not test in staging or prod, but locally all the mistakes we could do.

so during Orlando @laghee told me that the DB is currently empty and not used. And that we can basically starts from scratch. I was afraid to destroy things here, but given there is nothing to destroy.

This kind of also restart my discussion about which DB we should use. :p Let's look at the shape of the data we are actually using.

needsdiagnosis data (6593 rows)

  • timestamp (2019-01-24T01:00:03Z)
  • count (integer)

weekly diagnosis data (260 rows)

  • timestamp 2018-10-22T00:00:00
  • count (integer)

Question:
@laghee which TZ the timestamp is for your data? :)

In mozilla/webcompat-team-okrs#48
we have 3 new data sets.

  • needstriage
  • needscontact
  • sitewait

They have the same shape I believe than the previous one.

aka

  • a timestamp
  • a count

question for @adamopenweb who worked on a large set of events recently.
Do you know if it's hard to reconstruct history here based on the events data.
If possible that would remove a burden wrt fear of losing data.

I have renamed the title here to reflect a bit more what we will do. We can start collecting data for all milestones by doing.

milestone_name timestamp count and have a unique table for this.

Do you know if it's hard to reconstruct history here based on the events data.

I don't think it would be difficult to contruct . As in we don't need comments to track a report's progression.

Example event data (looks nice in FFJSON viewer):
https://api.github.com/repos/webcompat/web-bugs/issues/18143/events

Each item in the array contains the event property. I think the important values are milestoned and labeled but there's also demilestoned and unlabeled. When milestone state is changed there is an associated milestone title "needstriage" for example and same for labels. Also there is created_at for the event timestamp.

Collecting event data is heavy, one request per issue. @karlcow Are you looking to use a webhook?

weekly diagnosis data (260 rows)

* timestamp `2018-10-22T00:00:00`
* count (integer)

Question:
@laghee which TZ the timestamp is for your data? :)

It should actually be 2018-10-22T00:00:00Z, so UTC (based on the timestamps of the issues themselves) -- and always midnight into a Monday.

Do you know if it's hard to reconstruct history here based on the events data.

As @adamopenweb says, there's a fairly high data/parsing burden (lots of API calls, lost of data in the json that isn't always really necessary for our purposes).

The next step for this one will be #79

Collecting event data is heavy, one request per issue. @karlcow Are you looking to use a webhook?

nope I was more thinking of rebuilding the history of certain milestones (when we started to use them) to create the historical data we do not have. (but more as a pet project).

@laghee I also realize that my script doesn't cover the weekly data incoming data. We probably needs a separate script. Let's open an issue for this.

I created #80 for this previous comment.

@laghee Would that make sense?

class IssuesCount(db.Model):
    """Define a IssuesCount for milestone at a precise time.
    
    An issues count has:

    * a unique table id
    * a timestamp representing the time the count has been done
    * a count of the issues at this timestamp
    * a milestone representing the category it belongs to
    """
    id = db.Column(db.Integer, primary_key=True)
    timestamp = db.Column(db.DateTime, nullable=False)
    count = db.Column(db.Integer, nullable=False)
    milestone = db.Column(db.String(15), nullable=False)

    def __repr__(self):
        """Representation of IssuesCount."""
        return '<IssuesCount {timestamp} {count} {milestone}'.format(
            timestamp=self.timestamp,
            count=self.count,
            milestone=self.milestone
        )

@karlcow That seems pretty logical to me, and the datetime object should preserve the time zone info, which is good.

What do you think we should do about weekly new issue counts? Do we want to make a different model for those? I suppose we could make the milestone field nullable... but that could be confusing down the road.

Oh, wait, I just had a thought. One of the cool things about @adamopenweb's original poc was the way you could see all the milestones within the total reports simultaneously. If we want to preserve the ability to do anything like that, it's probably better to keep overall weekly counts as a separate model (that could be tweaked or added to in the future).

So, if indeed we can simplify the timestamp for daily reports to just the date and query the database to get the weekly totals, the model for daily issues filed could then be something like:

class DailyTotal(db.Model):
    """Define a DailyTotal for new issues filed.
    
    An daily total has:

    * a unique table id
    * a day representing the date that corresponds to the total
    * a count of the issues filed on this date
    """
    id = db.Column(db.Integer, primary_key=True)
    day = db.Column(db.DateTime, nullable=False)
    count = db.Column(db.Integer, nullable=False)

    def __repr__(self):
        """Representation of DailyTotal."""
        return '<DailyTotal for {day}: {count}'.format(
            day=self.day,
            count=self.count
        )

Edit: Just reread title and realized this issue is specific to milestones data, so opened #82 for the daily count.