ankane / groupdate

The simplest way to group temporal data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

group_by_duration doesn't work for a ActiveRecord::Relation

jeffblake opened this issue · comments

Hi @ankane ,

Thanks for your work on group_by_duration. I was the first commenter 6 years ago on the feature request :) #23

I spun up the branch real quick, and it appears the method signatures don't match up when going through an ActiveRecord::Relation. i.e., enumerable.rb vs query_methods.rb

User.limit(3).group_by_duration(10.minutes) ----> Arel::Visitors::UnsupportedVisitError (Unsupported argument type: Hash. Construct an Arel node instead.)

User.limit(3).group_by_duration(10.minutes, :created_at) ----> ArgumentError (wrong number of arguments (given 2, expected 1))

Thanks!

I originally took a stab at this in SQL, and came to appreciate very much your work on building complete series, integration with active record relations, and time zones!

    query = <<-SQL.squish
      SELECT COUNT(*) count,
      to_timestamp(floor((extract('epoch' from scanned_at) / 600 )) * 600)
      AT TIME ZONE 'UTC' as interval_alias
      FROM scans WHERE event_id = $1 AND action IN (#{bind_param.join(',')}) GROUP BY interval_alias
    SQL

    binds = [
      ActiveRecord::Relation::QueryAttribute.new("event_id", event_id, ActiveRecord::Type::Integer.new)

    ]

    ActiveRecord::Base.connection.exec_query(query, 'SQL', binds, prepare: true).to_a

Hey @jeffblake, thanks for the report. Just pushed a fix to that branch.

It's fixed, thank you. Would love to see this in v5.

Some minor things I noticed

  • If passing in a format, e.g., "%l:%M%P", with data than spans multiple days (i.e. key_format is not unique), the count of the last appearance of that time, say 4:10pm, would overwrite the previous occurrences. This could be expected behavior, but it wasn't clear initially
  • Is it possible to strip out outliers of the series? I think that would be a useful option to pass in

I have some other performance ideas that I may take a stab at

  • add # frozen_string_literal: true comments to save a few allocations
  • prefer Time.iso8601 instead of Time.parse for performance
  • key_format in series_builder is significantly slower when passing in a format. line 215 time_zone.parse("2014-03-02 00:00:00") does this every time
    I only took a quick peek, but I think there are more opportunities to shave down allocations.

Hey @jeffblake, thanks for the feedback/ideas.

  • Since aggregations return a hash, non-unique keys will be overwritten. I don't think there's a way around that.
  • For outliers, check out Anomaly or Trend.
  • For performance, we can memoize key_format speed things up (nice find!). I'm not sure the other optimizations will make a big difference.

Sounds good, the memoize will be the best bang for buck.

Thanks for the tips on the outliers, and yes, makes sense about the aggregation

I'll go ahead and close this, appreciate the quick fix

Just fyi, decided to go a different direction to group by 10 minute intervals. See #23 (comment).