Support for grouping explorer results by custom tags

Question

Support for grouping explorer results by custom tags

subintp opened this issue 3 years ago · comments

datafloyd commented 3 years ago

Use Case

Visualize traffic patterns like calls count, P99 latency, error across IP, user_id, api_key, etc via the explorer tab.

This will help us debug issues like

Latency/Errors spike specific to a user or customer
Request spikes from particular IPs or API keys

Proposal

The above use cases can be solved in a generic fashion by grouping explorer query results based on custom tags. Currently, there is no way we can solve this issue via prometheus+grafana due to the high cardinality of these tags. Addition of feature will be makes explorer tab more powerful

Tasks

Based on the below conversation, converting the high-level items discussed here to sub-tasks for this ticket.

Query Service:

#1268
hypertrace/query-service#99
hypertrace/query-service#100
Deprecations ( CONTAINS_KEY, CONTAINS_KEY_VALUE operators and ColumnIdentifier expressions)

Gateway Service

API support for receiving attribute expressions in all places a ColumnIdentifier is currently received
Impl support for translating attribute expressions into QS attribute expressions
Deprecations of ColumnIdentifier expression

GraphQL service:

API support for receiving attribute expressions all places string keys are currently received
Impl support for translating into GW attribute expressions
Deprecations of string keys input

UI

Implementing new UX definintation for groupby
Update GQL queries to pass attribute expressions

kotharironak · Answer 1 · Wed Nov 10 2021 13:22:53 GMT+0800 (China Standard Time)

Currently, we are storing the span tags in a MAP attribute which is eventually stored in pinot as MAP column (two multi-valued arrays). On the MAP attribute, we do have support for lookup -> Contains_Key and key value comparison -> Contains_key_value, and they are translated to pinot's mapValue queries in the query service layer. Similarly, we will need the support in translating group by query expression involving mapValue.

@sarthak77 Can you look into the query service side of changes as part of this ticket?

kotharironak · Answer 2 · Mon Nov 15 2021 15:31:51 GMT+0800 (China Standard Time)

To handle the container object like map for a corresponding attribute (e.g spanTags - https://github.com/hypertrace/query-service/blob/main/query-service/src/main/resources/configs/common/application.conf#L75), I was thinking if we can extend query service expression definition (https://github.com/hypertrace/query-service/blob/main/query-service-api/src/main/proto/request.proto#L11) as bellow to handle transporting the information.

message Expression {
  oneof value {
    ColumnIdentifier columnIdentifier = 1;
    LiteralConstant literal = 2;
    Function function = 3;
    OrderByExpression orderBy = 4;
    ObjectIdentifier = 5;
  }
}

message ObjectIdentifier {
  string columnName = 1; // 1 & 2 can be replaced with ColumnIdentifier
  string alias = 2;
  string path_key = 3; // or path_expression
}

With the above, we will be able to handle containe_key_value as below.
.eg existing contain_key_value expression

childFilter {
        lhs {
          columnIdentifier {
            columnName: "API_TRACE.tags"
          }
        }
        operator: CONTAINS_KEYVALUE
        rhs {
          literal {
            value {
              valueType: STRING_ARRAY
              string_array: "span.kind"
              string_array: "server"
            }
          }
        }
}

using ObjectIndetifier,

childFilter {
        lhs {
          objectIdentifier {
            columnName: "API_TRACE.tags"
            path_key" "span.kind"
          }
        }
        operator: CONTAINS_KEYVALUE
        rhs {
          literal {
            value {
              valueType: STRING
              string: "server"
            }
          }
        }
}

We will able to add Group By support as below (only selection, omitted filter in example):

selection {
  function {
    functionName: "AVG"
    arguments {
      columnIdentifier {
        columnName: "EVENT.duration"
      }
    }
    alias: "AVG_EVENT.duration_[]"
  }
}
selection {
  objectIdentifier {
    columnName: "EVENT.spanTags"
    path_expression: "span.kind"
    alias: "EVENT.spanTags.span.kind"
  }
}
groupBy {
  objectIdentifier {
    columnName: "EVENT.spanTags"
    path_expression: "span.kind"
    alias: "EVENT.spanTags.span.kind"
  }
}

The above will be translated to the below example Pinot query (other time filters are added for reference):

select mapValue(tags__KEYS,'span.kind',tags__VALUES),  AVG(duration_millis) FROM spanEventView 
WHERE tenant_id = '__default' 
AND start_time_millis >= 1636519514905 AND start_time_millis < 1636523114905 // example filter
AND mapValue(tags__KEYS,'span.kind',tags__VALUES) != ''
group by mapValue(tags__KEYS,'span.kind',tags__VALUES)
limit 10000

@aaron-steinfeld what do you think?

Aaron Steinfeld · Answer 3 · Mon Nov 15 2021 22:56:46 GMT+0800 (China Standard Time)

I think this solution makes sense in isolation for QS - it's along the lines of what I was thinking. Now, we need to make sure it will work at other layers, or that whatever we do there is consistent with this change, because we need to make sure the solutions make sense together too. For example, if the attribute identifier string has to contain this information higher up the stack, we'd want to do the same here otherwise we have two sources of paths.

Will spend time today looking at gateway and graphql and see if we can get consistent changes there.

Aaron Steinfeld · Answer 4 · Tue Nov 16 2021 06:20:46 GMT+0800 (China Standard Time)

Looked into this more today. So we need support in all of:

Group By
Filter
Selection
Order by (this one wasn't mentioned, but looks like it comes in mostly the same; I'm adding it for consistency as a column on the table requires sorting, and we want to support adding these on tables same as any other attribute)

The analysis above makes sense for QS and works for all four locations we'd need to specify this info. Looking at gateway service, it's more or less a duplication of QS for our purposes, and the same solution works there. Now the more painful bit comes when we get to graphQL

Just taking the explorer API for now (which isn't quite accurate, since the explorer page results table is powered by the traces API, not the explorer API), we've got the following pseudo-schema:

GroupBy:
  groupLimit Int
  includeRest Boolean
  keys [String]!
Filter:
  idScope String
  idType AttributeScope
  key String
  operator FilterOperatorType!
  type FilterType!
  value Unknown!
Sort:
  aggregation MetricAggregationType
  direction OrderDirection
  key String!
  size Int
Selection:
  key String!
  aggregation MetricAggregationType
  size Int
  units TimeUnit

All of these are inputs and can take new fields, but it's going to get ugly quick in order to support backwards compatibility. Ignoring that for a second, I think the general idea of replacing all key/String references with an object like

  AttributeExpression {
    attributeId: String!
    subpath: String
  }

would accomplish what we want.

To wire it through compatibly, I'd propose making all the key/String fields optional, and add optional fields named attribute(s) of type AttributeExpression. GQL can resolve the two fields down to one AttributeExpression, and wire that through as the new expression.

Now back to the changes in QS (which will be mirrored in Gateway) - a couple small changes I'd suggest to keep the whole stack consistent:

ObjectIdentifier -> AttributeExpression This represents an attribute, so we can be more precise with naming. The concept of column should not be exposed outside the QS implementation (I know, I know - that ship sailed, but let's try to be precise on the new stuff)
columnName -> attributeId Same reasoning as above
path_expression -> subpath This string represents a path traversal into the data stored for that attributeId, but can't be any old expression, those are represented by full messages elsewhere in the api. It also reads like it's the path to that attribute, while it's actually the path relative to that attribute - so I thought subpath captured that better.
path_expression / subpath -> should be marked optional

How's that sound @kotharironak @sarthak77 ?

Prashant Pandey · Answer 5 · Tue Nov 16 2021 19:02:13 GMT+0800 (China Standard Time)

@aaron-steinfeld @kotharironak Is there a design doc on this in progress?

kotharironak · Answer 6 · Tue Nov 16 2021 19:53:53 GMT+0800 (China Standard Time)

@aaron-steinfeld @kotharironak Is there a design doc on this in progress?

@suddendust we already have support for lookup : contains_key and eq : contains_key_value operation on map fields. So, as a design to extend the support for group by and other operators for map fields. In the current implementation, there are some implicit assumptions and so the part of lhs expression passed as a string array in rhs expression (see here example : #1099 (comment)). The discussion here is how should we transport the required information from GQL to the Query Service layer?

So, once we have the right expression as described above, it will help us fix the existing issue and extend support for group by. With the new expression described above, QS will translate the pinot group by query shown in the example above discussion. (e.g draft PR - handling contains_key_value with newer expression - hypertrace/query-service#97)

kotharironak · Answer 7 · Tue Nov 16 2021 20:12:02 GMT+0800 (China Standard Time)

Looked into this more today. So we need support in all of:

Group By

Filter

Selection

Order by (this one wasn't mentioned, but looks like it comes in mostly the same; I'm adding it for consistency as a column on the table requires sorting, and we want to support adding these on tables same as any other attribute)

The analysis above makes sense for QS and works for all four locations we'd need to specify this info. Looking at gateway service, it's more or less a duplication of QS for our purposes, and the same solution works there. Now the more painful bit comes when we get to graphQL

Just taking the explorer API for now (which isn't quite accurate, since the explorer page results table is powered by the traces API, not the explorer API), we've got the following pseudo-schema:
GroupBy:
  groupLimit Int
  includeRest Boolean
  keys [String]!
Filter:
  idScope String
  idType AttributeScope
  key String
  operator FilterOperatorType!
  type FilterType!
  value Unknown!
Sort:
  aggregation MetricAggregationType
  direction OrderDirection
  key String!
  size Int
Selection:
  key String!
  aggregation MetricAggregationType
  size Int
  units TimeUnit
All of these are inputs and can take new fields, but it's going to get ugly quick in order to support backwards compatibility. Ignoring that for a second, I think the general idea of replacing all key/String references with an object like
  AttributeExpression {
    attributeId: String!
    subpath: String
  }
would accomplish what we want.

To wire it through compatibly, I'd propose making all the key/String fields optional, and add optional fields named attribute(s) of type AttributeExpression. GQL can resolve the two fields down to one AttributeExpression, and wire that through as the new expression.

Now back to the changes in QS (which will be mirrored in Gateway) - a couple small changes I'd suggest to keep the whole stack consistent:

ObjectIdentifier -> AttributeExpression This represents an attribute, so we can be more precise with naming. The concept of column should not be exposed outside the QS implementation (I know, I know - that ship sailed, but let's try to be precise on the new stuff)

columnName -> attributeId Same reasoning as above

path_expression -> subpath This string represents a path traversal into the data stored for that attributeId, but can't be any old expression, those are represented by full messages elsewhere in the api. It also reads like it's the path to that attribute, while it's actually the path relative to that attribute - so I thought subpath captured that better.

path_expression / subpath -> should be marked optional

How's that sound @kotharironak @sarthak77 ?

Sounds good to me. I was thinking more of the distinction between simple attributes vs complex attributes.

ColumnIndentifer -> Simple attribute
ObjectIdentifier -> Complex attribute

With AttributeExpression, it is at the QueryService that decides internally if it can handle that request based on how the attribute is mapped to a simple column or complex column.

e.g correct request, here attribute is mapped to map column, and QS can understand the subpath.

AttributeExpression {
attributeId: API_TRACE.tags";
subpath: "span.kind"
}

e.g bad request as attribute is mapped to a long column, and can't understand subpath for that column, so throw an exception

AttributeExpression {
attributeId: API_TRACE.startTime";
subpath: "1234"
}

e.g correct request, a subpath is not present and attribute is mapped to a simple long column, QS can serve it.

AttributeExpression {
attributeId: API_TRACE.startTime";
}

Secondly, with this approach, eventually, we will not need ColumnIdentifer, so I guess, we will deprecate it, right?

We will continue to support alias, right? So, the expression at QS (or Gateway) will look like below?

message Expression {
  oneof value {
    ColumnIdentifier columnIdentifier = 1;
    LiteralConstant literal = 2;
    Function function = 3;
    OrderByExpression orderBy = 4;
    AttributeExpression attributeExpression = 5;
  }
}

message AttributeExpression {
  string attributeId = 1;
  string subpath = 2;
  string alias = 2;
}

Prashant Pandey · Answer 8 · Tue Nov 16 2021 20:35:17 GMT+0800 (China Standard Time)

@kotharironak We would like to contribute to this feature to speed it up, so can you let us know if there are some sub-tasks that we can pick-up and work on in parallel? Also, do you have an ETA for this in mind?

Aaron Steinfeld · Answer 9 · Tue Nov 16 2021 23:50:23 GMT+0800 (China Standard Time)

Secondly, with this approach, eventually, we will not need ColumnIdentifer, so I guess, we will deprecate it, right?

Yeah, I thought it'd be more readable and simpler to handle a single message representing a selection, rather than two alternatives that would need to be checked up and down the stack. So ColumnIdentifier would be deprecated eventually, yes.

We will continue to support alias, right? So, the expression at QS (or Gateway) will look like below?

Yep, sorry ignored alias. It'd remain with equivalent functionality.

@kotharironak We would like to contribute to this feature to speed it up, so can you let us know if there are some sub-tasks that we can pick-up and work on in parallel? Also, do you have an ETA for this in mind?

Will let Ronak speak to this generally in terms of allocation and timeline, but the main work breakdown I see:

QS
- API support for receiving attribute expressions in all places a ColumnIdentifier is currently received
- Impl support for translating attribute expressions into pinot queries
- Deprecations ( CONTAINS_KEY, CONTAINS_KEY_VALUE operators and ColumnIdentifier expressions)
GW
- API support for receiving attribute expressions in all places a ColumnIdentifier is currently received
- Impl support for translating attribute expressions into QS attribute expressions
- Deprecations
GQL
- API support for receiving attribute expressions all places string keys are currently received
- Impl support for translating into GW attribute expressions
- Deprecations
UI
- Implementing new UX defined below
- Update GQL queries to pass attribute expressions
UX
- Input mechanism for user defined keys in group by
- Input mechanism for user defined keys in filters (may be a NOOP if it builds on current contains_key)

A good amount of them are coupled or repetitive with one another, but I was planning on starting with the GQL layer and working my way down. I believe @sarthak77 is coming from QS up.

kotharironak · Answer 10 · Wed Nov 17 2021 00:14:19 GMT+0800 (China Standard Time)

@suddendust If you can along with @sarthak77 focus first on QS (including test), and if something is not completed in the above layer in Gateway would be better (going from down to up). I will see in the morning, and create sub-tasks if possible for the QS layer.