Support primary-key "in" queries with a filter condition

Question

Support primary-key "in" queries with a filter condition

darrenklein opened this issue 4 years ago · comments

In a query where a == operator is used on a primary key, a filter will be applied as expected - for example, the query

from(p in Person, where: p.id == ^person1.id
  and is_nil(p.first_name))
|> TestRepo.all()

will correctly return an empty list, since the target record does have a value for first_name.

However, the same query will fail if an in operator is used -

from(p in Person, where: p.id in ^[person1.id]
  and is_nil(p.first_name))
|> TestRepo.all()

will return the record, even though no record should be returned due to the filter condition.

Darren · Answer 1 · Fri Apr 10 2020 04:35:26 GMT+0800 (China Standard Time)

Initially, I implemented this by simply using a query in place of get-item/batch-get-item, but @alhambra1 noted that it would likely be more performant to continue to use those methods and post-filter the data instead.

I've been experimenting with this idea on the batch branch, and some initial benchmarking (with Benchee) suggests that this idea is correct:

Name                      ips        average  deviation         median         99th %
post-get filter          7.84      127.58 ms    ±10.20%      123.88 ms      175.53 ms
get                      7.46      134.03 ms    ±13.79%      124.57 ms      179.83 ms
query                   0.182         5.50 s     ±0.00%         5.50 s         5.50 s

In this initial experiment, I tested three different requests, all of which were dealing with a production-instance table that had 134 records.

get was an unfiltered batch-get-item request for a list of ids:

from(t in Test.Table, where: t.id in ^ids) |> Test.Repo.all()

post-get filter was a batch-get-item call that was filtered after the fact, while query was a query that passed a filter condition to Dynamo. While both use the same syntax:

from(t in PerfTest.Table, where: t.id in ^ids and is_nil(t.deleted_at)) |> Test.Repo.all()

they were handled differently under the hood.

In the two filtered cases, 106 records were returned, 28 were filtered. From the results above, it is evident that post-filtering was significantly more performant than a query with a filter condition, on par with the unfiltered request (it seems like it can sometimes be more performant, I would think due to the fact that fewer records needed to be fully decoded).

Darren · Answer 2 · Sun Apr 19 2020 11:04:29 GMT+0800 (China Standard Time)

Documenting an issue I noticed while working on this - when the primary key query is an in operation, rather than ==, if there are between or begins_with filters present, the values interpolated in those filters are ending up in the hash_batch, where they should not be. This value is derived from the params that are passed to execute, so we're actually receiving that from Ecto. Why is Ecto sending malformed params? Is it receiving bad data, or is this just a kink that needs to be ironed out by the adapter?