Support primary-key "in" queries with a filter condition
darrenklein opened this issue · comments
In a query where a ==
operator is used on a primary key, a filter will be applied as expected - for example, the query
from(p in Person, where: p.id == ^person1.id
and is_nil(p.first_name))
|> TestRepo.all()
will correctly return an empty list, since the target record does have a value for first_name
.
However, the same query will fail if an in
operator is used -
from(p in Person, where: p.id in ^[person1.id]
and is_nil(p.first_name))
|> TestRepo.all()
will return the record, even though no record should be returned due to the filter condition.
Initially, I implemented this by simply using a query in place of get-item
/batch-get-item
, but @alhambra1 noted that it would likely be more performant to continue to use those methods and post-filter the data instead.
I've been experimenting with this idea on the batch
branch, and some initial benchmarking (with Benchee) suggests that this idea is correct:
Name ips average deviation median 99th %
post-get filter 7.84 127.58 ms ±10.20% 123.88 ms 175.53 ms
get 7.46 134.03 ms ±13.79% 124.57 ms 179.83 ms
query 0.182 5.50 s ±0.00% 5.50 s 5.50 s
In this initial experiment, I tested three different requests, all of which were dealing with a production-instance table that had 134 records.
get
was an unfiltered batch-get-item
request for a list of ids:
from(t in Test.Table, where: t.id in ^ids) |> Test.Repo.all()
post-get filter
was a batch-get-item
call that was filtered after the fact, while query
was a query that passed a filter condition to Dynamo. While both use the same syntax:
from(t in PerfTest.Table, where: t.id in ^ids and is_nil(t.deleted_at)) |> Test.Repo.all()
they were handled differently under the hood.
In the two filtered cases, 106 records were returned, 28 were filtered. From the results above, it is evident that post-filtering was significantly more performant than a query with a filter condition, on par with the unfiltered request (it seems like it can sometimes be more performant, I would think due to the fact that fewer records needed to be fully decoded).
Documenting an issue I noticed while working on this - when the primary key query is an in
operation, rather than ==
, if there are between
or begins_with
filters present, the values interpolated in those filters are ending up in the hash_batch
, where they should not be. This value is derived from the params
that are passed to execute
, so we're actually receiving that from Ecto. Why is Ecto sending malformed params? Is it receiving bad data, or is this just a kink that needs to be ironed out by the adapter?