Add filter for join

Question

Add filter for join

lenovin opened this issue 8 months ago · comments

lenovin commented 8 months ago

Hi,

Can you add a feature like adding a filter option in join tables

Regards

Maurits van der Schee · Answer 1 · Sat Dec 23 2023 20:36:39 GMT+0800 (China Standard Time)

Filtering based on values in joined tables is not supported. You can work around this by doing two queries or reversing the query (join in the other direction).

Lucas Kinne · Answer 2 · Thu Feb 01 2024 20:50:07 GMT+0800 (China Standard Time)

@mevdschee
Is there a specific reason why this is not supported?
Would the API run in some kind of ambiguity issues during generation of the raw SQL queries or is it just not implemented yet?

I think both of your solutions do not suffice for my use case, because I need to access and filter multiple tables at the same time:

As a simple example, we have the following tables:

personality: ~75k entries with data about persons
ou: ~3k entries with data about organizational units, including boolean columns that indicate whether the OU should be visible (real ous) or not (technical or due to privacy reasons hidden ous)
personality2ou: contains mappings of persons to organizational units (many-to-many), also with visibility boolean columns that indicate whether the mapping should be hidden completely, only be shown in the intranet or shown publicly on the internet
So for contextual reasons the visibility columns are distributed among multiple tables.

I ran into this problem and I solved it by implementing a custom middleware:

[Request-Manipulation] Mandatory joins are automatically added to the request, so a request to /records/personality would automatically be transformed to /records/personality?join=personality2ou,ou before being processed, so that the personality table cannot be accessed on its own.
[Response-Manipulation] After the records have been loaded, they are postprocessed by recursively enforcing the visibility column values and thereby filtering the response.

This worked so far, but now I run into the problem that pagination does not work, since pagination limits the record list's length at SQL level (e.g. return 50 records) and the records are only filtered by my response manipulation in the middleware afterwards, which results in only a partial amount of records being returned (e.g. 38), which also varies by page.
We thought about loading everything and doing the whole pagination part on the client-side, but since our tables have been kind of large since we integrated our production data into the current development setup, loading the whole table without pagination on SQL level is realllly slow.

All of my problems would be solved If I could simple modify the requests to add filters on joined tables, so that e.g.
/records/personality would being preprocessed in my middleware to /records/personality?join=personality2ou,ou&filter=personality2ou.visible_internet=1&filter=... or a similar syntax.

Maurits van der Schee · Answer 3 · Thu Feb 01 2024 21:35:28 GMT+0800 (China Standard Time)

Is there a specific reason why this is not supported?

It is not supported, because the code does not do joins, but applies the security model on each table read and stitches the results together (like a left join does).

Without the security model you end up with something like https://www.pathql.org

PathQL is much better (faster, simpler) and more versatile than TreeQL, but does not apply a security model on the data (you need to define that in the database).

Maybe you can combine the two solutions?

Lucas Kinne · Answer 4 · Thu Feb 01 2024 23:15:45 GMT+0800 (China Standard Time)

Thanks for the explanation. I see why this is not possible now.

A combination would not work, because:

solution 1: the two queries would have to be done at the API-side due to security reasons, which means I would have to 'hack' the whole 1 SQL-Query per 1 HTTP-Request architecture of this API
solution 2: multiple tables have to be filtered at the same time, so it does not matter which my starting table is

I did a bit of brainstorming and came up with possible mitigations/solutions:

narrow down the list by filters on the main table: in our case it could be sufficient to not display a whole list of entries, but only entries that can be filtered with filters on the main table, e.g. "only show children OUs whose parent OU is XYZ" to narrow down the long list and therefore not needing pagination
use redundant columns on the main table (basically like the solution above): it would be possible to add redundant columns to the main table, which somehow replicate the data of the joined tables, which you want to filter, but this would require to keep these columns in sync with the 'original' data
writing an own middleware for pagination: it should work to write a middleware that first executes a raw SQL query against the database, which e.g. could return all IDs of persons that the current user is allowed to see regarding our policy, and then manipulates the request to only load the allowed records via batch read

Additionally, it will (as always) be useful to utilize caches so that even if you have to bite the bullet to load the complete list of records to modify it later on the client-side, only the first query will take a while and subsequent (re)loads will be served from the client's cache without needing to perform another heavy query.

jaleonardo · Answer 5 · Fri Feb 09 2024 19:59:38 GMT+0800 (China Standard Time)

Just an opinion. If your use case requires a join that is consistently used or needed, maybe its better to join the tables on the DBMS side and expose it as a view, instead of dynamically building the joined data thru the api. This way, the view can be accessed thru the api.php/records/ endpoint and the filter will work as-is.

Lucas Kinne · Answer 6 · Sat Feb 10 2024 18:30:20 GMT+0800 (China Standard Time)

Thanks for the tip. I'll keep it in mind in case we have related performance issues at some point.

It would require us to restructure the client parsing though, because the tables would not be returned nested, but flat.
Additionally, I don't think it would solve our pagination issues. If the view returns the data like this

personality_pk	personality_attr	ou_pk	ou_attr
A	F	X	M
A	F	Y	O
B	G	Z	N
C	H	X	M

and I would like to use a pagination over the personalities (after they are filtered by the visibility attributes) with a batch of 2, it would only return the rows of personality A, because the person is in a 1-to-2-relationship.
What you would normally do is use a DISTINCT, but as far as I know TreeQL does not support this as well.

Maurits van der Schee · Answer 7 · Sat Feb 10 2024 19:57:51 GMT+0800 (China Standard Time)

@Dherlou Thank you for your explanation. I understand that there are features that one may miss in this software. I think I also understand the cases in which it would be good to have real joins. Also I think most of these cases can be covered by doing multiple http requests (using 'in' filter).

I would have to 'hack' the whole 1 SQL-Query per 1 HTTP-Request architecture of this API

There is no such architecture. There are multiple queries per http request (when you use joins) and all queries respect the security model. There are no filters, nor pagination on nested objects. Filters on nested objects should do a natural join (I guess that is what you are after) while applying the security model requires a left join. This difference in join types is one of the problems you run into when implementing filters on nested objects.

My opinion is that once you need complex queries and need more or less full SQL freedom you might as well apply your security model in the database and send SQL to the endpoint. This is where the idea of PathQL was born, see https://www.pathql.org

NB: PathQL returns nested json and is very much comparable with solutions like GraphQL.