cube2222 / octosql

OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

result of StreamJoin or OuterJoin is not equal with database

Lvnszn opened this issue · comments

I looked at the source code carefully.
In the code related to Join, the data is mainly obtained from the left table and the right table in an asynchronous manner, and then sent to chan for consumption. This will cause a problem that if the left table receives data, the right table has not yet received it. To the data that can match the data in the left table, this will cause the record in the left table to generate a piece of data that is not associated with the right table. In fact, he can be linked from the next few lines. I think this asynchronous design method will cause the final result to be smaller than the real data that can be matched.

Hey! Do you have a reproduction?

Outer join will use retractions to retract the early "not matched" record if the left table receives a record before the right one.

StreamJoin only sends matches so should work regardless of retractions.

Thanks for your reply.
How is this retractions triggered? When I look at with output is print, I judge whether to use the produce function according to a quarter of the time.

Try using batch_table or stream_native output formats. With JSON it will indeed print both the send and the retraction as normal records (which is not good). Could be improved by doing a batch JSON printer as the output if retractions are possible, or by adding an undo field that is true on JSON outputs that are retractions.

You can actually work around this by using an ORDER BY, that forces buffering and will process all retractions before outputting anything. Basically

SELECT .... ORDER BY true

thanks for your reply