yorek / non-scalar-uda-transitive-closure

Transitive Closure Clustering with T-SQL, SQLCLR and JSON

Home Page:https://medium.com/p/dade18953fd2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parallelism causes problems with very large datasets

scrollsaw opened this issue · comments

commented

An issue I've noticed when using this on large data sets is that when the size of the input data set gets very large (> 1.5 million rows in my case) SQL Server will set up a plan to run the query using parallelism. This then causes the cluster connections to not be calculated correctly. You only get a fraction of the clusters back. I'm guessing it's because when running in parallel each chunk of the query doesn't know about the others. You can test for this by running the query with larger and larger data sets until SQL makes a parallel plan.

A solution is to just add OPTION (MAXDOP 1) to the query like so:

select dbo.TCC(id1, id2) from dbo.TestData OPTION (MAXDOP 1)

This restricts parallelism and the clusters are then returned correctly.

Thanks a lot for reporting this. Parallelism will use the merge method to merge two different result into one. I'll try to run some test as soon as possible to figure out what's not working. In the meantime, thanks for the workaround!