parallelism causes problems with very large datasets

Question

parallelism causes problems with very large datasets

scrollsaw opened this issue 5 years ago · comments

An issue I've noticed when using this on large data sets is that when the size of the input data set gets very large (> 1.5 million rows in my case) SQL Server will set up a plan to run the query using parallelism. This then causes the cluster connections to not be calculated correctly. You only get a fraction of the clusters back. I'm guessing it's because when running in parallel each chunk of the query doesn't know about the others. You can test for this by running the query with larger and larger data sets until SQL makes a parallel plan.

A solution is to just add OPTION (MAXDOP 1) to the query like so:

select dbo.TCC(id1, id2) from dbo.TestData OPTION (MAXDOP 1)

This restricts parallelism and the clusters are then returned correctly.

Davide Mauri · Answer 1 · Fri Jun 07 2019 04:34:49 GMT+0800 (China Standard Time)

Thanks a lot for reporting this. Parallelism will use the merge method to merge two different result into one. I'll try to run some test as soon as possible to figure out what's not working. In the meantime, thanks for the workaround!