gpgpu-loadbalancerx

Simple load-balancing library for balancing (gpugpu-type or other) workloads between gpus (or any devices) in a computer (or multiple computers if it is a cluster). "Hello-world" version with CUDA:

On each run() call from LoadBalancerX instance, the work distribution becomes more fair (the faster GPU/CPU gets more work). API-overhead per run call is less than 50 microseconds(for FX8150 CPU + 6 devices) and the grains that are sent to devices should be taking at least comparable time (to the API overhead) to benefit from run-time minimization optimization and number of grains should be high enough to let load-balancing trade enough grains between devices to minimize run-time.

What can grain state be and what should a grain do?

Computation kernel for single 16x16 tile of an image, to be processed by 256 CUDA threads + its data transmissions over PCI-e
Sending work to another computer and waiting for response by any means
Anything that can be run in serial or parallel as long as it completes its own task within its scope
Should have asynchronous methods in inputWork, computeWork and outputWork to have optimum performance
must synchronize in outputWork function or syncWork function

What can device state have?

Device settings to launch a kernel such as OpenCL context handle for a GPU-id
CUDA GPU-id
Object instance that holds I/O arrays for a GPU/FPGA/another CPU or even some network comm that offloads to a server
Anything that needs some temporary state (to be used for grain computation)

How does it work? (wiki: https://github.com/tugrul512bit/gpgpu-loadbalancerx/wiki)

User adds devices with state objects or values
User adds work grains to be repeated in each run() call
Load balancer creates 1 dedicated CPU thread for each device
Load balancer selects a grain and a device
- If selected grain was not initialized in selected device, then runs the initWork function
- runs inputWork function to copy data from host to device, user should use asynchronous functions inside for performance
- runs computeWork function to compute, user should use asynchronous functions inside for performance
- runs outputWork function to copy results from device to host, user should use asynchronous functions inside for performance
- runs syncWork function to synchronize all previous async work with the host
Load balancer synchronizes all dedicated device threads and returns to user with run-time minimization optimization for the next run() call
After several repeats, it converges to a fair work distribution depending on performances of devices and run-time approaches to optimum level

(image created in https://app.diagrams.net/)

// number of chunks in a divide&conquer algorithm
const int grains = 1000;
const int pixelsPerGrain=5;

// simulating pixel buffer in host for a GPGPU task
std::vector<float> input(grains*pixelsPerGrain);
std::vector<float> output(grains*pixelsPerGrain);

// simulate an image data
for(int i=0;i<grains*pixelsPerGrain;i++)
{
	input[i]=i&255;
}

// necessary device state information for all types of devices
class DeviceState
{
public:
	int gpuId;
};

// necessary grain state information
class GrainState
{
public:
	GrainState():whichGpuComputedMeLastTime(-1){}
	int whichGpuComputedMeLastTime;

	// just simulating a GPU's video-memory buffer
	std::map<int,std::vector<float>> cudaInputDevice;
	std::map<int,std::vector<float>> cudaOutputDevice;
};

// load balancer to distribute grains between devices fairly depending on their performance
LoadBalanceLib::LoadBalancerX<DeviceState, GrainState> lb;


for(int i=0;i<grains;i++)
{
	lb.addWork(LoadBalanceLib::GrainOfWork<DeviceState, GrainState>(
			[&,i](DeviceState gpu, GrainState& thisGrain){
				/* (async/sync) initialize grain's host/device environment (if necessary),
				 * called only once for lifetime of LoadBalancerX instance per device
				 */
				if(thisGrain.whichGpuComputedMeLastTime != gpu.gpuId)
				{
					thisGrain.cudaInputDevice[gpu.gpuId]=std::vector<float>(pixelsPerGrain); // simulating a cuda gpu buffer allocation
					thisGrain.cudaOutputDevice[gpu.gpuId]=std::vector<float>(pixelsPerGrain); // simulating a cuda gpu buffer allocation
					thisGrain.whichGpuComputedMeLastTime = gpu.gpuId;
				}
			},
			[&,i](DeviceState gpu, GrainState& thisGrain){
				/* (async) send input data to device (called for every run) */
				for(int j=0;j<pixelsPerGrain;j++)
					thisGrain.cudaInputDevice[gpu.gpuId][j]=input[i*pixelsPerGrain + j];
			},
			[&,i](DeviceState gpu, GrainState& thisGrain){
				/* (async) compute GPGPU task in device using input (called for every run) */

				// some simple color computation
				// (just simulating a cuda kernel)
				for(int j=0;j<pixelsPerGrain;j++)
					thisGrain.cudaOutputDevice[gpu.gpuId][j]=0.5f*thisGrain.cudaInputDevice[gpu.gpuId][j];
			},
			[&,i](DeviceState gpu, GrainState& thisGrain){
				/* (async) get results from device to host (called for every run) */

				for(int j=0;j<pixelsPerGrain;j++)
					output[i*pixelsPerGrain + j] = thisGrain.cudaOutputDevice[gpu.gpuId][j];
			},
			[&,i](DeviceState gpu, GrainState& thisGrain){
				/* (synchronized)synchronize this grain's work (called for every run) */

				// simulating cuda kernel synchronization
				// simulating different GPUs (bigger gpuId = low-end GPU)
				std::this_thread::sleep_for(std::chrono::milliseconds(2+gpu.gpuId));
			}
	));

}

lb.addDevice(LoadBalanceLib::ComputeDevice<DeviceState>({0})); // RTX3090
lb.addDevice(LoadBalanceLib::ComputeDevice<DeviceState>({1})); // RTX3070
lb.addDevice(LoadBalanceLib::ComputeDevice<DeviceState>({2})); // RTX3060 with overclock
lb.addDevice(LoadBalanceLib::ComputeDevice<DeviceState>({3})); // RTX3060
lb.addDevice(LoadBalanceLib::ComputeDevice<DeviceState>({4})); // for offloading to a server
lb.addDevice(LoadBalanceLib::ComputeDevice<DeviceState>({5})); // maybe some CPU cores



size_t nano;
{

	for(int i=0;i<20;i++)
	{
		{
			LoadBalanceLib::Bench bench(&nano);
			lb.run();
		}
		std::cout<<nano<<"ns"<<std::endl;
		std::cout<<"performance ratios:"<<std::endl;
		auto performances = lb.getRelativePerformancesOfDevices();
		for(int i=0;i<performances.size();i++)
		{
			std::cout<<performances[i]<<"% ";
		}
		std::cout<<std::endl;
	}
}





for(int i=0;i<min(25,grains);i++)
{
	std::cout<<output[i]<<std::endl;
}

output:

1181739684ns <------ bad start with 1.18 seconds
performance ratios:
16.6667% 16.6667% 16.6667% 16.6667% 16.6667% 16.6667% 

1074422792ns <------ faster
performance ratios:
19.4891% 17.5123% 16.5013% 15.8729% 15.4616% 15.1628% 

970820248ns  <------ faster
performance ratios:
22.2881% 18.3442% 16.3681% 15.0816% 14.2571% 13.661% 

861089248ns  <------ faster
performance ratios:
25.0974% 19.1938% 16.2083% 14.3014% 13.043% 12.156% 

756021596ns  <----- very close to optimum performance
performance ratios:
27.9079% 20.042% 16.0664% 13.4933% 11.8386% 10.652% 

659877380ns  <----- 0.65 seconds (2x performance of equal distribution)
performance ratios:
30.6945% 20.8847% 15.9153% 12.7279% 10.6311% 9.14648% 

659023379ns 
performance ratios:
30.69% 20.8868% 15.9086% 12.7347% 10.6316% 9.14839% 

655717420ns
performance ratios:
30.6905% 20.9072% 15.8724% 12.7362% 10.6319% 9.16177% 

655666552ns
performance ratios:
30.6857% 20.9003% 15.8632% 12.7275% 10.643% 9.18019% 

656228846ns
performance ratios:
30.6829% 20.8943% 15.8336% 12.7437% 10.6643% 9.18123% 

656478081ns
performance ratios:
30.6876% 20.9014% 15.8151% 12.7215% 10.6902% 9.18421% 

653987494ns
performance ratios:
30.6606% 20.898% 15.8112% 12.7402% 10.7102% 9.1799% 

654755954ns
performance ratios:
30.6738% 20.8798% 15.8082% 12.7467% 10.7251% 9.16641% 

655304544ns
performance ratios:
30.6821% 20.8701% 15.8057% 12.7455% 10.74% 9.15654% 

653814090ns
performance ratios:
30.7018% 20.8527% 15.7976% 12.7627% 10.7323% 9.15283% 

650336759ns
performance ratios:
30.7407% 20.8431% 15.7905% 12.7677% 10.7095% 9.14844% 

653457175ns
performance ratios:
30.7678% 20.8433% 15.7934% 12.7436% 10.7029% 9.14894% 

655730340ns
performance ratios:
30.7917% 20.8576% 15.7908% 12.736% 10.6823% 9.14162% 

655918707ns
performance ratios:
30.7905% 20.8661% 15.7948% 12.735% 10.6658% 9.14793% 

654000862ns
performance ratios:
30.7691% 20.8763% 15.8158% 12.7174% 10.6576% 9.16378% 

0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
11
11.5
12

tugrul512bit / gpgpu-loadbalancerx

gpgpu-loadbalancerx

About

Languages