Load Data from Multiple Files with Balanced Load Distribution

Running the Bash Script (Ubuntu)

To load data from multiple files where each file contains a different number of rows while maintaining a constant number of columns and ensuring a balanced load, you can use the provided Bash script. This script is designed to work on an Ubuntu operating system. In this program total rows loaded are 1M and it is divided into 4 files containing 100000, 200000, 300000, and 400000 respectively as shown in below figure. Each process load 250000 rows of data from four files. The method that used to load data is included in kmeans.py file.

To execute the script:

Open your terminal in the directory where your program files are located.
Make sure the script has execute permissions. If not, you can grant execute permissions by running:
```
chmod +x part_2.sh
```
Run the script:
```
./part_2.sh
```

Program Description

Problem Statement

When dealing with multiple data files, it's common to have files with varying numbers of rows. This can be challenging when you want to distribute the data processing load evenly across available resources. The goal of this program is to load data from multiple files with different row counts while keeping the number of columns constant. By doing so, we aim to achieve a balanced load distribution for data processing tasks. Whenever the file size or file structure is chaneged the starting cursor points should be manually set in the program.(In kmeans.py line 260 and line 266)

Program Workflow

The Bash script (part_2.sh) first identifies all data files in the current directory.
It calculates the total number of rows in each file and stores this information.
The script determines the file with the maximum number of rows and uses this as the reference for load balancing.
It then calculates the number of rows each file should contribute to maintain a balanced load. This is done by dividing the maximum row count by the number of files.
The script then uses relevant commands to extract the required number of rows from each file while preserving the constant number of columns.
The extracted data is processed as needed, and the balanced data is available for kmeans analysis.
Finally, the script may perform additional post-processing tasks or display the kmeans results.

Results Displayed in the Console:

The script provides various information in the console during its execution. Here's a list of results and information that can be shown:

Process Information: Each process may display its rank and step information. For example, "at process 0 step 1" and "at process 0 step 2" indicate loading data in two steps for process 0.
Data Loading Time: The script measures and displays the time taken for loading data from CSV files, e.g., "Process 0: Data loader took 2.3456 seconds."
Calculation Time: It displays the time taken for the actual K-means clustering calculations, e.g., "Process 0: Calculation took 1.2345 seconds."
Communication Time: The time spent on communication between processes using MPI is also displayed, e.g., "Process 0: Communication took 0.5678 seconds."
Total Time Breakdown: At the end of the script, the total time spent on data loading, communication, and calculation is summarized, e.g., "Data loader took 3.1234 seconds," "Communication took 0.7890 seconds," and "Calculation took 2.3456 seconds."
Finally Kmeans iterations as shown below saved in images folder.

Customization

You can customize the script to meet your specific requirements, such as defining the number of columns or specifying the post-processing steps.

Note: Make sure you have the necessary data processing tools and dependencies installed to execute this script successfully.

layanmoyura / HPC_project_my_task