SpasMilenkov / Document-Classification-Algorithm

The repository contains both single-process and Open MPI-based parallel implementations of a document classification algorithm. Developed as part of the TU Sofia Parallel Processing of Information course, this project showcases the use of C++ and Open MPI to efficiently classify documents based on a predefined catalog of topics and key words.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Document-Classification-Algorithm

This repository contains implementations of a document classification algorithm using both single-process and MPI-based parallel processing techniques. This project is part of the TU Sofia Parallel Processing of Information course.

Table of Contents

Description

The Document-Classification-Algorithm project demonstrates two different approaches to document classification:

  1. Single-Process Implementation: A straightforward implementation that processes documents one by one.
  2. Open MPI-Based Parallel Implementation: A parallelized version that uses MPI to distribute the workload among multiple processes, improving performance and scalability.

Both versions classify documents based on a catalog of topics and identifiers, reading from text files and determining the relevance of each document to different topics.

Single-Process Implementation

  • Reads a catalog of topics and identifiers.
  • Tokenizes and processes each document sequentially.
  • Outputs the classification results to a CSV file.

MPI-Based Parallel Implementation

  • Utilizes MPI for parallel processing.
  • Distributes document classification tasks across multiple processes.
  • Aggregates and outputs the results.

Technologies Used

  • C++: The core programming language used for both implementations.
  • Open MPI (Message Passing Interface): Used for parallel processing in the MPI-based implementation.
  • Standard Template Library (STL): Utilized for data structures and algorithms.
  • Filesystem Library: Used for directory and file handling.
  • Chrono Library: Used for measuring execution time.

Installation

  1. Clone the repository:
    git clone https://github.com/yourusername/Document-Classification-Algorithm.git
  2. Navigate to the project directory:
    cd Document-Classification-Algorithm

For Single-Process Implementation

  1. Compile the single-process code:
    g++ ./main.cpp -o single_classification

For MPI-Based Parallel Implementation

  1. Compile the MPI-based code:
      mpicxx -I/usr/lib/x86_64-linux-gnu/openmpi/include main.cpp -o parallel

Usage

Single-Process Implementation

  1. Run the single-process classification:
    ./single_classification

MPI-Based Parallel Implementation

  1. Run the MPI-based classification (example with 4 processes):
    mpirun -np 4 ./mpi_classification
    Where 4 is the amount of processes you want Open MPI to spawn.

License

This project is licensed under the MIT License - see the LICENSE file for details.

This project is part of the TU Sofia Parallel Processing of Information course.

About

The repository contains both single-process and Open MPI-based parallel implementations of a document classification algorithm. Developed as part of the TU Sofia Parallel Processing of Information course, this project showcases the use of C++ and Open MPI to efficiently classify documents based on a predefined catalog of topics and key words.

License:MIT License


Languages

Language:CSS 27.7%Language:TeX 24.1%Language:Shell 14.3%Language:HTML 11.5%Language:JavaScript 9.2%Language:C++ 7.0%Language:C 4.3%Language:CMake 1.7%Language:Makefile 0.1%