vittoriopolverino/mapreduce-wordcount

MapReduce Word Count

A naive python implementation (no distributed computing) to mimic and understand the MapReduce paradigm.

📜 Table of Contents

About
Getting Started
Usage
Test
Built Using
Authors

🧐 About

MapReduce is a programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The "MapReduce System" is usually composed of three functions (or steps):

Map: The map function, also referred to as the map task, processes a single key/value input pair and produces a set of intermediate key/value pairs.
Shuffle: The shuffle function transfer data from Mapper to Reducer. It is a mandatory operation for reducers to proceed their jobs further as the shuffling process serves as input for the reduce tasks.
Reduce: The reduce function, also referred to as the reduce task, consists of taking all key/value pairs produced in the map phase that share the same intermediate key and producing zero, one, or more data items.

🏁 Getting Started

Use the Pipfile to install packages in the virtualenv:

pipenv install
pipenv install --dev

💻 Usage

Run the MapReduce example:

pipenv run wordcount

🐛 Test

Run Unit and Integration tests

pipenv run test

⛏️ Built Using

Python | Programming language
Pipenv | Dependency management
Pytest | Testing
Pre-Commit | Managing and maintaining hooks
Github Actions | CI/CD
clean-text | Data cleaning

✏️ Authors

Made with ❤️ by @vittoriopolverino ️

vittoriopolverino / mapreduce-wordcount