ebtelmarz / big_data_lsh_ensemble

MapReduce implementation of LSH Ensemble

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LSH Ensemble

This is an assignment for the Big Data course in Roma Tre University.

This repo is based on the work reported in this paper: LSH Ensemble: Internet-Scale Domain Search.

Requirements

To run this project you need:

  • Python 3.6.9
  • Hadoop 3.2.1
  • Spark 3.0.0
  • pip3 intstalled in your machine. To install pip3 run the following commands in a shell
sudo apt update
sudo apt install python3-pip

Usage

To run the project locally

Start Hadoop, open a shell and run

$HADOOP_HOME/sbin/start-dfs.sh 

Download this repo or clone it by running

git clone https://github.com/ebtelmarz/big_data_lsh_ensemble.git

Move inside the downloaded directory

cd big_data_lsh_ensemble/

Execute the run.sh script by running in a shell

sh run.sh

 

To run the project on cluster

Create a virtual environment

python3 -m venv my_env
source .my_env/bin/activate 

Execute the run.sh script by running

sh run.sh

About

MapReduce implementation of LSH Ensemble


Languages

Language:Python 91.7%Language:Shell 8.3%