pramodh941 / AMD

Finding Similar Pairs using PySpark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AMD

#This project is done for the course Algorithms for Massive Datasets taught by Prof. Dario Malchiodi at University of Milan

#Code executed in Google Colab #Need to upload your Kaggle json file hence to pull the dataset from Kaggle

#Description: #Cleaned data (removed unnecessary string,code patterns, stopwords #Tokenized the questions #Implemented TfIdf to extract feature vectors (of 8 vectors each) #Feature vectors passed to MinHashLSH (of 5 buckets) #Used "approxnearestNeighbours" to group similar questions