sinanazem / foursquare-location-matching

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deduplication on Foursquare Location Matching

This repository contains the implementation of a machine learning project focused on deduplication of location data from Foursquare.

The goal is to match different records that refer to the same physical location, improving the quality of location-based services.





Table of Contents

Introduction

Deduplication of location data is essential for maintaining accurate and reliable datasets, especially for applications like location-based services, recommendations, and navigation. This project aims to address the problem of identifying and merging duplicate location records in Foursquare's dataset.

Dataset

The dataset used in this project consists of location records from Foursquare. Each record includes information such as the name, address, latitude, longitude, and other attributes of a location. The dataset has been preprocessed to remove obvious errors and standardize formats.

Project Structure

.
├── data
│   ├── raw
│   ├── processed
├── notebooks
├── src
│   ├── data_preprocessing.py
│   ├── feature_engineering.py
│   ├── model.py
│   ├── evaluate.py
├── results
├── tests
├── README.md
└── requirements.txt
  • data: Contains raw and processed datasets.
  • notebooks: Jupyter notebooks for exploratory data analysis and model experimentation.
  • src: Source code for data preprocessing, feature engineering, modeling, and evaluation.
  • results: Directory for storing model outputs and evaluation results.
  • tests: Unit tests for the project.
  • README.md: Project documentation.
  • requirements.txt: List of dependencies required to run the project.

Installation

  1. Clone this repository:

    git clone https://github.com/sinanazem/foursquare-deduplication.git
    cd foursquare-deduplication
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate
  3. Install the required dependencies:

    pip install -r requirements.txt

Usage

Data Preprocessing: sh python src/data_preprocessing.py

Models and Methods

Preprocessing

  • Data Cleaning: Removal of duplicate records, handling missing values, and standardizing formats.
  • Geocoding: Ensuring consistency in geographic coordinates.

Feature Engineering

  • Text Features: Tokenization, TF-IDF, and other text processing techniques for location names and addresses.
  • Geographical Features: Distance calculations between coordinates.
  • Categorical Features: Encoding of categorical variables.

Model

  • Similarity-Based Methods: Cosine similarity, Jaccard similarity, and others for text comparison.
  • Machine Learning Models: Random Forest, Gradient Boosting, and other classifiers.
  • Deep Learning Models: Siamese networks for learning similarity between pairs of records.

About

License:MIT License


Languages

Language:Jupyter Notebook 100.0%