abhishek-kathuria / Email-Clustering

Detection of Corporate Fraud using k-means and hierarchical clustering techniques on Enron Email dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Corporate Fraud Detection using Email Clustering

This Data Mining and Natural Language Processing project involves clustering of emails, finding intra-cluster similarity through a cohesion score, and cluster labelling using WordNet's hypernyms and synsets. It uses K-means and Hierarchical clustering for implementation. POS Tagging has been used to identify the names of the people involved in corporate fraud.
My research publication as the first author : https://link.springer.com/chapter/10.1007/978-981-15-3369-3_9

I presented this paper as a first author at the IEEE International Conference (IC4S-2019) and was honoured with the BEST PAPER AWARD for the presentation.

Dataset ☁️

The Enron email dataset was collected and prepared by the Cognitive Assistant that Learns and Organizes (CALO) project. It contains data from about 150 users, mostly senior management of Enron, organized into folders [18]. The corpus contains a total of about 0.5 M messages. The original data included approximately 500,000 emails generated by employees of the Enron Corporation. These emails were read as a.csv file, where the data was split into three columns, namely index, message id and raw message.

Requirements and usage 💻

  1. Anaconda
  2. Python 3.6
  3. nltk
  4. matplotlib
    Clone this repository and directly run the "EmailClustering.ipynb" file.

Insights 📝

Architecture

Elbow Method for finding clusters for K-Means Clustering

K-Means Clustering

Hierarchical clustering

Citation for my publication

[1] Kathuria, A., Mukhopadhyay, D., & Thakur, N. (2020). Evaluating cohesion score with email clustering. In Proceedings of First International Conference on Computing, Communications, and Cyber-Security (IC4S 2019) (pp. 107-119). Springer, Singapore.

About

Detection of Corporate Fraud using k-means and hierarchical clustering techniques on Enron Email dataset.


Languages

Language:Jupyter Notebook 100.0%