kedir / GLG_Group2

GLG Capstone Project for NER, Clustering and Topic Modeling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fourthbrain - GLG Capstone

Project:

Automated Meta-data Tagging and Topic Modeling

Background:

Gerson Lehrman Group powers great decisions through our network of experts. Our company receives hundreds of requests a day from clients seeking insights on topics ranging from the airline industry’s ability to cope with COVID-19 to the zebra mussel infestations in North America. The goal is to match each request to a topic specialist in our database. This project on Natural Language Processing (NLP) is aimed at improving the topic/keyword detection process from the client-submitted reports and identifying the underlying patterns in submitted requests over time. The primary challenges include Named Entity Recognition (NER) and Pattern Recognition for Hierarchical Clustering of Topics.

Typically, the client requests we receive include a form with unstructured free text with screening questions. Thus, we have a need to group these requests into common topics – to better understand and service demand. This project is aimed to increase the resourcefulness of the current data pipelines for efficient data storage and retrieval.

Dataset

Two datasets used for this project:

  • All the News 2.0 — This dataset contains 2,688,878 news articles and essays from 27 American publications, spanning January 1, 2016 to April 2, 2020.

  • Annotated Corpus for NER — Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

License

Distributed under the MIT License. See LICENSE for more information.

About

GLG Capstone Project for NER, Clustering and Topic Modeling

License:MIT License


Languages

Language:Jupyter Notebook 100.0%