natural-language-processing nltk-python instagram-scraper

NLP-Influencers-Analysis

Introduction

Affiliate and influencer marketers has seen tremendous growth over the years with the prominence of social media sites like Youtube, Twitter, Instagram and TikTok. According to Digital Marketing Institute, influencer marketing was worth just $1.7 billion in 2016, and is set to reach $13.8 billion by 2022 as the industry witnesses more growth and becomes a more effective marketplace.

This repository focuses on performing text analysis and text mining on instagram's influencers predominantly in the micro-influencers (e.g. 5k-20k followers) domain and discovering if a certain post is sponsored based on the caption itself as well as identifying the sponsor.

It contains 2 scripts:

Notebook for scraping influencers posts in Instagram.
Main Program - Data Preprocessing, Feature Extraction, Sponsor Tagging, Natural Language Processing techniques and Data Visualization on these posts.

Disclaimer

All content scraped were done in a responsible and considerate fashion with adequate time stop that mimics human behavior to prevent any request overload on the server. Additionally, this project is done purely for educational purposes with no profit or monetization were involved else it will be at the direct benefit to Meta, or the organization or the user.

Overview

The overall workflow of this project:

Data Collection

We’ll start by configuring the Chromedriver and setting up the login credentials. Following that, we will login to Instagram and go to the user Instagram page. Next, we will get the JSON information on that page. Once we retrieve the JSON page, we’ll store all of the relevant information for the post such as name, follower, postdate, likes, comments and caption on a list and append it to our data frame. The “time sleep” in the code prevents Instagram from identifying the scraper as a BOT. Lastly, we will export the data to a CSV file

Exploratory Data Analysis

We will drop duplicates and drop all posts that have an empty caption as a post with no caption will serve us no purpose. We will also replace missing values in the industry column with “General”
First, we will quantify the engagement rate of a post by adding up the number of comments and likes and dividing the sum by the number of followers. Next, we’ll create a new column to store hyperlinks from the post. This feature will be used in conjunction with NLP to identify sponsored posts.

Tagging

We will Load the NLP model and retrieve and list of stop_words
Preprocess the text by cleaning caption by changing all words to lowercase and removing and character that is not alphanumeric. Tokenize each word for entities tagging and
Remove Emoji and duplicates. Tag the possible sponsor based on the spaCy Named Entity Recognition package, links and hashtag.

Natural Language Processing

Custom NER using spaCy - Train a custom NER model on top of the base model to include new entity labels sponsors, promo codes, products
Topic Modeling - Preprocess and perform LDA to identify and classify key topics of captions. Achieving a 0.515 Coherence score and 8 topics ranging from giveaway, fitness, love to promotions.

Data Visualization

Most occurring words in Instagram post captions
Factors that attributes to a post having high engagement rate?
Frequency of Sponsorship by Brand
Country and Industry with the most sponsorship

About

This repository contains webscraping script and an analysis on Influencer/Affiliate Marketing in Instagram. This involves topic modeling and building a custom NER model.

natural-language-processing nltk-python instagram-scraper

Languages

Language:Jupyter Notebook 99.9%Language:Python 0.1%