data-scraping hierarchical-attention-networks lstm youtube-comment deep-learning machine-learning twitter-api youtube-api

Hierarchical Attention Network for Political Affiliation of Youtube Commenters

The research project was associated with "[INF-DS-RMB] Research Module B: Projekt: Social Media and Business Analytics Project", Summer Semester 2022, for my Masters of Science: Data Science, University of Potsdam, Germany. Associated Research Paper can be found here.

If you want to train/inference/visualize Hierarchical Attention Model (HAN) or LSTM, follow the steps shown in readme here

Installing Dependencies for Data Scraping

Install package scrapetube using pip to scrap the youtube videos id:

pip install scrapetube
pip install youtubesearchpython

Also, create twitter api key and Youtube-api key for scraping data.

Add your twitter credentials in utility/config_KEYS.yml

To scrap the data:

Change the API key and tokens in config_KEYS.yml with your own twitter API key and tokens
To create a list of News channels with their website link and country, Run

python 'Scrap MBFC website.py'

To scrap each News channel website with their Youtube Channel name and twitter handle, run

python '2. scrap_youtube_twitter.py'

Review Twitter handles manually.

python '2_1. review_twitter_handle.py'

To find twitter handle of the remaining news channels, we will do exhaustive search using:

python '3. Get_twitter_handle.py'

To search the youtube search page with channel title to find their youtube channel

python '4. Scrap_youtube_channel.py'

To validate the given youtube channel Id. If youtube channel username is available to us, we will get their youtube channel ID.

python '5. scrap_youtube_id.py'

Another method to get the channel id using their username

python '5.1 get_yt_id.py'

Manually changing the youtube channel's ids.

python '5.2 get_yt_id_manual.py'

From the youtube channel's playlist, get all those videos which was published from 2021-01-01 to 2021-08-31.

python '6. get_yt_channel_playlists_videos.py'

Using the video Id's scraped in step 10, use those video id's to scrap their comments.

python '7. get_yt_comments.py'

--- A utility file to combine right and right center files.

python '7.1 combine_right_and_center_right_data.py'

Convert the data from json format to csv.

python '8. json_to_csv.py'

Get the subscription list of all the auhtors who have made comments using Youtube API.

python '9. get_authors_subscription.py'

First step of Annotation. Annotating users as liberals or conservatives using users subscription data and homogeneity score.

python '10. user_subscription_homogeneity_score.py'

Create sepearte dataframe to easily annotate hashtags as being used by liberals or conservatives.

python '11. create_df_hashtag_annotations.py'

Create first layer of annotated data for training (this was done using users subscription data) - Just a sample file to create out models.

python '12. create_data_subscription_training.py'

Second step of Annotation. Annotating users as liberals or conservatives using hashatags used by the user on their comment and homogeneity score.

python '13. create_data_from_hashtags.py'

Find conflicted users (user both in left and right channels with different leaning) and remove them. Then combine both the dataet and save. Next, take those samples, where we know the leaning of the user, and generate annotated training data for training.

python '14. generate_training_data revisit.py'

Preprocess the given trainng dataset (created in step 14 and 17) suitable for training.

python '15. data_for_training_revisit.py'

Preprocess the un-annotated dataset for inference.

python '16. null_comments.py'

Create Plots.

python '17. plots.py'

This can only be executed after yuo have infernece files. On removing conflicts from inferenced result.

python '18. remove_conflicts_inference.py'

utils.py contains all utility functions and variables.

About

data-scraping hierarchical-attention-networks lstm youtube-comment deep-learning machine-learning twitter-api youtube-api

Languages

Language:HTML 87.8%Language:Python 9.4%Language:Jupyter Notebook 2.8%