`cs132-main`

scraper

twint user attributes | twint tweet attributes | twint configuration options | minamotorin's fork

Table columns (optional ones are bracketed):

ID
- 09-[row number]
Timestamp of inclusion
Tweet URL
Group
- 09
Collector
- Daryll | Westin | Zandrew
Category
- RBRD
Topic
- Leni's incompetence as VP
Keywords
Account handle
Account name
Account bio
Account type
Joined
Following
Followers
Location
Tweet
[Translated tweet]
Tweet type
Date posted
[Screenshot]
Content type
Likes
Replies
Retweets
[Quote tweets]
[Views]
[Rating]
Reasoning
[Remarks]
Reviewer
- leave blank
Review
- leave blank

Instructions

Group 9 fodder spreadsheet | Group 9 final spreadsheet

Workflow:

Scrape using scrape.py
Extract CSV file using processed.py
Append contents of CSV file to fodder spreadsheet
Select rows in fodder spreadsheet to copy-paste to final spreadsheet

Steps:

Download the repository as a ZIP file:

Extract the ZIP file, and in your terminal, cd (change directory) to the scraper folder:

a) Install twint:

First, clone minamotorin's fork:

git clone git@github.com:minamotorin/twint.git

Next, cd into the twint repository:

cd twint

Install twint like so:

pip3 install . -r requirements.txt

b) Install snscrape and pandas:

pip3 install snscrape pandas

Scrape tweets using the scrape.py program:

python3 scrape.py [-u <username>] [-s <search query>] [-l <limit>]

There are three command line arguments you can pass in here:

-u or --user: optional; indicates the username of the Twitter user you want to scrape tweets from here (e.g., Official_UPD)
-s or --search: required; indicates the search terms or keywords you want to use for scraping tweets (e.g., leni)
- important: if your search terms have spaces, add backslashes before those spaces:

python3 scrape.py -s leni\ walang\ ginawa

-l or --limit: optional; indicates the maximum number of tweets scraped; default is 100
- important: if indicated, the limit must be a multiple of 20:

python3 scrape.py -s leni\ walang\ ambag -l 60

Once you execute this step, an output.csv file should appear in the folder you are in. There's no need to touch this file; it will be processed by process.py.

This step is relatively short (~1 minute or so for 200 tweets).

Process tweets using the process.py program:

python3 process.py [-u <program user>] [-f <first index>] [-s <search query>]

There are three command line arguments you can pass in here:

-u or --user: required; this is one of three characters: d (for Daryll), w (for Westin), or z (for Zandrew), indicates who's using the process.py program right now
-f or --first: required; to find out what number to pass into this flag, look at last row of the fodder spreadsheet: if the last row is $\verb|09-|X$, pass in $X+1$
s or --search: required; this must be the same thing you passed as the -s flag into the scrape.py program

Example:

python3 process.py -u d -f 61 -s leni\ walang\ ambag

Once you execute this step, a processed.csv file should appear in the folder you are in. This is the file we'll import into the fodder spreadsheet!

This step is the bottleneck of our workflow (roughly ~8 minutes for 200 tweets).

Import the generated processed.csv file into the fodder spreadsheet. Make sure to append to the current sheet, and turn off the setting that recognizes values, dates, or equations.

The header row will still be there once you import the csv file: just delete that row.

Select rows in the fodder spreadsheet we want, copy-paste them into the final spreadsheet, then fill in the remaining required columns:
- Account Type
- Content Type
- Reasoning

daryll-ko / cs132-main

`cs132-main`

Instructions

About

Languages