jayavanth / transcripts-scraping

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scraping 

git clone 
python3 -m venv transcripts-scraping
source ./transcripts-scraping/bin/activate
cd transcripts-scraping
pip install -r requirements.txt

STEP 1 : Get list of tv serials from https://transcripts.foreverdreaming.org/index.php

Code : listoftvserials.py
Output : listoftvserials.json ( PRETTY PRINT JSON SHELL COMMAND- `cat listoftvserials.json | jq` )

STEP 2 : Get list of episodes for each tv serial from STEP 1


Code : listofepisodesforeachtvserial.py
Output: listofepisodesforeachtvserial.json

STEP 3 : Fore each episode in STEP 2, fetch the transcript

Prototype : saveeachepisodetotextfile.py
Output: episodecompletetext.txt

Code : module-eachepisodetottextfile.py



Code : Work in Progress(WIP)

STEP 4 : Get the  list of movies seperately

Code: listofmovies.py
Output : list-of-all-movies.json in the Movies folder

STEP 5 : For each move in Step 4 , get the movie transcript


Code: listoftranscriptsforeachmovie.py | WORK IN PROGRESS 
Output : in the Movies folder

***

POST SCRAPING :

	Clean the data. ( Online store, COVID, Transcript Index links in each file)

About


Languages

Language:Python 99.7%Language:Shell 0.3%