This repository contains a collection of scripts to scrape content from various sources like YouTube, Wikipedia, and Britannica. It includes functionality to download video captions from YouTube, scrape Wikipedia articles, and fetch content from Britannica.
-
Clone the repository:
git clone https://github.com/shivendrra/web-graze.git cd web-scraper-suite
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
The YouTube scraper fetches video captions from a list of channels.
-
Add your YouTube API key to a
.env
file:yt_key=YOUR_API_KEY
-
Create a
channelIds.json
file with the list of channel IDs:[ "UC_x5XG1OV2P6uZZ5FSM9Ttw", "UCJ0-OtVpF0wOKEqT2Z1HEtA" ]
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('yt_key')
from graze import youtube
scraper = youtube(api_key=api_key, filepath='./output.txt')
scraper()
The Wikipedia scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.
- Define your search queries in
queries.py
:class WikiQueries: def __init__(self): self.search_queries = ["topic1", "topic2", "topic3"] def __call__(self): return self.search_queries
from graze import wikipedia
wiki = wikipedia()
wiki(out_file='./output.txt')
The Britannica scraper fetches content based on search queries and writes it to a file.
- Define your search queries in
queries.py
:class BritannicaQueries: def __init__(self): self.search_queries = ["topic1", "topic2", "topic3"] def __call__(self): return self.search_queries
from graze import britannica
scraper = britannica(max_limit=20)
scraper(out_file='./output.txt')
The Unsplash Image scraper fetches images based on given topics & saves them in their respective folders
- Define your search queries in
queries.py
:search_queries = ["topic1", "topic2", "topic3"]
import graze
scraper = graze.unsplash(topics=search_queries)
Downloading 'american football' images:
Downloading : 100%|██████████████████████████| 176/176 [00:30<00:00, 5.72it/s]
Downloading 'indian festivals' images:
Downloading : 100%|██████████████████████████| 121/121 [00:30<00:00, 7.29it/s]
-
API Keys and other secrets: Ensure that your API keys and other sensitive data are stored securely and not hard-coded into your scripts.
-
Search Queries: The search queries for Wikipedia and Britannica scrapers are defined in
queries.py
.
The YouTube scraper logs errors to youtube_fetch.log
. Make sure to check this file for detailed error messages and troubleshooting information.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.
Check out CONTRIBUTING.md for more details
This project is licensed under the MIT License.