data-collection data-collection-system webscra webscraper webscrapper-python data-for-llm data-for-ml

web-graze

Introduction

This repository contains a collection of scripts to scrape content from various sources like YouTube, Wikipedia, and Britannica. It includes functionality to download video captions from YouTube, scrape Wikipedia articles, and fetch content from Britannica.

Installation

Clone the repository:

git clone https://github.com/shivendrra/web-graze.git
cd web-scraper-suite

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Install the required packages:
```
pip install -r requirements.txt
```

Usage

YouTube Scraper

The YouTube scraper fetches video captions from a list of channels.

Configuration

Add your YouTube API key to a .env file:
```
yt_key=YOUR_API_KEY
```

Create a channelIds.json file with the list of channel IDs:

[
  "UC_x5XG1OV2P6uZZ5FSM9Ttw",
  "UCJ0-OtVpF0wOKEqT2Z1HEtA"
]

Running the Scraper

from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('yt_key')

from graze import youtube

scraper = youtube(api_key=api_key, filepath='./output.txt')
scraper()

Wikipedia Scraper

The Wikipedia scraper generates target URLs from provided queries, fetches the complete web page, and writes it to a file.

Configuration

Define your search queries in queries.py:

class WikiQueries:
    def __init__(self):
        self.search_queries = ["topic1", "topic2", "topic3"]
    
    def __call__(self):
        return self.search_queries

Running the Scraper

from graze import wikipedia

wiki = wikipedia()
wiki(out_file='./output.txt')

Britannica Scraper

The Britannica scraper fetches content based on search queries and writes it to a file.

Configuration

Define your search queries in queries.py:

class BritannicaQueries:
    def __init__(self):
        self.search_queries = ["topic1", "topic2", "topic3"]
    
    def __call__(self):
        return self.search_queries

Running the Scraper

from graze import britannica

scraper = britannica(max_limit=20)
scraper(out_file='./output.txt')

Unsplash Scraper

The Unsplash Image scraper fetches images based on given topics & saves them in their respective folders

Configuration

Define your search queries in queries.py:

search_queries = ["topic1", "topic2", "topic3"]

Running the Scraper

import graze

scraper = graze.unsplash(topics=search_queries)

Output:

Downloading 'american football' images:
Downloading : 100%|██████████████████████████| 176/176 [00:30<00:00,  5.72it/s]

Downloading 'indian festivals' images:
Downloading : 100%|██████████████████████████| 121/121 [00:30<00:00,  7.29it/s]

Configuration

API Keys and other secrets: Ensure that your API keys and other sensitive data are stored securely and not hard-coded into your scripts.
Search Queries: The search queries for Wikipedia and Britannica scrapers are defined in queries.py.

Logging

The YouTube scraper logs errors to youtube_fetch.log. Make sure to check this file for detailed error messages and troubleshooting information.

Contribution

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.

Check out CONTRIBUTING.md for more details

License

This project is licensed under the MIT License.

About

scrape raw data from various sources of the internet, like wikipedia, internet archieve, britannica, youtube, unsplash, etc

data-collection data-collection-system webscra webscraper webscrapper-python data-for-llm data-for-ml

MIT License

Languages

Language:Python 61.6%Language:Jupyter Notebook 20.7%Language:JavaScript 17.6%

shivendrra / web-graze

web-graze

Introduction

Table of Contents

Installation

Usage

YouTube Scraper

Configuration

Running the Scraper

Wikipedia Scraper

Configuration

Running the Scraper

Britannica Scraper

Configuration

Running the Scraper

Unsplash Scraper

Configuration

Running the Scraper

Output:

Configuration

Logging

Contribution

License

About

Languages