There are 10 repositories under warc topic.
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Collect and revisit web pages.
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Serverless replay of web archives directly in the browser
Run a high-fidelity browser-based crawler in a single Docker container
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
WarcDB: Web crawl data as SQLite databases.
Streaming WARC/ARC library for fast web archive IO
News crawling with StormCrawler - stores content as WARC
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Drill into WARC web archives
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
🗄️ A simple CLI for converting WARC to Parquet.
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
A Rails engine supporting the discovery of web archives.
🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.
Summarize web archive capture index (CDX) files.
A robust web archive analytics toolkit
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.