warc

There are 10 repositories under warc topic.

ArchiveBox
ArchiveBox / ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
archivebox backups bookmark-archiver browser-bookmarks chromium digipres firefox headless-browser internet-archiving pinboard pocket python rss self-hosted singlefile warc wayback-machine web-archiving wget youtube-dl
Language:Python 19720
internetarchive / heritrix3
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
heritrix java warc webcrawling
Language:Java 2675
conifer
Rhizome-Conifer / conifer
Collect and revisit web pages.
webrecorder web-archiving archives pywb python docker wayback warc
Language:Python 1459
ArchiveTeam / grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
archiving crawl crawler spider warc
Language:Python 1257
webrecorder / replayweb.page
Serverless replay of web archives directly in the browser
web-archiving web-archive replay-web-page web-replay wayback-machine warc service-worker wacz
Language:TypeScript 605
ipwb
oduwsdl / ipwb
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
docker ipfs memento memento-rfc python service-worker warc wayback web-archiving
Language:Python 589
webrecorder / browsertrix-crawler
Run a high-fidelity browser-based crawler in a single Docker container
crawler crawling wacz warc web-archiving web-crawler webrecorder
Language:TypeScript 531
webrecorder / webrecorder-player
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
webrecorder warc pywb electron web-archiving
Language:JavaScript 421
Florents-Tselai / WarcDB
WarcDB: Web crawl data as SQLite databases.
cli crawling database sqlite warc web-archiving web-data
Language:Python 383
wail
machawk1 / wail
:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation
web-archiving wayback python heritrix gui warc openwayback pyinstaller
Language:Roff 341
webrecorder / warcio
Streaming WARC/ARC library for fast web archive IO
web-archives web-archiving warc pywb python
Language:Python 340
bitextor
bitextor / bitextor
Bitextor generates translation memories from multilingual websites
document-aligner apertium dictionaries crawler wget hunalign sentence-segmentation tokenizer bicleaner tmx warc corpus-tools corpus-processing corpus-generator parallel-corpora machine-translation neural-machine-translation statistical-machine-translation bitextor bleualign
Language:Python 278
commoncrawl / news-crawl
News crawling with StormCrawler - stores content as WARC
crawler news warc web-crawler apache-storm common-crawl commoncrawl storm-crawler
Language:Java 248
warcreate
machawk1 / warcreate
Chrome extension to "Create WARC files from any webpage"
chrome-extension warc web-archiving
Language:JavaScript 192
cocrawler / cocrawler
CoCrawler is a versatile web crawler built using modern tools and concurrency.
crawler python3 async-python warc pluggable-modules screenshot concurrency aiohttp aiohttp-client
Language:Python 176
cocrawler / cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
web-archiving web-archives warc cdx cdx-api commoncrawl python
Language:Python 150
helgeho / ArchiveSpark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
archivespark spark-framework spark web-archiving webarchive internet-archive warc
Language:Scala 140
crissyfield / troll-a
Drill into WARC web archives
command-line-tool common-crawl internet-archive security security-tools warc
Language:Go 129
N0taN3rd / wail
:whale2: One-Click User Instigated Preservation
electron web-archiving warc browser-based-presrevation high-fidelity-preservation
Language:JavaScript 119
webrecorder / browsertrix-cloud
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
archiving cloud warc web-archive web-archiving webrecorder wacz kubernetes
Language:TypeScript 118
maxcountryman / warc-parquet
🗄️ A simple CLI for converting WARC to Parquet.
crawling duckdb parquet warc web-archiving
Language:Rust 99
N0taN3rd / node-warc
Parse And Create Web ARChive (WARC) files with node.js
webarchive webarchiving web-archives warc-files warc web-archiving pupeteer chrome-remote-interface
Language:JavaScript 90
CGamesPlay / chronicler
Offline-first web browser
electron browser warc
Language:JavaScript 82
ArchiveTeam / wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
webarchiving warc wget lua archiving crawler crawl crawling spider archiveteam wget-lua zstd ftp scraper scraping crawlers downloader
Language:C 81
mikwielgus / forum-dl
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
python scraper forum discourse phpbb simplemachines data-fetching internet-archiving warc
Language:Python 59
centic9 / CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
cdx-files commoncrawl mime-types warc java
Language:Java 58
archivesunleashed / warclight
A Rails engine supporting the discovery of web archives.
blacklight ruby discovery webarchive-discovery solr rails webarchives warc rails-engine
Language:Ruby 48
pirate / internet-archiving-talk
🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.
internet-archiving talks slideshow web-archiving wget warc archivebox censorship ethics
Language:JavaScript 47
cdx-summary
internetarchive / cdx-summary
Summarize web archive capture index (CDX) files.
archive cdx collection nodejs python report statistics summary warc web-archive webcomponents
Language:Python 43
PromyLOPh / crocoite
Web archiving using Google Chrome
warc chrome-browser archiving devtools
Language:Python 42
chatnoir-eu / chatnoir-resiliparse
A robust web archive analytics toolkit
python web warc bigdata cython cpp extraction webarchive htmlparser
Language:Cython 40
jedireza / warc
:gear: A Rust library for reading and writing WARC files
warc rust-library rust
Language:Rust 39
harvard-lil / warc-gpt
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
ai rag warc webarchiving
Language:Python 34
openzim / warc2zim
Command line tool to convert a file in the WARC format to a file in the ZIM format
warc zim scraper
Language:Python 33
datatogether / warc
Golang WARC (Web ARChive) Library
golang archiving iipc warc package
Language:Go 29
datacoon / metawarc
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
warc warc-files webarchiving metadata osint osint-python
Language:Python 23

warc

ArchiveBox / ArchiveBox

internetarchive / heritrix3

Rhizome-Conifer / conifer

ArchiveTeam / grab-site

webrecorder / replayweb.page

oduwsdl / ipwb

webrecorder / browsertrix-crawler

webrecorder / webrecorder-player

Florents-Tselai / WarcDB

machawk1 / wail

webrecorder / warcio

bitextor / bitextor

commoncrawl / news-crawl

machawk1 / warcreate

cocrawler / cocrawler

cocrawler / cdx_toolkit

helgeho / ArchiveSpark

crissyfield / troll-a

N0taN3rd / wail

webrecorder / browsertrix-cloud

maxcountryman / warc-parquet

N0taN3rd / node-warc

CGamesPlay / chronicler

ArchiveTeam / wget-lua

mikwielgus / forum-dl

centic9 / CommonCrawlDocumentDownload

archivesunleashed / warclight

pirate / internet-archiving-talk

internetarchive / cdx-summary

PromyLOPh / crocoite

chatnoir-eu / chatnoir-resiliparse

jedireza / warc

harvard-lil / warc-gpt

openzim / warc2zim

datatogether / warc

datacoon / metawarc