rien333 / wacz2warc

Convert .wacz to .warc files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wacz2warc

wacz2warc is a simple shell script to convert and combine one or more .wacz files into a single .warc file.

Installation

  1. Ensure that sed, unzip, and gzip are installed:
  • Fedora
    sudo dnf install unzip sed gzip
  • Arch Linux
    sudo pacman -Sy unzip sed gzip
  • Debian/Ubuntu
    sudo apt-get install unzip sed gzip
  1. Clone this repository (or simply download wacz2warc), and make it your working directory:
git clone https://github.com/rien333/wacz2warc.git && cd wacz2warc
  1. Copy or symlink wacz2warc to a location in your $PATH, so that you can call it from anywhere. For example:
sudo cp wacz2warc /usr/bin/

Usage

wacz2warc expects one or more .wacz files as input. Thus, say you directory looks like this:

$ ls 
archive1.wacz archive2.wacz archive3.wacz 

In that case, run this to convert archive1 to a .warc file:

wacz2warc archive1.wacz

Or this to combine and convert all .wacz in the current directory into a single .warc:

wacz2warc *.wacz

The latter command is especially helpful if you encounter a webpage archive that has been split acrross multiple .warcz files.

If everything goes well, there should be a new .warc file in the directory of the original .wacz file(s). The filename of this new .warc will match the first .wacz file provided to wacz2warc.

Run wacz2warc -h or wacz2warc --help for a summary of these instructions.

Known issues

  • The output file will probably contain duplicate pages and files

About

Convert .wacz to .warc files


Languages

Language:Shell 100.0%