wacz2warc

wacz2warc is a simple shell script to convert and combine one or more .wacz files into a single .warc file.

Installation

Ensure that sed, unzip, and gzip are installed:

Fedora
```
sudo dnf install unzip sed gzip
```

Arch Linux
```
sudo pacman -Sy unzip sed gzip
```

Debian/Ubuntu
```
sudo apt-get install unzip sed gzip
```

Clone this repository (or simply download wacz2warc), and make it your working directory:

git clone https://github.com/rien333/wacz2warc.git && cd wacz2warc

Copy or symlink wacz2warc to a location in your $PATH, so that you can call it from anywhere. For example:

sudo cp wacz2warc /usr/bin/

Usage

wacz2warc expects one or more .wacz files as input. Thus, say you directory looks like this:

$ ls 
archive1.wacz archive2.wacz archive3.wacz

In that case, run this to convert archive1 to a .warc file:

wacz2warc archive1.wacz

Or this to combine and convert all .wacz in the current directory into a single .warc:

wacz2warc *.wacz

The latter command is especially helpful if you encounter a webpage archive that has been split acrross multiple .warcz files.

If everything goes well, there should be a new .warc file in the directory of the original .wacz file(s). The filename of this new .warc will match the first .wacz file provided to wacz2warc.

Run wacz2warc -h or wacz2warc --help for a summary of these instructions.

Known issues

The output file will probably contain duplicate pages and files

rien333 / wacz2warc

wacz2warc

Installation

Usage

Known issues

About

Languages