wacz2warc is a simple shell script to convert and combine one or more .wacz
files into a single .warc
file.
- Ensure that
sed
,unzip
, andgzip
are installed:
-
Fedora
sudo dnf install unzip sed gzip
-
Arch Linux
sudo pacman -Sy unzip sed gzip
-
Debian/Ubuntu
sudo apt-get install unzip sed gzip
- Clone this repository (or simply download wacz2warc), and make it your working directory:
git clone https://github.com/rien333/wacz2warc.git && cd wacz2warc
- Copy or symlink
wacz2warc
to a location in your$PATH
, so that you can call it from anywhere. For example:
sudo cp wacz2warc /usr/bin/
wacz2warc
expects one or more .wacz
files as input. Thus, say you directory looks like this:
$ ls
archive1.wacz archive2.wacz archive3.wacz
In that case, run this to convert archive1
to a .warc
file:
wacz2warc archive1.wacz
Or this to combine and convert all .wacz
in the current directory into a single .warc
:
wacz2warc *.wacz
The latter command is especially helpful if you encounter a webpage archive that has been split acrross multiple .warcz
files.
If everything goes well, there should be a new .warc
file in the directory of the original .wacz
file(s). The filename of this new .warc
will match the first .wacz
file provided to wacz2warc
.
Run wacz2warc -h
or wacz2warc --help
for a summary of these instructions.
- The output file will probably contain duplicate pages and files