rcanovas / Low-LZ78

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Low-LZ78

The Low-LZ78 project contains the C++11 and C codes associated with the following reference: D. Arroyuelo, R. Canovas, G. Navarro, and R. Raman. "LZ78 Compression in Low Main Memory Space". To appear in Proc. SPIRE'17

This work includes the following data structures: C++11:

  • hlz78: A LZ78 data structure that uses a fixed hash table.
  • ghlz78: A LZ78 data structure that uses a growing hash table.
  • mhlz78: A LZ78 data structure that uses multi-hash tables.

C (32 bits):

  • lz78_min: The LZ78 compact representation of Arroyuelo and Navarro ("Space-efficient construction of Lempel-Ziv compressed text indexes". Information and Computation, 209(7):1070–1102, 2011).
  • lz78_uc: The LZ78 uncompressd baseline representation used by Arroyuelo and Navarro.

Compile

To be able to compile the codes:

  • Install sdsl-lite. Follow the installation guide here: (https://github.com/simongog/sdsl-lite)
  • Modify the location of the sdsl library in the CMakeLists.txt if necessary.
  • Require cmake version 3.5 or higher.
  • For the C++11 code, go to the build folder (create the folder if it does not exists) and run:
    • cmake ..
    • make
  • For the C codes, go to the lz78_c32 folder and run:
    • make

Methods

hlz78/ghlz78/mhlz78

-[CompressText] :

Use: ./CompressText file_name <opt> 
  		<file_name>: Name of the file to be compressed 
	<opt> : 
	  	-o output_name:  String containing the name of the output file. Default file_name
	  	-s sigma: Size of the alphabet of the input text. Default s = 256
	  	-f factor: Integer value indicating the overload factor used  for the Hash table. Default factor = 5%
	  	-d D_bits: Number of bits used to store each collision value. Default d=0 (meaning that it is compute internally)
		-w Index_type. Default = 0
    		    ---+--------------------
		     0 | HLZ78 using a map for the displacements 
		     1 | HLZ78 using a hash table and a sublayer for displacements
		     2 | MHLZ78 using maps for the displacements
		     3 | MHLZ78 using a hash table and a sublayer for displacements
		     4 | GHLZ78 using maps for the displacements
		     5 | GHLZ78 using a hash table and a sublayer for displacements
    		     

      	output:  <output_name>.index_type

	Example: ./CompressText file -w 1 -s 97 -f 10 -d 3
	output:  file.hlz78_hash

-[DecompressHLZ] :

Use: ./DecompressHLZ index_type <opt>
	<index_type>: Name of the compressed index file
	<opt> (index details are requiered): 
		-o output_name:  String containing the name of the output file. Default <index_name>_decom
		-w Index_type. Default = 0
		 # | Index_type
		 ---+--------------------
		  0 | HLZ78 using a map for the displacements
		  1 | HLZ78 using a hash table and a sublayer for displacements
		  2 | MHLZ78 using maps for the displacements
		  3 | MHLZ78 using a hash table and a sublayer for displacements
		  4 | GHLZ78 using maps for the displacements
		  5 | GHLZ78 using a hash table and a sublayer for displacements
 
	output:  output_name file

	Example: ./DecompressHLZ file.hlz78_hash -w 1 -o file   
	output: file (uncompress version of file.hlz78_hash)

lz78_min/lz78_uc

-[Clz78_(min/uc)] :

Use: ./Clz78_(min/uc) file_name
	file_name: Name of the file to be compressed 

	output:  file_name.lzt file

	Example: ./Clz78_min file 
	output: file.lzt (Compressed version of file)

-[Dlz78_(min/uc)] :

Use: ./Dlz78_(min/uc) file_name
	file_name: Name of the file to be decompressed (file_name.lzt must exist) 

	output:  file_name file

	Example: ./Dlz78_min file 
	output: file (Decompressed version of file.lzt. Note: the original file is overwrite)

The results presented in the paper "LZ78 Compression in Low Main Memory Space" were generated using the command line:

	./CompressText in_file -w type_index -s sigma -f factor
	using w = {1,3,5} and f = {5,10,20,40,60}, and files (with sigma):
		dblp.xml (s=97)
		proteins (s=27)
		english  (s=237) --> Note: we needed to erasa /0 characters from the original text before compressing
		DNA      (s=51)

For more information about the data structures and the test sets used please refer to the paper.

Note: These codes assume that the computer used has enough RAM to read and store the complete input.

About

License:GNU General Public License v3.0


Languages

Language:C++ 53.7%Language:C 45.1%Language:Makefile 0.7%Language:CMake 0.5%