NNLCB is a general-purpose (Universal) lossless compression algorithms benchmark test for multi-source data with deep neural networks. Currently, in our benchmark, we performed examinations on general-purpose lossless compressors, including 8 NN-based and 9 traditional compressors, using 28 datasets with differing type. Each loss-less compressor was evaluated on 19 performance measures, including compression robustness, compression strength, as well as time and peak memory required for compression and decompression, etc.
Note: The NNLCB has been published on the FCS journal. If you would like to test new compressors, please contact us.
Algorithm | WavgCR | AvgCR | WavgSSP | AvgSSP | CRP | TotalCT | TotalDT | AvgCPM | AvgDPM |
---|---|---|---|---|---|---|---|---|---|
(bits/base) | (bits/base) | (\%) | (\%) | (\%) | (Hour) | (Hour) | (GB) | (GB) | |
NNCP | 4.183 | 2.521 | 47.713 | 68.476 | 13.084 | 942.928 | 926.049 | 0.111 | 0.111 |
PAC | 4.327 | 2.638 | 45.912 | 67.019 | 12.720 | 74.398 | 116.868 | 6.102 | 6.295 |
TRACE | 4.411 | 2.718 | 44.867 | 66.032 | 12.486 | 69.128 | 131.110 | 6.106 | 6.449 |
DZip | 4.494 | 2.516 | 43.819 | 68.545 | 14.272 | 332.787 | 148.374 | 10.113 | 4.790 |
DZip* | 4.562 | 3.802 | 42.971 | 52.476 | 11.158 | 332.787 | 148.374 | 10.113 | 4.790 |
Lstm-compress | 5.395 | 2.786 | 32.563 | 65.168 | 14.543 | 492.869 | 474.498 | 0.009 | 0.009 |
DeepZip* | 16.835 | 7.045 | -110.434 | 11.933 | 18.504 | 250.714 | 52.449 | 13.708 | 4.292 |
DeepZip | 16.865 | 5.760 | -110.811 | 28.003 | 24.092 | 250.714 | 52.449 | 13.708 | 4.292 |
BSC | 4.826 | 2.928 | 39.677 | 63.394 | 13.045 | 0.353 | 0.300 | 0.121 | 0.116 |
Lzma2 | 4.912 | 3.122 | 38.590 | 60.967 | 12.289 | 0.584 | 0.030 | 1.264 | 0.427 |
XZ | 4.923 | 3.118 | 38.463 | 61.021 | 12.365 | 0.879 | 0.040 | 1.612 | 0.504 |
PPMD | 4.960 | 3.025 | 38.001 | 62.181 | 12.934 | 0.893 | 0.953 | 0.226 | 0.225 |
PBzip2 | 5.052 | 3.275 | 36.845 | 59.062 | 11.798 | 0.024 | 0.016 | 0.115 | 0.084 |
Gzip | 5.351 | 3.862 | 33.113 | 51.728 | 10.342 | 0.451 | 0.026 | 0.002 | 0.002 |
LZ4-multi | 5.618 | 4.280 | 29.770 | 46.501 | 9.656 | 0.064 | 0.009 | 0.116 | 0.025 |
SnZip | 5.981 | 5.100 | 25.235 | 36.244 | 7.473 | 0.031 | 0.021 | 0.003 | 0.003 |
Notes. “*” : Consideration of NN Model Size; “Avg/WavgCR (bits/base)” : Average OR Weighted Average Compression Ratio; “TotalCT/DT (Hours)” : Total Compression OR Decompression Time; “AvgCPM/DPM (GB)” : Average Compression OR Decompression Peak Memory; “Avg/WavgSSP (%)” : Average OR Weighted Average Storage Saving Percentage; “CRP (%)” : Compression Robust Performance (%).
We benchmark on 28 widely studied datasets. These datasets contain various types of text, images, audio, genomic data, etc. Please refer to our paper for detailed information about the data, and the details of how to obtain each dataset are given below. The detailed link address of the benchmark datasets are as follows:
ID | Name | Data Type | Size (Bytes) | Description |
---|---|---|---|---|
D1 | xml | text | 5345280 | Files in xml format |
D2 | ooffice | heterogeneous | 6152192 | Files consisting of Office programs |
D3 | reymont | text | 6627202 | A pdf file with the contents of Reymont’s book |
D4 | sao | homogeneous | 7251944 | Files containing information of 258,996 stars |
D5 | x-ray | image | 8474240 | 12-bit grayscale scaled x-ray medical image of a child’s hand |
D6 | mr | image | 9970564 | A magnetic resonance medical image of the head |
D7 | osdb | heterogeneous | 10085684 | Open source database files for testing |
D8 | dickens | text | 10192446 | Text file consisting of multiple novels by Dickens |
D9 | samba | heterogeneous | 21606400 | An collected open source project |
D10 | nci | homogeneous | 33553445 | Files in SDF format |
D11 | webster | heterogeneous | 41458703 | An English dictionary stored in HTML format |
D12 | mozilla | homogeneous | 51220480 | The executable file Mozilla |
D13 | enwik8 | text | 100000000 | First $2^8$ bytes of the English Wikipedia dump on 2006 |
D14 | text8 | text | 100000000 | First $2^8$ bytes of the English Wikipedia (only text) dump on 2006 |
D15 | MNIST | image | 54880032 | A widely studied dataset containing handwritten digital images |
D16 | CIFAR-10 | image | 186213868 | A standard dataset of images with multiple categories |
D17 | ImageNet | image | 745823247 | Training datasets in task3 from ISLVRC on 2012 |
D18 | ImageTest | image | 470611702 | A new 8-bit benchmark dataset for image compression evaluation |
D19 | Silesia | heterogeneous | 211938580 | A heterogeneous corpus of 12 documents with various data types |
D20 | Backup | heterogeneous | 1000000000 | $2^9$ bytes random extract from the disk backup of TRACE |
D21 | enwik9 | text | 1000000000 | First $2^9$ bytes of the English Wikipedia dump on 2006 |
D22 | Book | text | 1000000000 | First $2^9$ bytes of BookCorpus |
D23 | ESC | audio | 220522000 | First 500 audio files of the ESC |
D24 | Command | audio | 327759206 | First 10,000 audio files of the Google Speech Commands Dataset |
D25 | LibriSpeech | audio | 359034309 | Development set (“clean” speech) of LibriSpeech ASR corpus |
D26 | LJSpeech | audio | 293847664 | First 10,000 audio files of the LJ Speech Dataset |
D27 | DNACorpus | genome | 685597124 | A corpus of DNA sequences from 15 different species |
D28 | ERR7091247 | genome | 1926041160 | A collection of genomics sequencing dataset with FastQ format |
In our comparison examinations, we benchmarked 8 advanced general-purpose NN-based compressors Cmix, NNCP, Lstm-compress, DeepZip, DZip, TRACE, PAC, LLMZip and 9 traditional methods Gzip, PBzip2, XZ, BSC, SnZip, Lzma2, and PPMD, LZ4, and X3.
All experiments were conducted on a GPU server equipped with 4 * Intel Xeon Silver 4310 CPUs (2.10 GHz, 48 cores in total), 4* NVIDIA GeForce RTX 4090 GPUs (16,384 CUDA cores, 24 GB of GPU memory), and 128 GB of DDR4 RAM. The server runs the operating system Ubuntu 20.04.6 LTS.
Cmix is a neural network based lossless compression algorithm aimed at optimizing compression ratio at the cost of high CPU/memory usage, and it uses thousands of context models followed by an NN-based mixer. We used Cmix V19 to finish the experiments.
# compression
cmix -c file file.cmix
# decompression
cmix -d file.cmix file.cmix.out
LSTM-compressor is an LSTM-based lossless compression algorithm that uses the same LSTM module and preprocessing code as CMIX. LSTM-compress currently only supports compression of a single file. In this manuscript, we used LSTM-compress V3. The detailed commands are as follows.
# compression
lstm-compress -c file file.lstm
# decompression
lstm-compress -d file.lstm file.lstm.out
NNCP is a lossless compression algorithm based on LSTM and supports multi-GPU parallel processing. NNCP is an experiment to build a practical lossless data compressor with neural networks. The latest version uses a Transformer model. In this manuscript, we used NNCP V2021-06-01 to finish the experiments. The detailed commands are as follows.
# compression
nncp c file file.nncp -T 16 --cuda
# decompression
nncp d file.nncp file.nncp.out -T 16 --cuda
DeepZip is a general-purpose compression algorithm based on recurrent neural networks. It belongs to the static pre-training method. The detailed commands for using DeepZip are shown below.
# compression
sh ./compress.sh file file.deepzip bs model
# decompression
sh ./decompress.sh file.deepzip file.deepzip.out bs model
DZip is an upgraded version of DeepZip, with an extra deeper network added to DeepZip to improve compression. DZip includes two compression modes, combined mode and bootstrap mode. The detailed commands of DZip are as follows.
# compression
sh ./compress.sh file file.dzip com model
# decompression
sh ./decompress.sh file.dzip file.dzip.out com model
TRACE is a lossless compression algorithm based on Performer (a Transformer variant.) TRACE uses byte grouping and shared FFNs, and therefore has better execution efficiency. Since the original TRACE puts compression and decompression processes into simultaneous execution, in order to test the performance of compression and decompression separately, we have modified the source files to test the performance of compression or decompression separately.
# compression
python compressor.py --source file --comp file.trace
# decompression
python compressor.py --comp file.trace --decomp file.trace.out
PAC is a deep learning based compression algorithm fusing MLP and Ordered Mask. Due to the use of MLP, PAC has a lower computational cost. Again, we separate the compression-decompression process of PAC as shown in the command line below.
# compression
python compressor.py --source file --comp file.pac
# decompression
python compressor.py --comp file.pac --decomp file.pac.out
LLMZip uses LLaMA as probabilistic predictor in combination with entropy coding (zlib, Token-by-Token and arithmetic coding) to achieve general-purpose lossless compression. The compression and decompression commands of LLMZip are as follows.
# compression
torchrun --nproc_per_node 1 LLMzip_run.py --ckpt_dir llama2/llama-2-7b --tokenizer_path llama2/tokenizer.model --win_len 511 --text_file file --compression_folder file --encode_decode 0
# decompression
torchrun --nproc_per_node 1 LLMzip_run.py --ckpt_dir llama2/llama-2-7b --tokenizer_path llama2/tokenizer.model --win_len 511 --text_file file --compression_folder file --encode_decode 1
Gzip is a popular early general-purpose lossless compression program originally written by Jean-loup Gailly for the GNU project. The commands for Gzip are shown below.
# compression
gzip -c file > file.gz -9
# decompression
gzip file.gz -9
PBzip2 is a parallel implementation of the Bzip2 block-sorting file compression algorithm that uses pthreads and achieves near-linear speedup on SMP devices. PBzip2 utilizes the Burrows-Wheeler block sorting algorithm for compressing files, along with Huffman coding for efficient text compression. This manuscript uses parallel Bzip2 V1.1.13 to compress data.
# compression
pbzip2 -9 -m2000 -p16 -c file > file.bz2
# decompression
pbzip2 -dc -9 -p16 -m2000 file.bz2
XZ Utils is free general-purpose data compression software with a high compression ratio. XZ Utils were written for POSIX-like systems, but also work on some not-so-POSIX systems. XZ Utils are the successor to LZMA Utils. In our experiments, we used XZ V5.5.0. The compression and decompression commands are as follows.
# compression
pbzip2 -9 -m2000 -p16 -c file > file.bz2
# decompression
pbzip2 -dc -9 -p16 -m2000 file.bz2
BSC is a high-performance file compressor based on lossless block-ordered data compression algorithm, block-ordered data compression algorithm, high-performance file compressor. This manuscript uses BSC V3.3.2 to compress and decompress data.
# compression
bsc e file file.bsc -e2
# decompression
bsc d file.bsc file.bsc.out
SnZip is a traditional general-purpose lossless compression algorithm based on snappy. It supports a wide range of file formats including framing-format, old framing-format and so on. The default is framing-format. The command line to run SnZip is as follows.
# compression
snzip -k -t snzip file
# decompression
snzip -kd -t snzip file.snz
LZMA2 improves the multi-threading capability and performance of the LZMA algorithm and better handles incompressible data, so the compression performance is slightly improved. We also used the built-in LZMA2 algorithm in the 7-Zip application.
# compression
7zz a -m0=lzma2 -mx9 -mmt16 file.7z file
# decompression
7zz x -y -mx9 -mmt16 file.7z
PPMD is a context-based compressor, and its core idea is the Partial Matching Prediction (PPM) algorithm proposed by Cleary and Witten. PPM is a statistical modeling technique that uses a set of previous symbols in the input to predict the next symbol to reduce the output data’s entropy. PPM differs from a dictionary because PPM predicts the next symbol instead of trying to find the next symbol in the dictionary to encode. We utilized the PPMD in the 7-Zip to compress data.
# compression
7zz a -m0=ppmd -mx9 -mmt16 file.7z file
# decompression
7zz x -y -mx9 -mmt16 file.7z
LZ4 is a lossless compression algorithm with compression speeds greater than 500 MB/s per kernel (greater than 0.15 bytes/cycle). Its decoder is extremely fast, up to several GB/s (1 byte/cycle) per kernel. The latest lz4 algorithm supports multi-threaded versions.
# compression
lz4 -12 -T16 file file.lz4
# decompression
lz4 -12 -T16 -d file.lz4 file.lz4.reads
\end{lstlisting}
All experiments were conducted on a GPU server equipped with 4 * Intel Xeon Silver 4310 CPUs (2.10 GHz, 48 cores in total), 4* NVIDIA GeForce RTX 4090 GPUs (16,384 CUDA cores, 24 GB of GPU memory), and 128 GB of DDR4 RAM. The server runs the operating system Ubuntu 20.04.6 LTS.
Source-Version-Date 2024.03.08. 2024.03.10.
Latest-Version-Date 2024.07.28.
Authors: NBJL-AIGroup.
Contact us: https://nbjl.nankai.edu.cn, sunh@nbjl.naikai.edu.cn, and mahd@nbjl.naikai.edu.cn
2024.07.29: Modify the ReadMe file to include the X3 algorithm.
2024.12.15: updating linux sever from 128 GB of DDR4 RAM to 512 GB.