Lossless text compression using GPT-2 language model and Huﬀman coding

. Modern daily life activities produced lots of information for the advancement of telecommunication. It is a challenging issue to store them on a digital device or transmit it over the Internet, leading to the necessity for data compression. Thus, research on data compression to solve the issue has become a topic of great interest to researchers. More-over, the size of compressed data is generally smaller than its original. As a result, data compression saves storage and increases transmission speed. In this article, we propose a text compression technique using GPT-2 language model and Huﬀman coding. In this proposed method, Burrows-Wheeler transform and a list of keys are used to reduce the original text ﬁle’s length. Finally, we apply GPT-2 language mode and then Huﬀman coding for encoding. This proposed method is compared with the state-of-the-art techniques used for text compression. Finally, we show that the proposed method demonstrates a gain in compression ratio compared to the other state-of-the-art methods.


Introduction
It is not easy to manage the increasing amount of data produced every day, especially in medical centers and on social media.To save the data in a digital device requires too much storage space.According to the research of the 6 th edition of DOMO's report [1], more than 2.5 quintillion bytes of data are produced daily, and it is growing.It further estimates that approximately 90% of the world's data has been produced between 2018 and 2019.It also calculates that each person will create 1.7MB of data per second on the earth by 2020.There are three possible solutions to this problem, using better hardware or software or a combination of them.Nowadays, so much information is created that it is impossible to design new hardware competing with data creation because of the limitations of hardware construction reported in [2].Therefore, developing better software is the only solution to the problem.
A solution from the software viewpoint is compression.A way of representing data using fewer bits than an original is called compression.Compression reduces the consumption of storage and bandwidth; and increases transmission speed over the cyber world [3][4].Compression is applied in many areas such as audio, video, text, images, Etc.Two types of compression are lossless and lossy.The less important and irrelevant data are removed permanently for lossy compression, whereas lossless preserves every detail.Lossless compression eliminates only statistical redundancy.In short, lossy allows a little bit of degradation in data and lossless reconstructs perfectly from its compressed form [5][6].There are many applications of lossless compressions, such as electronic document compression, medical imagery, zip file format, and facsimile transmissions of bitonal images [7][8][9].
Our primary focus in this article is to compress a text file.Usually, Burrows-Wheeler Transform (BWT), Huffman coding, LZW (Lempel-Ziv-Welch), LZMA, Gzip, Bzip2, and Deflate are the most popular text compression algorithms [10][11].Some statistical compression techniques such as Huffman coding, arithmetic coding put the shorter codes to the characters repeated more frequently.On the other hand, LZ77, LZW compress a text file based on a dictionary where a set of sub-strings is created and assigns them a pointer reported in [12][13].Storer et al. in [14] show that though LZW is a good compression technique, its searching complexity is very high.Salomon in [15] shows that Deflate is Huffman and LZSS based text compression algorithm and provides better compression, but it is slow due to searching the longer and duplicate substrings.Gzip is an LZ77 and Huffman coding-based text compression algorithm and provides a speedy compression than Deflate reported in [16][17].[18] and Claims a better compression than Deflate, Bzip2, Gzip, LZMA, and LZW.It also shows that the method is a bit slow.

Rahman et al. show a Burrows-Wheeler transform (BWT), pattern matching, and Huffman coding-based text compression technique in
Burrows-Wheeler transform (BWT), a reversible technique that converts a set of characters into runs of similar characters [19], and the technique is used by Bzip2, a lossless text compression algorithm [20].Though many text compression techniques have already been developed, current technology needs a more effective text compression strategy.
From this point of view, we propose a straightforward but efficient lossless text compression procedure using GPT-2 language model and Huffman coding in this paper.Our primary focus is on compression, not the speed of compression.There are three steps in our proposal.First of all, we split a large text file into a set of small files and then apply Burrows-Wheeler transform to each file individually to speed up transformation.Secondly, we use a list of uniquely defined keys to reducing the length of each file.Finally, the GTP-2 language model and then Huffman coding is applied for compression.We compare the proposed approach against some standard and advanced popular alternatives.In this article, background studies are discussed in segment 2. The proposed technique is explained in segment 3. In segment 4, we show the experimental outcomes and analysis.We finally conclude the paper in segment 5.

Background studies
Huffman is a lossless entropy coding technique that constantly generates a most effective tree [21] and works on the contrary path of Shannon-Fano coding.It sorts the probabilities in descending order and makes a tree connecting the two lowest probabilities each time, and the tree is finally used for encoding.There are two barriers to Huffman coding.First of all, it is extremely sensitive to error and may smash nearly the entire message for only modifying one or two bits in transmission.Secondly, it provides a relatively lower compression rate as it assigns a code-word for each pixel Figure 1: Huffman Tree in the encoded matrix [22][23].However, Huffman coding provides better compression than Shannon-Fano coding.A detailed analysis of Huffman and Shannon-Fano coding with a numeric example has been given in [3].For the dataset A=[1 2 2 1 2 2 2 6 6 6 4 4 4 3 3 4 4 4 4 5 5 5], Huffman tree is shown in Figure 1.The figure shows that Huffman provides 2.4545 bits averagely for the data list A, which is 0.0122% less storage than Shannon-Fano coding.
Lempel-Ziv-Storer-Szymanski (LZSS), developed by James A. Storer and Thomas Szymanski, is a dictionary-based text compression technique and a derivative of LZ77 [24].Deflate was created by Phil Katz in 1951 to compresses data using LZSS and Huffman coding together [25].Deflate was covered by patents that led to developing another lossless text compression algorithm for its widespread use.For this reason, a deflate-based data compression algorithm called Gzip was developed for free use by Jean-loup Gailly and Mark Adler.The pseudocode of LZ77 is shown in the following Algorithm 1.
Generative Pre-trained Transformer 2 (GPT-2) is a language model that uses Byte Pair Encoding (BPE) techniques for compression.BPE technique chooses subwords instead of words or characters in a neural network reported in [26][27][28].The pseudocode of Byte Pair Encoding (BPE) is demonstrated in the following Algorithm 2.

Proposed method
The most popular text compression algorithms are Gzip, Bzip2, Lempel-Ziv-Markov chain algorithm (LZMA), Brotli, Etc.Many of them concentrate on the compression ratios, while others focus on speed.In this article, we mainly focus on compression ratios.Burrows-Wheeler transform (BWT) takes a long time for a large file transformation reported in [18].To solve the issue, we split a large text file into a set of small files and apply the BWT to each file to speed up the conversion in our proposed technique.Secondly, each converted text is reduced by keys.It has been analyzed that GPT-2 model produces a Hangul Syllables list as an output that contains much fewer Hangul characters than its input text.We save the number of Hangul characters produced in each section for later reconstruction.Lastly, we combine the outputs of the GPT-2 segments and apply Huffman coding for encoding.Figure 2 shows the general encoding procedure of the proposed technique.

Experimental results and analysis
This section shows the experimental results of some of the most commonly used text compression methods and the proposed technique and explains the methods' overall performance.The most important thing is to determine the evaluation parameters.We evaluate the methods mentioned in this article based on the compression ratio defined in equation 1.There are many text compression methods used in real applications.However, we select Brotli, Bzip2, LZMA, and Gzip for comparison.We choose ten different text samples and apply the state-of-the-art and proposed techniques to them.We show the samples' compression ratios in Table 1, and Figure 3 demonstrates a graphical representation of the compression ratios for quick comparison.
Table 1 shows that, on average, Bzip2, Gzip, LZMA, Brotli, and the proposed techniques give 2.91, 2.5954, 2.8924, 2.7791, and 2.9066 compression ratios for the ten samples.The table demonstrates that Bzip2 and Gzip provide the highest and the lowest compression ratios.It is calculated that the proposed technique averagely provides 10.707%, 0.489%, and 4.387% better compression than Gzip, LZMA, and Brotli, respectively.However, Bzip2 shows 0.117% better results than the proposed technique.On average, though Bzip2 provides better compression than the proposed technique, the proposed technique sometimes shows a better result than Bzip2.As an example, the proposed technique provides 0.741%, 4.159%, and 0.133% more compression than Bzip2 for the text samples 2, 3, and 9, respectively.

CR =
N umber of bits of an original text N umber of bits of compressed text 1

Conclusions
This work proposes an easy yet effective text compression procedure using Burrows-Wheeler transform, GPT-2 language model, and Huffman coding.It is inspired as the  GPT-2 works with Byte Pair Encoding and provides much fewer Hangul characters than the original one, and Huffman coding provides a better result for a small number of symbols.The experimental results show that the proposed technique averagely provides better compression than state-of-the-art techniques without Bzip2.However, the proposed approach shows better reduction for at least 30% of text samples than the Bzip2.Finally, we conclude that the proposed technique sometimes outperforms the state-of-the-art methods.

Algorithm 1 :while input do 3 Find 4 if prefix then 5 6 7 8 else 9 11 end 12 14 endAlgorithm 2 :1
The pseudo-code of LZ77 algorithm 1 Take a text (input); 2 the longest prefix of the input in a window and assign it to PF; Calculate the distance (X) from where the prefix has started and assign it to X; Calculate the length of the prefix (Y); Assign the character to Z that follows prefix in the input; Assign 0 to X and Y; 10 Assign the first character of the input to Z; Output a triple (X, Y, Z); 13 Move the cursor Y+1 positions to the right; The pseudo-code of Byte Pair Encoding (BPE) algorithm Split each word into a series of characters; 2 Find out the highest frequency patterns and perform the joining operation with the highest frequency patterns; 3 Repeat step 2 until it gets the predefined highest number of subword of iterations;

Figure 3 :
Figure 3: Comparison of compression ratios

Table 1 :
Experimental compression ratios