Compression by Substring Enumeration with a Finite Alphabet Using Sorting

Takahiro Ota; Hiroyoshi Morita; Akiko Manada

Summary

The 2018 International Symposium on Information Theory and Its Applications　(ISITA2018)

2018

Session Number:We-AM-1-3

Session:

Number:We-AM-1-3.4

Compression by Substring Enumeration with a Finite Alphabet Using Sorting

Takahiro Ota, Hiroyoshi Morita, Akiko Manada,

pp.587-591

Publication Date:2018/10/18

Online ISSN:2188-5079

DOI:10.34385/proc.55.We-AM-1-3.4

PDF download

Summary:

This paper proposes two variants of improved Compression by Substring Enumeration (CSE) with a finite alphabet. In previous studies on CSE, an encoder utilizes inequalities which evaluate the number of occurrences of a substring and a minimal forbidden word (MFW) to be encoded. Also, its codeword length is proportional to the difference between the upper and lower bounds deduced from the inequalities, but the lower bound is not tight. Therefore, in this paper, we derive a new tight lower bound and consequently propose a new CSE algorithm using the new inequality. We also propose a new encoding order of substrings and MFWs which are sorted by row and column marginal totals of the proper substrings and MFWs, instead of lexicographical order used in previous studies. We then propose a new CSE algorithm which is the first proposed CSE algorithm using the new encoding order. Experimental results show that compression ratios of all files of the Calgary corpus in the proposed algorithms are better than those of a previous study on CSE with a finite alphabet. Moreover, compression ratios under the second proposed CSE get better than or equal to that under a well-known compressor for 11 files amongst 14 files in the corpus.