The 2018 International Symposium on Information Theory and Its Applications (ISITA2018)
A Note on Lempel-Ziv Parser Tails and Substring Lengths
T. Aaron Gulliver, Ulrich Speidel, Niko Rebenich,
LZ77, LZ78 and other compression algorithms use derivates of the original Lempel-Ziv (LZ) parsing algorithm, which computes the LZ production complexity. The focus of this paper is on the original algorithm. Like its derivates, it progressively factors an input string s into a series of steps, each of which consists of a reference to a substring starting in the already parsed part of the |s (reproduction) and a single symbol (innovation). In all LZ parsers, the input string generally ends on an incomplete step part-way through a reproduction rather than on an innovation symbol. This LZ parsing tail is usually given scant treatment as it tends to be short compared to the string length |s|, even though its length is bound only by |s| - 1. This paper examines the expected length of the tail for |s| produced by an i.i.d. Bernoulli source, and uses this to derive a new result on the statistics of subsequences of large strings.