A Note on Lempel-Ziv Parser Tails and Substring Lengths

T. Aaron Gulliver; Ulrich Speidel; Niko Rebenich

Summary

The 2018 International Symposium on Information Theory and Its Applications　(ISITA2018)

2018

Session Number:We-AM-1-3

Session:

Number:We-AM-1-3.3

A Note on Lempel-Ziv Parser Tails and Substring Lengths

T. Aaron Gulliver, Ulrich Speidel, Niko Rebenich,

pp.582-586

Publication Date:2018/10/18

Online ISSN:2188-5079

DOI:10.34385/proc.55.We-AM-1-3.3

PDF download

Summary:

LZ77, LZ78 and other compression algorithms use derivates of the original Lempel-Ziv (LZ) parsing algorithm, which computes the LZ production complexity. The focus of this paper is on the original algorithm. Like its derivates, it progressively factors an input string s into a series of steps, each of which consists of a reference to a substring starting in the already parsed part of the |s (reproduction) and a single symbol (innovation). In all LZ parsers, the input string generally ends on an incomplete step part-way through a reproduction rather than on an innovation symbol. This LZ parsing tail is usually given scant treatment as it tends to be short compared to the string length |s|, even though its length is bound only by |s| - 1. This paper examines the expected length of the tail for |s| produced by an i.i.d. Bernoulli source, and uses this to derive a new result on the statistics of subsequences of large strings.