acm-header
Sign In

Communications of the ACM

Research highlights

Resolution of the Burrows-Wheeler Transform Conjecture


geometric cubes, illustration

Credit: Getty Images

The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2) as well as recent powerful compressed indexes (such as the r-index7), central in modern bioinformatics. The compressibility of BWT is quantified by the number r of equal-letter runs in the output. Despite the practical significance of BWT, no nontrivial upper bound on r is known. By contrast, the sizes of nearly all other known compression methods have been shown to be either always within a poly-log n factor (where n is the length of the text) from z, the size of Lempel–Ziv (LZ77) parsing of the text, or much larger in the worst case (by an nε factor for ε > 0).

In this paper, we show that r = cacm6506_bs.gif(z log2 n) holds for every text. This result has numerous implications for text indexing and data compression; in particular: (1) it proves that many results related to BWT automatically apply to methods based on LZ77, for example, it is possible to obtain functionality of the suffix tree in cacm6506_bs.gif(z polylog n) space; (2) it shows that many text processing tasks can be solved in the optimal time assuming the text is compressible using LZ77 by a sufficiently large polylog n factor; and (3) it implies the first nontrivial relation between the number of runs in the BWT of the text and of its reverse.

In addition, we provide an cacm6506_bs.gif(z polylog n)-time algorithm converting the LZ77 parsing into the run-length compressed BWT. To achieve this, we develop several new data structures and techniques of independent interest. In particular, we define compressed string synchronizing sets (generalizing the recently introduced powerful technique of string synchronizing sets11) and show how to efficiently construct them. Next, we propose a new variant of wavelet trees for sequences of long strings, establish a nontrivial bound on their size, and describe efficient construction algorithms. Finally, we develop new indexes that can be constructed directly from the LZ77 parsing and efficiently support pattern matching queries on text substrings.

Back to Top

1. Introduction

Lossless data compression aims to exploit redundancy in the input data to represent it in a small space. Despite the abundance of compression methods, nearly every existing tool falls into one of the few general frameworks, among which the three most popular are the following: Lempel–Ziv compression (where the nominal and most commonly used is the LZ77 variant25), statistical compression (this includes, e.g., context mixing, prediction by partial matching (PPM), and dynamic Markov coding), and Burrows-Wheeler transform (BWT).4

One of the features that best differentiates these algorithms is whether they better remove the redundancy caused by skewed symbol frequencies or by repeated fragments. The idea in LZ77 (which underlies, e.g., 7-zip and gzip compressors) is to partition the input text into long substrings, each having an earlier occurrence in the text. Every substring is then encoded as a pointer to the previous occurrence using a pair of integers. This method natively handles long repeated substrings and can achieve an exponential compression ratio given sufficiently repetitive input. Statistical compressors, on the other hand, are based on representing (predicting) symbols in the input based on their frequencies. This is formally captured by the notion of the kth order empirical entropy Hk(T). For any sufficiently long text T, symbol frequencies (taking length-k contexts into account) in any power of T (the concatenation of several copies of T) do not change significantly (see Kreft and Navarro,14 Lemma 2.6). Therefore, |Tt| Hk(Tt) ≈ t · |T| Hk(T) holds for any t ≥ 1, which means that entropy is not sensitive to long repetitions, and hence it is not able to capture the same type of redundancy as the LZ77 compression.7,12,14,24

The above analysis raises the question about the nature of compressibility of the Burrows-Wheeler transform. The compression of BWT-based compressors, such as bzip2, is quantified by the number r of equal-letter runs in the BWT. The clear picture described above no longer applies to the measure r. On the one hand, Manzini16 proved that r can be upper-bounded in terms of the kth order empirical entropy of the input string. On the other hand, already in 2008, Sirén et al.24 observed that BWT achieves excellent compression (superior to statistical methods) on highly repetitive collections and provided probabilistic analysis exhibiting cases when r is small. Yet, after more than a decade, no upper bound on r in terms of z (the size of the LZ77 parsing) was discovered.

This lack of understanding is particularly frustrating due to numerous applications of BWT in the field of bioinformatics and compressed computation. One of the most successful applications of BWT is in compressed indexing, which aims to store a compressed string simultaneously supporting various queries (such as random access, pattern matching, or even suffix array queries) concerning the uncompressed version. Although classical (uncompressed) indexes, such as suffix trees and suffix arrays, have been successful in many applications, they are not suitable for storing and searching big highly repetitive collections, which are virtually impossible to search without preprocessing. A recent survey17 provides several real-life examples of such datasets. In particular, Github stores more than 20 terabytes of data, with an average of over 20 versions per project, whereas the 100000 Human Genome Project produced over 70 terabytes of DNA, which is highly compressible due to 99.9% similarity between individual human genomes. Motivated by such applications, compressed indexing witnessed a remarkable surge of interest in recent years. BWT-based indexes, such as the r-index,7 are among the most powerful, and their space usage is up to cacm6506_bs.gif(polylog n) factors away from the value r. For a comprehensive overview of this field, we refer the reader to a survey by Navarro.18

In addition to text indexing, the Burrows-Wheeler transform has many applications in computational biology. In particular, BWT is the main component of the popular genome read aligners such as Bowtie, BWA, and Soap2, which paved its way to general bioinformatics textbooks.21 Specialized literature on algorithmic analysis of biological sequences15,19 devotes dozens of pages to BWT applications.

Given the importance and practical significance of BWT, one of the biggest open problems that emerged in the field of lossless data compression and compressed computation asks:

What is the upper bound on the output size of the Burrows-Wheeler transform?

Except for BWT, essentially every other known compression method has been proven to produce output whose size is always within an cacm6506_bs.gif(polylog n) factor from z, the output size of the LZ77 algorithm (e.g., grammar compression, collage systems, and macro schemes)a or larger by a polynomial factor (nε for some ε > 0) in the worst case (e.g., LZ78, compressed word graphs (CDAWGs)). We refer the reader to Navarro17 for a survey of repetitiveness measures. Notably, BWT is known to never compress much better than LZ77, that is, z = cacm6506_bs.gif(r log n),6 and the opposite relation r = cacm6506_bs.gif(z poly-log n) was often conjectured to be false.5

* 1.1. Our contribution

We prove that r = cacm6506_bs.gif(z log2 n) holds for all strings, resolving the BWT conjecture in the more surprising way and answering an open question of Gagie et al.5,6 This result alone has multiple implications for indexing and compression:

  1. It is possible to support suffix array and suffix tree functionality in cacm6506_bs.gif(z polylog n) space.7
  2. It was shown by Kempa10 that many string processing tasks (such as BWT and LZ77 construction) can be solved in cacm6506_bs.gif(n/logσ n+r polylog n) time (where σ is the alphabet size). Thus, if the text is sufficiently compressible by BWT (formally, n/r = Ω (polylog n)), these tasks can be solved in optimal time (which is unlikely to be possible for general texts11). Our result loosens this assumption to n/z = Ω (polylog n).
  3. Until now, methods based on the Burrows-Wheeler transform were thought to be neither statistical nor dictionary (LZ-like) compression algorithms.5,24 Our result challenges the notion that the BWT forms a separate compression type: Because of our bound, BWT is much closer to LZ compressors than anticipated.
    Our slightly stronger bound r = cacm6506_bs.gif(δ log2 n), where δ ≤ z is a symmetric (insensitive to string reversal) repetitiveness measure recently studied by Kociumaka et al.,13 further shows that:
  4. The number cacm6506_aw.gif of BWT runs in the reverse of the text satisfies cacm6506_b.gif, which is the first nontrivial bound in terms of r. This result is of practical importance due to many algorithms whose efficiency depends on cacm6506_aw.gif.2,20,22,23

After proving r = cacm6506_bs.gif(z log2 n), we refine our approach to obtain cacm6506_c.gif and subsequently cacm6506_d.gif. We then show that the latter bound, combined with the trivial one rn, is asymptotically tight for the full spectrum of values of n and δ. As a side result, we obtain a tight upper bound cacm6506_bs.gif(n log δ) on the sum of irreducible LCP values, improving upon the previously known bound cacm6506_bs.gif(n log r).9

Next, we develop an cacm6506_bs.gif(z log8 n)-time algorithm converting the LZ77 parsing into the run-length compressed BWT (the polylog n factor has not been optimized). This offers up to exponential speedup over the previously fastest space-efficient algorithms, which need Ω(n log z) time.20,22 To achieve this, we develop new data structures and techniques of independent interest. In particular, we introduce a notion of compressed string synchronizing sets, generalizing the technique by Kempa and Kociumaka.11 We then describe a new variant of wavelet trees,8 designed to work for sequences of long strings. In the full version of this paper, we also propose new indexes that can be built directly from the LZ77 parsing and support fast pattern matching queries on text substrings.

Back to Top

2. Preliminaries

A string is a finite sequence of characters from a given alphabet. The length of a string S is denoted by |S| and, for i ∈ [1 . . |S|],b the ith character of S is denoted by S[i]. A string U is a substring of S if U = S[i] S[i + 1] … S[j – 1] for some 1 ≤ ij ≤ |S| + 1. We then say that U occurs in S at position i. The occurrence of U at position i in S is denoted by S[i . . j) or S[i . . j – 1]. Such an occurrence is a fragment of S and can be represented by (a pointer to) S and the two positions i, j. Two fragments (perhaps of different strings) match if they are occurrences of the same substring. A fragment S[i . . j] is a prefix if i = 1 and a suffix if j = |S|.

We use cacm6506_a.gif to denote the reverse of S, that is, S[|S|] … S[2] S[1]. We denote the concatenation of two strings U and V, that is, U[1]U[2] … U[|U|]V[1]V[2] … V[|V|], by UV or U · V. Furthermore, cacm6506_e.gif denotes the concatenation of k ≥ 0 copies of S; note that S0 = ε is the empty string.

An integer p ∈ [1 . . |S|] is a period of a string S if S[i] = S[i + p] holds for every i ∈ [1 . . |S| – p]. The shortest period of S is denoted as per(S). A string S is called periodic if cacm6506_f.gif. The following fact is a consequence of the classic periodicity lemma (which we do not use directly).

FACT 2.1 (see Amir et al.,1 FACT 1). Any two distinct periodic strings of the same length differ on at least two positions.

Throughout the paper, we consider a string (called the text) T of length n ≥ 1 over an ordered alphabet Σ of size σ. We assume that T[n] = $, where $ is the smallest symbol in Σ, and that $ does not occur anywhere else in T. We use ⪯ to denote the order on Σ, extended to the lexicographic order on Σ* (the set of strings over Σ) so that U, V ∈ Σ* satisfy UV if and only if either U is a prefix of V, or U[1 . . i) = V[1 . . i) and U[i] ≺ V [i] holds for some i ∈ [1 . . min(|U|, |V|)].

The suffix array SA[1 . . n] of T is a permutation of [1 . . n] such that T[SA[1] . . n] ≺ T[SA[2] . . n] ≺ … ≺ T[SA[n] . . n]. The closely related Burrows-Wheeler transform BWT[1 . . n] of T is defined by BWT[i] = T[SA[i] – 1] if SA[i] > 1 and BWT[i] = T[n] otherwise. In the context of BWT, it is convenient to consider an infinite string T defined so that T[i] = T[1 + (i – 1) mod n] for i ∈ ℤ in particular, T [1 . . n] = T [1 . . n]. We then have BWT[i] = T [SA[i] – 1] for i ∈ [1 . . n]. Moreover, the assumption T[n] = $ implies T [SA[1] . .] ≺ … ≺ T[SA[n] . .].

For any string cacm6506_g.gif, where ci ∈ Σ and i ∈ ℤ>0 for i ∈ [1 . . h], and cici+1 for i ∈ [1 . . h), we define the run-length encoding of S as a sequence RL(S) = ((c1, λ1), …, (ch, λh)), where λi = 1 + … + i for i ∈ [1 . . h]. Throughout, we let r = |RL(BWT)| denote the number of runs in the BWT of T. For example, for the text T = bbabaababababaababa$ in Table 1, we have BWT = a1b6a1b2a6b1a2$1, and hence r = 8.

t1.jpg
Table 1. The lexicographically sorted suffixes of a string T = bbabaa-babababaababa$ along with the BWT, SA, and LCP tables.

By lcp(U, V) we denote the length of the longest common prefix of strings U and V. For j1, j2 ∈ [1 . . n], we let LCE(j1, j2) = lcp(T[j1 . .], T[j2 . .]). The LCP array LCP[1 . . n] is defined so that LCP[i] = LCE(SA[i], SA[i – 1]) for i ∈ [2 . . n] and LCP[1] = 0. We say that the value LCP[i] is reducible if BWT[i] = BWT[i – 1] and irreducible otherwise (including when i = 1). There are exactly r irreducible LCP values.

We say that a nonempty fragment T[i . . i + ) is a previous factor if it has an earlier occurrence in T, that is, LCE (i, i') ≥ holds for some i' ∈ [1 . . i). The LZ77 parsing of T is a factorization T = F1 ··· Ff into nonempty phrases such that the jth phrase Fj is the longest previous factor starting at position 1 + |F1 ··· Fj-1|; if no previous factor starts there, then Fj consists of a single character. In the underlying LZ77 representation, every phrase Fj = T [i . . i+) that is a previous factor is encoded as (i', ), where i' ∈ [1 . . i) satisfies LCE(i, i') ≥ (and is chosen arbitrarily in case of multiple possibilities); if Fj = T[i] is not a previous factor, then it is encoded as (T[i], 0). We denote the number of phrases in the LZ77 parsing by z. For example, the text bbabaababababaababa$ of Table 1 has LZ77 factorization b · b · a · ba · aba · bababa · ababa · $ with z = 8 phrases, and its LZ77 representation is (b, 0), (1, 1), (a, 0), (2, 2), (3, 3), (7, 6), (10, 5), ($, 0).

The trie of a finite set S ⊆ Σ* is an edge-labeled rooted tree with a node vX for every string X that is a prefix of a string SS. The trie is rooted at vε and the parent of each node vX with X ≠ ε is vX[1..|X|). In this case, the edge from vX[1..|X|) to vX is labeled with X[|X|]. See Figure 1 for the trie of {$aba, aaba, abaa, abab, abb$, b$ab, baab, baba, babb, bb$a}.

f1.jpg
Figure 1. The trie cacm6506_bt.gif of the reversed length-4 substrings of T for T = bbabaababababaababa$ of Table 1. Light edges are thin and dotted.

Back to Top

3. Combinatorial Bounds

* 3.1. Basic upper bound

To illustrate the main idea of our proof technique, we first prove the upper bound in its simplest form r = cacm6506_bs.gif(z log2 n). The following lemma stands at the heart of our proof.

LEMMA 3.1. For every ℓ ∈ [1 . . n], the number of irreducible LCP values in [ . . 2) is cacm6506_bs.gif(z log n).

PROOF. Denote Sm = {S ∈ Σm: S is a substring of T} for m ≥ 1. Observe that |Sm| ≤ mz because every length-m substring of T has an occurrence crossing or beginning at a phrase boundary of the LZ77 parsing of T. This includes substrings overlapping two copies of T, which cross the boundary between the last and the first phrase.

The idea of the proof is as follows: With each irreducible value LCP[i] ∈ [ . . 2), we associate a cost of units, which are charged to individual characters of strings in S3. We then show that each of the strings in S3 is charged at most 2 log n times. The number of irreducible LCP values in [ . . 2) equals cacm6506_h.gif times the total cost, which is at most

ueq01.gif

To devise the announced assignment of cost to characters of strings in S3, consider the trie cacm6506_bt.gif of all reversed strings in S (see Figure 1 for example).

Let LCP[i] ∈ [ . . 2) be an irreducible LCP value; note that i > 1 due to LCP[i] ≥ > 0. Let j0 = SA[i – 1] and j1 = SA[i] so that LCP[i] = LCE(j0, j1). Because LCP[i] is irreducible, we have T [j0 – 1] = BWT[i – 1] ≠ BWT[i] = T [j1 – 1]. For k ∈[1 .. ], the kth unit of the cost associated with LCP[i] is charged to the kth character (T[jt – 1]) of the string (T[jtk .. jtk + 3)∈S3, where t ∈ {0, 1} is such that the subtree of cacm6506_bt.gif rooted at cacm6506_i.gif contains fewer leaves than the subtree rooted at cacm6506_j.gif (we choose t = 0 in case of ties); see Figure 2 for an illustration.

f2.jpg
Figure 2. Proof of Lemma 3.1 on T from Table 1 for = 4, i = 11, and k = 2. Strings T [jtk . . jtk + 3) are highlighted. The subtree of cacm6506_bt.gif rooted in vbaa is smaller than in vbab (see Figure 1), and hence we charge the second symbol of T [j1k . . j1k + 3), that is, t = 1.

Note that at most log n characters of each SS3 can be charged during the above procedure: Whenever S[k], with k ∈ [1 .. ], is charged, the subtree of cacm6506_bt.gif rooted at cacm6506_k.gif has at least twice as many leaves as the subtree rooted at cacm6506_l.gif, and this can happen for at most log|S| ≤ log n nodes cacm6506_l.gif on the path from cacm6506_m.gif to the root of cacm6506_bt.gif.

It remains to show that, for every SS3, a single position S[k], with k ∈[1 .. ], can be charged at most twice. For this, observe that the characters charged for a single irreducible value LCP[i] are at different positions (of strings in S3). Hence, to analyze the total charge assigned to S[k], we only need to bound the number of possible candidate positions i. Let [b . . e] be the set of indices i' such that T [SA[i']..] starts with S[k + 1 .. 3]. In the above procedure, if a character S[k] is charged a unit of cost corresponding to LCP[i], then S[k + 1 .. 3] is a prefix of either T[SA[i−1]..]=T[j0 ..] or T[SA[i]..]=T[j1 ..]. Hence, {i−1,i}∩[b..e]≠∅. At the same time, LCE(SA[i –1], SA[i]) < 2, and all strings T[SA[i']..] with i' ∈[b .. e] share a common prefix S[k + 1 .. 3] of length 3k ≥ 2. Thus, i = b or i = e + 1. ⅈ

THEOREM 3.2. All length-n strings satisfy r = cacm6506_bs.gif(z log2 n).

PROOF. Recall that r is the total number of irreducible LCP values. Thus, the claim follows by applying Lemma 3.1 for i = 2i, with i ∈[0 .. ⌊log n⌋] , and observing that the number of LCP values 0 is exactly σ≤z.

* 3.2. Tighter upper bound

To obtain a tighter bound, we refine the ideas from Section 3.1, starting with a counterpart of Lemma 3.1.

LEMMA 3.3. For every ℓ ∈[1 .. n], the number of irreducible LCP values in [ .. 2) is cacm6506_bs.gif(z log z).

PROOF. The proof follows closely that of Lemma 3.1. However, with each irreducible value LCP[i] ∈ [ .. 2), we associate cost cacm6506_n.gif instead of ℓ. We then show that each of the strings in S3 is charged at most 2 · (3 + log z) times (rather than 2 log n times). Then, the number of irreducible LCP values in the range [ .. 2) does not exceed cacm6506_o.gif times the total cost, which is bounded by

ueq02.gif

Recall the trie cacm6506_bt.gif of all reversed strings in S. For a node v of cacm6506_bt.gif, by size(v), we denote the number of leaves in the subtree of cacm6506_bt.gif rooted in v. An edge connecting v ≠root(cacm6506_bt.gif) to its parent in cacm6506_bt.gif is called light if v has a sibling v' satisfying size(v') ≥ size(v) (see Figure 1). In the proof of Lemma 3.1, we observed that the characters S[k] of SS3 that can be charged correspond to light edges on the path from the root of cacm6506_bt.gif to the leaf cacm6506_m.gif: Whenever S[k], with k ∈[1 .. ], is charged, the edge connecting cacm6506_l.gif to its parent cacm6506_p.gif is light. We then noted that there are at most log | S | ≤ log n light edges on each root-to-leaf path in cacm6506_bt.gif. Here, we charge the characters of strings in S3 in the same way as in Lemma 3.1, but only for units cacm6506_q.gif. This implies that only characters S[k] of SS3 with cacm6506_r.gif are charged. It remains to show that any root-to-leaf path in cacm6506_bt.gif contains at most 3 + log z light edges between a node at depth at least cacm6506_s.gif and its child.

Consider a light edge from a node v to its parent u at depth at least cacm6506_s.gif. Let v' be a sibling of v satisfying size(v') ≥ size(v), and let Sv, Sv' be the labels of the paths from the root to v and v', respectively. These labels differ on the last position only so, by Fact 2.1, they cannot be both periodic. Let ∈ {v, v'} be such that S is not periodic, and let cacm6506_t.gif.

Consider the set S of length- strings corresponding to the leaves in the subtree of cacm6506_bt.gif rooted at (i.e., the labels of the root-to-leaf paths passing through ). Define cacm6506_u.gif and note that cacm6506_v.gif, because cacm6506_bt.gif is the trie of reversed strings from S. Let cacm6506_br.gif denote the ending positions of the leftmost occurrences in T[1 . .) of strings in cacm6506_a.gif. By definition, we have an occurrence of cacm6506_w.gif ending in T at every position ei with cacm6506_x.gif. Now, cacm6506_y.gif implies that cacm6506_z.gif for every cacm6506_aa.gif (otherwise, the two close occurrences of cacm6506_w.gif would yield cacm6506_ab.gif. Consequently, at least cacm6506_ac.gif length- substrings of T[1 . .) have disjoint leftmost occurrences. Because each leftmost occurrence crosses or begins at a phrase boundary of the LZ77 parsing of T, we conclude that cacm6506_ad.gif, and therefore cacm6506_ae.gif.

The reasoning above shows that once a root-to-leaf path encounters a light edge connecting a node u at depth at least cacm6506_s.gif to its child v, we have size(v) ≤ 4z. The number of the remaining light edges on the path is at most log(size(v)) ≤ 2 + log z by the standard bound applied to the subtree of cacm6506_bt.gif rooted at v.

THEOREM 3.4. Every string T of length n satisfies cacm6506_af.gif.

PROOF. To obtain tighter bounds on the number of irreducible LCP values in [ . . 2), we consider three cases:

log z. We repeat the proof of Lemma 3.1, except that we observe that the number of light edges on each root-to-leaf path in cacm6506_bt.gif is bounded by ℓ. Thus, the number of irreducible LCP values in [ . . 2) is cacm6506_bs.gif(z ℓ).

cacm6506_ag.gif. We use the bound cacm6506_bs.gif(z log z) of Lemma 3.3. cacm6506_ah.gif. We repeat the proof of Lemma 3.3, except that we observe that |S3| ≤ n. Thus, the number of irreducible LCP values in cacm6506_ai.gif.

The above upper bounds, applied for every = 2i with i ∈ [0 . . ⌊log n⌋], yield

ueq03.gif

* 3.3. Tight bounds in terms of δ

Let cacm6506_aj.gif denote the (cyclic) substring complexity of T.13 Note that letting cacm6506_ak.gif is equivalent because |Sm| ≤ n holds for m ≥ 1, which implies cacm6506_al.gif for mn. We start by noting that δ ≤ z because |Sm| ≤ mz holds for every m ≥ 1, as observed in the proof of Lemma 3.1. Furthermore, |Sm| ≤ mδ holds by definition of δ, so δ can replace z in the proof of Lemma 3.1.

To adapt the proof Lemma 3.3, we need to generalize the observation that at most z substrings from S may have disjoint leftmost occurrences in T[1 . .). This observation is easy because the LZ77 parsing naturally yields a set of z positions (phrase boundaries) in T. The substring complexity δ does not provide such structure, but as the lemma here implies, we can replace z by 3δ in the aforementioned observation. The proof of Lemma 3.5 is a straightforward modification of the argument used in Kociumaka et al.13 (Lemma 6). For completeness, we write down the full reasoning, with technical details tailored to our notation (e.g., S defined in terms of T rather than T).

LEMMA 3.5 (based on Kociumaka et al.,13 LEMMA 6). For any integer ℓ ≥ 1, the total number of positions in T[1 . .) covered by the leftmost occurrences of strings from S is at most 3δℓ.

PROOF. Let C denote the set of positions in T[1 . .) covered by the leftmost occurrences of strings from S, and let C' = C\[1 .. ). For any iC' denote S =T[i+1..i+], and let S ={Si:iC'}⊆S2. We will show that |S| = |C'|. Let iC'. First, observe that, due to i, the fragment Si is entirely contained in T[1 . .). Furthermore, by definition, Si. contains the leftmost occurrence of some SS. Thus, this occurrence of Si in T[1 . .) must also be the leftmost one in T[1 . .). Consequently, the substrings Si for iC' are distinct.

We have thus shown that |C'| = |S|≤|S2. Because |S2ℓ| ≤ 2δℓ holds by definition of δ, we obtain |C|<|C'|+ ℓ ≤ |S2|+≤(2δ+1)≤3δℓ. ⅈ

LEMMA 3.6. For every ℓ ∈[1 .. n], the number of irreducible LCP values in [ .. 2) is cacm6506_bs.gif(δ log δ).

PROOF. Compared to the proof of Lemma 3.3, we use the bound |S3| ≤ 3δ instead of |S3| ≤ 3ℓz. The only other modification required is that, for every light edge connecting a node u of cacm6506_bt.gif at depth at least cacm6506_s.gif to its child v, we need to prove size(v) = cacm6506_bs.gif(δ).

Let cacm6506_am.gif be defined as in the proof of Lemma 3.3. Recall that there are at least cacm6506_an.gif strings in S with disjoint leftmost occurrences in T[1 . .). By Lemma 3.5, there are at most 3δ such substrings. Thus, cacm6506_ao.gif. ⅈ

By replacing the thresholds log z and cacm6506_ap.gif with log δ and cacm6506_aq.gif, respectively, in the proof of Theorem 3.4, we immediately obtain a bound in terms of δ.

THEOREM 3.7. Every string T of length n satisfies cacm6506_ar.gif.

The following construction, analyzed in the full version of this paper, provides tight examples with substring complexity δ covering the whole spectrum between cacm6506_bs.gif(1) and cacm6506_as.gif. For ≥ 1 and x ∈[0 .. 2), we set bin (x) ∈ {0, 1} to be the binary representation of x.

LEMMA 3.8. For all integers ℓ ≥ 2 and K1, the length n, the substring complexity δ, and the number of runs r in the BWT of a string Tl,K ∈ {$, 0, 1, 2}+, defined with

ueq04.gif

satisfy n = Θ(2K+, δ=Θ(2), and r = Ω(2 ℓK).

If cacm6506_at.gif, on the other hand, then a trivial upper bound r = cacm6506_bs.gif(n) is tighter than that of Theorem 3.7. In this case, the following construction, also analyzed in the full version of this paper, provides tight examples.

LEMMA 3.9. For all integers ℓ ≥ 2 and Δ∈Ω(2)∩cacm6506_bs.gif(2), the length n, the substring complexity δ, and the number of runs r in the BWT of a string cacm6506_au.gif, defined with

ueq05.gif

satisfy n =Θ(2), δ=Θ(Δ), and r = Ω(2).

Overall, combining Lemmas 3.8 and 3.9, we obtain the following tight lower bound for δ ranging from cacm6506_bs.gif(1) to Ω(n).

THEOREM 3.10. For every N1 and Δ ∈ [1 .. N], there exists a string T whose length n, substring complexity δ, and the number of runs r in the BWT satisfy n=Θ(N), δ=Θ(Δ), and cacm6506_av.gif.

* 3.4. Further combinatorial bounds

By combining Theorem 3.7 with known properties of the substring complexity δ, we obtain the first bound relating the number of BWT runs in the string and its reverse. No such bounds (even polynomial in r log n) were known before.

COROLLARY 3.11. If r and cacm6506_aw.gif denote the number of runs in the BWT of a length-n text and its reverse, respectively, then cacm6506_ax.gif.

PROOF. Because the value of δ is the same for the text and its reverse, Theorem 3.7 yields cacm6506_ay.gif. Combining the results of Kempa and Prezza12 (Theorem 3.9) and Kociumaka et al.13 (Lemma 2) gives δ ≤ r. Consequently, we obtain cacm6506_az.gif.

In the full version of this paper, we also characterize the sum of all irreducible LCP values:

THEOREM 3.12. For every string of length n, the sum rΣ of all irreducible LCP values satisfies rΣ = cacm6506_bs.gif(n log δ).

Due to δ ≤ r, the presented upper bound is always (asymptotically) at least as strong as the previous bound rσn log r by Kärkkäinen et al.9 Furthermore, it can be strictly stronger because log δ = o(log r) is possible when δ = logo(1) n. In the full version of this paper, we show that the strings of Lemmas 3.8 and 3.9 yield a matching lower bound:

THEOREM 3.13. For every N ≥ 1 and Δ ∈ [1 .. N], there exists a string T whose length n, substring complexity δ, and sum rΣ of irreducible LCP values satisfy n=Θ(N), δ=Θ(Δ), and rΣ = Θ(n log δ).

Back to Top

4. From LZ77 to Run-Length BWT

In this section, we outline our algorithm that, given the LZ77 parsing of a text T ∈ Σn, computes its run-length compressed BWT in cacm6506_bs.gif(z polylog n) time. We start with an overview that explains the key concepts. Next, we present two new data structures utilized in our algorithm: the compressed string synchronizing set (Section 4.1) and the compressed wavelet tree (Section 4.2). The sketch of the final algorithm is then presented in Section 4.3.

For any substring Y of T, we define

ueq06.gif

We say that a substring Y of T is left-maximal if there exist distinct symbols a, b ∈ Σ such that the strings aY and bY are also substrings of T. The following definition, assuming Σ∩ℕ=∅, plays a key role in our construction.

DEFINITION 4.1 (BWT MODULO ). Let T ∈ Σn, ≥ 1 be an integer, and Yi= T[SA[i]..SA[i]+) for i ∈[1..n]. We define the string BWT ∈(Σ∪ℕ)n, called the BWT modulo (of T), as follows. For i ∈ [1 .. n],

ueq07.gif

The algorithm runs in k = ⌈log n⌉ rounds. For q ∈ [0 .. k), the input to the qth round is RL(BWT2q) and the output is RL(BWT2q+1). At the end of the algorithm, we have RL(BWT2k) = RL(BWT), because XS2k is never left-maximal for 2kn.

Informally, in round q, we are given a (run-length compressed) subsequence of BWT that can be determined based on sorting the suffixes only up to their prefixes of length 2q. Notice that BWT[b .. e] ∈ Σ* implies BWT+1 [b .. e] ∈ Σ* (because a prefix of a left-maximal substring is left-maximal). Hence, these subsequences need not be modified until the end of the algorithm (except possibly merging their runs with adjacent runs). For the remaining positions, BWT identifies the (leftmost occurrences of) substrings to be inspected in the qth round with the aim of replacing their corresponding runs in BWT with previously unknown BWT symbols (as defined in BWT2).

We call a block BWT[b . . e] uniform if all symbols in BWT[b . . e] are equal and nonuniform otherwise. The following lemma ensures the feasibility of the above construction.

LEMMA 4.2. For any ℓ ≥ 1, it holds |RL(BWT)| < 2r.

PROOF. Denote RL(BWT) = ((c1, λ1, …,(chh)), letting λ0 = 0. By definition of BWT, if ci ∈ ℕ, then the block BWT(λi-1 .. λi] is nonuniform. Thus, there are at most r – 1 runs of symbols from ℕ in BWT.

On the other hand, ci ∈ Σ and cj ∈ Σ, with i < j, cannot both belong to the same run in BWT. If this was true, then either ci+1 ∈ Σ (which implies ci+1 = ci, contradicting the definition of RL(BWT) ), or ci+1 ∈ ℕ, which is impossible because BWT(λi .. λi+1] would then be nonuniform. Thus, there are at most r runs of symbols from Σ in BWT. ⅈ

* 4.1. Compressed string synchronizing sets

Our algorithm builds on the notion of string synchronizing sets, recently introduced by Kempa and Kociumaka.11 Synchronizing sets are one of the most powerful techniques for sampling suffixes. In the uncompressed setting, they are the key in obtaining time-optimal solutions to multiple problems, such as a state-of-the-art BWT construction algorithm for texts over small alphabets.11 In this section, we introduce a notion of compressed string synchronizing sets. Our construction is the first implementation of synchronizing sets in the compressed setting and thus of independent interest.

We start with the definition of basic synchronizing sets.

DEFINITION 4.3 (τ-SYNCHRONIZING SET11). Let T ∈ Σn be a string and let cacm6506_ba.gif be a parameter. A set S⊆[1..n−2τ+1] is called a τ-synchronizing set of T if it satisfies the following consistency and density conditions:

  1. If T[i..i+2τ)=T[j..j+2τ), then i ∈ S holds if and only if j ∈ S (for i, j ∈[1 .. n – 2τ +1]),
  2. S∩[i..i+τ)=∅ if and only if cacm6506_bb.gif (for i ∈[1 .. n – 3τ + 2]).

In most applications, we want to minimize |S|. However, the density condition imposes a lower bound cacm6506_bc.gif for strings of length n ≥ 3τ – 1 that do not contain substrings of length 3τ – 1 that are periodic with period at most cacm6506_bd.gif. Therefore, we cannot hope to achieve an upper bound improving in the worst case upon the following one.

THEOREM 4.4 (Kempa and Kociumaka11). For any string T of length n and parameter cacm6506_be.gif, there exists a τ-synchronizing set S of size cacm6506_bf.gif. Moreover, such S can be (deterministically) constructed in cacm6506_bs.gif(n) time.

Storing S for compressible strings presents the following challenge: Although cacm6506_bc.gif is sometimes possible, it is not implied by zn. For example, as noted by Bille et al.,3 Thue–Morse strings satisfy z = cacm6506_bs.gif(log n) yet they contain no periodic substring X with cacm6506_bg.gif, and thus their synchronizing sets satisfy cacm6506_bc.gif. This prevents us from keeping the plain representation of S when cacm6506_bh.gif.

We therefore exploit a different property of compressible strings: That all their substrings Y satisfy lpos(Y) cacm6506_bi.gif, where ej is the last position of the jth phrase in the LZ77 parsing of T. By consistency of S, it suffices to store cacm6506_bj.gif. To decide whether i ∈ S, we then locate i' = lpos(T[i .. i + 2τ) and check if cacm6506_bk.gif. This motivates the following (more general) definition.

DEFINITION 4.5 (COMPRESSED τ-SYNCHRONIZING SET). Let S be a τ-synchronizing set of string T[1 . . n] for some cacm6506_bl.gif, and, for every j ∈[1 .. z], let ej denote the last position of the jth phrase in the LZ77 parsing of T. For k ∈ ℤ≥2, we define the compressed representation of S as

ueq08.gif

In the full version of this paper, we prove that every text T has a synchronizing set S with a small compressed representation, and we show how to efficiently compute such S from the LZ77 parsing of T.

THEOREM 4.6. There exists a Las-Vegas randomized algorithm that, given the LZ77 parsing of a string T ∈ Σn, a parameter cacm6506_bl.gif, and a constant k ∈ ℤ≥2, constructs in cacm6506_bs.gif(z log5 n) time a compressed representation compk(S) of a τ-synchronizing set S of T satisfying |compk(S)|≤ 72kz.

* 4.2. Compressed wavelet trees

Along with string synchronizing sets, wavelet trees,8 originally invented for text indexing, play a central role in our algorithm. Unlike virtually all prior applications of wavelet trees, ours uses a sequence of very long strings (up to Θ(n) symbols). This approach is feasible because all strings are substrings of the text, which is stored in the LZ77 representation. In this section, we describe this novel variant of wavelet trees, dubbed here compressed wavelet trees. In particular, we prove an upper bound on their size, describe an efficient construction from the LZ77-compressed text, and show how to augment them to support some fundamental queries.

Let Σ be an alphabet of size σ ≥ 1. Consider a string W[1 . . m] over the alphabet Σ so that W is a sequence of m ≥ 0 strings of length ≥ 0 over the alphabet Σ. The wavelet tree of W is the trie cacm6506_bt.gif of Σ with each node vX associated to a string BX ∈ Σ* defined here based on W. We let cacm6506_bm.gif denote the node-set of cacm6506_bt.gif.

With each node vXV(cacm6506_bt.gif), we associate an increasing sequence IX [1 . . h] of primary indices such that

ueq09.gif

Based on IX, we define BX ∈ Σ* so that, for i ∈ [1 . . h],

ueq10.gif

if |X| < , and BX = ε if |X| = ℓ. In other words, BX is a string containing the symbol at position |X| + 1 for each string of W that is prefixed by X. Importantly, the symbols in BX occur in the same order as these strings occur in W.

As typically done in the applications of wavelet trees, we only explicitly store the strings BX. The values of primary indices IX are retrieved using additional data structures, based on the following observation.

LEMMA 4.7 (Grossi et al.8). Let X ∈ Σd, where d ∈ [0 . . ). For every c ∈ Σ and j ∈ [1 . . |IXc|], we have IXc [j] = IX [i], where BX[i] is the jth occurrence of c in BX.

We define the compressed wavelet tree cacm6506_bt.gifc of W as the wavelet tree of W in which all strings BX have been run-length compressed and, except cacm6506_bn.gif, all nodes vX satisfying |RL(BX)| ≤ 1 have been removed (the unary paths are collapsed into single edges with their labels concatenated). The shape and edge labels of the resulting tree are identical to the compacted trie of {W[1], …, W[m]}.

We store the edge labels of cacm6506_bt.gifc as pointers to substrings in W. We assume that each edge of cacm6506_bt.gifc and each element of RL(BX) can be encoded in cacm6506_bs.gif(1) space. Because |RL(BY)| ≥ 1 holds for every internal node vYV(cacm6506_bt.gifc) and, unless |V(cacm6506_bt.gifc)| = 1, each leaf vZ in cacm6506_bt.gifc can be injectively mapped to an element of RL(BZ') for the parent vZ' of vZ, the tree cacm6506_bt.gifc uses cacm6506_bo.gif space.

THEOREM 4.8. If cacm6506_bt.gifc is the compressed wavelet tree of a sequence W of length-l strings, then cacm6506_bp.gif.

PROOF. Let m = |W|, k = |RL(W)| ≤ m, and k' = |{W[i] : i ∈ [1 . . m]}|. Due to |V(Tc)|≤ 1+2k'=cacm6506_bs.gif(1 + k), we can focus on nodes vxV(Tc) such that |RL(BX)| ≥ 2.

The proof resembles that of Lemma 3.1. With each X ∈ Σ* such that |RL(BX)| ≥ 2, we associate |RL(BX)| – 1 units of cost and charge them to individual elements of W. We then show that each run in RL(W) is in total charged at most 2 log k' units of cost. Consequently,

ueq11.gif

Consider X ∈ Σd with |RL(BX)| ≥ 2; note that d < ℓ. Let RL(BX) = ((c1, …, (ch, λh)). Observe that if we let p0 = IXi] and p1 = IXi + 1] for some i ∈ [1 . . h), then W[p0][d + 1] = cici + 1 = W[p1][d + 1]. Moreover, BXi] ≠ BXi + 1] implies W[p0 + 1] ≠ W[p0] and W[p1 – 1] ≠ W[p1]. The ith unit of cost is charged to W[pt], where t ∈ {0, 1} is chosen depending on the sizes of subtrees of cacm6506_bt.gifc rooted at the children of vX, so that the subtree containing vw[pt] has at most as many leaves as the subtree containing vw[p1−t].

Now, consider a run W[b . . b'] = Yδ in RL(W). For a single depth d, the run could be charged at most twice, with at most one unit assigned to W[b] due to p1 = b and at most one unit assigned to W[b'] due to p0 = b', both for X = Y[1 . . d]. Moreover, note that the subtree size on the path from vY to the root vε of cacm6506_bt.gifc doubles for every depth d for which the run was charged. Thus, the total charge of the run is at most 2 log k' units. ⅈ

The key operation that we want to support on cacm6506_bt.gifc is to compute the value IX [q] given q ∈ [1 . . |IX|] and a pointer to vxV(Tc). Let W[1 . . m] be a sequence of substrings of T of common length . Notice that if we have access to T, then the sequence W can be encoded in cacm6506_bs.gif(1 + |RL(W)|) space. Namely, it suffices to store the length and the sequence RL((lpos(W[i])) i ∈ [1 .. m]). In the full version of this paper, we show that, given such encoding of W and the LZ77 parsing of T, the compressed wavelet tree of W supporting fast primary index queries can be constructed efficiently.

THEOREM 4.9. Given the LZ77 representation of T[1 . . n] and a sequence W[1 .. m] of mn same-length substrings of T, represented as RL((lpos(W[i]))i ∈[1 .. m]), the compressed wavelet tree of W, supporting primary index queries in time cacm6506_bs.gif(log4 n), can be constructed in time cacm6506_bs.gif((z+|RL(W)|) log2 n).

* 4.3. The algorithm

We are now ready to sketch the procedure constructing the sequences RL(BWT2q), where q ∈ [0 . . ⌈log n⌉].

Let q ∈ [4 . . ⌈log n⌉) (values q ∈ [0 . . 3] are handled separately) and let = 2q. We show how to compute RL(BWT2) given RL(BWT) and the LZ77 parsing of T. The main idea of the algorithm is as follows.

Let S be a τ-synchronizing set of T, where cacm6506_bq.gif. As noted earlier, BWT [j] ∈ Σ implies BWT2 [j] ∈ Σ. Let BWT [y . . y'] ∈ ℕ+ be a run in BWT. By definition of BWT, the suffixes of T starting at positions i ∈ SA[y . . y'] share a common prefix of length ≥ 3τ. Thus, assuming that S∩[i..i+τ) = ⌽ holds for all i ∈ SA[y . . y'] (the periodic case is handled separately), by the consistency of S, all text positions i ∈ SA[y . . y'] share a common offset Δ with i + Δ = min (S∩[i..i+τ)). This lets us deduce the order of length-21 prefixes T[i . . i+2) based on the order of strings T[i + Δ . . i + 2) starting at synchronizing positions. For this, from the sorted list of fragments T[s . . s + 2) across s ∈ S, we extract, using a wavelet tree, those preceded by T[i . . i + Δ) (a prefix common to T[i . .) for i ∈ SA[y . . y']). Importantly, the synchronizing positions s sharing T[s – τ . . s + 2) can be processed together; hence, by Theorem 4.6, applied for k = 7 so that 2kτ, it suffices to use cacm6506_bs.gif(z) distinct substrings.

In the full version of the paper, we provide the details of the above construction and obtain the following result.

THEOREM 4.10. There exists a Las-Vegas randomized algorithm that, given the LZ77 parsing of a text T of length n, computes its run-length compressed Burrows-Wheeler transform in time cacm6506_bs.gif((r + z) log6 n) = cacm6506_bs.gif(z log8 n).

Back to Top

Acknowledgments

This work was performed when the first author was a post-doctoral scholar at the University of California, Berkeley (supported by NSF grants no. 1652303 and 1934846, and an Alfred P. Sloan Fellowship grant), whereas the second author was a postdoctoral scholar at Bar-Ilan University, Israel (supported by ISF grants no. 1278/16 and 1926/19, a BSF grant no. 2018364, and an ERC grant MPM under the EU's Horizon 2020 Research and Innovation Programme, agreement no. 683064).

Back to Top

References

1. Amir, A., Iliopoulos, C.S., Radoszewski, J. Two strings at Hamming distance 1 cannot be both quasiperiodic. Inf. Process. Lett. 128, (2017), 54–57.

2. Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M. Composite repetition-aware data structures. In CPM (2015), Springer, Cham, Switzerland, 26–39.

3. Bille, P., Gagie, T., Gørtz, I.L., Prezza, N. A separation between RLSLPs and LZ77. J. Discrete Algorithms 50, (2018), 36–39.

4. Burrows, M., Wheeler, D.J. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation. Palo Alto, CA, 1994.

5. Gagie, T., Navarro, G., Prezza, N. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. arXiv 1809.02792 (2018).

6. Gagie, T., Navarro, G., Prezza, N. On the approximation ratio of Lempel–Ziv parsing. In LATIN (2018), Springer, Cham, Switzerland, 490–503.

7. Gagie, T., Navarro, G., Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1 (2020), 1–54.

8. Grossi, R., Gupta, A., Vitter, J.S. High-order entropy-compressed text indexes. In SODA (2003), ACM/SIAM, Philadelphia, PA, USA, 841–850.

9. Kärkkäinen, J., Kempa, D., Piătkowski, M. Tighter bounds for the sum of irreducible LCP values. Theor. Comput. Sci. 656, (2016), 265–278.

10. Kempa, D. Optimal construction of compressed indexes for highly repetitive texts. In SODA (2019), SIAM, Philadelphia, PA, USA, 1344–1357.

11. Kempa, D., Kociumaka, T. String synchronizing sets: Sublinear-time BWT construction and optimal LCE data structure. In STOC (2019), ACM, New York, NY, USA, 756–767.

12. Kempa, D., Prezza, N. At the roots of dictionary compression: String attractors. In STOC (2018), ACM, New York, NY, USA, 827–840.

13. Kociumaka, T., Navarro, G., Prezza, N. Towards a definitive measure of repetitiveness. In LATIN (2020), Springer, Cham, Switzerland, 207–219.

14. Kreft, S., Navarro, G. On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483 (2013), 115–133.

15. Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, Cambridge, UK, 2015.

16. Manzini, G. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3 (2001), 407–430.

17. Navarro, G. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Comput. Surv. 54, 2 (2021), 1–31.

18. Navarro, G. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Comput. Surv. 54, 2 (2021), 1–32.

19. Ohlebusch, E. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Bremen, Germany, 2013.

20. Ohno, T., Sakai, K., Takabatake, Y., Tomohiro, I, Sakamoto, H. A faster implementation of online RLBWT and its application to LZ77 parsing. J. Discrete Algorithms, (2018), 52-53:18–28.

21. Pevsner, J. Bioinformatics and Functional Genomics, 3rd edn. Wiley-Blackwell, Chichester, UK, 2015.

22. Policriti, A., Prezza, N. From LZ77 to the run-length encoded Burrows-Wheeler transform, and back. In CPM (2017), Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 17:1–17:10.

23. Policriti, A., Prezza, N. LZ77 computation based on the run-length encoded BWT. Algorithmica 80, 7 (2018), 1986–2011.

24. Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G. Run-length compressed indexes are superior for highly repetitive sequence collections. In SPIRE (2008), Springer, Berlin, Heidelberg, Germany, 164–175.

25. Ziv, J., Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337–343.

Back to Top

Authors

Dominik Kempa ([email protected]), Johns Hopkins University, Baltimore, MD, USA.

Tomasz Kociumaka ([email protected]), University of California, Berkeley, CA, USA.

Back to Top

Footnotes

a. The choice for LZ77 as a representative in this class follows from the fact that most other methods are NP-hard to optimize, whereas LZ77 admits a simple linear-time compression.

b. For integers i, j ∈ ℤ, we denote [i . . j] = {k ∈ ℤ : ikj}, [i . . j) = {k ∈ ℤ : ik < j}, and (i . . j] = {k ∈ ℤ : i < k < j}.

The original version of this paper was presented at FOCS 2020 and is available at https://arxiv.org/abs/1910.10631

To view the accompanying Technical Perspective, visit doi.acm.org/10.1145/3531443


cacm_ccby.gif This work is licensed under a https://creativecommons.org/licenses/by/4.0/

The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.


 

No entries found