The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2
) as well as recent powerful compressed indexes (such as the r-index7), central in modern bioinformatics. The compressibility of BWT is quantified by the number r of equal-letter runs in the output. Despite the practical significance of BWT, no nontrivial upper bound on r is known. By contrast, the sizes of nearly all other known compression methods have been shown to be either always within a poly-log n factor (where n is the length of the text) from z, the size of Lempel–Ziv (LZ77) parsing of the text, or much larger in the worst case (by an nε factor for ε > 0).
In this paper, we show that r = (z log2 n) holds for every text. This result has numerous implications for text indexing and data compression; in particular: (1) it proves that many results related to BWT automatically apply to methods based on LZ77, for example, it is possible to obtain functionality of the suffix tree in (z polylog n) space; (2) it shows that many text processing tasks can be solved in the optimal time assuming the text is compressible using LZ77 by a sufficiently large polylog n factor; and (3) it implies the first nontrivial relation between the number of runs in the BWT of the text and of its reverse.
In addition, we provide an (z polylog n)-time algorithm converting the LZ77 parsing into the run-length compressed BWT. To achieve this, we develop several new data structures and techniques of independent interest. In particular, we define compressed string synchronizing sets (generalizing the recently introduced powerful technique of string synchronizing sets11) and show how to efficiently construct them. Next, we propose a new variant of wavelet trees for sequences of long strings, establish a nontrivial bound on their size, and describe efficient construction algorithms. Finally, we develop new indexes that can be constructed directly from the LZ77 parsing and efficiently support pattern matching queries on text substrings.
Lossless data compression aims to exploit redundancy in the input data to represent it in a small space. Despite the abundance of compression methods, nearly every existing tool falls into one of the few general frameworks, among which the three most popular are the following: Lempel–Ziv compression (where the nominal and most commonly used is the LZ77 variant25), statistical compression (this includes, e.g., context mixing, prediction by partial matching (PPM), and dynamic Markov coding), and Burrows-Wheeler transform (BWT).4
One of the features that best differentiates these algorithms is whether they better remove the redundancy caused by skewed symbol frequencies or by repeated fragments. The idea in LZ77 (which underlies, e.g., 7-zip
and gzip
compressors) is to partition the input text into long substrings, each having an earlier occurrence in the text. Every substring is then encoded as a pointer to the previous occurrence using a pair of integers. This method natively handles long repeated substrings and can achieve an exponential compression ratio given sufficiently repetitive input. Statistical compressors, on the other hand, are based on representing (predicting) symbols in the input based on their frequencies. This is formally captured by the notion of the kth order empirical entropy Hk(T). For any sufficiently long text T, symbol frequencies (taking length-k contexts into account) in any power of T (the concatenation of several copies of T) do not change significantly (see Kreft and Navarro,14 Lemma 2.6). Therefore, |Tt| Hk(Tt) ≈ t · |T| Hk(T) holds for any t ≥ 1, which means that entropy is not sensitive to long repetitions, and hence it is not able to capture the same type of redundancy as the LZ77 compression.7,12,14,24
The above analysis raises the question about the nature of compressibility of the Burrows-Wheeler transform. The compression of BWT-based compressors, such as bzip2
, is quantified by the number r of equal-letter runs in the BWT. The clear picture described above no longer applies to the measure r. On the one hand, Manzini16 proved that r can be upper-bounded in terms of the kth order empirical entropy of the input string. On the other hand, already in 2008, Sirén et al.24 observed that BWT achieves excellent compression (superior to statistical methods) on highly repetitive collections and provided probabilistic analysis exhibiting cases when r is small. Yet, after more than a decade, no upper bound on r in terms of z (the size of the LZ77 parsing) was discovered.
This lack of understanding is particularly frustrating due to numerous applications of BWT in the field of bioinformatics and compressed computation. One of the most successful applications of BWT is in compressed indexing, which aims to store a compressed string simultaneously supporting various queries (such as random access, pattern matching, or even suffix array queries) concerning the uncompressed version. Although classical (uncompressed) indexes, such as suffix trees and suffix arrays, have been successful in many applications, they are not suitable for storing and searching big highly repetitive collections, which are virtually impossible to search without preprocessing. A recent survey17 provides several real-life examples of such datasets. In particular, Github stores more than 20 terabytes of data, with an average of over 20 versions per project, whereas the 100000 Human Genome Project produced over 70 terabytes of DNA, which is highly compressible due to 99.9% similarity between individual human genomes. Motivated by such applications, compressed indexing witnessed a remarkable surge of interest in recent years. BWT-based indexes, such as the r-index,7 are among the most powerful, and their space usage is up to (polylog n) factors away from the value r. For a comprehensive overview of this field, we refer the reader to a survey by Navarro.18
In addition to text indexing, the Burrows-Wheeler transform has many applications in computational biology. In particular, BWT is the main component of the popular genome read aligners such as Bowtie
, BWA
, and Soap2
, which paved its way to general bioinformatics textbooks.21 Specialized literature on algorithmic analysis of biological sequences15,19 devotes dozens of pages to BWT applications.
Given the importance and practical significance of BWT, one of the biggest open problems that emerged in the field of lossless data compression and compressed computation asks:
What is the upper bound on the output size of the Burrows-Wheeler transform?
Except for BWT, essentially every other known compression method has been proven to produce output whose size is always within an (polylog n) factor from z, the output size of the LZ77 algorithm (e.g., grammar compression, collage systems, and macro schemes)a or larger by a polynomial factor (nε for some ε > 0) in the worst case (e.g., LZ78, compressed word graphs (CDAWGs)). We refer the reader to Navarro17 for a survey of repetitiveness measures. Notably, BWT is known to never compress much better than LZ77, that is, z = (r log n),6 and the opposite relation r = (z poly-log n) was often conjectured to be false.5
1.1. Our contribution
We prove that r = (z log2 n) holds for all strings, resolving the BWT conjecture in the more surprising way and answering an open question of Gagie et al.5,6 This result alone has multiple implications for indexing and compression:
After proving r = (z log2 n), we refine our approach to obtain and subsequently . We then show that the latter bound, combined with the trivial one r ≤ n, is asymptotically tight for the full spectrum of values of n and δ. As a side result, we obtain a tight upper bound (n log δ) on the sum of irreducible LCP values, improving upon the previously known bound (n log r).9
Next, we develop an (z log8 n)-time algorithm converting the LZ77 parsing into the run-length compressed BWT (the polylog n factor has not been optimized). This offers up to exponential speedup over the previously fastest space-efficient algorithms, which need Ω(n log z) time.20,22 To achieve this, we develop new data structures and techniques of independent interest. In particular, we introduce a notion of compressed string synchronizing sets, generalizing the technique by Kempa and Kociumaka.11 We then describe a new variant of wavelet trees,8 designed to work for sequences of long strings. In the full version of this paper, we also propose new indexes that can be built directly from the LZ77 parsing and support fast pattern matching queries on text substrings.
A string is a finite sequence of characters from a given alphabet. The length of a string S is denoted by |S| and, for i ∈ [1 . . |S|],b the ith character of S is denoted by S[i]. A string U is a substring of S if U = S[i] S[i + 1] … S[j – 1] for some 1 ≤ i ≤ j ≤ |S| + 1. We then say that U occurs in S at position i. The occurrence of U at position i in S is denoted by S[i . . j) or S[i . . j – 1]. Such an occurrence is a fragment of S and can be represented by (a pointer to) S and the two positions i, j. Two fragments (perhaps of different strings) match if they are occurrences of the same substring. A fragment S[i . . j] is a prefix if i = 1 and a suffix if j = |S|.
We use to denote the reverse of S, that is, S[|S|] … S[2] S[1]. We denote the concatenation of two strings U and V, that is, U[1]U[2] … U[|U|]V[1]V[2] … V[|V|], by UV or U · V. Furthermore, denotes the concatenation of k ≥ 0 copies of S; note that S0 = ε is the empty string.
An integer p ∈ [1 . . |S|] is a period of a string S if S[i] = S[i + p] holds for every i ∈ [1 . . |S| – p]. The shortest period of S is denoted as per(S). A string S is called periodic if . The following fact is a consequence of the classic periodicity lemma (which we do not use directly).
FACT 2.1 (see Amir et al.,1 FACT 1). Any two distinct periodic strings of the same length differ on at least two positions.
Throughout the paper, we consider a string (called the text) T of length n ≥ 1 over an ordered alphabet Σ of size σ. We assume that T[n] = $, where $ is the smallest symbol in Σ, and that $ does not occur anywhere else in T. We use ⪯ to denote the order on Σ, extended to the lexicographic order on Σ* (the set of strings over Σ) so that U, V ∈ Σ* satisfy U ⪯ V if and only if either U is a prefix of V, or U[1 . . i) = V[1 . . i) and U[i] ≺ V [i] holds for some i ∈ [1 . . min(|U|, |V|)].
The suffix array SA[1 . . n] of T is a permutation of [1 . . n] such that T[SA[1] . . n] ≺ T[SA[2] . . n] ≺ … ≺ T[SA[n] . . n]. The closely related Burrows-Wheeler transform BWT[1 . . n] of T is defined by BWT[i] = T[SA[i] – 1] if SA[i] > 1 and BWT[i] = T[n] otherwise. In the context of BWT, it is convenient to consider an infinite string T∞ defined so that T∞[i] = T[1 + (i – 1) mod n] for i ∈ ℤ in particular, T∞ [1 . . n] = T [1 . . n]. We then have BWT[i] = T∞ [SA[i] – 1] for i ∈ [1 . . n]. Moreover, the assumption T[n] = $ implies T∞ [SA[1] . .] ≺ … ≺ T∞[SA[n] . .].
For any string , where ci ∈ Σ and ℓi ∈ ℤ>0 for i ∈ [1 . . h], and ci ≠ ci+1 for i ∈ [1 . . h), we define the run-length encoding of S as a sequence RL(S) = ((c1, λ1), …, (ch, λh)), where λi = ℓ1 + … + ℓi for i ∈ [1 . . h]. Throughout, we let r = |RL(BWT)| denote the number of runs in the BWT of T. For example, for the text T = bbabaababababaababa$
in Table 1, we have BWT = a1b6a1b2a6b1a2$1
, and hence r = 8.
Table 1. The lexicographically sorted suffixes of a string T = bbabaa-babababaababa$
along with the BWT, SA, and LCP tables.
By lcp(U, V) we denote the length of the longest common prefix of strings U and V. For j1, j2 ∈ [1 . . n], we let LCE(j1, j2) = lcp(T[j1 . .], T[j2 . .]). The LCP array LCP[1 . . n] is defined so that LCP[i] = LCE(SA[i], SA[i – 1]) for i ∈ [2 . . n] and LCP[1] = 0. We say that the value LCP[i] is reducible if BWT[i] = BWT[i – 1] and irreducible otherwise (including when i = 1). There are exactly r irreducible LCP values.
We say that a nonempty fragment T[i . . i + ℓ) is a previous factor if it has an earlier occurrence in T, that is, LCE (i, i') ≥ ℓ holds for some i' ∈ [1 . . i). The LZ77 parsing of T is a factorization T = F1 ··· Ff into nonempty phrases such that the jth phrase Fj is the longest previous factor starting at position 1 + |F1 ··· Fj-1|; if no previous factor starts there, then Fj consists of a single character. In the underlying LZ77 representation, every phrase Fj = T [i . . i+ℓ) that is a previous factor is encoded as (i', ℓ), where i' ∈ [1 . . i) satisfies LCE(i, i') ≥ ℓ (and is chosen arbitrarily in case of multiple possibilities); if Fj = T[i] is not a previous factor, then it is encoded as (T[i], 0). We denote the number of phrases in the LZ77 parsing by z. For example, the text bbabaababababaababa$
of Table 1 has LZ77 factorization b · b · a · ba · aba · bababa · ababa · $
with z = 8 phrases, and its LZ77 representation is (b, 0), (1, 1), (a, 0), (2, 2), (3, 3), (7, 6), (10, 5), ($, 0).
The trie of a finite set S ⊆ Σ* is an edge-labeled rooted tree with a node vX for every string X that is a prefix of a string S ∈ S. The trie is rooted at vε and the parent of each node vX with X ≠ ε is vX[1..|X|). In this case, the edge from vX[1..|X|) to vX is labeled with X[|X|]. See Figure 1 for the trie of {$aba, aaba, abaa, abab, abb$, b$ab, baab, baba, babb, bb$a}
.
Figure 1. The trie of the reversed length-4 substrings of T∞ for T = bbabaababababaababa$
of Table 1. Light edges are thin and dotted.
3.1. Basic upper bound
To illustrate the main idea of our proof technique, we first prove the upper bound in its simplest form r = (z log2 n). The following lemma stands at the heart of our proof.
LEMMA 3.1. For every ℓ ∈ [1 . . n], the number of irreducible LCP values in [ℓ . . 2ℓ) is (z log n).
PROOF. Denote Sm = {S ∈ Σm: S is a substring of T∞} for m ≥ 1. Observe that |Sm| ≤ mz because every length-m substring of T∞ has an occurrence crossing or beginning at a phrase boundary of the LZ77 parsing of T. This includes substrings overlapping two copies of T, which cross the boundary between the last and the first phrase.
The idea of the proof is as follows: With each irreducible value LCP[i] ∈ [ℓ . . 2ℓ), we associate a cost of ℓ units, which are charged to individual characters of strings in S3ℓ. We then show that each of the strings in S3ℓ is charged at most 2 log n times. The number of irreducible LCP values in [ℓ . . 2ℓ) equals times the total cost, which is at most
To devise the announced assignment of cost to characters of strings in S3ℓ, consider the trie of all reversed strings in Sℓ (see Figure 1 for example).
Let LCP[i] ∈ [ℓ . . 2ℓ) be an irreducible LCP value; note that i > 1 due to LCP[i] ≥ ℓ > 0. Let j0 = SA[i – 1] and j1 = SA[i] so that LCP[i] = LCE(j0, j1). Because LCP[i] is irreducible, we have T∞ [j0 – 1] = BWT[i – 1] ≠ BWT[i] = T∞ [j1 – 1]. For k ∈[1 .. ℓ], the kth unit of the cost associated with LCP[i] is charged to the kth character (T∞[jt – 1]) of the string (T∞[jt – k .. jt – k + 3ℓ)∈S3ℓ, where t ∈ {0, 1} is such that the subtree of rooted at contains fewer leaves than the subtree rooted at (we choose t = 0 in case of ties); see Figure 2 for an illustration.
Figure 2. Proof of Lemma 3.1 on T from Table 1 for ℓ = 4, i = 11, and k = 2. Strings T∞ [jt – k . . jt – k + 3ℓ) are highlighted. The subtree of rooted in vbaa is smaller than in vbab (see Figure 1), and hence we charge the second symbol of T∞ [j1 – k . . j1 – k + 3ℓ), that is, t = 1.
Note that at most log n characters of each S ∈ S3ℓ can be charged during the above procedure: Whenever S[k], with k ∈ [1 .. ℓ], is charged, the subtree of rooted at has at least twice as many leaves as the subtree rooted at , and this can happen for at most log|Sℓ| ≤ log n nodes on the path from to the root of .
It remains to show that, for every S ∈ S3ℓ, a single position S[k], with k ∈[1 .. ℓ], can be charged at most twice. For this, observe that the characters charged for a single irreducible value LCP[i] are at different positions (of strings in S3ℓ). Hence, to analyze the total charge assigned to S[k], we only need to bound the number of possible candidate positions i. Let [b . . e] be the set of indices i' such that T∞ [SA[i']..] starts with S[k + 1 .. 3ℓ]. In the above procedure, if a character S[k] is charged a unit of cost corresponding to LCP[i], then S[k + 1 .. 3ℓ] is a prefix of either T∞[SA[i−1]..]=T∞[j0 ..] or T∞[SA[i]..]=T∞[j1 ..]. Hence, {i−1,i}∩[b..e]≠∅. At the same time, LCE(SA[i –1], SA[i]) < 2ℓ, and all strings T∞[SA[i']..] with i' ∈[b .. e] share a common prefix S[k + 1 .. 3ℓ] of length 3ℓ – k ≥ 2ℓ. Thus, i = b or i = e + 1. ⅈ
THEOREM 3.2. All length-n strings satisfy r = (z log2 n).
PROOF. Recall that r is the total number of irreducible LCP values. Thus, the claim follows by applying Lemma 3.1 for ℓi = 2i, with i ∈[0 .. ⌊log n⌋] , and observing that the number of LCP values 0 is exactly σ≤z.
3.2. Tighter upper bound
To obtain a tighter bound, we refine the ideas from Section 3.1, starting with a counterpart of Lemma 3.1.
LEMMA 3.3. For every ℓ ∈[1 .. n], the number of irreducible LCP values in [ℓ .. 2ℓ) is (z log z).
PROOF. The proof follows closely that of Lemma 3.1. However, with each irreducible value LCP[i] ∈ [ℓ .. 2ℓ), we associate cost instead of ℓ. We then show that each of the strings in S3ℓ is charged at most 2 · (3 + log z) times (rather than 2 log n times). Then, the number of irreducible LCP values in the range [ℓ .. 2ℓ) does not exceed times the total cost, which is bounded by
Recall the trie of all reversed strings in Sℓ. For a node v of , by size(v), we denote the number of leaves in the subtree of rooted in v. An edge connecting v ≠root() to its parent in is called light if v has a sibling v' satisfying size(v') ≥ size(v) (see Figure 1). In the proof of Lemma 3.1, we observed that the characters S[k] of S ∈ S3ℓ that can be charged correspond to light edges on the path from the root of to the leaf : Whenever S[k], with k ∈[1 .. ℓ], is charged, the edge connecting to its parent is light. We then noted that there are at most log | Sℓ | ≤ log n light edges on each root-to-leaf path in . Here, we charge the characters of strings in S3ℓ in the same way as in Lemma 3.1, but only for units . This implies that only characters S[k] of S ∈ S3ℓ with are charged. It remains to show that any root-to-leaf path in contains at most 3 + log z light edges between a node at depth at least and its child.
Consider a light edge from a node v to its parent u at depth at least . Let v' be a sibling of v satisfying size(v') ≥ size(v), and let Sv, Sv' be the labels of the paths from the root to v and v', respectively. These labels differ on the last position only so, by Fact 2.1, they cannot be both periodic. Let ṽ ∈ {v, v'} be such that Sṽ is not periodic, and let .
Consider the set S of length-ℓ strings corresponding to the leaves in the subtree of rooted at ṽ (i.e., the labels of the root-to-leaf paths passing through ṽ). Define and note that , because is the trie of reversed strings from Sℓ. Let denote the ending positions of the leftmost occurrences in T∞[1 . .) of strings in . By definition, we have an occurrence of ending in T∞ at every position ei with . Now, implies that for every (otherwise, the two close occurrences of would yield . Consequently, at least length-ℓ substrings of T∞[1 . .) have disjoint leftmost occurrences. Because each leftmost occurrence crosses or begins at a phrase boundary of the LZ77 parsing of T, we conclude that , and therefore .
The reasoning above shows that once a root-to-leaf path encounters a light edge connecting a node u at depth at least to its child v, we have size(v) ≤ 4z. The number of the remaining light edges on the path is at most log(size(v)) ≤ 2 + log z by the standard bound applied to the subtree of rooted at v.
THEOREM 3.4. Every string T of length n satisfies .
PROOF. To obtain tighter bounds on the number of irreducible LCP values in [ℓ . . 2ℓ), we consider three cases:
ℓ ≤ log z. We repeat the proof of Lemma 3.1, except that we observe that the number of light edges on each root-to-leaf path in is bounded by ℓ. Thus, the number of irreducible LCP values in [ℓ . . 2ℓ) is (z ℓ).
. We use the bound (z log z) of Lemma 3.3. . We repeat the proof of Lemma 3.3, except that we observe that |S3ℓ| ≤ n. Thus, the number of irreducible LCP values in .
The above upper bounds, applied for every ℓ = 2i with i ∈ [0 . . ⌊log n⌋], yield
3.3. Tight bounds in terms of δ
Let denote the (cyclic) substring complexity of T.13 Note that letting is equivalent because |Sm| ≤ n holds for m ≥ 1, which implies for m ≥ n. We start by noting that δ ≤ z because |Sm| ≤ mz holds for every m ≥ 1, as observed in the proof of Lemma 3.1. Furthermore, |Sm| ≤ mδ holds by definition of δ, so δ can replace z in the proof of Lemma 3.1.
To adapt the proof Lemma 3.3, we need to generalize the observation that at most z substrings from Sℓ may have disjoint leftmost occurrences in T∞[1 . .). This observation is easy because the LZ77 parsing naturally yields a set of z positions (phrase boundaries) in T. The substring complexity δ does not provide such structure, but as the lemma here implies, we can replace z by 3δ in the aforementioned observation. The proof of Lemma 3.5 is a straightforward modification of the argument used in Kociumaka et al.13 (Lemma 6). For completeness, we write down the full reasoning, with technical details tailored to our notation (e.g., Sℓ defined in terms of T∞ rather than T).
LEMMA 3.5 (based on Kociumaka et al.,13 LEMMA 6). For any integer ℓ ≥ 1, the total number of positions in T∞[1 . .) covered by the leftmost occurrences of strings from Sℓ is at most 3δℓ.
PROOF. Let C denote the set of positions in T∞[1 . .) covered by the leftmost occurrences of strings from Sℓ, and let C' = C\[1 .. ℓ). For any i ∈ C' denote S =T∞[i−ℓ+1..i+ℓ], and let S ={Si:i∈C'}⊆S2ℓ. We will show that |S| = |C'|. Let i ∈ C'. First, observe that, due to i ≥ ℓ, the fragment Si is entirely contained in T∞[1 . .). Furthermore, by definition, Si. contains the leftmost occurrence of some S ∈ Sℓ. Thus, this occurrence of Si in T∞[1 . .) must also be the leftmost one in T∞[1 . .). Consequently, the substrings Si for i ∈ C' are distinct.
We have thus shown that |C'| = |S|≤|S2ℓ. Because |S2ℓ| ≤ 2δℓ holds by definition of δ, we obtain |C|<|C'|+ ℓ ≤ |S2ℓ|+ℓ≤(2δ+1)ℓ≤3δℓ. ⅈ
LEMMA 3.6. For every ℓ ∈[1 .. n], the number of irreducible LCP values in [ℓ .. 2ℓ) is (δ log δ).
PROOF. Compared to the proof of Lemma 3.3, we use the bound |S3ℓ| ≤ 3ℓδ instead of |S3ℓ| ≤ 3ℓz. The only other modification required is that, for every light edge connecting a node u of at depth at least to its child v, we need to prove size(v) = (δ).
Let be defined as in the proof of Lemma 3.3. Recall that there are at least strings in Sℓ with disjoint leftmost occurrences in T∞[1 . .). By Lemma 3.5, there are at most 3δ such substrings. Thus, . ⅈ
By replacing the thresholds log z and with log δ and , respectively, in the proof of Theorem 3.4, we immediately obtain a bound in terms of δ.
THEOREM 3.7. Every string T of length n satisfies .
The following construction, analyzed in the full version of this paper, provides tight examples with substring complexity δ covering the whole spectrum between (1) and . For ℓ ≥ 1 and x ∈[0 .. 2ℓ), we set binℓ (x) ∈ {0, 1}ℓ to be the binary representation of x.
LEMMA 3.8. For all integers ℓ ≥ 2 and K ≥ 1, the length n, the substring complexity δ, and the number of runs r in the BWT of a string Tl,K ∈ {$, 0, 1, 2}+, defined with
satisfy n = Θ(2K+ℓℓ, δ=Θ(2ℓ), and r = Ω(2ℓ ℓK).
If , on the other hand, then a trivial upper bound r = (n) is tighter than that of Theorem 3.7. In this case, the following construction, also analyzed in the full version of this paper, provides tight examples.
LEMMA 3.9. For all integers ℓ ≥ 2 and Δ∈Ω(2ℓ)∩(2ℓℓ), the length n, the substring complexity δ, and the number of runs r in the BWT of a string , defined with
satisfy n =Θ(2ℓℓ), δ=Θ(Δ), and r = Ω(2ℓℓ).
Overall, combining Lemmas 3.8 and 3.9, we obtain the following tight lower bound for δ ranging from (1) to Ω(n).
THEOREM 3.10. For every N ≥ 1 and Δ ∈ [1 .. N], there exists a string T whose length n, substring complexity δ, and the number of runs r in the BWT satisfy n=Θ(N), δ=Θ(Δ), and .
3.4. Further combinatorial bounds
By combining Theorem 3.7 with known properties of the substring complexity δ, we obtain the first bound relating the number of BWT runs in the string and its reverse. No such bounds (even polynomial in r log n) were known before.
COROLLARY 3.11. If r and denote the number of runs in the BWT of a length-n text and its reverse, respectively, then .
PROOF. Because the value of δ is the same for the text and its reverse, Theorem 3.7 yields . Combining the results of Kempa and Prezza12 (Theorem 3.9) and Kociumaka et al.13 (Lemma 2) gives δ ≤ r. Consequently, we obtain .
In the full version of this paper, we also characterize the sum of all irreducible LCP values:
THEOREM 3.12. For every string of length n, the sum rΣ of all irreducible LCP values satisfies rΣ = (n log δ).
Due to δ ≤ r, the presented upper bound is always (asymptotically) at least as strong as the previous bound rσ ≤ n log r by Kärkkäinen et al.9 Furthermore, it can be strictly stronger because log δ = o(log r) is possible when δ = logo(1) n. In the full version of this paper, we show that the strings of Lemmas 3.8 and 3.9 yield a matching lower bound:
THEOREM 3.13. For every N ≥ 1 and Δ ∈ [1 .. N], there exists a string T whose length n, substring complexity δ, and sum rΣ of irreducible LCP values satisfy n=Θ(N), δ=Θ(Δ), and rΣ = Θ(n log δ).
In this section, we outline our algorithm that, given the LZ77 parsing of a text T ∈ Σn, computes its run-length compressed BWT in (z polylog n) time. We start with an overview that explains the key concepts. Next, we present two new data structures utilized in our algorithm: the compressed string synchronizing set (Section 4.1) and the compressed wavelet tree (Section 4.2). The sketch of the final algorithm is then presented in Section 4.3.
For any substring Y of T∞, we define
We say that a substring Y of T∞ is left-maximal if there exist distinct symbols a, b ∈ Σ such that the strings aY and bY are also substrings of T∞. The following definition, assuming Σ∩ℕ=∅, plays a key role in our construction.
DEFINITION 4.1 (BWT MODULO ℓ). Let T ∈ Σn, ℓ ≥ 1 be an integer, and Yi= T∞[SA[i]..SA[i]+ℓ) for i ∈[1..n]. We define the string BWTℓ ∈(Σ∪ℕ)n, called the BWT modulo ℓ (of T), as follows. For i ∈ [1 .. n],
The algorithm runs in k = ⌈log n⌉ rounds. For q ∈ [0 .. k), the input to the qth round is RL(BWT2q) and the output is RL(BWT2q+1). At the end of the algorithm, we have RL(BWT2k) = RL(BWT), because X ∈ S2k is never left-maximal for 2k ≥ n.
Informally, in round q, we are given a (run-length compressed) subsequence of BWT that can be determined based on sorting the suffixes only up to their prefixes of length 2q. Notice that BWTℓ[b .. e] ∈ Σ* implies BWTℓ+1 [b .. e] ∈ Σ* (because a prefix of a left-maximal substring is left-maximal). Hence, these subsequences need not be modified until the end of the algorithm (except possibly merging their runs with adjacent runs). For the remaining positions, BWTℓ identifies the (leftmost occurrences of) substrings to be inspected in the qth round with the aim of replacing their corresponding runs in BWTℓ with previously unknown BWT symbols (as defined in BWT2ℓ).
We call a block BWT[b . . e] uniform if all symbols in BWT[b . . e] are equal and nonuniform otherwise. The following lemma ensures the feasibility of the above construction.
LEMMA 4.2. For any ℓ ≥ 1, it holds |RL(BWTℓ)| < 2r.
PROOF. Denote RL(BWTℓ) = ((c1, λ1, …,(ch,λh)), letting λ0 = 0. By definition of BWTℓ, if ci ∈ ℕ, then the block BWT(λi-1 .. λi] is nonuniform. Thus, there are at most r – 1 runs of symbols from ℕ in BWTℓ.
On the other hand, ci ∈ Σ and cj ∈ Σ, with i < j, cannot both belong to the same run in BWT. If this was true, then either ci+1 ∈ Σ (which implies ci+1 = ci, contradicting the definition of RL(BWTℓ) ), or ci+1 ∈ ℕ, which is impossible because BWT(λi .. λi+1] would then be nonuniform. Thus, there are at most r runs of symbols from Σ in BWTℓ. ⅈ
4.1. Compressed string synchronizing sets
Our algorithm builds on the notion of string synchronizing sets, recently introduced by Kempa and Kociumaka.11 Synchronizing sets are one of the most powerful techniques for sampling suffixes. In the uncompressed setting, they are the key in obtaining time-optimal solutions to multiple problems, such as a state-of-the-art BWT construction algorithm for texts over small alphabets.11 In this section, we introduce a notion of compressed string synchronizing sets. Our construction is the first implementation of synchronizing sets in the compressed setting and thus of independent interest.
We start with the definition of basic synchronizing sets.
DEFINITION 4.3 (τ-SYNCHRONIZING SET11). Let T ∈ Σn be a string and let be a parameter. A set S⊆[1..n−2τ+1] is called a τ-synchronizing set of T if it satisfies the following consistency and density conditions:
In most applications, we want to minimize |S|. However, the density condition imposes a lower bound for strings of length n ≥ 3τ – 1 that do not contain substrings of length 3τ – 1 that are periodic with period at most . Therefore, we cannot hope to achieve an upper bound improving in the worst case upon the following one.
THEOREM 4.4 (Kempa and Kociumaka11). For any string T of length n and parameter , there exists a τ-synchronizing set S of size . Moreover, such S can be (deterministically) constructed in (n) time.
Storing S for compressible strings presents the following challenge: Although is sometimes possible, it is not implied by z ≪ n. For example, as noted by Bille et al.,3 Thue–Morse strings satisfy z = (log n) yet they contain no periodic substring X with , and thus their synchronizing sets satisfy . This prevents us from keeping the plain representation of S when .
We therefore exploit a different property of compressible strings: That all their substrings Y satisfy lpos(Y) , where ej is the last position of the jth phrase in the LZ77 parsing of T. By consistency of S, it suffices to store . To decide whether i ∈ S, we then locate i' = lpos(T[i .. i + 2τ) and check if . This motivates the following (more general) definition.
DEFINITION 4.5 (COMPRESSED τ-SYNCHRONIZING SET). Let S be a τ-synchronizing set of string T[1 . . n] for some , and, for every j ∈[1 .. z], let ej denote the last position of the jth phrase in the LZ77 parsing of T. For k ∈ ℤ≥2, we define the compressed representation of S as
In the full version of this paper, we prove that every text T has a synchronizing set S with a small compressed representation, and we show how to efficiently compute such S from the LZ77 parsing of T.
THEOREM 4.6. There exists a Las-Vegas randomized algorithm that, given the LZ77 parsing of a string T ∈ Σn, a parameter , and a constant k ∈ ℤ≥2, constructs in (z log5 n) time a compressed representation compk(S) of a τ-synchronizing set S of T satisfying |compk(S)|≤ 72kz.
4.2. Compressed wavelet trees
Along with string synchronizing sets, wavelet trees,8 originally invented for text indexing, play a central role in our algorithm. Unlike virtually all prior applications of wavelet trees, ours uses a sequence of very long strings (up to Θ(n) symbols). This approach is feasible because all strings are substrings of the text, which is stored in the LZ77 representation. In this section, we describe this novel variant of wavelet trees, dubbed here compressed wavelet trees. In particular, we prove an upper bound on their size, describe an efficient construction from the LZ77-compressed text, and show how to augment them to support some fundamental queries.
Let Σ be an alphabet of size σ ≥ 1. Consider a string W[1 . . m] over the alphabet Σℓ so that W is a sequence of m ≥ 0 strings of length ℓ ≥ 0 over the alphabet Σ. The wavelet tree of W is the trie of Σ with each node vX associated to a string BX ∈ Σ* defined here based on W. We let denote the node-set of .
With each node vX ∈ V(), we associate an increasing sequence IX [1 . . h] of primary indices such that
Based on IX, we define BX ∈ Σ* so that, for i ∈ [1 . . h],
if |X| < ℓ, and BX = ε if |X| = ℓ. In other words, BX is a string containing the symbol at position |X| + 1 for each string of W that is prefixed by X. Importantly, the symbols in BX occur in the same order as these strings occur in W.
As typically done in the applications of wavelet trees, we only explicitly store the strings BX. The values of primary indices IX are retrieved using additional data structures, based on the following observation.
LEMMA 4.7 (Grossi et al.8). Let X ∈ Σd, where d ∈ [0 . . ℓ). For every c ∈ Σ and j ∈ [1 . . |IXc|], we have IXc [j] = IX [i], where BX[i] is the jth occurrence of c in BX.
We define the compressed wavelet tree c of W as the wavelet tree of W in which all strings BX have been run-length compressed and, except , all nodes vX satisfying |RL(BX)| ≤ 1 have been removed (the unary paths are collapsed into single edges with their labels concatenated). The shape and edge labels of the resulting tree are identical to the compacted trie of {W[1], …, W[m]}.
We store the edge labels of c as pointers to substrings in W. We assume that each edge of c and each element of RL(BX) can be encoded in (1) space. Because |RL(BY)| ≥ 1 holds for every internal node vY ∈ V(c) and, unless |V(c)| = 1, each leaf vZ in c can be injectively mapped to an element of RL(BZ') for the parent vZ' of vZ, the tree c uses space.
THEOREM 4.8. If c is the compressed wavelet tree of a sequence W of length-l strings, then .
PROOF. Let m = |W|, k = |RL(W)| ≤ m, and k' = |{W[i] : i ∈ [1 . . m]}|. Due to |V(Tc)|≤ 1+2k'=(1 + k), we can focus on nodes vx∈V(Tc) such that |RL(BX)| ≥ 2.
The proof resembles that of Lemma 3.1. With each X ∈ Σ* such that |RL(BX)| ≥ 2, we associate |RL(BX)| – 1 units of cost and charge them to individual elements of W. We then show that each run in RL(W) is in total charged at most 2 log k' units of cost. Consequently,
Consider X ∈ Σd with |RL(BX)| ≥ 2; note that d < ℓ. Let RL(BX) = ((c1, …, (ch, λh)). Observe that if we let p0 = IX[λi] and p1 = IX[λi + 1] for some i ∈ [1 . . h), then W[p0][d + 1] = ci ≠ ci + 1 = W[p1][d + 1]. Moreover, BX [λi] ≠ BX [λi + 1] implies W[p0 + 1] ≠ W[p0] and W[p1 – 1] ≠ W[p1]. The ith unit of cost is charged to W[pt], where t ∈ {0, 1} is chosen depending on the sizes of subtrees of c rooted at the children of vX, so that the subtree containing vw[pt] has at most as many leaves as the subtree containing vw[p1−t].
Now, consider a run W[b . . b'] = Yδ in RL(W). For a single depth d, the run could be charged at most twice, with at most one unit assigned to W[b] due to p1 = b and at most one unit assigned to W[b'] due to p0 = b', both for X = Y[1 . . d]. Moreover, note that the subtree size on the path from vY to the root vε of c doubles for every depth d for which the run was charged. Thus, the total charge of the run is at most 2 log k' units. ⅈ
The key operation that we want to support on c is to compute the value IX [q] given q ∈ [1 . . |IX|] and a pointer to vx ∈ V(Tc). Let W[1 . . m] be a sequence of substrings of T∞ of common length ℓ. Notice that if we have access to T, then the sequence W can be encoded in (1 + |RL(W)|) space. Namely, it suffices to store the length ℓ and the sequence RL((lpos(W[i])) i ∈ [1 .. m]). In the full version of this paper, we show that, given such encoding of W and the LZ77 parsing of T, the compressed wavelet tree of W supporting fast primary index queries can be constructed efficiently.
THEOREM 4.9. Given the LZ77 representation of T[1 . . n] and a sequence W[1 .. m] of m ≤ n same-length substrings of T∞, represented as RL((lpos(W[i]))i ∈[1 .. m]), the compressed wavelet tree of W, supporting primary index queries in time (log4 n), can be constructed in time ((z+|RL(W)|) log2 n).
4.3. The algorithm
We are now ready to sketch the procedure constructing the sequences RL(BWT2q), where q ∈ [0 . . ⌈log n⌉].
Let q ∈ [4 . . ⌈log n⌉) (values q ∈ [0 . . 3] are handled separately) and let ℓ = 2q. We show how to compute RL(BWT2ℓ) given RL(BWTℓ) and the LZ77 parsing of T. The main idea of the algorithm is as follows.
Let S be a τ-synchronizing set of T, where . As noted earlier, BWTℓ [j] ∈ Σ implies BWT2ℓ [j] ∈ Σ. Let BWTℓ [y . . y'] ∈ ℕ+ be a run in BWTℓ. By definition of BWTℓ, the suffixes of T∞ starting at positions i ∈ SA[y . . y'] share a common prefix of length ℓ ≥ 3τ. Thus, assuming that S∩[i..i+τ) = ⌽ holds for all i ∈ SA[y . . y'] (the periodic case is handled separately), by the consistency of S, all text positions i ∈ SA[y . . y'] share a common offset Δ with i + Δ = min (S∩[i..i+τ)). This lets us deduce the order of length-21 prefixes T[i . . i+2ℓ) based on the order of strings T[i + Δ . . i + 2ℓ) starting at synchronizing positions. For this, from the sorted list of fragments T[s . . s + 2ℓ) across s ∈ S, we extract, using a wavelet tree, those preceded by T[i . . i + Δ) (a prefix common to T∞[i . .) for i ∈ SA[y . . y']). Importantly, the synchronizing positions s sharing T[s – τ . . s + 2ℓ) can be processed together; hence, by Theorem 4.6, applied for k = 7 so that 2ℓ ≤ kτ, it suffices to use (z) distinct substrings.
In the full version of the paper, we provide the details of the above construction and obtain the following result.
THEOREM 4.10. There exists a Las-Vegas randomized algorithm that, given the LZ77 parsing of a text T of length n, computes its run-length compressed Burrows-Wheeler transform in time ((r + z) log6 n) = (z log8 n).
This work was performed when the first author was a post-doctoral scholar at the University of California, Berkeley (supported by NSF grants no. 1652303 and 1934846, and an Alfred P. Sloan Fellowship grant), whereas the second author was a postdoctoral scholar at Bar-Ilan University, Israel (supported by ISF grants no. 1278/16 and 1926/19, a BSF grant no. 2018364, and an ERC grant MPM under the EU's Horizon 2020 Research and Innovation Programme, agreement no. 683064).
1. Amir, A., Iliopoulos, C.S., Radoszewski, J. Two strings at Hamming distance 1 cannot be both quasiperiodic. Inf. Process. Lett. 128, (2017), 54–57.
2. Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M. Composite repetition-aware data structures. In CPM (2015), Springer, Cham, Switzerland, 26–39.
3. Bille, P., Gagie, T., Gørtz, I.L., Prezza, N. A separation between RLSLPs and LZ77. J. Discrete Algorithms 50, (2018), 36–39.
4. Burrows, M., Wheeler, D.J. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation. Palo Alto, CA, 1994.
5. Gagie, T., Navarro, G., Prezza, N. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. arXiv 1809.02792 (2018).
6. Gagie, T., Navarro, G., Prezza, N. On the approximation ratio of Lempel–Ziv parsing. In LATIN (2018), Springer, Cham, Switzerland, 490–503.
7. Gagie, T., Navarro, G., Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1 (2020), 1–54.
8. Grossi, R., Gupta, A., Vitter, J.S. High-order entropy-compressed text indexes. In SODA (2003), ACM/SIAM, Philadelphia, PA, USA, 841–850.
9. Kärkkäinen, J., Kempa, D., Piătkowski, M. Tighter bounds for the sum of irreducible LCP values. Theor. Comput. Sci. 656, (2016), 265–278.
10. Kempa, D. Optimal construction of compressed indexes for highly repetitive texts. In SODA (2019), SIAM, Philadelphia, PA, USA, 1344–1357.
11. Kempa, D., Kociumaka, T. String synchronizing sets: Sublinear-time BWT construction and optimal LCE data structure. In STOC (2019), ACM, New York, NY, USA, 756–767.
12. Kempa, D., Prezza, N. At the roots of dictionary compression: String attractors. In STOC (2018), ACM, New York, NY, USA, 827–840.
13. Kociumaka, T., Navarro, G., Prezza, N. Towards a definitive measure of repetitiveness. In LATIN (2020), Springer, Cham, Switzerland, 207–219.
14. Kreft, S., Navarro, G. On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483 (2013), 115–133.
15. Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, Cambridge, UK, 2015.
16. Manzini, G. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3 (2001), 407–430.
17. Navarro, G. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Comput. Surv. 54, 2 (2021), 1–31.
18. Navarro, G. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Comput. Surv. 54, 2 (2021), 1–32.
19. Ohlebusch, E. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Bremen, Germany, 2013.
20. Ohno, T., Sakai, K., Takabatake, Y., Tomohiro, I, Sakamoto, H. A faster implementation of online RLBWT and its application to LZ77 parsing. J. Discrete Algorithms, (2018), 52-53:18–28.
21. Pevsner, J. Bioinformatics and Functional Genomics, 3rd edn. Wiley-Blackwell, Chichester, UK, 2015.
22. Policriti, A., Prezza, N. From LZ77 to the run-length encoded Burrows-Wheeler transform, and back. In CPM (2017), Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 17:1–17:10.
23. Policriti, A., Prezza, N. LZ77 computation based on the run-length encoded BWT. Algorithmica 80, 7 (2018), 1986–2011.
24. Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G. Run-length compressed indexes are superior for highly repetitive sequence collections. In SPIRE (2008), Springer, Berlin, Heidelberg, Germany, 164–175.
25. Ziv, J., Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337–343.
a. The choice for LZ77 as a representative in this class follows from the fact that most other methods are NP-hard to optimize, whereas LZ77 admits a simple linear-time compression.
b. For integers i, j ∈ ℤ, we denote [i . . j] = {k ∈ ℤ : i ≤ k ≤ j}, [i . . j) = {k ∈ ℤ : i ≤ k < j}, and (i . . j] = {k ∈ ℤ : i < k < j}.
The original version of this paper was presented at FOCS 2020 and is available at https://arxiv.org/abs/1910.10631
To view the accompanying Technical Perspective, visit doi.acm.org/10.1145/3531443
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.
No entries found