Flash-based solid-state drives (SSDs) are a key component in most computer systems, thanks to their ability to support parallel I/O at sub-millisecond latency and consistently high throughput. At the same time, due to the limitations of the flash media, they perform writes out-of-place, often incurring a high internal overhead which is referred to as write amplification. Minimizing this overhead has been the focus of numerous studies by the systems research community for more than two decades. The abundance of system-level optimizations for reducing SSD write amplification, which is typically based on experimental evaluation, stands in stark contrast to the lack of theoretical algorithmic results in this problem domain. To bridge this gap, we explore the problem of reducing write amplification from an algorithmic perspective, considering it in both offline and online settings. In the offline setting, we present a near-optimal algorithm. In the online setting, we first consider algorithms that have no prior knowledge about the input and show that in this case, the greedy algorithm is optimal. Then, we design an online algorithm that uses predictions about the input. We show that when predictions are relatively accurate, our algorithm significantly improves over the greedy algorithm. We complement our theoretical findings with an empirical evaluation of our algorithms, comparing them with the state-of-the-art scheme. The results confirm that our algorithms exhibit an improved performance for a wide range of input traces.
Flash-based solid-state drives (SSDs) have gained a central role in the infrastructure of large-scale datacenters, as well as in commodity servers and personal devices. Unlike traditional hard disks, SSDs contain no moving mechanical parts, which allow them to provide much higher bandwidth and lower latencies, especially for random I/O. The main limitation of flash media is its inability to support update-in-place: after data has been written to a physical location, it has to be erased (cleaned) before new data can be written to it. In particular, SSDs support read and write operations in the granularity of pages (typically sized 4KB–16KB), while erasures are performed on entire blocks, often containing hundreds of pages.
To address this limitation, writes are performed out-of-place: whenever a page is written, its existing copy is marked as invalid, and the new data is written to a clean location. The flash translation layer (FTL), which is part of the SSD firmware, maintains the mapping of logical page addresses to physical locations (slots). To facilitate out-of-place writes, the physical capacity of an SSD is larger than the logical capacity exported to the application. This "spare" capacity is referred to as the device's overprovisioning. When the clean pages are about to be exhausted, a garbage collection (GC) process is responsible for reclaiming space: it chooses a victim block for erasure, rewrites any valid pages still written on this block to new locations, and then erases the block.
Block erasures typically require several milliseconds, orders of magnitude longer than page reads or writes. In addition, repeated erasures gradually wear out the flash media. As the technology advances and the capacity of flash-based SSDs increases, their lifetime—maximum number of erasures per block—decreases. This limitation is crucial, especially in datacenter settings, where SSDs absorb a large portion of latency-critical I/O. Thus, the efficiency of an SSD management algorithm is usually measured by its write amplification (WA)—the ratio between the total number of page writes in the system, including internal page rewrites, and the number of writes issued in the input sequence.
Mechanisms to reduce write amplification have been studied extensively by the systems community. Most mechanisms are based on the greedy algorithm, which chooses the block with the minimal number of valid pages (the min-valid block) as the victim for erasure. Several mathematical models were proposed for deriving the write amplification of the greedy algorithm as a function of overprovisioning and under various assumptions about the workload distribution.2 The optimality of the greedy algorithm under uniform page accesses was proven in Yang et al.,15 but it is known that its efficiency deteriorates when the workload distributions are skewed.
Garbage collection is typically conceived of as a process of selecting the optimal block to clean, whether based solely on occupancy or on predictions of future behavior. However, it seems that efficient garbage collection is also a matter of deciding where to place incoming data. Advanced techniques that separate frequently written (hot) pages from rarely written (cold) pages have been proposed. The key idea behind these techniques is that placing pages with similar write frequencies in the same block increases the likelihood that all pages within that block will be overwritten by writes with temporal proximity. Several methods for predicting page temperatures based on access context and history have proven useful in practice,7, 11 and the effectiveness of hot/cold data separation has further motivated the development and proliferation of advanced storage hardware and interfaces.
1.1. Our model
We assume that P logical pages are stored in S physical SSD slots, where pages and slots are of the same size. The slots are organized in B blocks, each block consists of n slots. The relation between the physical capacity and the logical capacity is expressed through a fixed spare factor, which is defined by . Consider, for example, the SSD in Figure 1. In this toy example, the SSD consists of five blocks, four slots each, and stores ten logical pages. Thus, the spare factor is .
Figure 1. Example SSD with five blocks and 10 logical pages. First, page p1 is written to a clean slot, and its previous location is invalidated. Then, b5 is cleaned, and pages p2, p3 are rewritten.
The most recent version of a page is considered valid, whereas all other versions are considered invalid. At every time t, each slot in the SSD must be in one of the following deterministic states: valid, invalid, or clean. A slot is called valid if it stores a valid version of a page. A valid slot that stores page p becomes invalid when an update request for p arrives. Following a clean operation of block b, each slot in b is rendered clean.
At every time t, a new update request for some page p arrives, and the new page must be written to a clean slot. When cleaning block b, each valid page stored in b must be rewritten to a clean slot. Thus, the SSD manager should make two types of decisions: (1) which block to clean and when, and (2) which slot to allocate to each page. In Figure 1, an update request for page p1 arrives, and the SSD manager writes it to a slot in b3 (note that p1 could also be written to a slot in b2 or b4). Then, the SSD manager decides to clean b5, and p2, p3 are rewritten to b4, b3, respectively.
Problem statement. In the SSD Management Problem, we are given an input in the form of a request sequence σ = (σ1, σ2, ..., σT) of length T. The objective is to minimize the number of rewrites during the service of σ. We evaluate the performance of an algorithm ALG given an input σ using the write amplification metric, which is defined by
where Z is the number of rewrites performed by ALG during the service of σ.
1.2. Our contribution
When developing online algorithms for SSD management, a fundamental issue is what kind of "extra" information about the input sequence is useful for improving performance. A relatively recent study6 observes that pages should ideally be grouped in blocks according to their death times, that is, pages that are about to be accessed at nearby times in the future should be placed in the same block. A more recent study applies this principle in a heuristic FTL.3 We call this rule grouping by death time. When all the pages in a block become invalid at approximately the same time, write amplification can be minimized by choosing the block whose pages have all been invalidated as the victim block. In a sense, this is precisely what hot/cold data separation strives to achieve; however, hot/cold separation can be viewed as grouping by lifetime, while two pieces of data can have the same lifetime (i.e., hotness) but distant death times. While hot/cold data separation has become standard practice, grouping by death time is only now coming to the attention of the systems research community.
Death times can be predicted using machine-learning techniques3; however, despite the success of machine-learned systems across many application domains, they tend to exhibit errors when deployed. A natural question is then how equipping online algorithms with a machine-learned oracle can benefit SSD performance. In particular, we look for an algorithm that: (1) performs better as the oracle's error decreases; and (2) behaves at least as good as the best online algorithm that has no knowledge of the future, independently of the oracle's performance on a particular instance.
The above is very much in line with recent trends of applying machine learning predictions to improve the performance of online algorithms.8, 9, 12 The work of Lykouris and Vassilvitskii8 is particularly relevant to our setting. They studied the classic paging problem assuming the availability of a machine-learned oracle that for each requested page, predicts the next time it will appear in the request sequence. They design a competitive online algorithm using these predictions and analyze its competitive factor as a function of the oracle's error. When predictions are pretty accurate, their algorithm circumvents known lower bounds on the competitive factor of online paging.
In this work we study the SSD management problem from an algorithmic perspective, considering it in both offline and online settings. A key parameter used in our bounds is the spare factor, denoted by α. This factor expresses the relation between the physical capacity of the SSD and the logical capacity. We now present our main technical results.
Myopic algorithms. In Section 2, we study online algorithms having no prior knowledge about the request sequence, for example, the well-known greedy algorithm. We analyze this algorithm and prove that no deterministic algorithm can provide better performance in this setting.
Offline setting. In Section 4, we consider an offline setting in which the request sequence is known in advance. We propose a novel placement technique that uses information about the death times of the pages to improve the performance of the SSD. We apply this technique and design a near-optimal algorithm whose write amplification given any input is not greater than 1 + ε. This result highlights the fact that death times are very useful for minimizing write amplification. Moreover, this is the first time that an offline algorithm for SSD management is analyzed, and from a systems perspective, having a near-optimal offline algorithm is very useful for benchmarking (similarly to the role that Belady's algorithm plays in paging).
We also consider the hardness of the offline SSD management problem. While it is not known whether this problem is solvable in polynomial time, we take a step in resolving this issue and link it to the problem of fragmented coloring on interval graphs, which has been previously studied in Diwan et al.5 Even though the complexity of the latter problem remains open, we extend previous results by providing an efficient algorithm for the special case in which the number of colors is fixed. Details can be found in the full version of the paper.
Online setting. In Section 5, we consider a middle ground setting in which we assume the availability of an oracle that predicts the death time of each accessed page. We design an online algorithm whose performance is given as a function of the prediction error η. Specifically, we prove that for any input σ, our algorithm satisfies:
Note that our algorithm exhibits a graceful degradation in performance as a function of the average prediction error. Furthermore, our algorithm is robust, that is, its performance is never worse than that guaranteed by the greedy algorithm, even when the prediction error is large. When predictions are perfect (η = 0), the write amplification is bounded by 1 + ε, similar to the offline setting. However, ε here has a larger value, since the online model is less informative.
Experimental evaluation. In Section 7, we empirically compare our online and offline algorithms to state-of-the-art practical SSD management with hot/cold data separation. Our results confirm that our algorithms exhibit an improved performance for a wide range of input traces; the highest benefit is achieved for traces with modest or no skew, which are the worst-case inputs for algorithms with hot/cold data separation.
Summary. The abundance of system-level optimizations for reducing write amplification, which is usually based on experimental evaluation, stands in stark contrast to the lack of theoretical algorithmic results in this problem domain. We present the first theoretical analysis and systematic evaluation of algorithms that observe the rule of grouping by death time, laying the ground for the development of a new class of algorithms that the systems community has not yet considered. Our results may also motivate the theory community to address the algorithmic aspects of SSD-related design challenges.
We consider here myopic algorithms, which are online algorithms that have no access to information about future requests. Intuitively, in this setting, cleaning a block that minimizes the number of valid slots is advantageous since it requires fewer rewrites and frees more slots for reuse. In this section, we prove that this intuition is indeed correct.
The well-known greedy algorithm for SSD management works as follows: it waits until the SSD becomes full, and then cleans the min-valid block, that is, the block that minimizes the number of valid slots. We now provide a simple upper bound on the write amplification guaranteed by the greedy algorithm on any input and claim that no deterministic myopic algorithm can provide a better guarantee. The proofs for the following theorems can be found in the full version of this paper.
THEOREM 1. For any input σ, .
THEOREM 2. Let ALG be a deterministic myopic algorithm for SSD management. Then, there exists an input σ such that .
2.1. Randomized lower bound
Randomization is a common approach to circumvent an adversary that "deliberately" feeds a bad input to a deterministic algorithm. A natural question is whether randomization can be used to circumvent the lower bound presented in Theorem 2 against an adversary that is oblivious to the random choices made by the algorithm. Surprisingly enough, it turns out that the power of randomization in the context of myopic algorithms for SSD management is very limited, as follows.
Yao's principle16 states that the expected cost of a randomized algorithm on a worst-case input is lower bounded by the expected performance of the best deterministic algorithm for a given probability distribution over the input. This principle is a powerful tool for proving lower bounds on the performance of randomized algorithms.
In previous work, Desnoyers4 investigated the expected write amplification of several algorithms under the assumption that accesses are uniformly distributed between the pages. In particular, he shows that the write amplification of the greedy algorithm under such a distribution is approximately 1/2α. This result is consistent with prior work that evaluated the performance of the greedy algorithm empirically.1 The optimality of the greedy algorithm under a uniform distribution assumption was proven in Yang et al.15 Hence, we have by Yao's principle that no randomized myopic algorithm can guarantee an expected write amplification that is less than 1/2α. This leaves only a small gap for improvement in the randomized setting.
The inherent limitation of myopic algorithms in general, and specifically the greedy algorithm, is due to two main reasons. The first is that choosing a victim block according to the number of valid slots may not be an optimal heuristic. For example, it is reasonable to delay the cleaning of a block whose pages are about to be accessed in the near future, even if it contains a small number of valid slots. The second reason, on which we focus in this work, is that myopic algorithms cannot make informed decisions on where to write pages. From the point of view of a myopic algorithm, all pages are semantically identical. Obviously, this limitation can be exploited by an appropriate adversary. In this section, we propose the notion of death time as a measure that can distinguish between pages. Then, we demonstrate how grouping pages with similar death times in the same block can benefit the performance of the SSD. The ideas presented in this section form the basis for both offline and online algorithms presented in the sequel.
Definition 1. The death time of page p at time t, denoted by y(p, t), is the time of the next access to p after time t.
When a block stores pages with different death times, there is a time window between the first and the last page invalidations within which the block contains both valid and invalid slots. We call such a time window a zombie window and a block in a zombie window a zombie block.6 In general, larger zombie windows lead to increased odds of a zombie block being chosen as a victim, increasing the number of rewrites.
Zombie windows can be reduced if pages with similar death times are placed in the same block. This rule is referred to as grouping by death time.6 However, algorithms relying on the death times of pages must be equipped with information about the future. This information can be exact (Section 4) or estimated by a prediction oracle (Section 5). We now introduce the basic principle of our death-time-aware placement technique.
Definition 2. Let R be a set of (page, time) tuples, let B be a set of blocks, and let f be a function that given a tuple (p, t) ∈ R, outputs the predicted death time of page p at time t. Consider a function A that maps each tuple (p, t) ∈ R to a block in B. We say that A allocates space according to f if for any (p1, t1), (p2, t2), (p3, t3) ∈ R such that f (p1, t1) < f(p2, t2) < f(p3, t3), if A(p1, t1) = A(p3, t3) = b then A(p2, t2) = b as well.
Figure 2 demonstrates an allocation according to the (exact) death times. The first two rows describe the request sequence. Each entry in the third row contains the death time (DT) of the corresponding page. For example, p1 is written at time t = 1, and this version of p1 dies at time t = 3 when an update request for p1 arrives. In this simple scenario, the set B consists of three blocks, two slots each. The slots are allocated to the first six accessed pages according to their death times. For example, since (p1, 1) and (p4, 5) die first, they are mapped to the same block.
Figure 2. An allocation according to the exact death times. Each entry in the third row contains the death time (DT) of the corresponding page. Since (p1, 1) and (p4, 5) die first, they are mapped to the same block.
We finish this section by introducing the idea lying at the core of our algorithms. A set of blocks whose slots are allocated to pages according to their (predicted) death times is called a batch, and the total number of batches is denoted by Δ. Observe that when allocating slots according to the exact death times of the pages, each batch may contain at most one zombie block. Hence, Δ can function as a measure of the disorder induced by the page locations: we expect that the greater the value of Δ, the greater the number of zombie blocks. The idea is that we maintain low values of Δ by reordering all the pages when their value becomes too high. Since such a reordering requires many rewrites, we also ensure that this operation is done infrequently.
In this section, we address the SSD management problem in an offline setting in which the request sequence is fully known in advance. We design an algorithm that applies the placement technique introduced in Section 3. Then we prove that our algorithm performs a negligible number of rewrites (relative to the input length), highlighting both the importance of the placement policy (rather than the victim selection strategy) and the usefulness of death times for minimizing write amplification.
4.1. The offline algorithm
We now describe the proposed offline algorithm. We assume that the SSD is initially clean. In the initialization step, we fill the SSD with the first S accessed pages according to their death times. Now, we perform a sequence of iterations. In each iteration, we first choose the set Bvictim of completely invalid blocks as victims (that is, blocks containing only invalid slots), and clean them. Then, the subsequent accessed pages are written to Bvictim according to their death times. As soon as the device becomes full again, a new iteration begins.
Observe that in each iteration, a new allocation is calculated. Thus, after j iterations, the number of batches can be at most j (note that the number of batches can be smaller than the number of iterations if the pages of an entire batch have been invalidated in subsequent iterations). In order to keep Δ low, when it exceeds the predefined value α·B, a reordering step is invoked. In this step, all pages are reordered in blocks according to their death times and Δ is reset to 1. The iterations performed between two reordering steps form a phase. From now on, iteration j in phase i is simply referred to as iteration (i,j).
Figure 3 presents a schematic example of the first phase in the offline algorithm. All slots are initially clean. In the initialization step, the pages are written to the SSD according to their death times, such that pages that die first are written to the left-most block. Note that due to overprovisioning, some slots in the left blocks are invalidated already during the initialization. When the device becomes full, the first phase begins. In each iteration, all completely invalid blocks are cleaned. Then, the next accessed pages are written to the clean slots, forming a new batch. As soon as Δ ≥ αB, a reordering step is invoked.
Figure 3. Schematic illustration of the first phase in the offline algorithm. Each iteration consists of a clean step and a write step. As soon as Δ ≥ 2.5, a reordering step is invoked.
Note that in general, reordering the pages according to their death times might require a huge number of rewrites. However, under specific circumstances, the reordering process can be done efficiently. In the full version of this paper, we propose a simple procedure that rewrites each page only once. A formal version of the algorithm is given in Algorithm 1.
THEOREM 3. For any input σ,
PROOF. We assume that in each reordering step each page is rewritten at most once, with a total of up to P rewrite operations. Since rewrites occur only in the reordering step, our analysis focuses on examining the frequency with which this step occurs.
Recall that when the device is full, the number of invalid slots in the whole device is exactly α · S. By allocating slots according to the exact death times of the pages, we ensure that each batch can contain at most one zombie block. Hence, the number of completely invalid blocks when the device is full is at least αB – Δ. This way, our condition for ending a phase guarantees that for any phase i, at least (αB – j) · n new requests are served during iteration (i, j). Thus, the total number Tph of requests served in each phase satisfies:
We obtain that for any input σ,
which completes the proof. □
In this section, we extend our discussion to online algorithms that have access to death-time predictions and present an algorithm based on our placement technique with improved performance.
5.1. Approach
We first discuss several challenges arising when trying to apply our placement technique in the online setting and present our approach for addressing them.
Challenge 1: In the online setting, information about the death times of the pages is not available.
Approach: We assume the availability of a prediction oracle, such that each request arrives with its predicted death time. Such an oracle can be derived from machine learning prediction techniques; however, since it might exhibit errors when deployed, the performance of the proposed algorithm is given as a function of the oracle's error. We emphasize that our algorithm is completely agnostic to the way this oracle works and treats it as a complete black box.
Challenge 2: Since predictions arrive together with requests, we do not have the predictions for the next accessed pages in advance. Therefore, an allocation based on their death times cannot be calculated.
Challenge 3: When prediction errors occur and zombie blocks must be chosen as victims, the pages they store should be reordered according to their predicted death times. However, reordering these pages might require an excessively large number of rewrites, harming the performance of the algorithm.
Approach: In order to address Challenges 2–3, we augment our model, assuming the availability of an auxiliary memory of size M. We assume that this memory is much faster than the SSD, allowing us to neglect implications concerning its management. In practice, such memory is typically attached to the SSD controller and is used for the FTL as well as for write buffering, although its size is limited.
We (logically) partition the memory into two zones: a buffer zone of size αM and a victim zone of size (1–α)M. When a zombie block is cleaned, the pages it stores are retrieved to the victim zone. When a new request for page p arrives, instead of writing p to the SSD immediately, it is first stored in the buffer zone. As soon as the buffer zone becomes full, an allocation is calculated, and then all pages stored in the memory are moved to the SSD according to their predicted death times.
5.2. The online algorithm
Our goal is to design an algorithm whose performance improves as the prediction error decreases. We achieve this objective by treating the oracle's output as the truth; this corresponds to a policy that allocates slots to pages according to their predicted death times. At the same time, we are interested in a robust algorithm, whose performance guarantees are comparable to those of the best online algorithm (i.e., the greedy algorithm), regardless of the performance of the oracle. In order to limit the effect of prediction errors, the predictions are used only in the placement policy. Specifically, we apply a greedy selection of min-valid blocks as victims, regardless of the predictions.
We now provide a short description of the proposed algorithm, whose formal description appears in Algorithm 2. Similarly to the offline case, our algorithm operates in phases. Each phase contains several iterations and terminates with a reordering of the pages on the entire SSD. Each iteration consists of: (1) a service process, in which incoming requests are served; (2) a garbage collection process, in which victim blocks are chosen and cleaned; and (3) a placement process, in which pages are moved from the memory to the SSD.
In the service process (line 4 in Algorithm 2), when an update request for page p arrives, we store the new version of p in the buffer zone. If the previous version of p was stored in the memory, we overwrite it; otherwise, we invalidate the corresponding slot in the SSD. This process continues until the buffer zone becomes full, and then garbage collection begins. In the garbage collection process (lines 5–6 in Algorithm 2), we first choose the set Bvictim of min-valid blocks as victims (note that the physical capacity of the victim blocks is equal to that of the entire memory). We then clean these blocks, and retrieve valid pages they store to the victim zone (note that, since we clean min-valid blocks, the victim zone is guaranteed to have enough room for these pages). In the placement process (line 7 in Algorithm 2), we move the pages from the memory to the SSD according to their predicted death times.
When the condition for ending a phase is satisfied, we perform a reordering step. An efficient implementation for this step can be found in the full version of this paper. In contrast to the offline case, the online algorithm gets a fixed confidence parameter 0 < ρ < 1 as input. This parameter should express the user's degree of confidence in the oracle's predictions, where higher confidence is expressed through higher values of ρ. The condition for ending a phase depends on the value of ρ, where higher values of ρ lead to longer phases. As a consequence, higher values of ρ increase the dependency of the performance on the magnitude of prediction errors and decrease the overhead imposed by the reordering steps. In the extreme case, if ρ is very close to 1, then we no longer bound the performance of the algorithm as a function of the error's magnitude. On the other hand, if ρ is very close to 0, then the algorithm performs an infinite sequence of reordering steps without serving any request, resulting in an unbounded write amplification.
5.3. Performance modeling
The above adaptations require us to remodel the performance of algorithms. Obviously, write amplification guarantees may now be given as a function of the oracle's error and the memory size.
Error model. Denote by ŷ (p, t) the predicted death time of page p at time t. We define the prediction error on a specific request (p, t) by ηt = |y(p, t) − ŷ(p, t)|, and the total error by η = ∑t ηt. Note that we do not make any assumptions about the oracle's error; specifically, our results will be parameterized as a function of the error magnitude, and not its distribution. We note that in practice, our results will be with respect to the observed value of η (i.e., the prediction errors that "interfere" with the SSD management), rather than its actual value, which might be much higher.
Objective. Observe that when page overwrites occur, some page versions are never written to the SSD. Hence, we redefine the write amplification of an algorithm ALG given input σ by:
where the system writes counts every write of a page to the SSD and #user writes = T. In particular, note that now write amplification can be smaller than one (due to page overwrites in memory). In any case, under these conventions, minimizing the write amplification remains our objective.
We obtain the following bound on the write amplification of the proposed algorithm.
THEOREM 4. For any input σ,
The proof of this theorem is in the spirit of Theorem 3; however, it involves more details due to the dependence on prediction errors. A complete proof can be found in the full version of this paper.
We complement our theoretical findings with an experimental evaluation. Our goal is to demonstrate how our algorithms compare with the state-of-the-art practical algorithms for SSD management. For the purpose of this evaluation, we built a simple SSD simulator similar to the one used in Yadgar et al.14
6.1. Algorithms
We implemented the three algorithms, greedy, offline, and online, according to their description in Sections 2, 4, and 5, respectively. We ran our online algorithm with three values of ρ: 0.25, 0.5, and 0.75. For brevity, here we discuss only the results for ρ = 0.5.
We also implemented an SSD management algorithm with hot/cold data separation (hotcold). Hotcold classifies pages as hot, warm, or cold, based on the death-time predictions used by our online algorithm. It writes the pages into disjoint sets (partitions) of blocks, with Greedy SSD management applied within each partition. The sizes of the partitions are adjusted dynamically to the input, by allocating blocks to partitions on demand. The dynamic allocation of blocks to partitions results in dynamic partition sizes, which is suitable when using death times: a page can be classified with a different temperature when it is written in different requests. This is on par with how practical hot/cold data separation schemes classify page temperatures dynamically.
6.2. Memory usage
We extended the standard greedy and hotcold algorithms to use the available memory as a write-back cache. The greedy algorithm, which does not rely on death time predictions, manages the cache with least-recently-used (LRU) replacement. Namely, in case of cache "hit"—a copy of the page is already in the memory, and it is overwritten. In case of cache "miss," the least-recently accessed page in the memory is written to the SSD and evicted from the memory. The valid copy of the requested page on the SSD is invalidated, and its new copy is written in the memory. Hotcold uses the memory in a similar way, but gives priority to hot and warm pages in the memory: when the memory is full, a new page can cause an old page to be evicted only if its own temperature is equal to or warmer than that of the old page. For a fair comparison, we also implemented an extension of the offline algorithm that uses memory as a write buffer, using Belady's law for page eviction.
6.3. Predictions and errors
We pre-processed all traces in order to annotate each page request with its death time. To evaluate the efficiency of our algorithm, we also created erroneous versions of those annotations. Specifically, we annotated each request σt for page p with a random value that is uniformly distributed between t and its correct death time y (p, t). Intuitively, we expect that in practice, prediction errors will be larger for pages whose next access is farther in the future.
6.4. Input traces
We used two trace types: synthetic and real. The synthetic traces were generated according to uniform and Zipf distributions. We also used a synthetic trace which sequentially writes entire files. The real traces were taken from servers at MSR Cambridge,10 available via SNIA.13 These traces are extensively used by the storage systems community to evaluate caching, scheduling, and SSD management algorithms. The real-world traces exhibit a high skew, that is, a small percentage of their pages is responsible for a large portion of their accesses.
6.5. SSD parameters
In our experiments, we compared the performance of the above algorithms on an SSD with 128-page blocks and varying spare factors and memory sizes. For brevity, we present here only the results for α = 0.1. The full details of our implementation and trace characteristics, as well as additional experimental results, can be found in the full version of this paper.
7.1. Accurate predictions
In our first experiment, we compared our proposed algorithms to the two practical online algorithms greedy and hotcold, on all traces, with increasing memory sizes and accurate death time predictions. Figure 4 shows the write amplification of these algorithms. The results confirm the common knowledge that hot/cold separation can significantly reduce write amplification with respect to greedy, and this reduction increases with the skew in the workload. Our online algorithm outperforms greedy on all the input traces and all settings, and it usually outperforms hotcold as well. The offline algorithm consistently achieves the lowest write amplification.
Figure 4. The write amplification of greedy, hotcold, and our online and offline algorithms with varying memory sizes and accurate death time predictions. The memory size is given as the percentage of the SSD's physical size.
The write amplification is highest for the Uniform trace: since all the pages have the same update frequency, the probability of a cache hit is low for all algorithms. Our online algorithm exhibits a significant improvement over both greedy and hotcold for this trace, and its write amplification is higher than that of the offline algorithm by only 7%-23%.
The skewed accesses in the Zipf and real traces (prxy_0, proj_0, and src1_2) imply that the memory can absorb a high portion of the requests, even if it only stores the most recently used pages. Indeed, the write amplification is lower than 1 in most parameter combinations for those traces. Our online algorithm is mostly better than greedy and hotcold; its advantage is larger in settings with modest memory sizes, where the effectiveness of the memory as a cache is limited.
The Sequential trace presents our online algorithm with a challenge. The large sequential writes always generate at least one block containing only invalid slots; hence, the greedy and hotcold algorithms simply erase one such block at a time, which is sufficient for serving the entire trace without rewrites. The online algorithm, on the other hand, always chooses BM blocks as victims, which sometimes results in unnecessary rewrites (especially for the larger memory sizes). For this reason, the three online algorithms are comparable on this trace.
7.2. The effect of prediction errors
In the next experiment, we evaluate the performance of our online algorithm using erroneous death-time predictions. We omit greedy from this experiment, because it does not use the death time annotations in the input, and is therefore not affected by prediction errors. Figure 5 shows the write amplification of hotcold and our online algorithm. As we expected, both algorithms are sensitive to prediction errors, because they rely on the predictions in order to (1) organize the pages in the SSD; and (2) select pages for eviction. Our online algorithm is more sensitive to errors than hotcold because its allocation decisions are made with finer granularity.
Figure 5. The write amplification of hotcold and our online algorithm with varying memory sizes with (Error) and without (NoError) prediction errors. The memory size is given as the percentage of the SSD's physical size.
Quantitatively, the prediction errors roughly double the write amplification of our online algorithm on both Uniform and Zipf traces. Prediction errors also affect our online algorithm with the Sequential trace: they cause it to spread pages from the same file on more blocks than necessary. We note, however, that sequential accesses in practice will not experience this extreme error scenario, as this is a relatively easy access pattern to predict. Hotcold is hardly affected by prediction errors in this trace because, in the worst case, a file will be split into at most three continuous slot ranges, one in each temperature.
On the prxy_0 trace, our online algorithm outperforms hotcold even with prediction errors. The other real traces, proj_0 and src1_2, contain a significant percentage of sequential accesses; indeed, real-world workloads often include a small set of hot pages and long sequential updates of large files or logs. For sequential access, the effectiveness of the memory as a cache is limited, which explains the higher write amplification of all the algorithms for those traces. The write amplification of our online algorithm is mostly higher than that of hotcold, which performs better on sequential accesses in the presence of prediction errors.
7.3. Consequences
Our offline algorithm achieves the lowest write amplification in all simulations, indicating the effectiveness of our death-time-aware placement technique. Its highest benefit is achieved for traces with modest or no skew, which are the worst-case inputs for the greedy and hotcold algorithms. Our online algorithm emphasizes worst-case guarantees, resulting in an excessive cost in the case of sequential accesses, which are handled naturally by the log-structured organization of greedy. On the other hand, in traces that do not present frequent sequential patterns, our online algorithm mostly outperforms other algorithms, especially when predictions are accurate — emphasizing that minimizing the oracle's error is important in order to take advantage of our technique.
In this paper, we initiated a comprehensive study of the SSD management problem. We first defined the problem formally and then explored it in both offline and online settings from a worst-case algorithmic perspective. Our paper opens up several intriguing research directions. The first one is resolving the complexity of the offline SSD management problem. We suspect that this problem is NP-hard, and thus finding an optimal approximation for it is an interesting avenue for further research. Another interesting direction is improving the dependency of our online algorithm on both the prediction error and the size of the auxiliary memory. Empirically, our online algorithm successfully deals with a wide range of input traces, and hence can be integrated with state-of-the-art algorithms to improve SSD performance.
1. Agarwal, R., Marrow, M. A closed-form expression for write amplification in NAND flash. In 2010 IEEE Globecom Workshops (2010), IEEE, Miami, FL, USA, 1846–1850.
2. Bux, W., Iliadis, I. Performance of greedy garbage collection in flash-based solid-state drives. Perform. Eval. 67, 11 (Nov. 2010), 1172–1186.
3. Chakraborttii, C., Litz, H. Reducing write amplification in flash by death-time prediction of logical block addresses. In Proceedings of the 14th ACM International Conference on Systems and Storage (2021), ACM, Haifa, Israel.
4. Desnoyers, P. Analytic models of SSD write performance. ACM Trans. Storage 10, 2 (Mar. 2014) 1–25.
5. Diwan, A., Pal, S., Ranade, A. Fragmented coloring of proper interval and split graphs. Discrete Appl. Math. 193 (2015), 110–118.
6. He, J., Kannan, S., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H. The unwritten contract of solid state drives. In Proceedings of the 12th European Conference on Computer Systems (EuroSys'17) (2017), ACM, NY, 127–144.
7. Hsieh, J.-W., Kuo, T.-W., Chang, L.-P. Efficient identification of hot data for flash memory storage systems. ACM Trans. Storage 2, 1 (Feb. 2006), 22–40.
8. Lykouris, T., Vassilvitskii, S. Competitive caching with machine learned advice. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, volume 80 of Proceedings of Machine Learning Research (2018), PMLR, Stockholm, Sweden, 3302–3311.
9. Mitzenmacher, M., Vassilvitskii, S. Algorithms with predictions. In Beyond the Worst-Case Analysis of Algorithms. T. Roughgarden, ed. Cambridge University Press, Cambridge, UK, 2020, 646–662.
10. Narayanan, D., Donnelly, A., Rowstron, A. Write off-loading: Practical power management for enterprise storage. ACM Trans. Storage 4, 3 (Nov. 2008), 1–23.
11. Park, D., Du, D.H. Hot data identification for flash-based storage systems using multiple Bloom filters. In 27th IEEE Symposium on Mass Storage Systems and Technologies (MSST) (2011) IEEE, Denver, CO, USA.
12. Purohit, M., Svitkina, Z., Kumar, R. Improving online algorithms via ML predictions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018 (December 3–8, 2018, Montrèal, Canada) (2018), Curran Associates, Inc., Montreal, Canada, 9684–9693.
13. SNIA IOTTA Trace Repository. MSR Cambridge Traces, 2007. http://iotta.snia.org/traces/block-io/388.
14. Yadgar, G., Gabel, M., Jaffer, S., Schroeder, B. SSD-based workload characteristics and their performance implications. ACM Trans. Storage 17, 1 (Jan. 2021) 1–26.
15. Yang, Y., Misra, V., Rubenstein, D. On the optimality of greedy garbage collection for SSDs. SIGMETRICS Perform. Eval. Rev. 43, 2 (Sept. 2015), 63–65.
16. Yao, A.C.-C. Probabilistic computations: Toward a unified measure of complexity. In 18th Annual Symposium on Foundations of Computer Science (SFCS) (1977), IEEE, Los Alamitos, CA, USA, 222–227.
To view the accompanying Technical Perspective, visit doi.acm.org/10.1145/3596204
The original version of this paper was published in Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2021.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
No entries found