Current processor trends of integrating more cores with wider Single-instruction multiple-data (SIMD) units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be designed. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-versus-programming effort effectiveness of multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads.
We first quantify the extent of the "Ninja gap," which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We show equally encouraging results for the upcoming Intel® Xeon Phi™ architecture which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
Performance scaling across processor generations has previously relied on increasing clock frequency. Programmers could ride this trend and did not have to make significant code changes for improved code performance. However, clock frequency scaling has hit the power wall,16 and the free lunch for programmers is over.
Recent techniques for increasing processor performance have a focus on integrating more cores with wider Single-instruction multiple-data (SIMD) units, while simultaneously making the memory hierarchy deeper and more complex. While the peak compute and memory bandwidth on recent processors has been increasing, it has become more challenging to extract performance out of these platforms. This has led to the situation where only a small number of expert programmers ("Ninja programmers") are capable of harnessing the full power of modern multi-/many-core processors, while the average programmer only obtains a small fraction of this performance. We define the term "Ninja gap" as the performance gap between naively written parallelism unaware (often serial) code and best-optimized code on modern multi-/many-core processors.
There have been many recent publications8, 14, 17, 24 that show 10–100X performance improvements for real-world applications through optimized platform-specific parallel implementations, proving that a large Ninja gap exists. This typically requires high programming effort and may have to be re-optimized for each processor generation. However, these papers do not comment on the effort involved in these optimizations. In this paper, we aim at quantifying the extent of the Ninja gap, analyzing the causes of the gap and investigating how much of the gap can be bridged with low effort using traditional C/C++ programming languages.a
We first quantify the extent of the Ninja gap. We use a set of real-world applications that require high throughput (and inherently have a large amount of parallelism to exploit). We choose throughput applications because they form an increasingly important class of applications7 and because they offer the most opportunity for exploiting architectural resources—leading to large Ninja gaps if naive code does not take advantage of these resources. We measure performance of our benchmarks on a variety of platforms across different generations: 2-core Conroe, 4-core Nehalem, 6-core Westmere, Intel® Xeon Phi™b, and the NVIDIA C2050 GPU. Figure 1 shows the Ninja gap for our benchmarks on three CPU platforms: a 2.4 GHz 2-core E6600 Conroe, a 3.33 GHz 4-core Core i7 975 Nehalem, and a 3.33 GHz 6-core Core i7 X980 Westmere. The figure shows that there is up to a 53X gap between naive C/C++ code and best-optimized code for a recent 6-core Westmere CPU. The figure also shows that this gap has been increasing across processor generations — the gap is 5–20X on a 2-core Conroe system (average of 7X) to 20–53X on Westmere (average of 25X). This is in spite of micro-architectural improvements that have reduced the need and impact of performing various optimizations.
We next analyze the sources of the large performance gap. There are many reasons why naive code performs badly. First, the code may not be parallelized, and compilers do not automatically identify parallel regions. This means that the increasing core count is not utilized in naive code, while the optimized code takes full advantage of it. Second, the code may not be vectorized, thus under-utilizing the increasing SIMD widths. While auto-vectorization has been studied for a long time, there are many difficult issues such as dependency analysis, memory alias analysis and control flow analysis which prevent compilers from vectorizing outer loops, loops with gathers (irregular memory accesses) and even innermost loops where dependency and alias analysis fails. A third reason for large performance gaps may be that the code is bound by memory bandwidth—this may occur, for instance, if the code is not blocked for cache hierarchies—resulting in cache misses.
Our analysis of code written by Ninja programmers show that such programmers put in significant effort to use threading technologies such as pthreads along with low level instrinsics for vectorization to obtain performance. This can result in very complex code especially when vectorizing irregular loops. In this work, we show that we can leverage recent compiler technologies that enable parallelization and vectorization of code with relatively low programmer effort. Parallelization can be achieved using OpenMP pragmas over the loop to be parallelized, and the programmer avoids writing complex threading code. For simple loops without dependencies, automatic loop parallelization is also possible—we assume the use of pragmas in this work. For vectorization, recent compilers such as the Intel® Composer XE 2011 version have introduced the use of a pragma for the programmer to force loop vectorization by circumventing the need to do dependency and alias analysis. This version of the compiler also has the ability to vectorize outer level loops, and the Intel® Cilk™ Plus feature11 helps the programmer to use this new functionality when it is not triggered automatically.c These features allow programmers to move away from using lower level intrinsics and/or assembly code and immensely boost performance. Using these features, we show that the Ninja gap reduces to an average of 2.95X for Westmere. The remaining gap is either a result of bandwidth bottlenecks in the code or the fact that the code gets only partially vectorized due to irregular memory accesses. This remaining gap, while relatively small, will however inevitably increase on future architectures with growing SIMD widths and decreasing bandwidth-to-compute ratios.
In order to bridge the remaining gap, programmer intervention is required. Current compilers do not automate changes at an algorithmic level that involve memory layout changes, and these must be manually performed by the programmer. We identify and suggest three critical algorithmic changes: blocking for caches, bandwidth/SIMD friendly data layouts and in some cases, choosing an alternative SIMD-friendly algorithm. An important class of algorithmic changes involves blocking the data structures to fit in the cache, thus reducing the memory bandwidth pressure. Another class of changes involves eliminating the use of memory gather/scatter operations. Such irregular memory operations can both increase latency and bandwidth usage, as well as limit the scope of compiler vectorization. A common data layout change is to convert data structures written in an Array of Structures (AOS) representation to a Structure of Arrays (SOA) representation. This helps prevent gathers when accessing one field of the structure across the array elements, and helps the compiler vectorize loops that iterate over the array. Finally, in some cases, the code cannot be vectorized due to back-to-back dependencies between loop iterations, and in those cases a different SIMD-friendly algorithm may need to be chosen.
Performing algorithmic changes does require programmer effort and insights, and we expect that education and training in parallel programming will play a big role in enabling programmers to develop better parallel algorithms. The payoffs are large—we show that after performing algorithmic changes, we have an average performance gap of only 1.3X between best-optimized and compiler-generated code. Moreover, this effort can be amortized across different processor generations and also across different computing platforms such as GPUs. Since the underlying hardware trends toward increasing cores, SIMD width and slowly increasing bandwidth have been optimized for, a small and predictable performance gap will remain across future architectures. We demonstrate this by repeating our experiments for the Intel® Xeon Phi™ Knights Corner co-processor architecture, a recent 86X based manycore platform. We show that the Ninja gap is almost the same (1.2X). In fact, the addition of hardware gather support makes programmability easier for at least one benchmark. The combination of algorithmic changes coupled with modern compiler technology is an important step toward enabling programmers to ride the trend of parallel processing using traditional programming.
For our study, we analyze compute and memory characteristics of recently proposed benchmark suites,2, 3, 5 and choose a representative set of benchmarks from the suite of throughput computing applications. Throughput workloads deal with processing large amounts of data in a given amount of time, and require a fast response time for all the data processed. These include workloads from areas of High Performance Computing, Financial Services, Image Processing, Databases, etc.5 Throughput computing applications have plenty of data- and thread-level parallelism, and have been identified as one of the most important classes of future applications with compute and memory characteristics influencing the design of current and upcoming multi-/many-core processors. They also offer the most opportunity for exploiting architectural resources—leading to large Ninja gaps if naive code does not take advantage of increasing computational resources. We formulated a representative set of benchmarks described below that cover this wide range of application domains of throughput computing. We capture the key computational kernels where most time is spent in throughput computing applications. As such, reducing Ninja gap in our benchmarks will also translate to the applications themselves.
Ninja Performance: Table 1 provides details of the representative dataset sizes for each of the benchmarks. There exists a corresponding best performing code for each, for which the performance numbers have been previously citedd on different platforms than those used in our study. In order to perform a fair comparison, we implemented and aggressively optimized (including the use of intrinsics/assembly code) the benchmarks, and obtained comparable performance to the best reported numbers on the corresponding platform. This code was then executed on our platforms to obtain the corresponding best optimized performance numbers we use in this paper. Table 1 (column 3) show the best optimized (Ninja) performance for all the benchmarks on Intel® Core™ i7 X980. For the rest of the paper, Ninja Performance refers to the performance numbers obtained by executing this code on our platforms.
In this section, we take each of the benchmarks described in Section 2, and attempt to bridge the Ninja gap starting with naively written code with low programming effort. For a detailed performance analysis, we refer the reader to our ISCA paper.23
Platform: We measured the performance on a 3.3 GHz 6-core Intel® Core™ i7 X980 (code-named Westmere, peak compute: 158 GFlops, peak bandwidth: 30 GBps). The cores feature an out-of-order super-scalar micro-architecture, with 2-way Simultaneous Multi-Threading (SMT). It also has 4-wide SIMD units that support a wide range of instructions. Each core has an individual 32 KB L1 and 256 KB L2 cache. The cores share a 12 MB last-level cache (LLC). Our system has 12 GB RAM and runs SuSE Enterprise Linux (ver. 11). We use the Intel® Composer XE 2011 compiler.
Methodology: For each benchmark, we attempt to first get good single thread performance by exploiting instruction and data level parallelism. To exploit data level parallelism, we measure the SIMD scaling for each benchmark by running the code with auto-vectorization enabled/disabled (-no-vec compiler flag). If SIMD scaling is not close to peak, we analyze the generated code to identify architectural bottlenecks. We then obtain thread level parallelism by adding OpenMP pragmas to parallelize the benchmark and evaluate thread scaling—again evaluating bottlenecks. Finally, we make necessary algorithmic changes to overcome the bottlenecks.
Compiler pragmas used: We use OpenMP for thread-level parallelism, and use the auto-vectorizer or recent technologies such as array notations introduced as part of the Intel® Cilk™ Plus (hereafter referred to array notations) for data parallelism. Details about the specific compiler techniques are available in Tian et al.26 The compiler directives we add to the code and command line are the following:
Summary: In this section, we analyzed each benchmark, and reduced the Ninja gap to within 1.1–1.6X by applying necessary algorithmic changes coupled with the latest compiler technology.
In this section, we identify the steps taken to bridge the Ninja gap with low programmer effort. The key steps taken are to first perform a set of well-known algorithmic optimizations to overcome scaling bottlenecks, and secondly to use the latest compiler technology for vectorization and parallelization. We now summarize our findings with respect to the gains we achieve in each step.
4.1 Algorithmic Changes
Algorithmic changes do require programmer effort and some insights, but are essential to avoid vectorization issues and bandwidth bottlenecks. Figure 5 shows the performance improvements due to set of well-known algorithmic optimizations that we describe below.
AOS to SOA conversion: A common optimization that helps prevent gathers and scatters in vectorized code is to convert data structures from Array-Of-Structures (AOS) to Structure-Of-Array (SOA). Separate arrays for each field allows contiguous memory accesses when vectorization is performed. AOS structures require gathers and scatters, which can impact both SIMD efficiency and introduce extra bandwidth and latency for memory accesses. The presence of a hardware gather/scatter mechanism does not eliminate the need for this transformation—gather/scatter accesses commonly need significantly higher bandwidth and latency than contiguous loads. Such transformations are advocated for a variety of architectures including GPUs.19 Figure 5 shows that for our benchmarks, AOS to SOA conversion helped by an average of 1.4X.
Blocking: Blocking is a well-known optimization that can help avoid bandwidth bottlenecks in a number of applications. The key idea is to exploit the inherent data reuse available in the application by ensuring that data remains in caches across multiple uses, both in the spatial domain (1-D, 2-D, or 3-D), and temporal domain.
In terms of code change, blocking involves a combination of loop splitting and interchange. The code snippet in Figure 6a shows an example of blocking NBody code. There are two loops (body1 and body2) iterating over all bodies. The original code on the top streams through the entire set of bodies in the inner loop, and must load the body2 value from memory in each iteration. The blocked code is obtained by splitting the body2 loop into an outer loop iterating over bodies in multiple of BLOCK, and an inner body22 loop iterating over elements within the block. This code reuses a set of BLOCK body2 values across multiple iterations of the body1 loop. If BLOCK is chosen such that this set of values fits in cache, memory traffic is brought down by a factor of BLOCK. In terms of performance (Figure 5), we achieve an average of 1.6X improvement (up to 4.3X for LBM and 7-point stencil).
SIMD-friendly algorithms: In some cases, the naive algorithm cannot easily be vectorized either due to back-to-back dependencies between loop iterations or due to the heavy use of gathers and scatters in the code. A different algorithm that is more SIMD friendly may then be required. Figure 6b shows an example of this in MergeSort. The code on the left shows the traditional algorithm, where only two elements are merged at a time and the minimum written out. There are back-to-back dependencies due to the array increment operations, and hence the code cannot vectorize. The figure on the right shows code for a SIMD-friendly merging network,6 which merges two sequences of SIMD-width S sized elements using a sequence of min, max and interleave ops. This code auto-vectorizes with each highlighted line corresponding to one SIMD instruction. However, this code does have to do more computation (by a constant factor of log(S)), but still yields a gain of 2.3X for 4-wide SIMD. Since these algorithmic changes involve tradeoff between total computation and SIMD-friendliness, the decision to use them must be consciously taken by the programmer.
Summary: Using well-known algorithmic techniques, we get an average of 2.4X performance gain on 6-core Westmere. With increasing cores, SIMD width, and compute-to-bandwidth ratios, gains due to algorithmic changes will further increase.
4.2 Compiler Technology
Once algorithmic changes have been taken care of, we show the impact of utilizing the parallelization and vectorization technology present in recent compilers in bridging the Ninja gap.
Parallelization. We parallelize our benchmarks using OpenMP pragmas typically over the outermost loop. OpenMP offers a portable solution that allows for specifying the number of threads to be launched, thread affinities to cores, specification of thread private/shared variables. Since throughput benchmarks offer significant TLP (typically outer for loop), we generally use a omp parallel for pragma. One example is shown in Figure 7a for complex 1D convolution. The use of SMT threads can help hide latency in the code—hence we sometimes obtain more than 6X scaling on our 6-core system.
Vectorization.
SSE versus AVX: Figure 8a shows the benefit from inner and outer loop auto-vectorization on Westmere, once proper algorithmic changes are made. We also compare it to the SIMD scaling for the manual best-optimized code, and show scaling on AVX (8-wide SIMD) in Figure 8b using a 4-core 3.4 GHz Intel® Core i7-2600 K Sandybridge system. We use the same compiler, and only change compilation flags to -xAVX from -xSSE4.2. In terms of performance, we obtain on average 2.75X SIMD scaling using compiled code, which is within 10% of the 2.9X scaling using best-optimized code. With 8-wide AVX, we obtain 4.9X and 5.5X scaling (again very close to each other) using compiled and best-optimized code.
Our overall SIMD scaling for best-optimized code is good for most of our benchmarks, with the exceptions being MergeSort, TreeSearch and BackProjection. As also explained in Section 4.1, we performed algorithmic changes in MergeSort and TreeSearch at the expense of performing more operations, resulting in lower than linear speedups. Backprojection does not scale linearly due to the presence of unavoidable gathers/scatters in the code. This limits SIMD scaling to 1.8X on SSE (2.7X on AVX) for backprojection.
Inner loop vectorization: Most of our benchmarks vectorize over the inner loop of the code, either by using compiler auto-vectorization or #pragma simd when dependence or memory alias analysis fails. The addition of this pragma is a directive to the compiler that the loop must be (and is safe to be) vectorized. Figure 7a shows an example where this pragma is used. Our average speedup for inner loop vectorization is 2.2X for SSE (3.6X on AVX).
Outer loop vectorization: Vectorizing an outer-level loop is challenging: Induction variables need to be analyzed for multiple loop levels; and loop control flows such as zero-trip test and exit conditions have to be converted into conditional/predicated execution on multiple vector elements. The array notations technology helps the programmer avoid those complications without elaborate program changes. We gain benefits from outer-loop vectorization in LIBOR, where the inner loop is only partially vectorizable. We currently use array notations to vectorize the outer (completely independent) loop (Figure 7b). The scalar code is modified to change the outer loop index to reflect the vectorization, and compute results for multiple iterations in parallel. Note that the programmer declares arrays of size S (simd width), and X[a:b] notation stands for accessing b elements of X, starting with index a. It is usually straightforward to change scalar code to array notations code. This change results in high speedups of 3.6X on SSE (7.5X on AVX).
Figure 9 shows the relative performance of best-optimized code versus compiler generated code before and after algorithm changes. We assume that the programmer has put in the effort to introduce the pragmas and compiler directives described in previous sections. There is a 3.5X average gap between compiled code and best-optimized code before we perform algorithmic changes. This gap is primarily because of the compiled code being bound by memory bandwidth or low SIMD efficiency. After we perform algorithmic changes described in Section 4.1, this gap shrinks to avg. 1.4X. The only benchmark with a significant gap is TreeSearch, where the compiler vectorizes the outer loop with gathers. The rest show 1.1–1.4X Ninja gaps, primarily due to extra instructions being generated due to additional spill/fill instructions, loads/stores—these are hard problems where the compiler relies on heuristics.
Impact of Many-core Architectures: In order to test the Ninja gap on many-core platforms, we performed the same experiments on the Intel® Xeon Phi™ "Knights Corner" co-processor (KNC), which has 60/61 cores on a single die, and each core features an in-order micro-architecture with 4-way SMT and 512-bit SIMD unit.
Figure 10 shows the Ninja performance gap for KNC as well as for Westmere. The average Ninja gap for KNC is only 1.2X, which is almost the same (slightly smaller) than the CPUs. The main difference between the two performance gaps comes from TreeSearch, which benefits from the hardware gather support on Xeon Phi, and is close in performance (1.1X) to the best-optimized code.
The remaining Ninja gap after algorithmic changes remains small and stable across KNC and CPUs, inspite of the much larger number of cores and SIMD width on KNC. This is because our algorithmic optimizations focused on resolving vectorization and bandwidth bottlenecks in the code. Once these issues have been taken care of, future architectures will be able to exploit increasing hardware resources yielding stable and predictable performance growth.
The algorithmic optimizations that we described in Section 4.1 are applicable to a variety of architectures including GPUs. A number of previous publications19, 21 have discussed the optimizations needed to obtain best performance on GPUs. In this section, we show the impact of the same algorithmic optimizations on GPU performance. We use the NVIDIA C2050 Tesla GPU for this study.
Although GPUs have hardware gather/scatters, best coding practices (e.g., the CUDA C programming guide19) state the need to avoid uncoalesced global memory accesses—including converting data structures from AOS to SOA for reducing latency and bandwidth usage. GPUs also require blocking optimizations, which refers to the transfer/management of data into the shared memory (or caches) of GPUs. Finally, the use of SIMD-friendly algorithms greatly benefits GPUs that have a wider SIMD width than current CPUs. The average performance gain from algorithmic optimizations is 3.8X (Figure 11)–higher than the 2.5X gain on CPUs, since GPUs have more SMs and larger SIMD width, and hence sub-optimal algorithmic choices have a large impact on performance.
There are a number of published papers that show 10–100X performance gains over previous work using carefully tuned code.1, 8, 14, 17, 24 Lee et al.15 summarized relevant hardware architecture features and a platform-specific software optimization guide on CPU and GPUs. While these works show the existence of a large Ninja performance gap, they do not describe the programming effort or how to bridge the Ninja gap.
In this work, we analyze the sources of the Ninja gap and use traditional programming models to bridge it using low programmer effort. A previous version of this paper was published in Satish et al.23 Production compilers have recently started to support parallelization and vectorization technology that have been published in compiler research. Examples of such technology include OpenMP for parallelization available in recent GCC and ICC compilers, as well as auto-vectorization technology,18 dealing with alignment constraints and outer loop vectorization. These technologies have been made available using straightforward pragmas and technology like array notations, a part of Intel® Cilk™ Plus.11
However, naively written code may not scale even with compiler support since they are bottlenecked by architectural features such as memory bandwidth, gathers/scatters or because the algorithm cannot be vectorized. In such cases, algorithmic changes such as blocking, SOA conversion and SIMD-friendly algorithms are required. There have been various techniques proposed to address these algorithmic changes, either using compiler assisted optimization, using cache-oblivious algorithms or specialized languages. Such changes usually require programmer intervention and programmer effort, but can be used across a number of architectures and generations. For instance, a number of papers have shown the impact of similar algorithmic optimizations on GPUs.21
While our work focuses on traditional programming models, there have been radical programming model changes proposed to bridge the gap. Recent suggestions include Domain Specific Languages (a survey is available at Fowler9), the Berkeley View project2 and OpenCL for programming heterogeneous systems. There have also been library oriented approaches proposed such as Intel® Threading Building Blocks, Intel® Math Kernel Library, Microsoft Parallel Patterns Library (PPL), etc. We believe these are orthogonal and used in conjunction with traditional models.
There is also a body of literature in adopting auto-tuning as an approach to bridging the gap.8, 25 Autotuning results can be significantly worse than the best-optimized code. For, for example, for autotuned stencil computation,8 our best-optimized code is 1.5X better in performance. Since our Ninja gap for stencil is only 1.1X, our compiled code performs 1.3X better than auto-tuned code. We expect our compiled results to be in general competitive with autotuned results, while offering the advantages of using standard tool-chains that can ease portability across processor generations.
In this work, we showed that there is a large Ninja performance gap of 24X for a set of real-world throughput computing benchmarks for a recent multi-processor. This gap, if left unaddressed will inevitably increase. We showed how a set of simple and well-known algorithmic techniques coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes only require low programming effort as compared to the very high effort in Ninja code.
1. Arora, N., Shringarpure, A., Vuduc, R.W. Direct N-body Kernels for multicore platforms. In ICPP (2009), 379–387.
2. Asanovic, K., Bodik, R., Catanzaro, B., Gebis, J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-183, 2006.
3. Bienia, C., Kumar, S., Singh, J.P., Li, K. The PARSEC benchmark suite: Characterization and architectural implications. In PACT (2008), 72–81.
4. Brace, A., Gatarek, D., Musiela, M. The market model of interest rate dynamics. Mathematical Finance 7, 2 (1997),127–155.
5. Chen, Y.K., Chhugani, J., et al. Convergence of recognition, mining and synthesis workloads and its implications. IEEE 96, 5 (2008),790–807.
6. Chhugani, J., Nguyen, A.D., et al. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB 1, 2 (2008), 1313–1324.
7. Dally, W.J. The end of denial architecture and the rise of throughput computing. In Keynote Speech at Desgin Automation Conference (2010).
8. Datta, K. Auto-tuning Stencil Codes for Cache-based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley (Dec 2009).
9. Fowler, M. Domain Specific Languages, 1st edn. Addison-Wesley Professional, Boston, MA 2010.
10. Giles, M.B. Monte Carlo Evaluation of Sensitivities in Computational Finance. Technical report. Oxford University Computing Laboratory, 2007.
11. Intel. A quick, easy and reliable way to improve threaded performance, 2010. software.intel.com/articles/intel-cilk-plus.
12. Ismail, L., Guerchi, D. Performance evaluation of convolution on the cell broadband engine processor. IEEE PDS 22, 2 (2011), 337–351.
13. Kachelrieb, M., Knaup, M., Bockenbach, O. Hyperfast perspective cone-beam backprojection. IEEE Nuclear Science 3, (2006), 1679–1683.
14. Kim, C., Chhugani, J., Satish, N., et al. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In SIGMOD (2010). 339–350.
15. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In ISCA (2010). 451–460.
16. T. N. Mudge. Power: A first-class architectural design constraint. IEEE Computer 34, 4 (2001), 52–58.
17. Nguyen, A., Satish, N., et al. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In SC10 (2010). 1–13.
18. Nuzman, D., Henderson, R. Multi-platform auto-vectorization. In CGO (2006). 281–294.
19. Nvidia. CUDA C Best Practices Guide 3, 2 (2010).
20. Podlozhnyuk, V. Black–Scholes option pricing. Nvidia, 2007. http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/BlackScholes/doc/BlackScholes.pdf.
21. Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP (2008). 73–82.
22. Satish, N., Kim, C., Chhugani, J., et al. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In SIGMOD (2010). 351–362.
23. Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., et al. Can traditional programming bridge the Ninja performance gap for parallel computing applications? In ISCA (2012). 440–451.
24. Smelyanskiy, M., Holmes, D., et al. Mapping high-fidelity volume rendering to CPU, GPU and many-core. IEEE TVCG, 15, 6(2009), 1563–1570.
25. Sukop, M.C., Thorne, D.T., Jr. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers, 2006.
26. Tian, X., Saito, H., Girkar, M., Preis, S., Kozhukhov, S., Cherkasov, A.G., Nelson, C., Panchenko, N., Geva, R., Compiling C/C++ SIMD extensions for function and loop vectorizaion on multicore-SIMD processors. In IPDPS Workshops (Springer, NY, 2012). 2349–2358.
a. Since measures of ease of programming such as programming time or lines of code are largely subjective, we show code snippets with the code changes required to achieve performance.
b. Intel, Xeon and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.
c. For more complete information about compiler optimizations, see the optimization notice at http://software.intel.com/en-us/articles/optimization-notice/.
d. The best reported numbers are cited from recent top-tier publications in the area of Databases, HPC, Image processing, etc. To the best of our knowledge, there does not exist any faster performing code for any of the benchmarks.
The original version of this paper was published in the Proceedings of the 39th Annual International Symposium on Computer Architecture (June 2012). IEEE Computer Society, Washington, D.C., 440–451.
Figure 1. Growing performance gap between Naive serial C/C++ code and best-optimized code on a 2-core Conroe (CNR), 4-core Nehalem (NHM), and 6-core Westmere (WSM) systems.
Figure 2. Breakdown of Ninja Performance Gap in terms of Instruction (ILP), Task (TLP), and Data Level Parallelism (DLP) before and after algorithm changes for NBody, BackProjection, Stencil, and LBM. The algorithm change involves blocking.
Figure 3. Breakdown of Ninja Gap for (a) Treesearch and Mergesort and (b) 2D convolution and VR. The benchmarks in (a) require rethinking algorithms to be more SIMD-friendly, while in (b) do not require any algorithmic changes.
Figure 4. Breakdown of Ninja Performance Gap for Libor, Complex 1D convolution, and BlackScholes. All benchmarks require AOS to SOA conversion to obtain good SIMD scaling.
Figure 5. Benefit of three different algorithmic changes to our benchmarks normalized to code before any algorithmic change. The effect of algorithmic changes is cumulative.
Figure 6. Code snippets showing algorithmic changes for (a) blocking in NBody and (b) SIMD-friendly MergeSort algorithm.
Figure 7. Code snippets showing compiler techniques for (a) Parallelization and inner loop vectorization in complex 1D convolution and (b) Outer loop vectorization in LIBOR. Note that the code changes required are small and can be achieved with low programmer effort.
Figure 8. Breakdown of benefits from inner and outer loop vectorization on (a) SSE and (b) AVX. We also compare to the best-optimized performance.
Figure 9. Relative performance between the best-optimized code, the compiler-generated code after algorithmic change, and the compiler-generated code before algorithmic change. Performance is normalized to the compiled code after algorithmic change.
Figure 10. Gap between best-optimized and compiler-generated code after algorithmic changes for Intel® Xeon Phi™ and CPUs.
Figure 11. Benefit of the algorithmic changes described in Figure 6 on the NVIDIA Tesla C2050 GPU.
Table 1. Various benchmarks and the respective datasets used, along with best optimized (Ninja) performance on Core i7 X980.
©2015 ACM 0001-0782/15/05
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2015 ACM, Inc.