acm-header
Sign In

Communications of the ACM

Communications of the ACM

The Ninja Project


When the Java programming language was introduced by Sun Microsystems in 1995, there was a perception (properly founded at the time) that its many benefits came at a significant performance cost. The related deficiencies were especially apparent in numerical computing. Our own measurements in 1997 with second-generation Java Virtual Machines (JVMs) found differences in performance of up to one hundredfold relative to C and Fortran. Initial experience with poor performance caused many developers of high-performance numerical applications to reject Java out-of-hand as a platform for their applications.

Despite the more recent progress in Java optimization, the performance of commercially available Java platforms is still not on par with state-of-the-art Fortran and C compilers. Programs using complex arithmetic exhibit particularly poor performance. Moreover, today's Java platforms are incapable of automatically applying important optimizations to numerical code, including loop transformations and automatic parallelization [12]. Nevertheless, we find no technical barriers to high-performance computing in Java. To prove this thesis, we developed a prototype Java environment, called Numerically INtensive JAva, or NINJA, that has demonstrated that Java can obtain Fortran-like performance on a variety of problems in scientific and technical computing. NINJA has addressed such high-performance programming issues as dense and irregular matrix computations, calculations with complex numbers, automatic loop transformations, and automatic parallelization. The NINJA techniques are straightforward to implement and allow reuse of existing optimization components already deployed by software vendors for other languages [9], thus lowering the economic barriers to Java's acceptance in numerically intensive applications.

The next challenge for numerically intensive computing in Java is convincing developers and managers in this domain that Java's benefits can be obtained with performance comparable to highly tuned Fortran and C. Once they accept that Java performance is only an artifact of particular implementations of the language and that there are no technical barriers to achieving excellent numerical performance, NINJA-derived techniques will allow vendors and researchers to quickly deliver high-performance Java platforms to program developers.

Back to Top

Sources of Java Performance Difficulties

Among the many difficulties associated with optimizing numerical code in Java, we've identified three characteristics that are, in a way, unique to the language: a lack of regular-shaped arrays; exception checks for null-pointer and out-of-bounds array accesses (combined with a precise exception model); and weak support for complex numbers and other arithmetic systems.

Arrays in Java. Unlike Fortran and C, Java has no direct support for truly rectangular multidimensional arrays. Java allows some simulation of multidimensional arrays through "arrays of arrays." Figure 1(a) shows an array of arrays used to simulate a rectangular 2D array. In this case, all rows have the same length, but arrays of arrays can be used to construct far more complicated structures, as in Figure 1(b); the only way to determine the minimum length of a row is to examine all rows. In contrast, determining the size of a true rectangular array, as in Figure 1(c), requires looking at only a few parameters. The shape of an array of arrays can change during computation. While there are other possible solutions, the simplest by far is to have a data structure that makes this property explicit, as in the rectangular 2D arrays in Figure 1(c).

Arrays of arrays may also have complicated aliasing patterns, with both intra- and inter-array aliasing, as in Figure 1(b). Alias analysis, which is important for compiler optimizations, is extremely difficult for arrays of arrays but is much easier for true multidimensional arrays like those in Figure 1(c) (Z and T). There is no intra-array aliasing for true multidimensional arrays; inter-array aliasing can be determined through simple tests [12].


Java's performance difficulties can be solved through a careful combination of language and compiler techniques.


The Java exception model. Java requires all array accesses to be checked to ensure they are not null and within bounds. Java's exception model states that when the execution of a piece of code causes an exception, all effects of instructions prior to the exception must be "visible," while effects of instructions after the exception should not be visible [6]. This requirement harms performance in two ways. First, checking the validity of array references contributes to runtime overhead. Second, code reordering in general—and loop iteration reordering in particular—is prohibited, thus preventing almost all optimizations for numerical codes.

Complex numbers in Java. From a numerical perspective, Java has direct support only for real numbers; Fortran and C++ both efficiently support other arithmetic systems, including complex numbers. This efficiency results from the ability to represent low-cost data structures that can be allocated on the stack or in registers. Java, in contrast, requires that any nonprimitive data type be represented as a full-fledged object. Complex numbers are typically implemented as objects of a class Complex, so every time an arithmetic operation generates a new complex value, a new Complex object is allocated and the old value is invalidated and subject to garbage collection.

These three difficulties—arrays, exceptions, and complex numbers—at the core of Java's performance deficiencies prevent the application of mature compiler optimization technology to Java and thus prevent Java from being truly competitive with more established languages, including Fortran and C.

Back to Top

Java Performance Solutions

Performance can be improved through a combination of language and compiler techniques. For example, we developed new class libraries that enrich the language and compiler techniques that take advantage of these new constructs to perform automatic optimizations. Above all, we maintain full Java portability across all virtual machines.

The Array package and semantic expansion. To attack the absence of truly rectangular multidimensional arrays in Java, we defined an Array package with multidimensional arrays (denoted here as Arrays with a capital A) of various types and ranks. For example, doubleArray2D implements a 2D Array of double-precision floating-point numbers, whereas ComplexArray3D implements a 3D Array of complex numbers (see the sidebar "Array Package for Java"). Several access and manipulation operations are defined for the Array data types. The Arrays have an immutable rectangular and dense shape that simplifies testing for aliases and facilitates the optimization of runtime checks. The Array classes are written in fully compliant Java code and can be run on any JVM, ensuring the portability of programs written using the Array package.

Array elements can be accessed via the get and set element operations. For example, A.get(i,j) retrieves the value of element (i,j) of a 2D Array A. Similarly, A.set(i,j,x) sets the value of that element to the value of x. The runtime overhead of a method invocation for each element access is unacceptable for high-performance computing. This problem is avoided through a compiler technique known as semantic expansion in which the compiler looks for specific method calls and substitutes efficient code for the call. In the case of the Array get and set operations, it leads to code as efficient as with multidimensional arrays in Fortran and C. This technique allows programs using the Array package to achieve high performance when executed on JVMs recognizing that package.

The Complex class and semantic expansion. We also defined a complex number class as part of the Array package, along with methods implementing arithmetic operations on complex numbers. Again, semantic expansion is used to convert calls to these methods into code that directly manipulates complex number values, instead of full-fledged Complex objects. Values are converted to objects in a lazy manner upon encountering an operation that may require OO functionality. Thus, the programmer continues to treat complex numbers as objects (maintaining the clean semantics of the original language), while the compiler transparently transforms them into values for efficiency.

Versioning for safe and alias-free regions. The Array package allows a compiler to perform simple transformations that eliminate the performance problems caused by Java's runtime checks and exception model. The idea is to create regions of code guaranteed to be free of exceptions. Once these exception-free, or "safe," regions are created, the compiler can apply code-reordering optimizations [12]. The safe regions are created by the versioning of loop nests. For each optimized loop nest, the compiler creates two versions—safe and unsafe—guarded by a runtime test. The test is constructed so that if all Arrays in the loop nest are valid (not null), and if all the indexing operations inside the loop generate inbound accesses, then the test evaluates to true. In that case, the safe version of the loop is executed; otherwise, the unsafe version is executed. Since the safe version cannot cause an exception, explicit runtime checks are omitted from the code.

We take the versioning approach a step further. Application of automatic loop transformation (and parallelization) techniques by a compiler generally requires alias disambiguation among the various arrays (single or multidimensional) referenced in a loop nest. With the Array package, it is easy to determine whether two arrays are distinct. Using these concepts, we can further specialize the safe version of a loop nest into two variants—one in which all arrays are guaranteed to be distinct (no aliasing), and one in which there may be aliasing between arrays. Mature loop optimization techniques are easily applied to the safe and alias-free regions.

Figure 2 shows an example of the versioning transformations for creating safe and alias-free regions; Figure 2(a) illustrates the original code, explicitly showing all null pointer and array bounds runtime checks being performed. (Operations checknull and checkbounds are actually implicit at the Java and "bytecode" levels but explicit here for illustration purposes.) Figure 2(b) illustrates the versioned code. A simple test for the values of the A and B pointers and a comparison of loop bounds and array extents can determine whether or not the loop will be free of exceptions.

Libraries for numerical computing. Optimized libraries are an important vehicle for achieving high performance in numerical applications. One approach is to make existing native libraries available to Java programmers through the Java Native Interface [3]. Another approach is to develop new libraries entirely in Java (see the sidebar "Numerical Linear Algebra in Java").

Back to Top

Implementation and Results

We have implemented these ideas in the NINJA prototype, which is based on the IBM XL family of compilers. In these compilers, front-ends for different languages transform programs into a common intermediate representation, called W-Code. The Toronto Portable Optimizer (TPO) W-Code-to-W-Code transformer performs classical optimizations, including constant propagation and dead code elimination, as well as high-level loop transformations based on aggressive dataflow analysis. Finally, the transformed W-Code is converted into optimized machine code by an architecture-specific back-end. Semantic expansion of the Array package methods [1] is implemented within the IBM compiler for the Java front-end [11]. Safe region creation and alias versioning are implemented in TPO.

NINJA translates machine-independent Java bytecode into machine-specific executable code in a separate step, prior to execution. Nothing prevents the techniques described here from being used in a dynamic compiler. Moreover, by using the quasi-static dynamic compilation model [10], the more expensive (in terms of compiler overhead) optimization and analysis techniques can be performed offline, sharply reducing the effect of compilation overhead.

To evaluate the performance effects of these techniques, we recently used a suite of benchmarks and a production data mining application [1, 8]. We compared the performance produced by the NINJA compiler with that of the IBM Development Kit for Java version 1.1.6 and the IBM XLF Fortran compiler on a variety of platforms.

Sequential execution results. Figure 3(a) summarizes results for eight real arithmetic benchmarks when running in strictly sequential (single-threaded) mode. The numbers at the top of the bars indicate actual Mflops. For the Java 1.1.6 version, arrays are implemented as double[][]. The NINJA version uses doubleArray2D Arrays from the Array package, optimized with semantic expansion. For six of the benchmarks (matmul, microdc, lu, cholesky, bsom, and shallow), the performance of the Java version (with the Array package and NINJA compiler) is 80% or more of the performance of the Fortran version. This high performance is due to well-known loop transformations, enabled by our techniques, that enhance data locality.


The impediments to widespread adoption of Java for numerically intensive computing are economic and social, not technical.


Results for complex arithmetic benchmarks. Figure 3(b) summarizes results for five complex arithmetic benchmarks (fft, matmul, lu, cfd, and microac). For the Java 1.1.6 version, complex arrays are represented using a Complex[][] array of Complex objects. The NINJA version uses ComplexArray2D Arrays from the Array package and semantic expansion. In all cases, we observed significant performance improvement between the Java 1.1.6 and NINJA versions ranging from a factor of 35 (1.7 to 60.5Mflops for cfd) to a factor of 75 (1.2 to 89.5Mflops for matmul). We achieved Java performance ranging from 55% (microac) to 85% (fft and cfd) of fully optimized Fortran code.

Parallel execution results. Loop parallelization is another important transformation enabled by safe region creation and alias versioning. Speedup results follow from applying automatic loop parallelization to the eight real arithmetic Java benchmarks. Figure 3(c) shows speedup results (relative to single-processor performance) of the parallel code optimized with NINJA. The compiler was able to parallelize loops in each of the eight benchmarks. Significant speedups were obtained (better than 50% efficiency on four processors) in six of the benchmarks (matmul, microdc, lu, shallow, bsom, and fft).

Results for parallel libraries. Evaluating the effectiveness of our solutions, we applied NINJA to a production data mining application. In this case, we used a parallel version of the Array package with multithreading to exploit parallelism within the Array operations. The user application was strictly sequential code, and all parallelism was exploited transparently to the application programmer; results are in Figure 3(d). The conventional (Java arrays) version of the application achieves only 26Mflops, compared to 120Mflops for the Fortran version. The single-processor Java version with the Array package (bar Array x 1) achieves 109Mflops. Moreover, when run on a multiprocessor, the performance of the Array package version scales with the number of processors (bars Array x 2, Array x 3, and Array x 4 for execution on two, three, and four processors, respectively), achieving almost 300Mflops on four processors.

Back to Top

Conclusion

There are no serious technical impediments to adopting Java as a major language for numerically intensive computing. The techniques we've developed and presented here are straightfoward to implement and allow exploitation of existing compiler optimization techniques. Java itself has many features, including simpler pointers and flexibility in choosing object layouts, that facilitate these optimization techniques.

The impediments to Java's widespread adoption for numerically intensive computing are instead economic and social, including vendor unwillingness to commit the resources to developing product-quality compilers; application developer reluctance to make the transition to new languages for developing new codes; and the widespread view that Java is simply not suited for technical computing. The consequences of this situation include a large pool of programmers being underutilized and millions of lines of code being developed using languages inherently more difficult and less safe to use than Java. Maintaining these non-Java programs will be a burden on scientists and application developers for decades to come.

We hope the concepts and results presented here help overcome these impediments and accelerate acceptance of Java to the benefit of the general computing community.

Back to Top

References

1. Artigas, P., Gupta, M., Midkiff, S., and Moreira, J. Automatic loop transformations and parallelization for Java. In Proceedings of the 2000 International Conference on Supercomputing (Santa Fe, NM, May 8–11). ACM Press, New York, 1999, 1–10.

2. Boisvert, R., Dongarra, J., Pozo, R., Remington, K., and Stewart, G. Developing numerical libraries in Java. In Proceedings of the ACM 1998 Workshop on Java for High-Performance Network Computing (Stanford University, Palo Alto, CA, Feb. 28–Mar. 1). ACM Press, New York, 1998; see www.cs.ucsb.edu/conferences/java98.

3. Casanova, H., Dongarra, J., and Doolin, D. Java access to numerical libraries. In Proceedings of the ACM 1997 Workshop on Java for Science and Engineering Computation (Las Vegas, June 21). ACM Press, New York, 1997; see www.npac.syr.edu/users/gcf/03/javaforsc/ acmspecissue/latestpapers.html.

4. Chatterjee, S., Jain, V., Lebeck, A., Mundhra, S., and Thottethodi, M. Nonlinear array layouts for hierarchical memory systems. In Proceedings of the 1999 International Conference on Supercomputing (Rhodes, Greece, June 20–25). ACM Press, New York, 1999, 444–453.

5. Dongarra, J., Duff, I., Sorensen, D., and van der Vorst, H. Solving Linear Systems on Vector and Shared Memory Computers. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1991.

6. Gosling, J., Joy, B., and Steele, G. The Java Language Specification. Addison-Wesley, Reading, MA, 1996.

7. Gustavson, F. Recursion leads to automatic variable blocking for dense linear algebra algorithms. IBM J. Res. Develop. 41, 6 (Nov. 1997), 737–755.

8. Moreira, J., Midkiff, S., Gupta, M., Artigas, P., Snir, M., and Lawrence, R. Java programming for high-performance numerical computing. IBM Syst. J. 39, 1 (2000), 21–56.

9. Sarkar, V. Automatic selection of high-order transformations in the IBM XL Fortran compilers. IBM J. Res. Develop. 41, 3 (May 1997), 233–264.

10. Serrano, M., Bordawekar, R., Midkiff, S., and Gupta, M. Quicksilver: A quasi-static compiler for Java. In Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA'00) (Minneapolis, MN, Oct. 15–20). ACM Press, New York, 2000, 66–82.

11. Seshadri, V. IBM high-performance compiler for Java. AIXpert Mag. (Sept. 1997); see www.developer.ibm.com/library/aixpert.

12. Wolfe, M. High-Performance Compilers for Parallel Computing. Addison-Wesley, Reading, MA, 2000.

Back to Top

Authors

José E. Moreira ([email protected]) is a research staff member and manager, Modular System Software, at the IBM T.J. Watson Research Center, Yorktown Heights, NY.

Samuel P. Midkiff ([email protected]) is a research staff member at the IBM T.J. Watson Research Center, Yorktown Heights, NY.

Manish Gupta ([email protected]) is a research staff member and manager, High-Performance and Cellular Programming Environments, at the IBM T.J. Watson Research Center, Yorktown Heights, NY.

Pedro V. Artigas ([email protected]) is a doctoral candidate in the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

Peng Wu ([email protected]) is a research staff member at the IBM T.J. Watson Research Center, Yorktown Heights, NY.

George Almasi ([email protected]) is a postdoctoral fellow at the IBM T.J. Watson Research Center, Yorktown Heights, NY.

Back to Top

Figures

F1Figure 1. Examples of (a) array of arrays simulating a 2D array; (b) array of arrays in a more irregular structure; and (c) rectangular 2D array.

F2Figure 2. Creation of safe and alias-free regions through automatic compiler versioning.

F3Figure 3. Performance results of applying our Java optimization techniques to various cases.

Back to Top

Back to Top


©2001 ACM  0002-0782/01/1000  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2001 ACM, Inc.


 

No entries found