GPU Computing

The combination of low power consumption, large memory bandwidth, high flop rate, and low cost make GPUs a promising platform for desktop scientific computing. The large latency is problematic for strong scaling, but there are many candidate computations for this type of throughput device, especially since interconect latency looks set to drop rapidly. Moreover, it is likely that a large fraction of future supercomputers will also employ accelerators (as 3 of the top ten machines in the world already do). In the last 4 decades of scientific computing, the most successful strategy for accommodating new hardware is the production of high quality libraries, such as MPI and PETSc. We propose to develop not only parallel algebraic solvers, but discretization and multiphysics libraries, suitable for hybrid machines with significant accelerator components.

Our strategy is to use a combination of low-level library routines and code generation techniques to produce implementations for our higher level API. We use the Thrust, Cusp and CUSPARSE libraries from NVIDIA, and the ViennaCL libraries to provide linear algebra, shown here, and more basic operations such as sorting and keyed reductions. For operations that are more variable, such as those that depend on the weak form, we use code generation and dynamic compilation, as is available from OpenCL. The key point is to circumscribe the domain so that the generation process becomes tractable. In generating FEM integration routines, we produce not only different block sizes, but different organization of memory writes and computation.

With Andy Terrel and Karl Rupp, we are exploring code transformations which will enable us to automatically optimize more general calculations. In particular, we are reorganizing traversal, and using code transformation tools, to explore quadrature-based methods for FEM residual and Jacobian evaluation. In our upcoming paper, we show that excellent vectorization is possible with CUDA and OpenCL, even for forms with variable coefficient.

Papers

Our upcoming paper with Andy Terrel and Karl Rupp uses a pattern we call thread transposition which allows us to evaluate integrals efficiently using quadrature without reductions among threads, and simultaneously vectorizes over quadrature points and basis functions. We apply this to the problem of FEM residual and Jacobian evaluation.
This paper in TOMS on integration of weak forms in affine mesh geometries uses exact integration, rather than quadrature. We use the geometric-analytic splitting first developed in this paper. We focus on low-order discretizations which currently lack efficient integration strategies. We show results for the P1 Laplacian in 3D and Elasticity operator in 2D, and achieve more than 100GF.
A description of the GPU implementation in PETSc was written with Victor Minden and Barry Smith. PETSc is organized as a class library with classes for vectors, matrices, Krylov methods, preconditioners, nonlinear solvers, and differential equation integrators. A new subclass of the vector class has been introduced that performs its operations on NVIDIA GPU processors. In addition, a new sparse matrix subclass that performs matrix-vector products on the GPU was introduced. The Krylov methods, nonlinear solvers, and integrators in PETSc run unchanged in parallel using these new subclasses. These can be used transparently from existing PETSc application codes in C, C++, Fortran, or Python. The implementation is done with the Thrust and Cusp C++ packages from NVIDIA.
In this paper, we developed an efficient method to calculate a high-dimensional (6d) contraction with a lot of memory reuse. This kind of operation arose in the computation of a nonlinear averaging operation used in Classical DFT for biological systems. In particular, this calculates the reference density in the RFD method of Dirk Gillespie.
Larry Zheng developed a structured MG code which solves the variable viscosity Stokes equations using the discretization method of Taras Gerya.
This paper uses FMM to solve a boundary element problem arising in bioelectrostatics using 512 GPUs of the DEGIMA cluster at Nagasaki. The BIBEE approximation is used to accelerate the BEM solve. The applications demonstrated include the electrostatics of protein-drug binding and several multi-million atom systems consisting of hundreds to thousands of copies of lysozyme molecules. The parallel scalability of the software was studied using 128 nodes, each with 4 GPUs. Delicate tuning has resulted in a strong scaling with parallel efficiency of 0.5 for 512 GPUs. The largest application run, with over 20 million atoms and one billion unknowns, required only one minute on 512 GPUs.

Matt Knepley's HomePage

GPU Computing

Papers