# The Wonderful People Who Pay My Students

- 2017-2020, $386,000, Platform for Advanced Scientific Computing (PASC), ETHZ

This is a PASC Software Development Proposal for the development of Salvus, a spectral-element software package built from the ground up with state-of-the-art techniques and tools. By leveraging the PETSc library (https://www.mcs.anl.gov/petsc/) and its DMPlex finite-element API (Lange et al., 2015), Salvus has been capable of running in parallel from the start. At its core, PETSc provides a distributed vector and matrix API, with specific routines added to support finite-element methods, allowing us to write a fully functioning, parallelised wave propagation package for acoustic and elastic domains in less than 10'000 lines of code, with performance comparable to SPECFEM3D.

Here we propose to position Salvus as a top wavefield modelling and inversion package in the exascale era. To ensure performance of Salvus on today's and tomorrow's supercomputing platforms, our work will focus on cross-architecture developments, code and I/O optimisation, and systematic testing and validation. This will be complemented by actions to increase and broaden the usability and impact of Salvus. They include workflow developments, the implementation of frequency-domain solvers, and extensions of the physics that can be modelled.

- 2017-2018, $100,000, Partnership for Edge Physics Simulation, U.S. DOE Contract

Solver Development for SCREAM and EPSIProf. Knepley will aid Dr. Mark Adams in developing advanced nonlinear solver capability for plasma physics simulations. In addition, he will provide mesh management and discretization support in PETSc for LBNL's solver development for the SCREAM and EPSI projects. - 2015-2018, $240,000, DOE Applied Math Research, U.S. DOE Contract DE-AC02-06CH11357

Extending PETSc's Composable Hierarchically Nested Linear SolversThe changing landscape of both scientific application needs (more complex physical models and coupling of diverse models) and high-performance computing systems (many-core node architectures, hybrid combinations of GPU and conventional processors, and high node counts) requires continued innovation in mathematical algorithms for solving sparse linear systems and high-quality, scalable software implementations of these algorithms. To this end, we propose to study composable, hierarchical, nested solvers—that is, solvers composed of multiple levels of nested algorithms and data models to exploit architectural features and/or problem-specific structure.

We focus on five complementary subprojects: (1) algorithmic and software infrastructure for nested block methods; (2) scalable solvers for tree-based structured grid adaptive mesh refinement; (3) advanced multigrid methods for multiphysics applications; (4) eigenanalysis of composite linear solvers, allowing one to understand what portions of the solution space are being handled well by the solver and which require additional preconditioning; and (5) software for understanding and analyzing composable, hierarchical, nested solvers in relation to their constituent parts, as well as user composition of such solvers. An essential crosscutting aspect of this work is the development of flexible and extensible software infrastructure to support these hierarchical algorithms.

The robust and scalable linear solvers will benefit numerous PDE-based applications, including gyrokinetics, geodynamics, neutron transport, subsurface flow, ice sheet modeling, solid mechanics, and earthquake simulation. Thus, this work will have an immediate and powerful impact on a broad range of important Office of Science and other DOE work.

- 2015-2020, $262,655, NSF SI2-SSI: 1450339

Scalable Infrastructure for Enabling Multiscale and Multiphysics Applications in Fluid Dynamics, Solid Mechanics, and Fluid-Structure InteractionMany biological and biomedical systems involve the interaction of a flexible structure and a fluid. These systems range from the writhing and coiling of DNA, to the beating and pumping of cilia and flagella, to the flow of blood in the body, to the locomotion of fish, insects, and birds. This project aims to develop advanced software infrastructure for performing dynamic computer simulations of such biological and biomedical systems. To facilitate the deployment of this software in a range of scientific and engineering applications, this project will develop new software capabilities in concert with new computer models that use the software. Specific application domains to be advanced in this project include models of aquatic locomotion that can be used to understand the neural control of movement and ultimately to develop new treatments for neurological pathologies such as spinal cord injuries, and models that simulate the interaction between the electrophysiology of the heart and the contractions of the heart that pump blood throughout the body, which could lead to improved approaches to treating heart disease. The software to be developed within the project is freely available online and is used by a number of independent research groups in a variety of scientific and engineering domains. It is being actively used in projects that model different aspects of cardiovascular dynamics, such as platelet aggregation and the dynamics of natural and prosthetic heart valves, and in projects that study other biological problems, including cancer dynamics, insect flight, aquatic locomotion, and the dynamics of phytoplankton. The software is also being applied to non-biological problems, including nanoscale models of colloidal suspensions and models of active particles. The improved methods and software to be developed in this project will thereby have a broad and sustained impact on a large number of ongoing research efforts in the biological and biomedical sciences and other scientific and engineering disciplines.

The immersed boundary (IB) method is a broadly applicable framework for modeling and simulating fluid-structure interaction (FSI). The IB method was introduced to model the fluid dynamics of heart valves, and subsequent development initially focused on simulating cardiac fluid dynamics. This methodology is broadly useful, however, and has been applied to a variety of problems in which a fluid flow interacts with immersed structures, including elastic bodies, bodies with known or prescribed deformational kinematics, and rigid bodies. Extensions of the IB method have also been developed to model electrophysiological systems and systems with chemically active structures. To improve the efficiency of the IB method, the PI has developed adaptive versions of the IB method that employ structured adaptive mesh refinement (AMR) to deploy high spatial resolution only where needed. These methods have been implemented within the IBAMR software framework, which provides parallel implementations of the IB method and its extensions that leverage high-quality computational libraries including SAMRAI, PETSc, and libMesh. This project will further extend the IBAMR software by implementing modeling and discretization technologies required by the research applications of current and prospective users of the software, by developing improved solver infrastructure facilitated by the implementation of native support for structured AMR discretizations in the PETSc library, and by integrating with existing high-quality software tools for model development, deployment, and analysis. IBAMR is freely distributed online and is used within a number of independent research groups both to the further development of the IB method and also to its application to simulate diverse problems in fluid dynamics and FSI. By enhancing IBAMR, this project will also enhance the ability of these and other researchers to construct detailed models without requiring those researchers to develop the significant software infrastructure needed to perform such simulations. This project will also develop general-purpose support for AMR discretizations in PETSc, a software library with thousands of active users, ~400 downloads per month, and numerous applications. The work of this project will help to grow the IBAMR user community of students and researchers by developing UI tools for building models, running simulations, and analyzing results. Students will be actively engaged in all aspects of the project, including code, method, and model development.

- 2015-2017, $400,000, Intel Parallel Computing Center

Extending PETSc with Adaptive Mesh Refinement and Optimal Solvers, Applied to PFLOTRAN, and Optimized for Modern Intel ProcessorsWe will extend and optimize PETSc for structured adaptive mesh refinement (SAMR) discretizations with initial deployment to the premier open source, state-of-the-art massively parallel subsurface flow and reactive transport code – PFLOTRAN. This work is significant in that it radically changes algorithms, data access, and low level computational organization in order to maximize performance and scalability on modern Intel architectures, and encapsulates this knowledge in the PETSc libraries for the broadest possible impact.

PETSc will provide a new solver interface to SAMR, enabling the efficient representation of multi-scale phenomenon, while maintaining the simplicity of structured grid kernel computations. These kernel computations are amenable to the high number of vector lanes characteristic of current and future Intel architectures, and we will aggressively vectorize the existing code. Moreover, we propose to use the most asymptotically efficient solvers for equations of interests to PFLOTRAN such as Richards’ equations: matrix-free full approximation scheme (FAS) nonlinear full multigrid (FMG) methods. FMG has the potential to solve nonlinear systems of equations to truncation error accuracy with a few (5-20) work units or residual calculations. These formulations have a unique opportunity to leverage modern architectures to deliver fast, accurate, versatile solvers for subsurface flows and other PETSc applications.

- 2012-2015, $117,710, NSF SI2-SSE: 1147680

SPIKE --- An Implementation of a Recursive Divide-and-Conquer Parallel Strategy for Solving Large Systems of Linear EquationsDrs. Negrut, Sameh, and Knepley will investigate, produce, and maintain a methodology and its software implementation that leverage emerging heterogeneous hardware architectures to solve billion-unknowns linear systems in a robust, scalable, and efficient fashion. The two classes of problems targeted under this project are banded dense and sparse general linear systems.

This project is motivated by the observation that the task of solving a linear system is one of the most ubiquitous ingredients in the numerical solution of Applied Mathematics problems. It is relied upon for the implicit integration of Ordinary Differential Equation (ODE) and Differential Algebraic Equation (DAE) problems, in the numerical solution of Partial Differential Equation (PDE) problems, in interior point optimization methods, in least squares approximations, in solving eigenvalue problems, and in data analysis. In fact, the vast majority of nonlinear problems in Scientific Computing are solved iteratively by drawing on local linearizations of nonlinear operators and the solution of linear systems. Recent advances in (a) hardware architecture; i.e., the emergence of General Purpose Graphics Processing Unit (GP-GPU) cards, and (b) scalable solution algorithms, provide an opportunity to develop a new class of parallel algorithms, called SPIKE, which can robustly and efficiently solve very large linear systems of equations.

Drawing on its divide-and-conquer paradigm, SPIKE builds on several algorithmic primitives: matrix reordering strategies, dense linear algebra operations, sparse direct solvers, and Krylov subspace methods. It provides a scalable solution that can be deployed in a heterogeneous hardware ecosystem and has the potential to solve billion-unknown linear systems in the cloud or on tomorrow?s exascale supercomputers. Its high degree of scalability and improved efficiency stem from (i) optimized memory access pattern owing to an aggressive pre-processing stage that reduces a generic sparse matrix to a banded one through a novel reordering strategy; (ii) good exposure of coarse and fine grain parallelism owing to a recursive, divide-and-conquer solution strategy; (iii) efficient vectorization in evaluating the coupling terms in the divide-and-conquer stage owing to a CPU+GPU heterogeneous computing approach; and (iv) algorithmic polymorphism, given that SPIKE can serve both as a direct solver or an effective preconditioner in an iterative Krylov-type method.

In Engineering, SPIKE will provide the Computer Aided Engineering (CAE) community with a key component; i.e., fast solution of linear systems, required by the analysis of complex problems through computer simulation. Examples of applications that would benefit from this technology are Structural Mechanics problems (Finite Element Analysis in car crash simulation), Computational Fluid Dynamics problems (solving Navier-Stokes equations in the simulation of turbulent flow around a wing profile), and Computational Multibody Dynamics problems (solving Newton-Euler equations in large granular dynamics problems).

SPIKE will also be interfaced to the Portable, Extensible Toolkit for Scientific Computation (PETSc), a two decades old flexible and scalable framework for solving Science and Engineering problems on supercomputers. Through PETSc, SPIKE will be made available to a High Performance Computing user community with more than 20,000 members worldwide. PETSc users will be able to run SPIKE without any modifications on vastly different supercomputer architectures such as the IBM BlueGene/P and BlueGene/Q, or the Cray XT5. SPIKE will thus run scalably on the largest machines in the world and will be tuned for very different network and hardware topologies while maintaining a simple code base.

The experience collected and lessons learned in this project will augment a graduate level class, ?High Performance Computing for Engineering Applications? taught at the University of Wisconsin-Madison. A SPIKE tutorial and research outcomes will be presented each year at the International Conference for High Performance Computing, Networking, Storage and Analysis. A one day High Performance Computing Boot Camp will be organized each year in conjunction with the American Society of Mechanical Engineers (ASME) conference and used to disseminate the software outcomes of this effort. Finally, this project will shape the research agendas of two graduate students working on advanced degrees in Computational Science.

- 2012-2015, $240,000, DOE Applied Math Research, U.S. DOE Contract DE-AC02-06CH11357

Extending PETSc's Composable Hierarchically Nested Linear SolversWe propose to focus on the following four complementary subprojects, which have the common theme of addressing issues in scalability and robustness through hierarchical algorithms:

- Scalable preconditioners for Stokes and Karush-Kuhn-Tucker (KKT) systems;
- Unstructured geometric-algebraic multigrid methods;
- Hierarchical Krylov methods designed to scale Krylov solvers up to million-core simulations; and
- Interactive eigenanalysis of composite linear solvers, allowing one to understand what portions of the solution space are being handled well by the solver and which require additional preconditioning procedures.

- 2010-2015, $650,000, Subcontract, NSF EAR-0949446

The Computational Infrastructure for GeodynamicsCIG is fully described at http://www.geodynamics.org.

- 2009-2014, $550,000, DOE Math-CS Institute, U.S. DOE Contract DE-AC02-06CH11357

Nonlinear Algorithms to Circumvent the Memory Bandwidth Limitations of Implicit PDE SimulationsSparse matrix computations, which form the computational kernel of many DOE simulation codes, are severely memory bandwidth limited. This bottleneck holds true for single core systems, multicore systems, the new general-purpose graphics processing systems (GP GPUs), and the slightly more exotic IBM Roadrunner Cell processor. Under the best of circumstances one cannot achieve more than 15 to 20 percent of the processors' performance in sparse matrix computations because the processors are waiting on memory loads up to 85 percent of the time. With current architectural trends this situation is unlikely to change; if anything, it will get worse.

Sparse matrix computations arise because the paradigm for solving nonlinear partial differential equations (PDEs) is to iteratively perform a global linearization and then solve a sparse matrix linear problem (that is, Newton's method). The linearization involves computing a sparse matrix approximation to a nonlinear problem, which is (because of its large size) stored in main memory. The linear solution process requires constantly moving the sparse matrix entries from main memory into the CPU, where a small number of computations are performed using those values, before more values need to be loaded, thus allowing no register or cache reuse of the sparse matrix entries.

To better utilize computer systems, we need to

**eliminate**the "global linearize, then linear solve" paradigm and replace it with a "partially linearize, then solve" paradigm that acknowledges the memory hierarchy and moves the best possible CPU utilization, from 15% to over 75%, resulting in run times that are one-fifth as long. To achieve this transformation will require the joint efforts of mathematicians and computer scientists. The full approximation scheme (FAS), also called nonlinear multigrid, provides an algorithmic framework for the "partially linearize, then solve" approach and will form the starting point for our work. Loosely speaking, we propose a transition from Newton-multigrid algorithms to multigrid-Newton.**This project will develop the mathematical algorithms, analysis, and computer science techniques (including parsing, code generation, code optimization, and automatic differentiation) to move the "partially linearize, then solve" paradigm from concept to practice.**This research will result in PDE-based simulations running**up to five times faster**than traditional linear algebra approaches on the next generation of computer systems. Preliminary work indicates the "partially linearize, then solve" approach has great potential. The two key research questions we will answer under this work are as follows:- How will we maintain roughly the same convergence rates with the nonlinear schemes as with Newton-multigrid? Obviously if the convergence rates decay so badly that we end up doing many more floating-point operations, then the higher utilization is not a good measure of performance.
- How will we provide a software system that end users will accept and use? This likely means not requiring a radical shifting of the way application scientists develop code. But it may involve a radical shift in the way we develop ``libraries'' for implicit methods for PDEs.

- 2010-2012, $45,000, Subcontract, Army Research Office W911NF-09-1-0488

Classical Density Functional Theory of Fluids: Ions at a Dielectric InterfaceThe overall (three-year) goal of the proposed work is to develop a classical density functional theory (DFT) of ions and charged polymers near dielectric interfaces in a three-dimensional system. Over the 3 years of the grant, we have had great success and some setbacks, as might be expected for any research project. These include:

- The biggest success was publication of the paper describing the general theory in Journal of Physical Chemistry Letters. This is the big theory paper to come out of this project and represents the most important goal of the proposed work (i.e., constructing the functional itself). In the year since publication, it has already been cited 3 times, indicating an impact of the work.
- Numerical implementation of an efficient algorithm to solve the dielectric DFT equations is still being implemented and work on this front will continue beyond the completion of the grant. This is the biggest setback of the project. The finding of an efficient algorithm to numerically solve the dielectric DFT equations was much more difficult than aniticipated. The root of the problem is that the reaction field is not a radial interaction between ions. Because of this, fast Fourier transforms (FFTs) can no longer be used and one must be clever in making the program scale with \(N \log N\) rather than \(N^2\) where \(N\) is the number of grid points to be solved.
- On the other hand, one great numerical success was in implementing fast computational methods for the reaction in 3D. Matt Knepley at the Computational Institute of the University of Chicago, through a subcontract of this grant, has implemented a Fast Multipole Method (FMM) and published a paper on this topic.
- An exciting offshoot of the DFT work was a collaboration with Dezso Boda in Hungary that resulted an alternative method (not using DFT at all) of ions at dielectric interfaces that we developed. The specific goal of the proposed work is to develop a DFT of ions at a dielectric interface and this has been done (as just mentioned). The broader goal of the proposed to work is develop methods that any scientist can use when working with ions at a dielectric interface, especially in three dimensional geometries. We have started to develop another method with Dr. Boda, a long-time collaborator, called the Local Equilibrium Grand Canonical Monte Carlo (LE-GCMC). Because we have already developed a fast method to include dielectrics in Monte Carlo, this method will allow for computing ion currents (currently at the drift-diffusion level, but not limited to that) in complex 3D geometries that will be very challenging with DFT. The grant sponsored Dr. Bodaís one month visit to the PIís institution to work on this. It is important to note that this work is in addition to the DFT development; LE-GCMC came out of discussions that Dr. Boda and the PI had in discussing the dielectric DFT. Moreover, Dr. Boda's visit helped greatly in moving toward an efficient algorithm to numerically solve the dielectric DFT equations.
- Another offshoot of this work is the hope of coupling the new dielectric DFT functional to experiments. This work, which will continue beyond the completion of the grant, is with experimentalists working on nanoáuidic devices. This resulted in 2 papers (a third is in preparation), including one in the prestigeous Nano Letters.

The final report is available here.

- 2001-2011, $1,200,000, DOE SciDAC ISIC

Towards Optimal Petascale SimulationsTOPS is fully described at http://www.scidac.gov/math/TOPS.html.

- 2009-2011, $90,000, NSF Grant OCI-0850680

Mechanical Transformation of Knowledge to LibrariesThe proposed project approaches the problem of library development by a fundamentally different approach -- one in which the knowledge abstractions at an appropriate level are represented and then this knowledge is converted into implementations optimized for different target platforms and circumstances. The primary intellectual merit here is the automation of high quality numerical code generation.

The broader impact is that abstraction driven automation for library developers can significantly cut down the "software gap", delay between development of new architectures and new software that maximizes its use. The impact of this will be broad across all disciplines that use HPC.

- 2005-2010, $650,000, Subcontract, NSF EAR-0426271

The Computational Infrastructure for GeodynamicsCIG is fully described at http://www.geodynamics.org.

- 2005-2007, $210,000, Subcontract, DOE Reactor Core Modeling