Analytical Cost Metrics : Days of Future Past

As we move towards the exascale era, the new architectures must be capable of running the massive computational problems efficiently. Scientists and researchers are continuously investing in tuning the performance of extreme-scale computational problems. These problems arise in almost all areas of computing, ranging from big data analytics, artificial intelligence, search, machine learning, virtual/augmented reality, computer vision, image/signal processing to computational science and bioinformatics. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. Therefore, the major challenge that we face in computing systems research is:"how to solve massive-scale computational problems in the most time/power/energy efficient manner?"The architectures are constantly evolving making the current performance optimizing strategies less applicable and new strategies to be invented. The solution is for the new architectures, new programming models, and applications to go forward together. Doing this is, however, extremely hard. There are too many design choices in too many dimensions. We propose the following strategy to solve the problem: (i) Models - Develop accurate analytical models (e.g. execution time, energy, silicon area) to predict the cost of executing a given program, and (ii) Complete System Design - Simultaneously optimize all the cost models for the programs (computational problems) to obtain the most time/area/power/energy efficient solution. Such an optimization problem evokes the notion of codesign.

The key element of our approach is to exploit multiple forms of domain-specificity [1]. First, we tackle a specific (family of) computations that are nevertheless very important in many embedded systems. This class of computations, called dense stencils, includes the compute-intensive parts of many applications such as computational fluid dynamics, neural networks, medical imaging, smart cameras, image processing kernels, simulation of physical systems relevant to realistic visualization, as well as the solution of partial differential equations (PDEs) that arise in many cyber-physical systems such as automobile control and avionics.
Second, we target NVIDIA GPUs, which are vector-parallel programmable accelerators. Such components are now becoming de-facto standard in most embedded platforms and MPSoCs since they provide lightweight parallelism and energy/power efficiency. We further argue that they will become ubiquitous for the following reasons. Any device on the market today that has a screen (essentially, any device, period) has to render images. GPUs are natural platforms for this processing (for speed and efficiency). So all systems will have an accelerator, by default. If the system now needs any additional dense stencil computations, the natural target for performing it in the most speed/power/energy efficient manner is on the accelerator.
The third element of domain specificity is that we exploit a formalism called the polyhedral model as the tool to map dense stencil computations to GPU accelerators. Developed over the past thirty years [13][14][15][16][17], it has matured into a powerful technology, now incorporated into gcc, llvm and in commercial compilers [Rstream, IBM]. Tools targeting GPUs are also available [18,19].
Thus, we formulate the domain-specific optimization problem: simultaneously optimize compilation and hardware/architectural parameters to compile stencil computations to GPUs.
Previously, we presented [20] an approach to solve the above problem as follows:

Develop Models
(a) Time Model [21] We show that the elements of the domain specificity can be combined to develop simple, analytical (as well as accurate) models for the execution time of tiled stencil codes on GPUs and that these model can be used to solve for optimal tile size selection. Our model was able to predict tile sizes that achieve 30% of theoretical machine peak on NVIDIA Maxwell GTX 980 and Titan X.
(b) Area Model [20] We develop a simple, analytical model for the silicon area of programmable accelerator architectures, and calibrate it using the NVIDIA Maxwell class GPUs. Our model proved to be accurate to within 2% when validated.
(c) Energy Model [22] We also developed energy models, as an explicit analytic function of a set of compiler and hardware parameters, that predict the energy consumption by analyzing the source code. We used these energy models to obtain optimal solutions to tile size selection problem. [20] We combine the proposed execution time model [21] and the area model [20] with a workload characterization of stencil codes to formulate a mathematical optimization problem that minimizes a common objective function of all the hardware and compiler parameters. We propose a set of Pareto optimal designs that represent optimal combination of the parameters that would allow up to 126% improvement in performance (GFlops/sec).

Codesign
Despite domain specificity, the problem remains difficult. Even when done by hand for single target architecture and an application kernel, it is more art than science. Although smart designers and implementers have worked for many decades on such problems for the "application/architecture du jour," each one was usually a point-solution [23][24][25]. Designers invested blood, sweat and tears to find the best implementation, used it to solve their problem of interest, usually published a paper explaining the design, and moved on. Their invested effort, particularly the trade-offs they made, and lessons they learned, are lost: future designers are left to reinvent wheels.
The high-level objective is to optimize stencil codes while tuning the hardware accelerator (GPUs) developing a complete ecosystem. The goal is to automatically and provably optimally, using time and/or energy as the objective function, map stencils to the hardware accelerators.
The idea is to obtain provably optimal mappings through rigorous mathematical optimizations.
The proposed approach can have the following benefits.
• Automation with Optimality: the most time/power/energy efficient implementations can be derived, reducing programmers' effort. Compilation tools can be used to guide the optimal choice of transformations which will, in turn, optimize the performance of the workloads such as deep learning, image rendering, cyber-physical systems, autonomous vehicle systems, etc.
• Future proofing: Porting applications to new GPU architectures will require less effort.
Instead of a redesign of each program, our methods can be used to develop new parallelization strategies and transformations, refine/redefine objective functions and constraints, and re-target the compiler. This one-time effort can then be amortized over many application kernels.
• Codesign: By casting hardware/architectural parameters as variables in the mathematical optimization framework, we can solve for their optimal values. This will enable us to systematically explore alternate GPU architectures and simultaneously tune compilation parameters. Such a codesign approach will help speed up the research work and the chip design process. The cost models can be used to quickly recognize the performance sinks and help identify the design flaws in its early stages saving billions of dollars.
Generalization Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way [5]. Consequently, highly specialized coprocessors will become much more common in exascale than in the current HPC systems. Such specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. In order to attain exascale level performance, accelerators have to become even more energy-efficient, and experts anticipate that a large part of this must come through increased specialization [1].
Our approach can be used to solve the problem of System codesign by applying proposed accelerator codesign techniques to all the classes of programs that optimize for all the parameters simultaneously. We provide a proof of concept of our approach, which is a stepping stone towards solving the larger problem of transforming the GPUs into accelerators for HPC Systems.  Since a long time the HPC developers and tool builders are using certain abstractions to improve the performance of applications. Figure 1 For every application, a profiler is used to get performance data which is analyzed to derive a performance model. The performance models are used to predict the performance and select an optimization strategy. Optimal transformation strategy is applied to get code. This cycle repeats until the satisfactory performance is obtained.   1. The main novelty of our work comes out as a consequence of some of the exascale challenges [1]. For exascale system design, various architectures, programs and transformation strategies are to be explored simultaneously in order to find the optimal. We add performance models to this design space and provide a unified view of the optimization space. Figure 1.3 shows this view (more details in Section 2.2).

Programs Optimization Strategies
Architectures Performance Models 2. The above design space, however, is very large, has too many parameters and is too complicated to develop precise models. Therefore, we explore domain specificity and identify regions where optimization across multiple axes become possible.
We show how the analytical cost models can be used to optimize the performance of domain specific programs using transformation strategies for a given architecture in Chapter 2. The rest of this document is organized as follows: Chapter 2 explains our proposed approach. Chapters 3, 4, and 5 discuss the work that has been accomplished. Bottleneck Analysis is explained in details in Chapter 6. Chapter 7 discusses the relevant literature. Finally, Chapter 8 concludes the work.

Domain Specificity
As we move to address the challenges of exascale computing, one approach that has shown promise is domain specificity: the adaptation of application, compilation, parallelization, and optimization strategies to narrower classes of domains. An important representative of such a domain is called Stencil Computations, and includes a class of typically compute bound parts of many applications such as partial differential equation (PDE) solvers, numerical simulations in domains like oceanography, aerospace, climate and weather modeling, computational physics, materials modeling, simulations of fluids, and signal and image-processing algorithms. One of the thirteen Berkeley dwarfs/motifs [26], is "structured mesh computations," which are nothing but stencils.
Many dynamic programming algorithms also exhibit a similar dependence pattern. The importance of stencils has been noted by a number of researchers, indicated by the recent surge of research projects and publications on this topic, ranging from optimization methods for implementing such computations on a range of target architectures, to Domain Specific Languages (DSLs) and compilation systems for stencils [27][28][29][30][31][32][33][34][35][36][37][38][39]. Workshops and conferences devoted exclusively to stencil acceleration have recently emerged.
A second aspect of domain specificity is reflected in the emergence of specialized architectures, called accelerators, for executing compute intensive parts of many computations. They include GPGPU, general purpose computing on graphics processing units (GPUs), and other coprocessors (Intel Xeon Phi, Knight's Landing, etc.). Initially they were "special purpose," limited to highly optimized image rendering libraries occurring in graphics processing. Later, researchers realized that these processors could be used for more general computations, and, eventually, the emergence of tools like CUDA and OpenCL enabled general purpose parallel programming on these platforms.
Exploiting the specificity of the applications and the specificity of target architectures leads to domain-specific tools to map very high level program specifications to highly tuned and optimized implementations on the target architecture. Many such tools exist, both academic research prototypes and productions systems.
As indicated earlier, our domain specificity comes in multiple flavors. First, we investigate only stencil computations. They belong to a class of programs called uniform dependence computations, which are themselves a proper subset of "affine loop programs." Such programs can be analyzed and parallelized using a powerful methodology called the polyhedral model [13][14][15][16][17][48][49][50], and many tools are widely available, e.g., PPCG, developed by the group at ENS, Paris [51]. Second, we tackle a specific target platform, namely a single GPU accelerator, and PPCG includes a module that targets GPUs and incorporates a sophisticated code generator developed by Grosser et al. [18] that employs a state-of-the-art tiling strategy called hybrid hexagonal classic tiling. An open source compiler, implementing this strategy is also available, henceforth called the HHC compiler.

Comparison with Polyhedral Methods
The landscape described (in Figure 1.3) allows us to place our work in context. Although our methods are for domain specific purposes, an extreme situation with CPUs as the architecture, and the set of polyhedral programs allows us to compare with conventional compilation.
The optimization problem a compiler "solves" is: pick transformation parameters so as to optimize the program property of interest, typically execution time. Since it has a single (or a small handful of) predetermined strategies, it is a limited kind of mathematical optimization problem.
The objective function is a surrogate for execution time. Now consider PLuTO [50], a state-of-the art polyhedral compiler based on a mathematical representation of both programs and transformations. By considering only polyhedral programs and transformations, the optimization problems are rigorous. By default, PLuTO targets multicore CPUs, and uses a transformation strategy that combines one level of tiling, loop fusion and (loop/wavefront) parallelization of tiles. It solves a mathematical optimization problem where the schedule parameters (coefficients of tiling and schedule hyperplanes) are the unknown variables, and the cost function is the number of linearly independent tiling hyperplanes, combined with a quantitative measure of the length of the inter-tile dependences. This is again, a surrogate for the total execution time, and leads to solutions that while reasonable, are not provably optimal.
Moreover, parameters like tile sizes, vectorization and inter-tile schedule are chosen using simple heuristics, and are not part of the optimization.

Limitations of current domain-specific compilation
Consider how polyhedral compilation has recently evolved. Bondhugula et al. [52] proposed an extension of PLuTO for periodic stencil computations, and Bandishti et al. [53] developed another extension to allow concurrent starts. Since the objective functions in these strategies are all surrogates for the execution time, there is no way to compare across the strategies. Authors leave the choice of strategy to the user, via compiler flags. Recently it was shown (in [54], both quantitatively and empirically) that while concurrent start may be faster for iteration spaces with a certain aspect ratio of the program size parameters, the best performance for the same program with different aspect ratio is provided by the basic PLuTO algorithm.
As another example, Grosser et al. [18] proposed a novel combination of hexagonal and classic tiling for stencil programs on GPUs. They demonstrated-only empirically-performance gains compared to previous strategies, but did not quantitatively explore cases where HHC was better.
As a consequence, polyhedral compilation remains difficult. Every time a new strategy is developed, the authors publish a paper, and empirically show that their results are better than previous ones. They usually do not provide a quantitative, analytical comparison, thereby preventing a better, collective understanding of how to solve the bigger, global problem. Our intention is not to criticize the field: the problems today are difficult enough that significant effort is needed for even developing such "point solutions." Our approach is a step towards addressing these limitations.

Design Landscape
To place our work into context, to precisely formulate the problems we address, and to describe the approach we take to solve them, we show the design landscape of domain-specific optimization problems. It has six dimensions, organized into three planes ( Fig. 2.1 (a), (b) and (c)). Each plane has two axes: instances, and features. The feature axis may be hierarchical, and parameterized.   The program plane ( Fig. 2.1(a)) consists of instances of dense stencil computations, such as Convolutional Neural Net (a machine learning kernel), Heat-3D (a stencil computation from com-putational science), Clustal-3D (a dynamic programming kernel from bioinformatics). Because of domain specificity, each program is compactly described with a small set of features, such as: (i) a set of iteration spaces (ii) a set of data spaces, (iii) a set of dependences, (iv) a set of computational operators (e.g., loop bodies), and (iv) one or more size parameters.
The transformation plane ( Fig. 2.1(b)) defines the space of compiler transformations that can be applied to the program. Domain specificity again allows us to consider only a few instances, e.g., time skewing [55,56], diamond tiling [53], diamond prisms [36,57], or hybrid hexagonal-classic (HHC) tiling [18,58,59]. Transformation strategies are (potentially) hierarchical, and each level of the hierarchy represents a partitioning. 1 They are also specified by a set of features, each of which is a mathematical function: (i) tile shape, specified by the so called "tiling hyperplanes," (ii) tile schedule, (iii) processor mapping specifying which (virtual/physical) processor in the hardware hierarchy will execute a tile, and (iv) memory allocation specifying where its inputs and outputs are stored. Note that the schedule usually also has components to specify when tile inputs are read and when tile outputs are written. The transformation plane features are also parameterized: mapping function coefficients, tile sizes, etc., are viewed as parameters.
The architecture plane ( simultaneously optimize for both, time and energy) as well as HW/SW Co-design.

Approach
We now describe our overall approach and how it can lead to the benefits mentioned earlier (i.e., automatic optimal mappings, future proofing and codesign). First, we develop analytical models for execution time and energy for a given program and a transformation strategy on a fixed architecture. We also develop silicon area models for GPU architectures and show its use in chip area prediction. Second, we show how these models can be used for performance optimization.
And finally, we show how to formulate mathematical optimization problems using such cost models to solve the problem of software-hardware codesign. We show our initial results to justify our claims, and identify remaining challenges in later chapters.

Models and Validation Execution Time Model and Prediction of GPGPU Stencils
We develop an execution time model that predicts execution time of transformed stencil codes.
Our model is a simple set of analytical functions that predict the execution time of the generated code. It is deliberately optimistic, since we are targeting modeling and parameter selections yielding highly optimized codes. We experimentally validate the model on a number of 2D and 3D stencil codes, and show that the root mean square error in the execution time is less than 10% for the subset of the codes that achieve performance within 20% of the best.

Energy Model and Prediction of GPGPU Stencils
Like the analytical execution time model, we develop a methodology for modeling the energy efficiency of tiled nested-loop codes running on a graphics processing unit (GPU) and use it for prediction of energy consumption. We assume that a highly optimized and parameterized version of a tiled nested-loop code-either written by an expert programmer or automatically produced by a polyhedral compilation tool-is given to us as an input. We then model the energy consumption as an analytical function of a set of parameters characterizing the software and the GPU hardware.
Most previous attempts at GPU energy modeling were based on low-level machine models that were then used to model whole programs through simulations, or were analytical models that required low level details. In contrast, our approach develops analytical models based on (i) machine and architecture parameters, (ii) program size parameters as found in the polyhedral model, and (iii) tiling parameters. Our model therefore allows prediction of the energy consumption with respect to a set of parameters of interest. We illustrate the framework on three nested-loop codes: Smith-Waterman, and one-dimensional and two-dimensional Jacobi stencils, and analyze the ac-curacy of the resulting models. With optimal choice of model parameters the RMS error is less than 4%. Two factors allow us to attain this high accuracy. The first is domain-specificity: we focus only on tilable nested-loop codes. The second is that we decouple the energy model from a model of the execution time, a known hard problem.

Area Model and Chip Area Prediction of GPUs
We also develop an analytic model for the total silicon area of a GPU accelerator. We faced some difficulties in deriving an acceptable analytical model, as silicon data had to be reverse engineered from extremely limited public domain resources. As a general observation, within each GPU family, there is little diversity in the parameter configurations. For the Maxwell family of GPUs, the GTX980 and Titan X chips were chosen as two sufficiently distinct points to calibrate our analytical models. The calibration itself was performed by evaluating die photomicrographs, publicly available information about the nVidia GTX 980 (Maxwell series) GPU, and other generally accepted memory architecture models. The model validation was done by comparing the predictions with known data on the Maxwell series Titan X GPU. We found the model prediction to be accurate to within, 2%, though this number is not significant 3 .
Next, we develop mathematical objective functions to illustrate the use of these models in performance optimization and later we will show the same for software-hardware codesign.

Compilation and its optimization subspaces
To address the limitations of current domain-specific compilation noted in Section 2.1.2, we now describe our approach to systematically exploring well defined regions of the design landscape using exact (not surrogate) objective functions. Each problem instance has an objective function that represents (is not just a surrogate for) the metric which we seek to optimize: execution time(M T ), power(M P ), energy(M E ), etc.
It is a function of all the parameters of this three-dimensional point. Other parameters, e.g., the number of processors, the memory capacity, etc., may define a feasible space where this function is valid.
Our approach is based on the hypothesis that domain-specificity of both the programs and the architecture allows us to develop such functions. Note that the objective function cannot be a surrogate, it must be the actual cost metric of interest. Under this hypothesis, our entire strategy can be summarized as collective solution of multiple optimization problems with common objective function(s). We will discuss two such common objective functions, M T for execution time and M E for energy (expressed as GOPs/sec or GOPs/joule) in details in the following chapters.
2. (Auto) Super-Tuning: The next step will be to extend the optimization across multiple strategies, say S 1 and S 2 . Given two separate optimizations formulated as follows: (a) Minimize M T 1 ( P , S 1 , A) subject to F T 1 ( P , S 1 , A), and We can formulate the problem of optimizing across strategies in two ways: (i) Take the minimum of the two optimizations min(minimize M T 1 ( P , S 1 , A), minimize M T 2 ( P , S 2 , A)), or (ii) solve separate optimization problems, depending on the intersections and differences of the feasible spaces of each one.
This can be extended to a set of strategies, S = {S i }. Although the second option is not very scalable-the number of sub-problems grows exponentially with the number of strategiesit is reasonable for a small number of strategies, e.g., it would let us automatically choose between time skewing, diamond tiling, diamond prisms, and HHC.
3. Multi-Metric (Auto) Tuning: The above optimizations account for only one performance metric, which leads to a single objective function for the optimization. One might want to optimize for more than one metric. Let us consider a multi-metric optimization such as the energy-delay product. The optimization problem can be formulated as Note, the feasible space consists of the intersection of the feasible space of time and energy.
The program parameters (eg. problem sizes) and the features (eg. tile sizes) of the selected strategy (eg. Diamond tiling) are the parameters to the multi-metric objective function.

4.
Multi-Metric (Auto) Super-Tuning: The above multi-metric objective function can be extended to multiple strategies, say S 1 and S 2 . Consider, two optimization functions As in (Auto) Super-Tuning, we can formulate the problem of optimizing across strategies in two ways: (i) Take the minimum of the two optimizations, or (ii) solve separate optimization problems, depending on the intersections and differences of the feasible spaces of each one.
The methods can be extended to a set of strategies, S = {S i }.
Thus, our approach would allow us to deliver on the promise of automatic and provably optimal compilation, for any point in the program × transformations/strategies plane for a given performance metric.
Polyhedral compilation revisited: As noted before, current polyhedral compilers target a fairly broad class of programs, and make choices like tiling hyperplanes and shapes, and (inter and intra) tile schedules. They do this by this using classic scheduling algorithms [16,17,50] that use (integer) linear programming using surrogate objective functions. Tile sizes are chosen subsequently via auto-tuning.

Codesign and its optimization subspaces
Codesign-the simultaneous design of hardware and software-has two common interpretations. System codesign is the problem of simultaneously designing hardware, runtime system, compilers, and programming environments of entire computing systems, typically in the context of large-scale, high-performance computing (HPC) systems and supercomputers. Application codesign, also called hardware-software codesign, is the problem of systematically and simultaneously designing a dedicated hardware platform and the software to execute a single application (program). The proposed approach is applicable in both contexts.
Application Codesign and its optimization subspaces This problem seeks to optimize the common performance metric for the set of programs on the given architecture. We treat A, not as parameters, but rather as unknowns, in the generalized optimization problem, argmin A M( P, S, A) gives us the optimal architecture for the set of program instances. Thus we simultaneously solve for architecture and compilation, thereby resolving the codesign problem. mization problem to a multi-metric optimization.
This is particularly useful in large system designs, where the transformation strategy is fixed and more than one performance metric is critical for system design. Note, that we show multi-metric optimization for two cost metrics which can be extended to more than two cost metrics as needed.

4.
Multi-Metric-Super System Codesign: The above multi-metric system codesign can be further extended to consider multiple strategies as shown below This would be the ultimate goal for system codesign where we can optimize across all the possible program, transformation and architecture planes for multiple performance metrics.

Bottleneckology
Our approach to system design seems deceptively simple, however, it is a very hard problem.
Exploiting the resources to their full capacity is one of the objectives while optimizing for performance. The bottleneck analysis becomes helpful in studying the performance sinks and design flaws. There are many ways of utilizing the cost models to perform bottleneck analysis. The cost models can be used to identify the resources that have been saturated and the ones that have slack.
We refer to this slack and saturation of the resources as Bottleneckology. We study this in three ways: (i) investigate codesign-tradeoffs, (ii) perform overhead analysis, and (iii) explore the effect of hyperthreading. More details are provided in Chapter 6.
In the next chapters (3, 4, 5, and 6), we discussed our work in more details.
Model predictions are used for estimating execution time, energy consumption, power consumption, etc. of a program. Cost metrics either appear in the objective function as a factor to be optimized or in the constraints. We will illustrate the use of a few of the many metrics -(i) execution time models, and (ii) energy models as cost in the optimizing functions; and (iii) memory access models, and (iv) silicon area models as the constraints to the objective function.

Execution Time Model for GPGPU Stencils
We develop an execution time models for GPGPU stencils that guides the optimal choice of compiler parameters(tile sizes). Our model is a simple set of analytical functions that predict the execution time of the generated code. It is deliberately optimistic, since we are targeting modeling and parameter selections yielding highly optimized codes. We experimentally validate the model on a number of 2D and 3D stencil codes, and show that the root mean square error in the execution time is less than 10% for the subset of the codes that achieve performance within 20% of the best.
We show the following.
• We develop a simple analytical model to predict the execution time of a tiled stencil program and apply it to codes generated by the HHC compiler. The model is an analytic function of program, machine, and compiler parameters that are easily available statically, and one stencil-specific parameter that is obtained by running a handful of micro-benchmarks.
It is deliberately optimistic and also ignores the effect of some parameters.
• Although our model may not accurately predict the performance for all tile size combinations, it is very accurate for the ones that matter, i.e., those that give top performance. To show this, we generated more than than 60,000 programs for two modern target platforms (NVIDIA GTX 980, and Titan X), and four 2D stencil codes (Jacobi2D, Heat2D, Laplacian2D, and Gradient2D) and two 3D stencils (Heat3D and Laplacian3D) over a range of ten input/problem sizes, and a wide range of tile sizes and thread counts (the HHC compiler inputs) for each platformstencil-size combination.
As we expected, the root-mean-square error (RMSE) over the entire data set was "disappointingly" over 100%. However, when we restricted ourselves to the data points that have an execution time within 20% of the best value for that particular platform-stencil-size combination, the RMSE dropped to less than 10% 4 , which we consider very good.
Our overall methodology is applicable, with simple extensions, to more general programs, e.g., those that fit the polyhedral model. But for achieving high GPU utilization, we need efficient GPU codes to start with, which are very hard and time consuming to produce manually, especially in higher dimensions. The highly optimized HHC-generated codes we are using for testing and validation have a few thousand lines of CUDA code each and we generated tens of thousands such codes in our experimental analysis. So our methodology is not limited to the HHC compiler (in fact we have applied it successfully to manually generated 1D stencil codes), but the use of HHC (or similar compiler) was necessary to produce for our experiments a high number of GPU codes that are also very efficient.

Energy Model for Tiled Nested-Loop Codes
Energy efficiency has been recognized as one of the biggest challenges in the roadmap to higher performance (exascale) systems for a number of reasons including cost, reliability, energy conservation, and environmental impact. The most powerful computers today consume megawatts of power, enough to power small towns, and at cost of millions per year. And those estimations do not include the cost of cooling, which might be almost as high as the cost of computing itself [60].
In addition, the cost of building a power provisioning facility ranges at $10-22 per deployed IT watt [61] and every 10 • C temperature-increase results in a doubling of the system failure rate, thereby reducing the reliability of HPC systems [62]. Designing accurate models for energy efficiency can help better predict the power and energy requirements of an application and aid developers optimize the parameters of their codes for better energy efficiency on HPC systems.
The goal of our work is to introduce a new approach for modeling the energy cost as an analytical function of tunable software parameters in a way that is both simple and accurate. Having such a model will allow the energy efficiency to be optimized with respect to (a subset of) the tunable parameters by solving the corresponding analytical optimization problem.
We target with our modeling approach tiled nested-loop code segments, which are the most compute-intensive portions of many application codes and which also allow a high degree of parallelism. In order to be more specific, we focus in our analysis on a subclass of the tiled nested-loop codes called dense stencils, which occur frequently in the numerical solution of PDEs and in many other contexts such as high-end graphics, signal and image processing, numerical simulation, scientific computing, and bioinformatics. We chose stencils for our case studies since that would allow us to model the entire class in a hierarchical way with a single generic model representing the whole class, while model parameters that are stencil-dependent have to be separately specified for each stencil of interest to complete its model. (However, the approach is applicable to any other class of nested-loop codes that allows tiling.) We completely develop and validate the detailed models (including the stencil-dependent parameters) of three specific stencils. Models for other stencils can be developed in a similar way with relatively small amount of extra work.
In order to efficiently optimize stencils on accelerators, we aim to represent the amount of energy consumed as an analytic function of the software parameters. We assume that the input codes have been analyzed and optimized with respect to parallelism and data-access efficiency by appropriate skewing and tiling transformations, say by a polyhedral code generator.
Our specific contributions are as follows.
• Our energy model predicts energy efficiency by analyzing source code only, unlike other approaches [63,64] that rely on parameters computed by running benchmarks for each individual code. We do use micro-benchmarks, but they are used to characterize hardware, rather than codes.
• We are not aware of any previous work combining the polyhedral method with energy modeling. Our approach allows optimization of codes that are already very efficient having been significantly improved by applying the polyhedral method and by using advanced tiling strategies such as hexagonal and hybrid tilings [18].
• Our model is very accurate (one version with RMS error ≤ 17.14% and another with RMS error ≤ 4%), with similar or higher precision than alternative existing models, e.g., GPUSimPow [65], which are simulation based.

Memory Access Model for GPGPU Stencils
We develop Memory Access models [21,22] for GPGPU stencils and use them for execution time models and energy models. The memory models appear in two specific contexts. Firstly, the total number of memory accesses made by a tile is used to model the data transfer time taken by a tile, similarly, the data movement requirement of a wavefront is modeled using the equations to calculate the data transfer time for wavefronts. Similarly, for the energy models we use memory access equations to determine the amount of transfer required and combine it with the energy consumption per data transfer to calculate the total energy consumption of a tile.
Second, the memory footprint of a tile appear as constraints to the formulated objective function for optimization. The memory requirements of a tile are not to exceed the shared memory capacity of a GPU. This in turn constrains the tile sizes and the feasible space.
These memory models are then used for codesign optimization [20]. Again for softwarehardware codesign, the memory models appear both in the objective function as a part of time model equations (i.e. to calculate the data transfer time) as well as the constraints where memory capacity defines the feasible space.

Silicon Area Model for GPUs
We develop an analytic model for the total silicon area of a GPU accelerator. We faced some difficulties in deriving an acceptable analytical model, as silicon data had to be reverse engineered from extremely limited public domain resources. As a general observation, within each GPU family, there is little diversity in the parameter configurations. For the Maxwell family of GPUs, the GTX980 and Titan X chips were chosen as two sufficiently distinct points to calibrate our analytical models. The calibration itself was performed by evaluating die photomicrographs, publicly available information about the nVidia GTX-980 (Maxwell series) GPU, and other generally accepted memory architecture models. The model validation was done by comparing the predictions with known data on the Maxwell series Titan X GPU. We found the model prediction to be accurate to within, 2%, though this number is not significant. 5 In the next chapter, we will show the use of these cost models for tile size selection. 5 Although a many configurations of any family of GPUs are spaced out, they come from binning only a small number of distinct dies. We ended up calibrating our model on one die and validating it on only another one.

Chapter 4 Tuning
An important element of compilation tools is a step called (auto) tuning: empirical evaluation of the actual performance of a, hopefully small, set of code instances for a range of mapping parameters. This enables the compilation system to choose these parameters optimally for actual "production runs" on real data/inputs. Modern architectures are extremely complicated, with sophisticated hardware features that interact in unpredictable manners, especially since the latency of operations is unpredictable because of the deep memory hierarchy. It is widely believed that because of this, autotuning is unavoidable in order to obtain good performance.
Our work challenges this. In particular, we make the case that domain specificity can have a third important benefit: it enables us to develop a good analytical model to predict the performance of specific types of codes on specific types of target architectures. We can then use the model to optimally choose the mapping parameters (notably tile sizes).
In order to address the challenges of exascale computing, many experts believe that a softwarehardware co-design approach-where the software and the corresponding hardware are jointly codeveloped and co-optimized-will be a "critical necessity" [66]. Since the architectures of exascale systems are in the flux, it is important to develop rigorous methods to map high level specifications of computations to diverse target architectures, ranging from multi-core CPUs, many-core GPUs, and accelerators over heterogeneous nodes of such CPU-GPU combinations to large distributed systems of many such nodes. In the overwhelming majority of cases, the mismatch between data communication patterns and hardware architecture prevents the efficient exploitation of all available computing resources and peak performance is almost impossible to achieve. Worse still, it is often not clear to the user when the point of diminishing returns is reached.
We address a key step of the optimization, namely mapping the software representation onto the hardware, and choosing the mapping parameters to optimize an objective function representing the performance, i.e., the execution time. In its full generality, the optimal mapping problem is a discrete non-linear optimization problem, known to be NP-hard [67] and hence very difficult to solve efficiently. We therefore use a number of simplifying assumptions, as is common in the literature. A number of parameters can be specified as inputs to a compiler, e.g., the tile sizes.
These parameters have a tremendous influence on the performance of the code. The problem we tackle here is how to select these parameters optimally.

Tune for Speed
To test the predictive abilities of our execution time models, we evaluated the model over the entire feasible space (for each platform-stencil-size combination) and obtained the tile sizes that were within 10% of the best predicted execution time. There were less than 200 such points. We called the HHC compiler with these tile sizes and were able to observe among this set a performance improvement of 9% on average with maximum of 17%. Prajapati et al. [21] illustrates the predictive power of our execution time models in detail.
We have two messages. The main one is that, contrary to widespread belief, it is possible to construct good analytical cost functions to drive performance tuning for GPGPUs. This can significantly reduce the space that autotuners need to explore. The second message, is that it may be necessary to revisit some of the "conventional wisdom," when choosing tile size parameters.
Our model is very accurate for predicting the times of problem instances whose performance is within 20% of the optimal and, hence, it can be used to find values for tunable parameters that will give near optimal performance.
We would like to note that our techniques can be easily extended to other type of stencils.

Optimize for Energy
Our proposed optimization methods can also be applied to optimize for energy. Our energy models represents the energy consumption as an explicit analytic function of a set of software and hardware parameters describing the specifics of the implementation and the hardware. speed, as expected in the folklore. Thus a user could use both optimizations rather than having to choose one or the other. Finally, we use our energy model to select the optimal tile size for energy efficiency and report the number of non-optimal tile size selections and hence the error in energy due to the selection of non-optimal tile sizes. Prajapati et al. [22] describes our energy models, the optimization methods, the results and experimental validation in details.
"Design is not just what it looks like and feels like. Design is how it works." -Steve Jobs Software-hardware codesign is one of the proposed enabling technologies for exascale computing and beyond [6]. Currently, hardware and software design are done largely separately. Hardware manufacturers design and produce a high-performance computing (HPC) system with great computing potential and deliver it to customers, who then try to adapt their application codes to run on the new system. But because of a typically occurring mismatch between hardware and software structure and parameters, such codes are often only able to run at a small fraction of the total performance the new hardware can reach. Hence, optimizing both the hardware and software parameters simultaneously during hardware design is considered as a promising way to achieve better hardware usage efficiency and thereby enabling leadership-class HPC availability at a more manageable cost and energy efficiency.
The design of HPC systems and supercomputers is by no means the only scenario where such optimization problems occur. The execution platforms of typical consumer devices like smart phones and tablets consist of very heterogeneous Multi-Processor Systems-on-Chip (MPSoCs) and the design challenges for them are similar.
Despite the appeal of an approach to simultaneously optimize for software and hardware, its implementation represents a formidable challenge because of the huge search space. Previous approaches [68][69][70], pick a hardware model H from the hardware design space, a software model S from the software design space, map S onto H, estimate the performance of the mapping, and iterate until a desirable quality is achieved. But not only each of the software and hardware design spaces can be huge, each iteration takes a long time since finding a good mapping of S onto H and estimating the performance of the resulting implementation are themselves challenging computational problems.
We propose a new approach for the software-hardware codesign problem that avoids these pitfalls by considerably shrinking the design space and making its exploration possible by formulating the optimization problem in a way that allows the use of existing powerful optimization solvers. We apply the methodology to programmable accelerators: Graphics Processing Units (GPUs), and for stencil codes. The key elements of our approach are to exploit multiple forms of domain-specificity. Our main contributions are: • We propose a new approach to software-hardware codesign that it is computationally feasible and provides interesting insights.
• We combine our area model with a workload characterization of stencil codes, and our previously proposed execution time model [21] to formulate a mathematical optimization problem that maximizes a common objective function of the hardware and software parameters.
• Our analysis provides interesting insights. We produce a set of Pareto optimal designs that represent the optimal combination of hardware and compiler parameters. They allow for up to 33% improvement in performance as measured in GFLOPs/sec.
We develop a framework for software-hardware codesign that allows the simultaneous optimization of software and hardware parameters. It assumes having analytical models for performance, for which we use execution time, and cost, for which we choose the chip area. We make use of the execution time model from Prajapati et al [21] that predicts the execution times of a set of stencil programs. For the chip area, we develop an analytical model that estimates the chip area of parameterized designs from the Maxwell GPU architecture. Our model is reasonably accurate for estimating the total die area based on individual components such as the number of SMs, the number of vector units, the size of memories, etc.
We formulate a codesign optimization problem using the time model and our area model for optimizing the compiler and architecture parameters simultaneously. We predict an improvement in the performance of 2D stencils by (104% and 69%) and 3D stencils by (123% and 126%) over existing Maxwell (GTX980 and Titan X) architectures.
The main focus is on the methodology; specifically, to develop a software-hardware codesign framework and to illustrate how models built using it can be used for efficient exploration of the design space for identifying Pareto-optimal configurations and analyzing for design tradeoffs. The same framework, possibly with some modifications, could be used for codesign on other type of hardware platforms (instead of GPU), other type of software kernels (instead of the set of stencils we chose, or even non-stencil kernels), and other kind of performance and cost criteria (e.g., energy as cost). Also, with work focused on the individual elements of the framework, the execution time and the chip area models we used could possibly be replaced by ones with better features in certain aspects or scenarios.
The analyses from our work indicate the following accelerator design recommendations, for the chosen performance, cost criteria, and application profile: • Remove caches completely and • Use the area (previously devoted to caches) to add more cores on the chip.
• The more precise the workload characterization and the specific area model parameters, the more useful the conclusions drawn from the study.
Hardware resources such as memories are often expensive and must be utilized wisely. In the next chapter, we discuss how to use cost models to identify the resources requirements for optimal performance via bottleneck analysis.
We explore bottleneckology in three ways: study codesign trade-offs, perform overhead analysis, and investigate the effect of hyperthreading. Next three sections discuss these in more details.

Codesign Trade-offs
Workload Sensitivity Table 6.1 illustrates the architectural parameters for the best performing designs for each of the six benchmarks(2D and 3D stencils), for an area budget between 425-450 mm 2 . Observe how the parameters of the best architecture are significantly different. There are also differences in the achieved performance for each benchmark, but that is to be expected since the main computation in the stencil loop body has different number of operations across the benchmarks.

Shared Memory Requirements
We can also observe that there are marked differences between the optimal architecture configurations for 2D and 3D stencils in Table 6.1. 3D stencils seem to require larger shared memory (≥ 96 kB / SM) compared to 2D stencils (≤ 24 kB / SM). Indeed, for designs with lower than 48kB, the performance was nowhere near the optimal for 3D stencil programs. Comparing the optimal configurations for Heat 2D stencil with that of Heat 3D stencil (both have equal total die area of 447 mm 2 ), we observe that the amount of shared memory required   for Heat 3D stencil is 16 times more than that for Heat 2D stencil. Also note, 3D stencils require higher number of vector units per SM for optimal performance.
Resource Allocation Another interesting perspective is seen in Figure 6.1 which plots the Paretooptimal design points in blue and all other non-Pareto configurations in orange. The axes show the relative percentages of the chip area devoted to memory, and to vector units. We notice that the optimal designs (blue points) lie in a relative cluster. This phenomenon is even more marked for 3D stencils. At present, we do not have a clear explanation for why the points are clustered in this manner, and we plan to mine this data to determine patterns, if any.

Overhead Analysis
As discussed before, a compiler generated code has a number of parameters, such as tile sizes, that are then tuned via empirical exploration. Our execution time model [21] guides such a choice. The execution time model uses a machine dependent parameter, called C iter , which is the execution time of one iteration of the loop body per vector unit provided that all the necessary data is available in shared memory. To measure C iter , one needs to generate a random set of codes for a given stencil with different tile sizes, modify these codes to remove global memory accesses and then take the average of empirically measured execution times to obtain the value of C iter . The process is very time consuming and requires expertise in Hybrid Hexagonal Tiled code generator [18]. Also, C iter is dependent on both, the machine as well as the program. Therefore, its value changes as machine and program parameters vary. It is, therefore, very difficult to model/measure is C iter . Also note that C iter value is used to evaluate the objective function to find the optimal tile size. This imposes limitations on the use of execution time model for optimal tile size selection because the crucial model parameter C iter , is to be empirically measured.
To address this problem, we propose a closed form solution that is completely independent of the machine and the program parameters. We are able to analytically predict the optimal tile sizes which are portable across platforms and is valid for all Jacobi-like stencils. We modify the objective function from Prajapati et al. [21] and develop a cost function which is independent of C iter . For a 2D stencil, our closed form solution suggests that we maximize the size of the hexagonal face of a tile subject to some constraints. This allows us to significantly narrow down the tile size design space.
The mathematical optimization problem in Prajapati et al. [21] for a given 2D stencil is formulated as follows: where t S 1 , t S 2 , and t T are tile sizes and T alg is the model predicted execution time of the code given by the following formula: where N w is the number of wavefronts, T sync is the time for synchronization for a wavefront, T prism is the time to execute a tile, n SM is the number of processors, w is the size of a wavefront and k is the number of tiles that execute simultaneously. For more details, please refer Prajapati et al. [21].
In addition to the time for computation, T alg includes time taken by data transfers and the time for inter-tile and intra-tile synchronizations. We are interested in only those tile sizes that give optimal performance. Therefore, our tiles will be compute bound. Let us consider an ideal machine where the performance is given by the following equation: where S 1 , and S 2 are problem sizes in space dimensions, T is the problem size in time dimension, and n V is the number of vector units. Such an ideal machine is free of all synchronization delays and takes no additional time to do data transfers. On a real machine, we would like to obtain the performance that is close to T ideal . However, there is always an overhead price such that Substituting T alg and T ideal with their respective equations and solving gives us Instead of minimizing the execution time(as in the equation 6.1), we can now minimize the overhead. To minimize equation 6.5, we need to maximize t S 1 + t T 2 . This suggests that we should increase the size of the hexagonal face of the tile as much as possible. Notice, t S 2 does not appear in the equation 6.5.
For the above formulation, we assumed that the tiles are compute bound. We need a mechanism to first prune the tile size design space and restrict it to only consider compute bound tiles and then use the above cost functions to further reduce the search space.

The effect of the hyper-threading
Our results in [21] suggest that we should revisit the "conventional wisdom" that says that an optimal strategy of a tiling is to choose the "largest possible tile size that fits" i.e., its memory footprint matches the available capacity. First of all, this falls into the trap that it precludes overlapping of computation and communication (the "hyperthreading effect"). But this can be avoided by explicitly accounting for hyperthreading. Indeed, our GPU platforms preclude such large size by disallowing the data footprint of a thread block to exceed half the shared memory capacity.
Thus, the hyperthreading-adjusted "conventional wisdom" would still seek to maximize tile volume subject to the half-capacity constraint-the best strategy is the largest tile volume for the given footprint. Our model and experimental data suggests otherwise-an even higher hyperthreading factor is turning out to yield the best performance. We still don't know why, and it is subject to investigation.

Stencil Computations and Code Generation
At the algorithmic level, most stencil applications are compute bound in the sense that the ratio of the total number of operations to the total number of memory locations touched can always be made "sufficiently large" because it is an asymptotically increasing value. We may expect that such codes can be optimized to achieve very high performance relative to machine peak.
However, naive implementations turn out to be memory-bound. Therefore, many authors seek to exploit data locality for these programs [53,71,72]. One successful technique is called time tiling [53,57,[73][74][75][76][77][78], an advanced form of loop tiling [73,79,80]. Time tiling first partitions the whole computation space into tiles extending in all dimensions, and then optionally executes these tiles in a so called "45 degree wavefront" fashion. We assume, like most of the work in the literature, that dense stencil programs are compute bound after time tiling. However, due to the intricate structure of time tiled code, writing it by hand is challenging. Automatic code generation, is an attractive solution, and has been an active research topic.
For iterative stencils a large set of optimizing code generation strategies have been proposed.
Pochoir [38] is a CPU-only code generator for stencil computations that exploits reuse along the time dimension by recursively dividing the computation in trapezoids. Diamond tiling [81], Hybrid-hexagonal tiling [18], and Overtile [82] are all tiling strategies that allow to exploit reuse along the time dimension, while ensuring a balanced amount of coarse-grained parallelism throughout the computation. While the former has only been evaluated on CPU systems, the last two tiling schemes have been implemented to target GPUs. Overtile uses redundant computation whereas hybrid-hexagonal tiling uses the hexagonal tiles to avoid the need for redundant computation and the increased shared memory that would otherwise be required to store temporary values. Another time tiling strategy has been proposed with 3.5D blocking by Nguyen et. al [83], who manually implemented kernels that use two dimensional space tiling plus streaming along one space dimension with tiling along the time dimension to target both CPUs and GPUs. A slightly orthogonal stencil optimization has been proposed by Henretty et. al, who use data-layout transformations to avoid redundant non-aligned vector loads on CPU platforms.

Performance Modeling
All of the previously discussed frameworks either come with their own auto-tuning framework or require auto tuning to derive optimal tile sizes. For stencil graphs, which are directed acyclic graphs (DAGs) of non-iterated stencil kernels, various DSLs compilers have been proposed. Halide [84] and Stella [85] are two DSLs from the context of image processing and weather modeling that separate the specification of the stencil computation from the execution schedule, which allows for the specification of platform specific execution strategies derived either by platform experts or automatic tuning. Both DSLs support various hardware targets, including CPUs and GPUs. Polymage [86] also provides a stencil graph DSL-this time for CPUs only-but pairs it with an analytical performance model for the automatic computation of optimal tile size and fusion choices. With MODESTO [87] an analytical performance model has been proposed that allows to model multiple cache levels and fusion strategies for both GPUs and CPUs as they arise in the context of Stella.
For stencil GPU code generation strategies that use redundant computations in combination with ghost zones, an analytical performance model has been proposed [88] that allows to automatically derive "optimal" code generation parameters. Yotov et. al [89] showed already more than ten years ago that an analytical performance model for matrix multiplication kernels allows to generate code that is performance-wise competitive to empirically tuned code generated by ATLAS [90], but at this point no stencil computations have been considered. Shirako et al. [91] use cache models to derive lower and upper bounds on cache traffic, which they use to bound the search space of empirical tile-size tuning. Their work does not consider any GPU specific properties, such as shared memory sizes and their impact on the available parallelism. In contrast to tools for tuning, Hong and Kim [92] present a precise GPU performance model which shares many of the GPU parameters we use. It is highly accurate, low level, and requires analyzing the PTX assembly code.
For stencil GPU code generation strategies that use redundant computations in combination with ghost zones an analytical performance model has been proposed [88] that allows to automatically derive "optimal" code generation parameters. Patus [93] provides an auto-tuning environment for stencil computations which can target CPU and GPU hardware. It does not use software managed memories and also does not consider any time tiling strategies.
Renganarayana et al. [94] identifies positivity as a common property shared by the parameters used by tile size selection methods and show that the property can be used to derive efficient and scalable tile size selection frameworks.

Chip Reverse Engineering and Area Modeling
Chip area modeling can be formally considered a branch of semiconductor reverse engineering, which is a well researched subject area. Torrence et. al. [95] gives an overview of the various techniques used for chip reverse engineering. The packaged chips are usually decapped and the wafer die within is photographed layer by layer. The layers are exposed in the reverse order after physical or chemical exfoliation. Degate [96], for example, is a well known open source software that can help in analyzing die photographs layer by layer. The reverse engineering process can be coarse-grained to identify just the functional macro-blocks. Sometimes, the process can be very fine-grained, in order to identify standard-cell interconnections, and hence, actual logic-gate netlists. Degate is often used in association with catalogs of known standard cell gate layouts, such as those compiled by Silicon Zoo [97]. Courbon et. al. [98] provides a case study of how a modern flash memory chip can be reverse engineered using targeted scanning electron microscope imagery. For chip area modeling, one is only interested in the relatively easier task of demarcating the interesting functional blocks within the die.

Energy Modeling
GPU power/energy model is a very active area: a recent survey article on the topic [99] cites almost 150 references. We only discuss the relevant work here. The model we present complements Mittal and Vetter [99] by enabling us to find the optimal parameters (i.e., tile sizes) for the energy efficient execution of stencil like programs. Hong and Kim [100] present a GPU power model to predict the number of optimal GPU cores to achieve the peak memory bandwidth for a kernel.
An analytical model is used to predict the execution time [101] which has enabled prediction of the power consumption statically. However, they have predicted the minimum number of cores required for a program to achieve the peak memory bandwidth of GPU. While this approach may work for memory bandwidth bound programs, it is unlikely to produce better results for computebound programs like tiled stencil computations. Our model is much simpler, because our model does not depend on warp and thread level parameters and number of PTX instructions. Nagasaka et al [63] model GPU power of kernels using performance counters. Lim et al. [102], GPUWattch [64] and GPUSimPow [103] are simulation based power models. McPAT [104] is the basis for Lim et al [102] and GPUWattch [64] uses GPGPUSim [105] to simulate execution time. Simulation and performance counters-based models require execution (or simulation) of the program to predict the power consumption. Therefore, these models are not feasible solutions when it requires to take decisions at compile time to determine optimal software parameters. We do need to run some micro-benchmarks to find the energy parameters that our model use. In contrast, we run our micro-benchmarks only to determine parameters of a GPU architecture, while the power consumption can be predicted for a given program statically without running the program.
There are studies [106,107] focused on reducing the energy for both CPU and GPU by balancing the load among CPU and GPU. Our study is only focused on modeling the energy consumption of GPUs. There are studies [106,107] focused on reducing the energy for both CPU and GPU. The models are used to determine how to balance the load among CPU and GPU, so that it reduces the overall energy consumption. Our study is only focused on modeling the energy consumption of GPUs.

Codesign
Application codesign is a well established discipline and has seen active research for well over two decades [108][109][110][111][112]. The essential idea is to start with a program (or a program representation, say in the form of a CFDG-Control Data Flow Graph) and then map it to an abstract hardware description, often represented as a graph of operators and storage elements. The challenge that makes codesign significantly harder than compilation is that the hardware is not fixed, but is also to be synthesized. Most systems involve a search over a design space of feasible solutions, and various techniques are used to solve this optimization problem: tabu search and simulated annealing [113,114], integer linear programming [115].
There is some recent work on accurately modeling the design space, especially for regular, or affine control programs [23][24][25]. However, all current approaches solve the optimization problem for a single program at a time. To the best of our knowledge, no one has previously considered the generalized application codesign problem, seeking a solution for a suite of programs.
There are multiple publications on codesign related to exascale computing, but they focus on different aspects. For instance, Dosanji et al. [116] focus on methodological aspects of exploring the design space, including architectural testbeds, choice of mini-applications to represent applications codes, and tools. The ExaSAT framework [117] was developed to automatically extract parameterized performance models from source code using compiler analysis techniques. Performance analysis techniques and tools targeting exascale and codesign are discussed in [6].
Kuck et al. [118,119] analyze and model program hot-spots. They develop computational capacity models and propose an approach for the HW/SW codesign of computer systems. The hardware/software measurements of computational capacity (based on bandwidth usage) and power consumption (based on hardware counters) are used to find optimal solutions to various codesign problems and to evaluate codesign trade-offs. Their models are theoretical and are illustrated by numerical examples. They do not validate their models using real hardware.
Our work contributes to knowledge in following ways: The unified view of the polyhedral design landscape We put together all the parameters to be considered for performance optimization in a single unified landscape (Chapter 2). The landscape (shown in Figure 2.1) considers the program, architecture, and compiler parameters and combines them with various cost metrics. This view lets us identify the pockets of domain specificity and allows us to study performance improvement across all cost metrics.
Analytical Models We develop analytical cost models (Chapter 3) for execution time, energy, memory access, and silicon area of a chip. Our models are reasonably accurate and help predict the associated cost. We argue that these models can be used to break the HPC application performance improvement cycle.

Mathematical Optimization Approach
We formulate mathematical optimization problems to address some of the challenges of exascale. We show how these optimizations can be used for performance tuning (Chapter 4) and accelerator codesign (Chapter 5). For GPGPU stencil computations and polyhedral code generator, we illustrate a proof of concept [20] and present a novel optimization approach to accelerator codesign.

Limitations of our approach
Our approach is limited to the narrow area of domain specific applications, polyhedral model and GPU-like programmable hardware accelerators. The approach can, however, be extended to other set of programs/architectures/transformations by identifying other domain specific regions in the design landscape. More work is needed in order to extend our approach across different regions of domain specificity.

Open Questions
Among the many different uses of the analytical cost models, they can be further explored to answer important performance related questions. We list some of them below: • Using analytical execution time and energy models we can find out (i) What happens when input parameters change? (ii) What happens when different number of processors is used?
(iii) What is the largest possible problem size on a given architecture? (iv) When does the efficiency drop?
• Silicon area models can be used for the following: • Performance tuning related questions such as (i) Sensitivity of problem size to tile size.
(ii) Sensitivity of optimal tile size for different codes. (iii) Reconfirm the folklore : whether optimizing for time is equal to optimizing for energy? (iv) Reasons for the poor performance.
can be answered using cost models.
The answer to these questions become helpful in two situations. One, for performance portability while moving from one architecture to another. Second, obtaining interesting insights to recognize promising areas for future research.