## Abstract

While most recent breakthroughs in scientific research rely on complex simulations carried out in large-scale supercomputers, the power draft and energy spent for this purpose is increasingly becoming a limiting factor to this trend. In this paper, we provide an overview of the current status in energy-efficient scientific computing by reviewing different technologies used to monitor power draft as well as power- and energy-saving mechanisms available in commodity hardware. For the particular domain of sparse linear algebra, we analyse the energy efficiency of a broad collection of hardware architectures and investigate how algorithmic and implementation modifications can improve the energy performance of sparse linear system solvers, without negatively impacting their performance.

## 1. Introduction

Linear algebra has been at the heart of the biannual Top500 (the Top500 List, http://www.top500.org/.) and Green500 (the Green 500 List, http://www.green500.org/.) rankings of supercomputers since their respective inception in 1993 and 2005. In particular, the Top500 list classifies computers according to their raw performance (in terms of GFLOPS, or millions of floating-point arithmetic operations per second) when solving a linear system of equations via the LU factorization.^{1} As a response to the increasing concern about the energy draw by high performance computing (HPC) facilities, the scientific computing community has welcomed the Green500 list, which puts the emphasis on power efficiency by using a performance–power ratio rather than using raw performance (concretely, the GFLOPS/W), where the performance is again based on the aforementioned HPL benchmark.

The egregious benefits that current petascale applications deliver, and those that are expected from the problems to be tackled in the upcoming exascale decade [1–4], motivate the construction of enormous and complex HPC systems. If we extrapolate the technologies underlying the current no. 1 in the Top500 and Green500 lists to build an exascale machine, the outcomes are systems that would dissipate 408 and 312 MW, respectively.

While HPL has long been criticized for not representing the actual performance of the applications currently running on HPC systems, the importance of linear algebra is far beyond just being at the base of this benchmark or modern candidates to replace it [5]. Indeed, the solution of dense and sparse linear systems is ubiquitous in scientific computing, being at the bottom of the ‘food chain’ for many computational applications. Therefore, any effort towards improving the performance and energy efficiency of this type of operation will surely have a strong positive impact on the countless numerical simulations that build upon them.

In this paper, we first provide a general overview of current research and developments on energy-aware HPC by reviewing software- and hardware-based power and energy evaluation methods, metrics used for quantifying the resource efficiency of numerical algorithms, and power-saving mechanisms integrated into hardware. We then specifically focus on sparse linear algebra, comparing the efficiency of different system architectures when solving sparse systems of linear equations using the conjugate gradient (CG) method, and then showing how to improve the energy efficiency of sparse linear system solvers based on this method on multicore and manycore architectures.

## 2. Energy-efficient computing

In everyday usage, there is often an inadequate distinction between the terms ‘power’ and ‘energy’, though the physical difference is significant. In particular, energy is the integral of power over time and, therefore, energy measures the amount of ‘work’ necessary to perform a task, whereas power is the rate at which work is performed and, therefore, the rate at which energy is used.

When talking about the resource demand of HPC facilities especially, the fundamental difference between power optimization and energy optimization is neglected. The term power optimization refers to adapting the power draft of the system running a specific application to some desired value, given, for example, by an upper bound or smooth transition between different values. These restrictions are in most cases motivated by the infrastructure.

When optimizing the system/application combination with respect to energy, the aim is instead to decrease the product of time and (average) power, which can be achieved by reducing execution time, diminishing average power draft or a combination of both. Thus, in some cases, it may be possible to attain a lower energy utilization at the expense of a higher power draft, or vice versa.

### (a) Power metrics

As infrastructure or user demands may impose restrictions on the power draft, new metrics accounting for the key factors of runtime, power draft and total energy consumption of a system/application combination are essential for optimization. While GFLOPS/W is the de facto standard to measure the energy efficiency of a computing system (the Green 500 List, http://www.green500.org/; the Top500 List, http://www.top500.org/), recent work [6] provides strong evidence that this metric tends to promote power-hungry algorithms that deliver high sustained performance, when, in reality, the objective should be the reduction of the total energy, preferably along with the minimization of the time-to-solution (TTS).

To avoid this effect, one can rely on the energy-delay product (EDP) [7], which combines both factors, TTS and energy, into a single figure of merit, thus promoting algorithms that are more power-friendly. The FFTSE [6] progresses further along this line by considering the product between the energy and a function *f* of TTS. Different cost functions for *f* thus lead to distinct cases: *f*(*t*)=1, with *t*=TTS, renders FTTSE as simply equivalent to the energy, whereas FTTSE is analogous to EDP when *f*(*t*)=*t*. More interestingly, the linear model *f*(*t*)=*α*⋅*t*, with *α* a scalar, allows one to weight (penalize) for time increases; and the exponential case *f*(*t*)=e^{α(t−τ)} exerts a stronger pressure when the TTS exceeds a certain threshold *τ*.

### (b) Technologies for power measurement

A common approach to measuring power and energy consumption is to leverage a hardware device, in general a wattmeter, connected between the electric power outlet and the power supply unit (PSU) of the system. Alternatively, the measurement point can be inside the system itself, connecting the hardware device to the lines that run from the PSU to the different components (motherboard, disk, network card, etc.). These power measurement devices can be mainly classified according to four criteria: nature, location (internal or external, as we just observed), accuracy and sampling rate.

*Nature*: some devices measure the active power, that is, average power consumption obtained from the energy measured during a second, whereas other devices instead provide the instantaneous electric power that is obtained by testing the current and voltage of the machine.

*Location*: the point of measurement has significant consequences. Internal supervision cards measure power consumption inside the machine (i.e. downstream from the PSU) unlike other devices which sample externally (upstream of the PSU). This implies that power measurements from such internal cards refer to direct current (DC) power and do not take into account the power wasted by the PSU. By contrast, external devices measure the overall alternating current (AC) power, and therefore include the PSU inefficiencies, which can represent a non-negligible portion of the total consumption of the machine. On the other hand, the external power meters, in general, operate at a lower sampling rate than their internal counterparts.

*Accuracy* and *sampling rate*: the variability of these two dimensions can be large: from devices that offer a measurement uncertainty of ±7 W and/or a sampling rate of 1 Hz (sample per second)—in general too large to capture power variations of few watts and/or a few seconds—to extremely accurate, though in general highly expensive, elements with error margins below ±0.1 W and sampling rates in the range 10 kHz and above.

Recent processor technologies are increasingly equipped with diverse types of counters that provide power measurements, either extracted from real data or estimated via a model. This is the case, for example, of the *running average power limit* (RAPL) model, available in the Intel Sandy Bridge microarchitecture; the *NVIDIA management library* (NVML) for graphics processors; and the *thermal power management device* (TPMD) and the *automated measurement of systems for temperature and energy reporting* (Amester) for IBM's Power 7 processors. Unfortunately, these technologies can be leveraged to measure only the power consumed by the processors, not the whole machine. Moreover, in some cases (e.g. RAPL), this is an intrusive solution as it is necessary to access the energy counters via a daemon program running on the same machine that is monitored, thus introducing unwanted noise into the measurement process.

The general conclusions from El Mehdi Diouri *et al*. [8] indicate that, thanks to their high sample rate, the internal measurement devices allow for better identification of some power fluctuations that the external wattmeters are not able to capture. However, a high sample rate is not always necessary to understand the evolution of the power consumption during the execution of a benchmark. Moreover, monitoring very large-scale distributed systems with internal wattmeters is not practicable, as the integration of those is usually not only expensive but also very difficult. For these reasons, other sources such as RAPL, NVML or TPMD are becoming very appealing for monitoring purposes.

### (c) Software for power measurement

An excellent survey on hardware and software for power profiling is given in [9]. Focusing on the software tools, PowerPack [10] is able to identify which lines feed different components such as disks, memory, network interface controllers, processors, etc. This information is then offered to the user who can gain insights on where and how applications consume power. PowerPack uses a user-friendly interface, and targets applications running on single-node platforms, though PowerPack's information can be ‘manually’ aggregated for parallel message passing interface (MPI) applications.

HDTrace [11] is a package to trace and simulate the power–performance footprint of MPI programs on a cluster. This software supports MPICH2 and the parallel virtual file system. Recently, it has also been extended to identify power hot spots in these applications, using information from commercial AC wattmeters.

*pmlib* [12] is a framework of tools used to analyse the power dissipation and the energy consumption of parallel MPI and/or multi-threaded scientific applications that comprise (i) a simple API (application programing interface) to interact with a number of power measurement devices, (ii) a library and a collection of drivers to capture the power dissipated during the execution of an application in a separate system, and (iii) an interface with integrated ` Extrae`+

`packages (the Paraver Project, http://www.bsc.es/plantillaA.php?cat_id=485.) that permits an interactive analysis of a graphical trace, relating the power dissipation per node/core to information on the core activity.`

*Paraver*### (d) Energy-efficient processor technology

Current processor technology uses tools and mechanisms originally designed for embedded systems and the mobile market to use energy more efficiently. As of today, most processors adhere to the advanced configuration and power interface (ACPI) standard [13], which defines a collection of states the system can be set to depending on the workload, and thus offers a tool to tune the power usage to the current needs.

From the processor point of view, ACPI defines the so-called performance states (or P-states), which define the operating voltage–frequency pair of the processor. Vendors customize their products to feature a varying number of P-states, depending on the type of application. Additionally, in some cases, the P-state can only be set for all cores in the socket, whereas others permit each core to operate in a different P-state. In the past few years, the cost and latency of transitioning between P-states has been decreasing, and the management of the P-states has been incorporated into the operating systems (e.g. via Linux governors). State P0 corresponds to the highest voltage–frequency pair and, therefore, the most power-consuming state. Subsequent states are enumerated as P1, P2, etc.

ACPI includes an additional energy-saving mechanism in the form of the processor or CPU power states (C-states). This tool defines how and when certain processor components can be switched off to improve energy conservation. In state C0, the processor is active. In ‘deeper’ states (C1, C1E, C2 and so on) the clock signal is shut down to certain components (e.g. the floating-point units or certain levels in the cache hierarchy). Thus, a deeper C-state can potentially yield higher energy savings, but the latency to transition to the active C0 state will also be longer. While the P-states can be operated by the user (e.g. via utilities such as cpufreq), for security, the C-states are only accessible to the hardware and the operating system. In this sense, the only possibility for the programmer to interact with these states is by setting the appropriate conditions in the application so that the operating system can promote the processor, when idle, to an energy efficient C-state. Section 3 offers two relevant examples of this.

## 3. Energy-efficient solution of sparse linear systems

### (a) Preliminaries

Sparse linear algebra is in many cases characterized by the solution of linear equations with sparse coefficient matrices. For this purpose, iterative methods are often preferred to direct solvers as they usually provide high accuracy solution approximations with a significantly lower computational effort [14,15]. Although a large variety of iterative methods exist, those based on Krylov subspaces^{2} have shown superior performance for many applications, therefore becoming the de facto standard in many sparse linear algebra software libraries.

The CG method, introduced by Hestenes & Stiefel in 1952 [15], is among the most effective Krylov subspace solvers for symmetric positive definite (SPD) systems. As all experiments in this section will be based on this method, we shortly recall the algorithm and its main characteristics next. For a linear system *Ax*=*b*, where is an SPD coefficient matrix, contains the independent terms and is the solution, the method is algorithmically formulated in figure 1, where the user-defined parameters *maxres* and *maxiter* set upper bounds, respectively, on the relative residual for the computed approximation to the solution *x*_{k} and the maximum number of iterations.

In practical applications, the computational cost of the CG method is dominated by the matrix–vector multiplication *z*_{k}:=*Ad*_{k}. Given a sparse matrix *A* with *n*_{z} non-zero entries, this operation roughly requires 2*n*_{z} floating-point arithmetic operations (flops). Additionally, the loop body contains several vector operations (for the updates of *x*_{k+1}, *r*_{k+1}, *d*_{k+1} and the computation of *ρ*_{k} and *β*_{k+1}) that cost *O*(*n*) flops each. A detailed derivation of the CG method can be found in [16], including a broad convergence analysis.

In the following sections, we apply this algorithm to generate a solution approximation to different linear systems where the SPD coefficient matrices are either chosen from the University of Florida Matrix Collection (UFMC),^{3} corresponding to finite-element discretizations of several structural problems arising in mechanics, or derived from a finite difference discretization of the three-dimensional Laplace problem. A few key parameters from these matrices are collected in table 1 while sparsity plots can be found online.^{2} In all cases, the vector of independent terms *b* is initialized so that all the entries of the solution vector *x* equal 1, and the CG iteration is started with the initial guess *x*_{0}≡0.

All experiments (except for those in §3*c*) were performed using IEEE single-precision (SP) arithmetic. While the use of double-precision (DP) arithmetic is in general mandatory for the solution of sparse linear systems, in [17] it is shown how the use of mixed SP–DP arithmetic, in combination with iterative refinement, leads to improved execution time and energy consumption for GPU-accelerated solver-platforms. The underlying idea of a hybrid SP–DP solver, with iterative refinement, is visualized in figure 2 as an iterative process where the DP solution approximation is updated by the solution of an error correction equation based on the residual and solved in SP. Note that a major conclusion from [17] is that, with this kind of solver, the impact on the global execution time and energy consumption by the parts that are performed in DP arithmetic is negligible. Therefore, it is natural to study the cost of the SP solver. Unfortunately, there exists only a DP version of the solver evaluated in §3*c* so that, for those experiments, we had to rely on the more conventional DP arithmetic.

### (b) Time–power–energy trade-offs

#### (i) General overview

Sparse linear algebra in general, and sparse linear system solvers in particular, are characterized by a high number of memory accesses compared with a low number of flops. Therefore, for many computing systems, the actual sustained performance peak that can be attained in a given platform is dictated by the bandwidth between the computational units and the level of the memory where the data resides. Hardware designs often try to account for this issue by either providing sophisticated memory hierarchies consisting of multiple cache levels, and/or data streaming mechanisms, such as those in GPUs [21].

Aiming for higher energy efficiency, many multicore systems allow for modification of the core clock rate (via the P-states). Although decreasing operating frequency for memory-bound operations may, in principle, yield lower energy usage while preserving performance, this is not true for all architectures, as processor frequency and memory throughput rate are sometimes coupled. Decreasing the number of working cores on multicore platforms (thus enabling the idle hardware to transition to a sleep C-state) may have a similar effect: if the memory bandwidth is independent of the number of active cores, for this type of computations, we may be able to decrease the power draft without impacting performance by deactivating a subset of the cores.

In summary, the key factor that determines the energy efficiency for sparse linear algebra applications is the underlying paradigm concerning the hardware complexity. An architecture composed of a few complex cores, featuring high clock frequencies and deep cache hierarchies, provides high floating-point performance and greatly benefits from data locality, at the price of a high power draft. On the other hand, a low-power processor consisting of simple cores and an unsophisticated memory hierarchy, operating at moderate frequencies, aims for a lower power draft at the expense of extended execution time. Both paradigms, though, have to compete with the data-streaming approach taken by the manycore GPUs.

#### (ii) Experimental analysis

Next, we summarize the key results and observations of our energy efficiency evaluation of the CG method for the solution of sparse linear equation systems on several architectures representative of today's HPC landscape [22]. Following a recent approach that aims at designing an exascale architecture from low-power and embedded components (the Mont Blanc Project, http://montblanc-project.eu), we included several low-power processors in the comparison. One crucial challenge in conducting a fair comparison is choosing an appropriate level of hardware-dependent optimizations to the code as, for example, different systems usually benefit from distinct layouts of the data structures. In particular, data-streaming hardware usually exhibits significant performance increases for coalesced memory access patterns. To account for this issue, we adopt the ELLPACK sparse matrix storage format for GPU systems, while we rely on the standard compressed sparse row (CSR) format for the multicore architectures [23,24]. More details on the implementation of the sparse matrix–vector (SpMV) kernels underlying the CG method, for the CSR and ELLPACKT formats, can be found in [22].

In the following experiments, we execute 1000 iterations of the CG solver on the systems listed along with their characteristics in table 2, for the subset of the UFMC test matrices introduced in table 1, which fit in the main memory of all architectures (bmw, crank and F1), as well as the Laplace A159 matrix. All power measurements are collected using a WattsUp?Pro wattmeter. In figure 3, we report the performance (figure 3*a*,*c*) and energy efficiency (figure 3*b*,*d*) for the different test platforms when optimizing the operating frequency for either runtime performance (figure 3*a*,*b*) or energy efficiency (figure 3*c*,*d*).

*Optimization with respect to time.* In terms of runtime performance, the Tesla K20 (kep) and the Tesla C2050 (fer) are superior to all other architectures by almost by an order of magnitude (see figure 3*a*). However, the performance of these two GPU-based systems strongly depends on the matrix sparsity pattern, a sensitivity that is also shared by the Quadro 1000M (qdr), which outperforms the C6678 DSP (tic) only for those cases where the use of the ELLPACK format does not incur significant overhead on the GPUs. While tic shows some variation on performance as well as the performance-per-watt ratio depending on the matrix structure, its energy efficiency is unmatched by any other architecture. The two other low-power processors, arm and Atom (iat), show better ratios than the conventional general-purpose processors (Interlagos ail, Magny-Cours amc, Nehalem inh, and Sandy-Bridge isb), but only for some matrix cases do they show higher energy efficiency than the GPUs. For fer and qdr, it is interesting to note that, although they yield very different performance, their performance-per-watt ratios are almost equal [22].

*Optimization with respect to the net energy efficiency.* If we optimize the configurations with respect to energy consumption, it is, in many cases, beneficial to reduce the operating voltage and frequency. Particularly, significant improvements can be observed for the low-power processors as these architectures provide not only higher optimization potential, but also exhibit superior energy efficiency compared with the conventional general-purpose CPUs or GPUs [22]. Again, the tic exhibits the best energy efficiency, followed by arm, with the latter still superior to the GPUs or the general-purpose CPUs (see figure 3*d*). iat is, in terms of energy efficiency, competitive with the fer and qdr GPU-based systems. Unfortunately, the decreased voltage and frequency, which results in the superior energy efficiency, also incurs an extended run time as the reduced watt-per-iteration values for tic, arm and iat come at the price of significantly higher execution times; see the performance decrease from figure 3*a*,*b* to figure 3*c*,*d*.

### (c) Leveraging the CPU states on manycore systems

The results in §3*b* illustrate that GPUs are among the most energy-efficient hardware architectures for sparse linear algebra. Despite the fact that these hardware accelerators rarely offer any user-controlled power-saving mechanism, their vast number of floating-point arithmetic units, in combination with an outstanding bandwidth to memory and a data-streaming architecture, results in highly favourable performance/watt ratios. However, as GPUs cannot be used as stand-alone systems, one also has to take the power draft of the host into account. In the end, the energy efficiency depends on the performance/watt ratio of the GPU and the question becomes how to save power for the host.

In this section, we show how to leverage the CPU states in order to improve the energy balance of a GPU-based iterative solver. For this purpose, we evaluate different implementations of the CG method on a platform equipped with both a multicore processor and a GPU. First, we consider a straightforward CG-CUBLAS implementation that consists of individual routines that are either taken from the CUBLAS library or self-written kernels, reflecting specific numerical operations of the method (compare the left-hand side implementation in figure 5 with algorithm 1). This natural approach has not only the advantage of the clear representation of the mathematical description of the method, but also provides the possibility of replacing individual kernels for alternative codes, and, for these reasons, will later be used as a reference implementation.

One major disadvantage of this initial implementation becomes obvious when monitoring the power draft of the CPU, which is in charge of launching the GPU kernels for the vector- and matrix–vector operations. Counter to what one might expect, the CPU does not enter a power-efficient state during idle periods, but instead remains in a power-oblivious polling loop waiting for the GPU kernel to finish. In past work, we leveraged an idle-wait approach [25] to avoid this behaviour, though the same effect can be achieved by running the code in the CUDA blocking mode, which suspends the CPU thread when waiting for the GPU to finish a task. While the suspended threads of the CPU can now be promoted by the hardware into an inactive power-friendly C-state, e.g. C1 (halt) or deeper [13], the reactivation cost may cause some runtime overhead; see figure 4, which illustrates the power benefits and runtime overhead of the blocking mode for the CG-CUBLAS implementation applied to the audi test matrix. Hence, the generally superior power efficiency of the CUDA blocking mode comes along with a certain price in terms of performance, which may negatively affect energy consumption. As the runtime is, in general, regarded as the crucial figure, the challenge is thus to recover the performance of the polling mode while leveraging the power-saving mechanisms embedded in the hardware by operating in blocking mode. The key to this lies in developing a new implementation of the CG method that breaks down the rigid structure of the original code, so that several kernels can be gathered and merged together, with the result offering a higher computational intensity.

Let us reconsider the loop of the original CG-CUBLAS implementation, in the left-hand side of figure 5, corresponding to algorithm 1. We can observe that the loop body comprises a number of calls to CUBLAS kernels inside a loop that can be potentially invoked from hundreds to thousands of times. This large number of kernel calls and the frequent data transfers between GPU and main memory are major sources of overhead. Reducing the number of calls by merging kernels and avoiding synchronization owing to data transfers are both crucial to alleviating the time overhead, but can also render a lower number of activations of the CPU, yielding a more efficient usage of the C-states and, consequently, improved energy efficiency. (We note that this technique can be applied not only to the CG method but, in general, the same approach can be applied in many other codes that off-load most of the computations to the accelerator as a sequence of short GPU kernels.)

In [26], we describe how to decrease the kernel count by gathering multiple operations into a single GPU kernel, and how to reduce the host-device communication by keeping the outcome from reduction operations on the GPU. The resulting implementation, hereafter referred to as ‘merge’, is shown in the right-hand side of figure 5. An enhanced version, ‘merge-*s*’, that checks the convergence criterion not every single but every *s* iterations of the CG loop, is also proposed in [26]. While, on average, this technique incurs *s*/2 additional iterations of the solver, it will also decrease the transfers from GPU to main memory as (11/*s*) of the transfers/synchronization are economized.

The benefits of the modifications visualized in figure 5 are summarized in figure 6 for all test matrices listed in table 1, showing the average variation of runtime (figure 6*a*) and CPU energy (figure 6*b*) in polling mode (left) and blocking mode (right), with respect to the CG-CUBLAS reference implementation running in polling mode. The target platform for this experiment was equipped with an Intel Core i7-3770K (four cores at 3.5 GHz) and 16 GB of RAM, connected to an NVIDIA GeForce GTX480 ‘Fermi’ GPU. Power was measured using an internal DC wattmeter connected to the 12 V lines that run from the PSU to the motherboard. For further details about the experiment set-up and the results, see [26].

Concerning the performance objective, we deduce from the left-hand side of figure 6*a* that the merged implementations running in polling mode outperform the reference polling-CUBLAS implementation. The blocking mode may still have a negative impact on the execution time, and in the worst case we observe an increase of about 10%. However, this only occurs for very specific cases where the time for the matrix and vector operations is short compared with the cost of reactivating CPU cores, whereas, on average, the performance of the reference CUBLAS implementation can be matched. When the blocking-merge implementation is slower than the polling-CUBLAS code, shifting to blocking-merge-10 compensates for this, making the difference negligible. In all cases, the blocking-merge-10 implementation exhibits higher performance gains with respect to the reference than the blocking-merge code. In summary, figure 6*a* demonstrates that the performance goal is fulfilled as the merge-10 implementation in blocking mode matches or even outperforms CG-CUBLAS in polling mode.

Figure 6*b* illustrates the CPU energy consumption of the merged implementations, compared against those of the CG-CUBLAS code. While the merged variants executed in polling mode (left-hand side) may, on average, render some appealing savings owing to the reduced execution time, the clear winners are the same variants operating in blocking mode (right-hand side), which yield a reduction of the CPU energy draft of about 25–30%. Hence, the reformulated CG meets the twofold goal of rendering a significant reduction of the total energy consumption while retaining the runtime performance.

### (d) Leveraging the CPU states on multicore systems

Here, we revisit the solution of sparse linear systems via ILUPACK (incomplete LU package^{4}), a numerical software based on Krylov iterative methods. The package implements multilevel ILU preconditioners for general, symmetric indefinite and Hermitian positive definite systems, in combination with inverse-based ILUs and Krylov subspace solvers.

The parallelization of ILUPACK is analysed in [27], where the task concurrency intrinsic to the preconditioned iterative solution of the sparse linear systems is efficiently exploited on multicore architectures. Specifically, in this approach, parallelism is exposed by means of nested dissection applied to the adjacency graph representing the non-zero connectivity of the sparse coefficient matrix of the linear system. By recursively splitting this graph, the result is a hierarchy of independent subgraphs that is highly amenable to parallelization.

This type of concurrency can be expressed as a (directed) task dependency tree, where nodes represent concurrent tasks and arcs specify dependencies among them. The parallel execution of this tree on multicore processors is then orchestrated by a runtime which dynamically maps tasks to threads (cores) in order to improve load balance requirements during the computation of the ILU preconditioner and the subsequent solution of the triangular linear systems.

From the implementation point of view, the runtime keeps a shared queue of ready tasks (i.e. tasks with their dependencies fulfilled), which is initialized with the tasks corresponding to the independent subgraphs (leaves of the tree). Processor threads update this queue as they complete the execution of tasks allocated to them. The access to shared data structures is carefully controlled via low-cost synchronization primitives.

In the following, we evaluate the exploitation of the C-states made by two implementations of the runtime underlying ILUPACK. For this purpose, we use a scalable SPD sparse linear system of dimension *n*=*N*^{3}, resulting from the finite difference discretization of the partial differential equation −*Δu*=*f* in a three-dimensional unit cube *Ω*=[0,1]^{3}, with Dirichlet boundary conditions *u*=*g* on ∂*Ω*, using a uniform mesh of size *h*=1/(*N*+1). We set *N*=252, which yields a linear system with about 16×10^{6} unknowns and 111×10^{6} non-zero entries in the coefficient matrix. The tests reported next were performed using IEEE DP arithmetic. The target server contains a single AMD Opteron 6128 processor (eight cores) and 24 GB of RAM. This processor features three operating or C-states (C0, C1 and C1E). Samples were collected in this case using an internal wattmeter. Unfortunately, there exists no version of ILUPACK using SP computations, so we have to use DP for the experiments in this section.

In the previous paragraphs, we exposed how the task-parallel calculation of the preconditioner in ILUPACK is organized as a binary tree and bottom-up dependencies. The subsequent iterative process basically requires the solution of two triangular linear systems per iteration, with tasks that are also organized as binary task trees, but with bottom-up (lower triangular system) or top-down (upper triangular system) dependencies. When these tasks are dynamically mapped to a multicore platform by the runtime, the execution can be expected to exhibit a number of time periods during which certain cores are idle, depending on the number of tasks of the tree, their computational complexity, the number of cores of the system, etc. As in the previous case, where we outsourced computations to the accelerator (see §3*c*), it is basically these idle periods that we could expect the operating system to leverage, by promoting the corresponding cores into a power-saving C-state (sleep mode).

In the original runtime underlying ILUPACK, upon encountering no tasks ready to be executed, ‘idle’ threads simply perform a ‘busy-wait’ (polling) on a condition variable, until a new task is available. This strategy prevents the operating system from promoting the associated cores into a power-saving C-state, because the threads are not idle but doing useless work by actively waiting for tasks.

As an alternative to the power-hungry strategy, a power-aware version of the runtime underlying ILUPACK can apply an ‘idle-wait’ (blocking) whenever a thread does not encounter a task ready for execution and, thus, becomes inactive. Note that, as previously mentioned in §2*d*, setting the necessary conditions for the operating system to promote the cores into a power-saving C-state is all we can do, as we cannot explicitly enforce the transition from the user side.

The question is whether the use of the power-aware runtime comes with a performance penalty which masks the energy benefits. Table 3 compares the two runtimes, showing that this is not the case for both the computation of the preconditioner and iterative solution. Indeed, for a negligible increase in the total execution time, from 286.28 to 287.91 s, we observe reductions in the (average) power from 240.17 to 227.16 W for the preconditioner; and from 269.27 to 230.80 W for the solver. The outcome is a more than 10% decrease of the total energy from 75 163.74 to 67 758.84 J.

## 4. Concluding remarks

In this paper, we have reviewed the current status in energy-efficient HPC, focusing the analysis in the particular domain of sparse linear algebra. Specifically, we have experimentally compared different architectures using the solution of sparse linear systems via the CG method as the key benchmark. Furthermore, we have identified current challenges that stand in the path to effectively lower energy utilization, and we have also assessed the potential of energy-aware implementations. One key observation of our studies is that the energy efficiency of linear system solvers can be improved, without impacting runtime performance, by carefully adapting the algorithm's implementation to the target hardware. Doing so allows for better exploitation of the hardware-featured power-saving techniques. As a more general conclusion of this work, we believe that, in the context of the energy challenge of exascale computing, hand-in-hand development of computing architectures and simulation algorithms is crucial for achieving efficient resource utilization.

## Acknowledgements

Enrique S. Quintana-Ortí was supported by project CICYT TIN2011-23283 of the Ministerio de Ciencia e Innovación and FEDER and the EU Project FP7 318793 ‘EXA2Green’. This work summarizes some key insights gained from a collaboration during the last few years. We thank the long list of colleagues that were involved in this cooperation.

## Footnotes

One contribution of 14 to a Theme Issue ‘Stochastic modelling and energy-efficient computing for weather and climate prediction’.

↵1 High Performance Linpack (HPL, http://www.netlib.org/benchmark/hpl/).

↵2 While, in principle, Krylov subspace methods belong to the class of direct solvers, as they approximate the solution in a subspace expanded in every iteration by an additional search direction until the complete space is generated, they are almost always used as iterative solvers providing a very good approximation of the solution after a few iterations.

- © 2014 The Author(s) Published by the Royal Society. All rights reserved.