A ‘Cool’ Load Balancer for Parallel Applications

Osman Sarood
Dept. of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL 61801, USA
sarood1@illinois.edu

Laxmikant V. Kale
Dept. of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL 61801, USA
kale@illinois.edu

ABSTRACT
Meeting power requirements of huge exascale machines of the future will be a major challenge. Our focus in this paper is to minimize cooling power and we propose a technique that uses a combination of DVFS and temperature aware load balancing to constrain core temperatures as well as save cooling energy. Our scheme is specifically designed to suit parallel applications which are typically tightly coupled. The temperature control, comes at the cost of execution time and we try to minimize the timing penalty.

We experiment with three applications (with different power utilization profiles), run on a 128-core (32-node) cluster with a dedicated air conditioning unit. We calibrate the efficacy of our scheme based on three metrics: ability to control average core temperatures thereby avoiding hot spot occurrence, timing penalty minimization, and cooling energy savings. Our results show cooling energy savings of up to 57% with a timing penalty of 19%.

1. INTRODUCTION
Cooling energy is a substantial part of the total energy spent by an High Performance Computing (HPC) computer room or a data center. According to some reports, this can be as high as 50% [21], [9], [19] of the total energy budget. It is deemed essential to keep the computer room adequately cold in order to prevent processor cores from overheating beyond their safe thresholds. For one thing, continuous operation at higher temperatures can permanently damage processor chips. Also, processor cores operating at higher temperatures consume more power while running identical computations at the same speeds due to the positive feedback loop between temperature and power [8].

Cooling is therefore needed to dissipate the energy consumed by a processor chip, and thus to prevent overheating of the core. This consumed energy has a static and dynamic component. The dynamic component increases as the cube of the frequency at which the core is run. Therefore, an alternative way of preventing overheating is to reduce the frequency. Modern processors and operating systems support such frequency control (e.g., DVFS). Using such controls, it becomes possible to run a computer in a room with high ambient temperature, by simply reducing frequencies whenever temperature goes above a threshold.

However, this method of temperature control is problematic for HPC applications, which are tightly coupled. If only one of the cores is slowed down by 50%, the entire application will slow down by 50% due to dependencies between computations on different processors. This is further exacerbated when the dependencies are global, such as when global reductions are used with a high-frequency. Since individual processors may overheat at different rates, and at different points in time, and since physical aspects of the room and the air flow may create regions which tend to be hotter, the situation where only a small subset of processors are operating at a reduced frequency will be quite common. For HPC applications, this method therefore is not suitable as it is.

The question we address in this paper is whether we can substantially reduce cooling energy without a significant timing penalty. Our approach involves a temperature-aware dynamic load balancing strategy. In some preliminary work presented at a workshop [18], we have shown the feasibility of the basic idea in the context of a single eight-core node. The contributions of this paper include development of a scalable load-balancing strategy demonstrated on 128 core machine, in a controlled machine room, and with explicit power measurements. Via experimental data, we show that cooling energy can be reduced to the extent of up to to 57% with a timing penalty of 19%.

We begin in Section 2 by introducing the frequency control method, and documenting the timing penalty it imposes on HPC applications. In Section 3, we describe our temperature-aware load balancer. It leverages object-based overdecomposition and the load-balancing framework in the Charm++ runtime system. Section 4 outlines the experimental setup for our work. We then describe (Section 5) performance data to show that, with our strategy, the temperatures are retained within the requisite limits, while the timing penalties are small. Some interesting issues that arise in understanding how different applications react to temperature control are analyzed in Section 6. Section 7 undertakes a detailed analysis of the impact of our strategies on machine energy and cooling energy. Section 8 summarizes related work and sets our work in its context, which is followed by a summary in Section 9.
2. CONSTRAINING CORE TEMPERATURES

Unrestrained, core temperatures can soar very high. The most common way to deal with this in today’s HPC centers is through the use of additional cooling arrangements. But as we have already mentioned, cooling itself accounts for around 50% [21, 3, 19] of the total energy consumption of a data center and this can rise even higher with the formation of hot spots. To motivate the technique of this paper, we start with a study of the interactions of core temperatures in parallel applications with the cooling settings of their surroundings. We run Wave2D, a finite differencing application, for ten minutes on 128 cores in our testbed. We provide more details of the testbed in Section 4. The cooling knob in this experiment was controlled by setting the cooling room air conditioning (CRAC) to different temperature settings. Figure 1 shows the average core temperatures and the maximum difference of any core from the average for ten minutes on 128 cores in our testbed. We used DVFS to keep core temperatures under 44°C by periodically checking core temperatures and reducing the frequency by one level whenever a core got hot. The experiment was repeated for five different CRAC set points. The results, in Figure 2 show the normalized execution time and machine energy. Normalization is done with respect to the run where all cores run at full frequency without DVFS. The high timing penalty (seen from Figure 2) coupled with an increase in machine energy makes it infeasible for HPC community to use such a technique. Now that we have established that DVFS on its own can not efficiently control core temperatures without incurring unacceptably high timing penalties, we propose our approach to ameliorate the deficiencies in using DVFS without load balancing.

3. TEMPERATURE AWARE LOAD BALANCING

In this section, we propose a novel technique based on task migration that can efficiently control core temperatures and simultaneously minimizes the timing penalty. In addition, it also ends up saving total energy. Although our technique should work well with any parallel programming language which allows object migration, we chose Charm++ for our tests and implementation because it allows simple and straightforward task migration. We introduce Charm++ followed by a description of our temperature aware load balancing technique.

3.1 Charm++

Charm++ is a parallel programming runtime system that works on the principle of processor virtualization. It provides a methodology where the programmer divides the program into small computations (objects or tasks) which are distributed amongst the available processors by the runtime system [5]. Each of these small problems is a migratable C++ object that can reside on any processor. The runtime keeps track of the execution time for all these tasks and logs them in a database which is used by a load balancer. The aim of load balancing is to ensure equal distribution of computation and communication load amongst the processors. Charm++ uses the load balancing database to keep track of how much work each task is doing. Based on this information, the load balancer in the runtime system, determines if there is a load imbalance and if so, it migrates object from an overloaded processor to an underloaded one [26]. The load balancing decision is based on the heuristic of principle of persistence, according to which computation and communication loads tend to persist with time for a certain class of iterative applications. Charm++ load balancers have proved to be very successful with iterative applications such as NAMD [14].
Algorithm 1 Temperature Aware Refinement Load Balancing

1: At node $i$ at start of step $k$
2: if $T_i^k > T_{\text{max}}$ then
3: $\text{decreaseOneLevel}(C_i)$ //reduce by 0.13GHz
4: else
5: $\text{increaseOneLevel}(C_i)$ //increase by 0.13GHz
6: end if
7: At Master core
8: for $i \in S_i^{k-1}$ do
9: $\text{ticks}^i_{k-1} = \text{ticks}^{i-1}_{k-1} \times F_k^{i-1}$
10: $\text{totalTicks} = \text{totalTicks} + \text{ticks}^i_{k-1}$
11: end for
12: for $i \in D_i^{k-1}$ do
13: $\text{ticks}^i_{k-1} = F_k^{i-1}$
14: $\text{freqSum} = \text{freqSum} + F_k^i$
15: end for
16: $\text{createOverHeapAndUnderSet}()$
17: while $\text{overHeap} \text{ NOT NULL}$ do
18: $\text{donor} = \text{overHeap} \text{->} \text{deleteMaxHeap}$
19: $(\text{bestTask}, \text{bestCore}) = \text{getBestCoreAndTask}(\text{donor}, \text{underSet})$
20: $\text{m}_{\text{bestTask}} = \text{bestCore}$
21: $\text{ticks}_{\text{donor}} = \text{ticks}_{\text{donor}} - \text{bestSize}$
22: $\text{ticks}_{\text{bestCore}} = \text{ticks}_{\text{bestCore}} + \text{bestSize}$
23: $\text{updateHeapAndSet}()$
24: end while
25: procedure isHeavy(i)
26: return $(\text{ticks}^i_{k-1} > (1 + \text{tolerance}) \times (\text{totalTicks} \times F_k^i)) / \text{freqSum}$
27: procedure isLight(i)
28: return $(\text{ticks}^i_{k-1} < \text{totalTicks} \times F_k^i / \text{freqSum})$

3.2 Refinement based temperature aware load balancing

We now describe our refinement based, temperature-aware load balancing scheme which does a combination of DVFS and intelligent load balancing of tasks according to frequencies in order to minimize execution time penalty. The general idea is to let each core work at the maximum possible frequency as long as it is within the maximum temperature threshold. Currently, we use DVFS on a per-chip instead of a per-core basis as the hardware did not allow us to do otherwise. When we change the frequency of all the cores on the chip, the core input voltage also drops resulting in power savings. This raises a question: What condition should trigger a change in frequency? In our earlier work [18], we did DVFS when any core on a chip crossed the temperature threshold. But our recent results show that basing DVFS decision on average temperature of the chip provides better temperature control. Another important decision is to determine how much should the frequency be lowered in case a chip exceeds the maximum threshold. Modern day processors come with a set of frequencies (frequency levels) at which they can operate. Our testbed had 10 different frequency levels from 1.2GHz to 2.4GHz (each step differs by 0.13GHz). In our scheme, we change the frequency by only one level at each decision time.

The pseudocode for our scheme is given in Algorithm 1 with the descriptions of variables and functions given in Table 1. The application specifies a maximum temperature threshold and a time interval at which the runtime periodically checks the temperature and determines whether any node has crossed that threshold. The variable $k$ in Algorithm 1 refers to the interval number the application is currently in. Our algorithm starts with each node computing the average temperature for all cores present on it, i.e., $T_i^k$. Once the average temperature has been computed, each node matches it against the maximum temperature threshold ($T_{\text{max}}$). If the average temperature is greater than $T_{\text{max}}$, all cores on that chip shift one frequency level down. However, if the average temperature is less than $T_{\text{max}}$, we increase the frequency level of all the cores on that chip (lines 1-6). Once the frequencies have been changed, we need to take into account the speed differential with which each core can execute instructions. We start by gathering the load information from the load balancing database for each core and task. In Charm++, this load information is maintained in milliseconds. Hence, in order to neutralize the frequency difference amongst the loads of each task and core, we convert the load times into clock ticks by multiplying load for each task and core with the frequency at which it was run-ning (lines 8-15). It is important to note that without doing this conversion, it would be incorrect to compare the loads and hence load balancing would result in inefficient schedules. Even with this conversion, the calculations would not be completely accurate, but will give much better estimates. We also compute the total number of ticks required for all the tasks (line 10) for calculating the weighted averages according to new core frequencies. Once the ticks are calculated, we create a max heap i.e. overHeap, for overloaded and a set for underloaded cores i.e. underSet (line 16). The categorization of over and underloaded cores is done by the isHeavy and isLight procedures on lines (25-28). A core is overloaded if its currently assigned ticks are greater than what it should be assigned, i.e., a weighted average of totalTicks according to the cores new frequency (line 26). Notice the $1+\text{tolerance}$ factor in the expression at line 26. We have to use this in order to do refinement only for cores that are overloaded by some considerable margin. We set it to 0.03 for all our experiments. This means that a core is considered to be overloaded if its currently assigned ticks are greater than its average weighted ticks by a factor of 1.03. A similar check is in place for isLight procedure (lines 27) but we do not include tolerance as it does not matter.

Once the max heap for the overloaded cores and a set for underloaded cores are ready, we start with the load balancing. We pop the max element (tasks with maximum number of ticks) out of overHeap (referred as donor). Next, we call

<table>
<thead>
<tr>
<th>Variable</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>$n$</td>
<td>number of tasks in application</td>
</tr>
<tr>
<td>$p$</td>
<td>number of cores</td>
</tr>
<tr>
<td>$T_{\text{max}}$</td>
<td>maximum temperature allowed</td>
</tr>
<tr>
<td>$C_i$</td>
<td>set of cores on name chip as core $i$</td>
</tr>
<tr>
<td>$F_k^i$</td>
<td>execution time of task $i$ during step $k$ (in ms)</td>
</tr>
<tr>
<td>$f_k^i$</td>
<td>time spent by core $i$ executing tasks during step $k$ (in ms)</td>
</tr>
<tr>
<td>$f_k^i$</td>
<td>frequency of core $i$ during step $k$ (in Hz)</td>
</tr>
<tr>
<td>$m_k^i$</td>
<td>core number assigned to task $i$ during step $k$</td>
</tr>
<tr>
<td>$t_k^i$</td>
<td>num. of ticks taken by $i$th task/core during step $k$</td>
</tr>
<tr>
<td>$T_i^k$</td>
<td>average temperature of node $i$ at start of step $k$ (in °C)</td>
</tr>
<tr>
<td>$S_{\text{over}}$</td>
<td>heap of overloaded cores</td>
</tr>
<tr>
<td>$S_{\text{under}}$</td>
<td>set of underloaded cores</td>
</tr>
<tr>
<td>$p_k$</td>
<td>$(f_1^k, f_2^k, \ldots, f_n^k)$</td>
</tr>
</tbody>
</table>
the procedure `getBestCoreAndTask()` which selects the best task to donate to the best underloaded core. The `bestTask` is the largest task currently assigned to `donor` such that it does not overload a core from the `underSet`. And the `bestCore` is the one which will remain underloaded after being assigned the `bestTask`. After determining the `bestTask` and `bestCore`, we do the migration by recording the task mapping and (line 20) updating the `donor` and `bestCore` with number of ticks in `bestTask`. We then call `updateHeapAndSet` (line 23) which rechecks the `donor` for being overloaded. If it is, we reenter it to `overHeap`. It also checks donor for being underloaded so that it is added to the `underSet` in case it has ended up with too little load. This ends the job of migrating one task from overloaded core to an underloaded core. We repeat this procedure until `overHeap` is empty. It is important to notice that the value of tolerance can affect the overhead of our load balancing. If that value is too large, it might ignore load imbalance whereas if it is too small, it can result in a lot of overhead for object migration. We have noticed that any value from 0.05 to 0.01 performs equally good.

4. EXPERIMENTAL SETUP

The primary objective of this work is to constrain core temperature and save energy spent on cooling. Our scheme ensures that all the cores fall below a user-defined maximum threshold. We want to emphasize that all results reported in this work are actual measurements and not simulations. We have used a 160 core (40 node, single socket) testbed equipped with a dedicated CRAC. Each node is a single socket machine with Intel Xeon X3430 chip. It is a quad core chip supporting 10 different frequency levels ranging from 1.2GHz to 2.4GHz. We use 128 cores out of the 160 cores available for all the runs that we report. All the nodes run ubuntu 10.4 and we use `cpufreq` module in order to use DVFS. The nodes are interconnected using a 48-port gigabit ethernet switch. We use the Liebert Power unit installed with the rack to get power readings for the machines.

The CRAC in our testbed is an air cooler that uses centrifally chilled water for cooling the air. The CRAC manipulates the flow of chilled water to achieve the temperature set point prescribed by the operator. The exhaust air ($T_{\text{hot}}$) i.e. the hot air coming in from the machine room, is compared against the set point and the flow of the chilled water is adjusted accordingly to cover the difference in the temperatures. This model of cooling is favorable considering that the temperature control is responsive to the thermal load (as it tries to bring the exhaust air to temperature set point) instead of room inlet temperature [9]. The machines and the CRAC are located in the Computer Science department of University of Illinois Urbana Champaign. We were fortunate enough to not only be able to use DVFS on all the available cores but to also change the CRAC set points.

There isn’t a straightforward way of measuring the exact power draw of the CRAC, as it uses the chilled water to cool the air which in turn is cooled centrally for the whole building. This made it impossible for us to use a power meter. However, this isn’t unusual as most data centers use similar cooling designs. Instead of using a power meter, we installed temperature sensors at the outlet and inlet of the CRAC. These sensors measure the air temperature coming from and going out to the machine room.

The heat dissipated into the air is affected by core temperatures and the CRAC has to cool this air for maintaining a constant room temperature. The power consumed by CRAC ($P_{\text{ac}}$) to bring the temperature of exhaust air ($T_{\text{hot}}$) down to the cool inlet air ($T_{\text{in}}$) is [9]:

$$P_{\text{ac}} = c_{\text{air}} \times f_{\text{ac}} \times (T_{\text{hot}} - T_{\text{in}}) \quad (1)$$

where $c_{\text{air}}$ is the heat capacity constant, $f_{\text{ac}}$ is the constant flow rate of the cooling system. Although we are not using a power meter, our results are very accurate because there is no interference from other heat sources as is the case with larger data centers. At these data centers, jobs from other users running on nearby nodes might dissipate more heat that would distort cooling energy estimation for your experiments.

To the best of our knowledge, this is the first work on constraining core temperatures and showing its benefit in cooling energy savings in HPC community, in contrast to most of the earlier work that emphasized on savings from machine power consumption using DVFS. Most importantly, our work is unique in using load balancing to mitigate effects of transient speed variations in HPC workloads.

We demonstrate the effectiveness of our scheme by using three applications having different utilization and power profiles. The first is a canonical benchmark, Jacobii2D, that uses 5 point stencil to average values in a 2D grid using 2D decomposition. The second application, Wave2D, uses a finite differencing scheme to calculate pressure information over a discretized 2D grid. The third application, Mol3D, is from molecular dynamics and is a real world application to simulate large biomolecular systems. For Jacobii2D and Wave2D, we choose a problem size of 22,000x22,000 and 30,000x30,000 grids respectively. For Mol3D however, we ran a system containing 92,224 atoms. We did an initial run of these applications without DVFS with CRAC working at 13.9°C and noted the maximum average core temperature reached for all 128 cores. We then used our temperature aware load balancer to keep the core temperatures at 44°C which was the maximum average temperature reached in the case of Jacobii2D (this was the lowest peak average temperature amongst all three applications). While keeping the threshold fixed at 44°C, we decreased the cooling by increasing the CRAC set point. In order to gauge the effectiveness of our scheme, we compared it with the scheme in which DVFS is used to constrain core temperatures, without using any load balancing (we refer to it as w/o TempLDB throughout the paper).

5. TEMPERATURE CONTROL AND TIMING PENALTY

Temperature control is important for cooling energy considerations since it determines the heat dissipated into the air which the CRAC is responsible for removing. In addition to that, core temperatures and power consumption of a machine are related with a positive feedback loop, so that an increase in any of them causes an increase in the other [8]. Our earlier work [18] shows evidence of this where we ran Jacobii2D on a single node with 8 cores and measured the machine power consumption along with core temperatures. The results showed increase in core temperature can cause an increase of up to 9% in machine power consumption and this figure can be huge for large data centers. For our testbed in this work, Figure 3 shows the average temperature for all
Figure 3: Average core temperature with CRAC set point at 21.1°C

Figure 4: Max difference in core temperatures for Wave2D

128 cores over a period of 10 minutes using our temperature aware load balancing. The CRAC was set to 21.1°C for these experiments. The horizontal line is drawn as a reference to show the maximum temperature threshold (44°C) used by our load balancer. As we can see, irrespective of how large the temperature gradient is, our scheme is able to restrain core temperature to within 1°C. For example, core temperatures for Mol3D and Wave2D reach the threshold, i.e., 44°C much sooner than Jacobi2D. But all three applications stay very close to 44°C after reaching the threshold.

Temperature variation across nodes is another very important factor. Spatial temperature variation is known to cause hot spots which can drastically increase the cooling costs of a data center. To get some insight into hot spot formation, we performed some experiments on our testbed with different CRAC set points. Each experiment was run for 10 minutes. Figure 4 shows the maximum difference any core has from the average core temperature for Wave2D when run with different CRAC set points. The Without DVFS run refers to all cores working at full frequency and no temperature control at core-level. We observed that for the case of Without DVFS run, the maximum difference is due to one specific node getting hot and continuing to be so throughout the execution, i.e., a hot spot. On the other hand, with our scheme, no single core is allowed to get a lot hotter than the maximum threshold. Currently, for all our experiments, we do temperature measurement and DVFS after every 6-8 seconds. More frequent DVFS would result in more execution time penalty since there is some overhead of doing task migration to balance the loads. We will return to these overheads later in this section.

The above experimental results showed the efficacy of our scheme in terms of limiting core temperatures. However, as shown in Section 2, this comes at the cost of execution time. We now use savings in execution time penalty as a metric to establish the superiority of our temperature aware load balancer in comparison to using DVFS without any load balancing. For this, we study the normalized execution times, $t_{\text{norm}}$, with and without our temperature aware load balancer, for all three applications under consideration. We define $t_{\text{norm}}$ as follows:

$$t_{\text{norm}} = \frac{t_{\text{LB}}}{t_{\text{base}}}$$

where $t_{\text{LB}}$ represents the execution time for temperature aware load balanced run and $t_{\text{base}}$ is execution time without DVFS so that all cores work at maximum frequency. The value for $t_{\text{norm}}$ in case of w/o TempLDB run is calculated in a similar manner except we use $t_{\text{NoLB}}$ instead of $t_{\text{LB}}$. We experiment with different CRAC set points. All the experiments were performed by actually changing the CRAC set point and allowing the room temperature to stabilize before any experimentation and measurements were done. To minimize errors, we averaged the execution times over three similar runs. Each run takes longer than 10 minutes to allow fair comparison between applications. The results of this experiment are summarized in Figure 5. The results show that our scheme consistently performs better than w/o TempLDB scheme as manifested by the smaller timing penalties for all CRAC set points. As we go on reducing the cooling (i.e., increasing the CRAC set point), we can see degradation in the execution times, i.e., an increase in timing penalty. This is not unexpected and is a direct consequence of the fact that the cores heat up in lesser time and scale down to lower frequency thus taking longer to complete the same run. It is interesting to observe from Figure 5 that the difference in our scheme and the w/o TempLDB scheme is small to start with but grows as we increase the CRAC set point. This is because when the machine room is cooler, the cores take longer to heat up in the first place. As a result, even the cores falling in the hot spot area do not become so hot that they go to a very small frequency (we decrease frequency in steps of 0.13GHz). But as we keep on decreasing the cooling, the hot spots become more and more visible, so much so that when the CRAC set point is 25.6°C, Node 10 (hot spot in our testbed) runs at the minimum possible frequency almost throughout the experiment. Our scheme does not suffer from this problem since it intelligently assigns loads by taking core frequencies into account. But without our load balancer, the execution time increases greatly (refer to Figure 5 for CRAC set point 25.6°C). This increase happens because in the absence of load balancing, execution time is determined by the slowest core, i.e., core with the minimum frequency.

For a more detailed look at our scheme’s sequence of actions, we use Projections [6], a performance analysis tool from the Charm++ infrastructure. Projections provides a visual demonstration of multiple performance data including
Figure 5: Normalized execution time with and without Temperature Aware Load Balancing

Figure 6: Minimum frequency for all three applications

Figure 7: Projections timeline with and without Temperature Aware Load Balancing for Wave2D

Figure 8: Zoomed Projections for 2 iterations

processor timelines showing their utilization. We carried out an experiment on 16-cores instead of 128 and use projections to highlight the salient features of our scheme. We worked with a smaller number of cores since it would have been difficult to visually understand a 128-core timeline. Figure 7 shows the timelines and corresponding utilization for all 16 cores throughout the execution of Wave2D. Both runs in the figure had DVFS enabled. The upper run, i.e., the top 16 lines, is the one where Wave2D is executed without temperature aware load balancing, whereas the lower part, i.e., the bottom 16 lines, repeated the same execution with our temperature aware load balancing. The length of the timeline indicates the total time taken by an experiment. The green and pink colors show the computations, whereas the white lines represents idle time. Notice that the execution time with temperature aware load balancing is much less than without it. To see how processors spend their time, we zoomed into the boxed part of Figure 7 and reproduced it in Figure 8. This represents 2 iterations of Wave2D. The zoomed part belongs to the run without temperature aware load balancing. We can see that because of DVFS, the first four cores work at a lower frequency than the remaining 12 cores. These cores, therefore, take longer to complete their tasks as compared to the remaining 12 cores (longer pink and green portions on the first 4 cores). The remaining 12 cores finish the work quickly and then keep on waiting for the first 4 cores to complete its tasks (depicted by white spaces towards the end of each iteration). These results clearly suggest that the timing penalty is dictated by the slowest cores. We also substantiate this by providing Figure 6 which shows the minimum frequency of any core during a w/o TempLDB run (CRAC set point at 23.3 °C). We can see from Figure 5 that Wave2D and Mol3D have higher penalties as compared to Jacobi2D. This is because the minimum frequency reached in these applications is lower than that reached in Jacobi2D.

We now discuss the overhead associated with our temperature aware load balancing. As outlined in Algorithm 1, our scheme has to measure core temperatures, use DVFS, decide new assignments and then exchange tasks according to the new schedule. The major overhead in our scheme comes from the last item, i.e., exchange of tasks. In comparison, temperature measurements, DVFS, and load balancing decisions take negligible time. To calibrate the communication load we incur on the system, we run an experiment with each of the three applications for ten minutes and count the number of tasks migrated at each step when we check core temperatures. Figure 9 shows these percentages for all
three applications. As we can see, the numbers are very small to make any significant difference. The small overhead of our scheme is also highlighted by its superiority over the w/o TempLDB scheme which does temperature control through DVFS but no load balancing (and so, no object migration). However, he do acknowledge that the amount of memory used per task can change this overhead greatly.

One important observation to be made from this figure is that although the total power consumption of the entire machine is smaller for Jacobi2D as compared to the other two applications and hence has more transitions in its frequency. We explain and verify these application-specific differences in power consumption in the next section.

6. UNDERSTANDING APPLICATION REACTION TO TEMPERATURE CONTROL

One of the reasons we chose to work with three different applications was to be able to understand how application-specific characteristics react to temperature control. In this section, we highlight some of our findings and try to provide a comprehensive and logical explanation for them.

We start by referring back to Figure 5 which shows that Wave2D suffers the highest timing penalty followed by Mol3D and Jacobi2D. Our intuition was that this difference could be explained by the frequencies at which each application is running along with their CPU utilizations (see Table 2). Figure 10 shows the average frequency across all 128 cores during the execution time for each application. We were surprised with Figure 10 because it showed that both Wave2D and Mol3D run at almost the same average frequency throughout the execution time and yet Wave2D ends up having a much higher penalty than Mol3D. Upon investigation, we found that Mol3D is less sensitive to frequency than Wave2D. To further gauge the sensitivity of our applications to frequency, we ran a set of experiments in which each application was run at all available frequency levels. Figure 11 shows the results where execution times are normalized with respect to a base run where all 128 cores run at maximum frequency, i.e., 2.4GHz. We can see from Figure 11 that Wave2D has the steepest curve indicating its sensitivity to frequency. On the other hand, Mol3D is the least sensitive to frequency as shown by its small slope. This gave us one explanation for the higher timing penalties for Wave2D as compared to the other two. However, if we use this line of reasoning only, then Jacobi2D is more sensitive to frequency (as shown by Figure 11) and has a higher utilization (Table 2) and should therefore have a higher timing penalty than Mol3D. Figure 5 suggests otherwise. Moreover, the average power consumption of Jacobi2D is also higher than Mol3D (see Table 2) which should imply cores getting hotter sooner while running Jacobi than with Mol3D and shifting to lower frequency level. On the contrary, Figure 10 shows Jacobi running with a much higher frequency than Mol3D. These counter intuitive results could only be explained in terms of CPU power consumption, which is higher in case of Mol3D than for Jacobi2D. To summarize, these results suggest that although the total power consumption of the entire machine is smaller for Mol3D, the proportion consumed by CPU is higher as compared to the same for Jacobi2D.

For some mathematical backing to our claims, we look at the following expression for core temperatures [9]:

\[
T_{cpu} = \alpha T_{ac} + \beta P_i + \gamma
\]  

Here \( T_{cpu} \) is the core temperature, \( T_{ac} \) is temperature of the air coming from the cooling unit, \( P_i \) is power consumed by the chip, \( \alpha, \beta \) and \( \gamma \) are constants which depend on heat capacity and air flow since our CRAC maintains a constant

![Figure 9: Percent objects migrated during temperature aware load balancer run](image)

![Figure 10: Average frequency for all three applications with CRAC at 23.3°C](image)

![Figure 11: Normalized execution time for different frequency levels](image)
Table 2: Performance counters for one core

<table>
<thead>
<tr>
<th>Counter Type</th>
<th>Wave2D</th>
<th>Jacobi2D</th>
<th>Mol3D</th>
<th>Wave2D</th>
</tr>
</thead>
<tbody>
<tr>
<td>Execution Time (secs)</td>
<td>340</td>
<td>955</td>
<td>539</td>
<td>540</td>
</tr>
<tr>
<td>MFLOP/s</td>
<td>240</td>
<td>252</td>
<td>97</td>
<td>292</td>
</tr>
<tr>
<td>Traffic L1-L2 (MB/s)</td>
<td>155</td>
<td>10,500</td>
<td>3,044</td>
<td>1,000</td>
</tr>
<tr>
<td>Traffic L2-DRAM (MB/s)</td>
<td>539</td>
<td>97</td>
<td>577</td>
<td>577</td>
</tr>
<tr>
<td>Cache misses to DRAM (billions)</td>
<td>4</td>
<td>0.72</td>
<td>4.22</td>
<td>4.22</td>
</tr>
<tr>
<td>CPU Utilization (%)</td>
<td>87</td>
<td>83</td>
<td>93</td>
<td>93</td>
</tr>
<tr>
<td>Power (W)</td>
<td>2472</td>
<td>2553</td>
<td>2558</td>
<td>2558</td>
</tr>
<tr>
<td>Memory Footprint(% of memory)</td>
<td>8.1</td>
<td>2.4</td>
<td>8.0</td>
<td>8.0</td>
</tr>
</tbody>
</table>

Figure 12: Timing penalty for different CRAC set points

airflow. This expression shows that core temperatures are dependent on power consumption of the chip rather than the whole machine, and therefore it is possible that the cores get hotter for Mol3D earlier than with Jacobi2D due to higher CPU power consumption.

So far, we have provided some logical and mathematical explanations for our counter-intuitive results. But we wanted to explore these results thoroughly and find more cogent evidence to our claims. As a final step towards this verification, we ran all three applications on 128 cores using the performance capabilities of Perfsuite [7] and collected information about different performance counters summarized in Table 2. We can see that Mol3D faces fewer cache misses and has 10 times more traffic between L1 and L2 cache (counter type 'Data Traffic L1-L2') resulting in higher miss rate and has 10 times more traffic between L1 and L2 cache (counter type 'Data Traffic L1-L2') resulting in higher miss rate compared to Mol3D.

7. ENERGY SAVINGS

This section is dedicated to a performance analysis for our temperature aware load balancing in terms of energy consumption. We first look at machine energy and cooling energy separately and then combine them to look at the total energy.

7.1 Machine Energy Consumption

Figure 13 shows the normalized machine energy consumption (ε_norm), calculated as:

$$ε_{\text{norm}} = \frac{ε_{\text{LB}}}{ε_{\text{NoLB}}}$$

where ε_{LB} represents the energy consumed for temperature aware load balanced run and ε_{NoLB} is execution time without DVFS with all cores working at maximum frequency. ε_{norm}, for w/o TempLDB run is calculated in a similar way with ε_{LB} replaced by ε_{NoLB}. Static power of CPU, along with the power consumed by power supply, memory, hard disk and the motherboard mainly form the idle power of a machine. A node of our testbed has an idle power of 40W which represents 40% of the total power when the machine is working at full frequency assuming 100% CPU utilization. It is this high idle/base power which inflates the total machine consumption in case of w/o TempLDB runs as shown in Figure 13. This is because for every extra second of penalty in execution time, we pay an extra 40j per node in addition to the dynamic energy consumed by the CPU. Considering this, our scheme does well to keep the normalized machine energy consumption close to 1 as shown in Figure 13.

We can better understand the reason why the w/o TempLDB run is consuming much more power than our scheme if we refer back to Figure 8. We can see that although the lower 12 cores are idle after they are done with their tasks (white portion enclosed in the rectangle), they still consume idle power thereby increasing the total energy consumed.

7.2 Cooling Energy Consumption

While there exists some literature discussing techniques for saving cooling energy, those solutions are not applicable to HPC where applications are tightly coupled. Our aim in this work is to come up with a framework for analyzing cooling energy consumption specifically from the perspective of HPC systems. Based on such a framework, we can design mechanisms to save cooling energy that are particularly suited to HPC applications. We now refer to Equation 1 to infer that $T_{hot}$ and $T_{ac}$, are enough to compare energy consumption for CRAC as the rest are constants.

So we come up with the following expression for normalized cooling energy ($ε_{\text{norm}}$):

$$ε_{\text{norm}} = \frac{T_{\text{hot}} - T_{\text{ac}}}{T_{\text{hot}} - T_{\text{ac}}} + t_{\text{norm}}$$

where $T_{\text{hot}}$ represents the temperature of hot air leaving the machine room (entering the CRAC) and $T_{\text{ac}}$ represents temperature of the cold air entering the machine room respectively when using temperature aware load balancer. Similarly, when running all the cores at maximum frequency without any DVFS, $T_{\text{hot}}$ is the temperature of hot air leaving the machine room and $T_{\text{ac}}$ is the temperature of the cold air entering the machine room. $t_{\text{norm}}$ is the normalized time for the temperature aware load balanced run. Notice that we include the timing penalty in our cooling energy model so that we incorporate the additional time for which cooling must be done.

Figure 14 shows the normalized cooling energy for both with and without temperature aware load balancer. We can see from the figure that both schemes end up saving some cooling energy but temperature aware load balancing outperforms w/o TempLDB scheme by a significant margin. Our temperature readings showed that the difference between $T_{\text{hot}}$ and $T_{\text{ac}}$ was very close in both cases, i.e., our
scheme and the w/o TempLDB scheme, and the savings in our scheme was a result of savings from $I_{\text{norm}}$.

### 7.3 Total Energy Consumption

Although most data centers report cooling to account for 50% \cite{21, 3, 19} of total energy, we decided to take a conservative figure of 40% \cite{12} for it in our calculations of total energy. Figure 15 shows the percentage of total energy we save and the corresponding timing penalty we end up paying for it. Although it seems that Wave2D does not give us much room to decrease its timing penalty and energy, we would like to mention that our maximum threshold of 44°C was very conservative for it. On the other hand, results from Mol3D and Jacobi2D are very encouraging in the sense that if a user is willing to sacrifice some execution time, he can save a considerable amount of energy keeping core temperatures in check. The timing penalty can further be reduced by choosing a suitable maximum threshold as our current constraints are very strict considering we do not allow any core to go above the threshold.

To quantify energy savings achievable with our technique, we plot normalized time against normalized energy (Figure 16). The figure shows data points for both our scheme and w/o TempLDB scheme. We can see that for each CRAC set point, our scheme moves the corresponding w/o TempLDB point towards the left (reducing energy consumption) and down (reducing timing penalty). The slope of these curves would give us the number of seconds the execution time increases for each joule saved in energy. As we see Jacobi2D has a higher potential for energy saving as compared to Mol3D because of the lower MFLOP/s.

### 8. RELATED WORK

Most researchers from HPC have focused on minimizing machine energy consumption as opposed to cooling energy \cite{16, 1, 23, 10}. In \cite{17}, Rountree et al., exploit the load imbalance present in applications to save machine energy...
consumption. Our scheme focuses on improving the load imbalance rather than exploiting it. Given a target program, a DVFS enabled cluster, and constraints on power consumption, they [20] come up with a frequency schedule that minimizes execution time while staying within the power constraints. Our work differs in that we base our DVFS decisions on core temperatures for saving cooling energy whereas they devise frequency schedules according to task schedule irrespective of core temperatures. Their scheme works with load balanced applications only whereas ours has no such constraints. In fact, one of the major features of our scheme is that it strives to achieve a good load balance. A runtime system named, PET (Performance, power, energy and temperature management), by Hanson et al. [4], tries to maximize performance while respecting power, energy and temperature constraints. Our goal is similar to theirs, but we achieve it in a multicore environment which adds an additional dimension of load balancing.

The work of Banarjee et al. [1] comes closest to ours in the sense that they also try to minimize cooling costs in an HPC data center. But their focus is on controlling the CRAC set points rather than the core temperatures. In addition, they need to know the job start and end times beforehand to come up with the correct schedule whereas our technique does not rely on any pre-runs. Merkel et al. [11] also explore the idea of task migration from hot to cold cores. However, they do not do it for parallel applications and therefore do not have to deal with complications in task migration decisions because of synchronization primitives. In another work, Tang et al. [23] have proposed a way to decrease cooling and avoid hot spots by minimizing the peak inlet temperature from the machine room through intelligent task assignment. But their work is based on a small-scale data center simulation while ours is comprised of experimental results on a reasonably large testbed.

Work related to cooling energy optimization and hot-spot avoidance has been done extensively in non HPC data centers [2, 13, 24, 25, 22]. But most of this work relies on placing jobs such that jobs expected to generate more heat are placed on nodes located at relatively cooler areas in the machine room and vice versa. Rajan and Yu [15] discuss the effectiveness of system throttling for temperature aware scheduling. They claim system throttling rules to be the best one can achieve under certain assumptions. But one of their assumptions, non-migrateability of tasks, is clearly not true for HPC applications we target. Another recent approach is used by Le et al [9] where they switch machines on and off in order to minimize total energy to meet the core temperature constraints. However, they do not consider parallel applications.

9. CONCLUSION

We experimentally showed the possibility of saving cooling and total energy consumed by our small data center for tightly coupled parallel applications. Our technique not only saved cooling energy but also minimized the timing penalty associated with it. Our approach was conservative in a manner that we set hard limits on absolute values of core temperature. However, our technique can readily be applied to constrain core temperatures within a specified temperature range which can result in a much lower timing penalty. We carried out a detailed analysis to reveal the relationship between application characteristics and the timing penalty that can be expected if it were to constrain core temperatures. Our technique was successfully able to identify and neutralize a hot spot from our testbed.

We plan to extend our work by incorporating critical path analysis of parallel applications in order to make sure that we always try to keep all tasks on critical path on the fastest cores. This would further reduce our timing penalty and possibly reduce machine energy consumption. We also plan to extend our work in such a way that instead of using DVFS to constrain core temperatures, we apply it to meet a certain maximum power threshold that a data center wishes not to exceed.

Acknowledgments

We are thankful to Prof. Tarek Abdelzaher for letting us use the testbed for experimentation under grant NSF CNS 09-58314. We also thank Josephine Geigner and Shehla Saleem Rana for their valuable help in writing and proofreading the paper.

10. REFERENCES


