# High-Level Hardware Feature Extraction for GPU Performance **Prediction of Stencils**

Anonymous Author(s)

# Abstract

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as LIFT, Futhark or Delite have shown it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these high-level functional abstractions can also be used to represent programs, and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs.

Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The first approach has fast compile times, however, it is not performance-portable across different devices and requires a lot of human effort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to different devices. However, this process is often very time consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features.

Using the LIFT framework, this paper demonstrates how lowlevel, GPU-specific features are extractable directly from a highlevel functional representation. The LIFT IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speedup the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.

GPGPU'20, February 23, 2020, San Diego, CA, USA 2020.

# 1 Introduction

Recent years have witnessed the emergence of high-level approaches for high-performance computing such as Accelerate [21], Futhark [9], Delite [4], LIFT [34] and AnyDSL [18]. They enable programmers to write hardware-agnostic code while putting the burden on the compiler to extract performance. Tuning a compiler is very laborious and time-consuming, especially when considering accelerators such as GPUs (Graphics Processing Units) and this process has to be repeated for every new hardware generation.

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

LIFT proposes to use rewriting [33] to solve this problem. Rewriting for compiler optimizations is an approach first proposed in 2001 in the Haskell compiler [29]. LIFT's rewrite rules attempt to define the set of all possible algorithmic and, crucially, hardware-specific optimizations. Rewrite rules liberate compiler writers from having to implement hard-coded optimizations and make it easy to extend the compiler. Optimizations are simply implemented as rules and a generic rewriting engine explores the space automatically.

However, this approach results in a large optimization space. The optimization process takes a few hours for stencils on GPUs [8], even when using an efficient auto-tuner [1]. In response, this paper develops an automatic performance model predicting the best optimized program variant using static features from the high-level LIFT IR. This removes the necessity for compiling and running programs which accounts for the majority of the exploration time.

The use of performance modeling for GPUs is not novel [10, 11, 25, 26, 37]. However, to the best of our knowledge, this is the first paper to show how information about low-level GPU-specific features is extractable from a high-level functional IR. This paper demonstrates that a high-level IR is amenable to the extraction of low-level information useful for predicting performance using high-level semantic information. It also shows how cache locality information is extractable at this level. This relies on the use of the rich information stored in the LIFT type system together with the ability to reason about array indices in a symbolic manner.

Using the extracted features, a performance predictor is built using machine-learning. This leads to a highly accurate model for the stencil domain, an important class of high-performance code. The model achieves a correlation of 0.8 and 0.9 on GPUs from Nvidia and AMD, respectively. Using the model to search the space requires less than 5 runs in the majority of the cases to achieve performance within 90% of the best available. In comparison, a random search requires 100s of runs in the majority of the cases.

To summarize, the paper makes three contributions:

- It shows how low-level GPU hardware features are extracted from a high-level functional IR;
- It presents a simple machine-learning model that predicts program performance;
- It shows that the model is able to drastically reduce exploration time of the optimization space.

The rest of this paper is organized as follows: Section 2 motivates this work while Section 3 presents background information about OpenCL and LIFT. Section 4 explains how low-level hardware

122

123

124

125

126

127

128

129

130

131

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180



237

238

239

240



**Figure 1.** LIFT compilation and exploration. a) The current approach compiles and executes all transformed expressions. b) The new strategy ranks the transformed expressions with a model and only compile and execute the best ones.

features are extracted from the high-level LIFT IR and Section 5 presents the performance model. Section 6 analyses the features and the model performance while Section 7 shows that the model is able to speedup drastically the optimization space exploration. Finally, Section 8 discusses related work and Section 9 concludes.

# 2 Motivation

**Current LIFT Exploration** LIFT [33] explores the GPU optimization space using rewrite rules. Figure 1a presents an overview. First, a high-level expression representing the program is used as an input to the compiler. This generic high-level expression does not encode any optimizations. Then, the rewriting takes place and the LIFT exploration module applies rewrite rules to search the space randomly. This results in a set of *transformed* expressions where optimizations have been applied and parallelism has been mapped.

The transformed expressions are then fed into the LIFT code generator which produces OpenCL kernels. These kernels are compiled with the vendor-provided OpenCL compiler into binaries. Finally, all binaries are executed, the performance is recorded and the best found kernel is reported.

This process is time consuming as it produces a large number of kernels (1,000 for this paper). In addition, every LIFT generated kernel is executable with a different number of threads leading up to 10,000 kernel executions.

**Time breakdown** Figure 2 shows the percentages of the time spent in the different stages of the current LIFT compilation and exploration. Unsurprisingly, the last part of LIFT's workflow, the kernel execution, requires by far the most time (up to 90%). For this paper, executing all kernels for a single application, including the



**Figure 2.** Time breakdown for the LIFT exploration process. *Kernel* generation includes time to rewrite and compile LIFT expressions to OpenCL kernels. *Kernel compilation* is the vendor-provided OpenCL compiler time. *Kernel execution* is the time required to execute all generated kernels.

exploration of thread configurations took up to 41 minutes while all kernels were generated in less than a minute, which is about 2% of the overall time.

**Using a Performance Predictor for Exploration** The major bottleneck for exploration is clearly the OpenCL compilation and execution time of the generated kernels, which represent 98% of total time. This paper addresses this bottleneck by using a trained performance predictor directly on the transformed LIFT expression. Figure 1b shows how the exploration strategy is modified with a performance model.

Once the transformed expressions have been produced, the idea is to extract features that are informative about performance. These features are fed into a predictive model which almost instantaneously ranks the transformed expressions. Then, the transformed expression with the fastest predicted performance is selected, the corresponding kernel generated, compiled and finally executed.

While this approach seems very simple, the challenges are twofold. First, we need to identify features that are informative about performance, such as memory access patterns. Then, they need to be extracted from the high-level functional LIFT IR. As we will see, the LIFT IR encodes all the required information to calculate low-level GPU-specific features. The next section gives background information about the LIFT IR while Section 4 will discuss feature extraction.

# 3 Background

This section introduces OpenCL, the existing LIFT IR and the rewrite systems used to produce efficient OpenCL kernels.

### 3.1 OpenCL

An OpenCL program (*kernel*) is executed by multiple threads (*work-items*) organized in *work-groups*, providing a two-level thread hierarchy. Both work-items and work-groups are organized in three dimensional grids identified by unique IDs. On GPUs, multiple work-groups are executable on a core and work-items are scheduled in groups of 32 for Nvidia (warp) or 64 for AMD (wavefront). It is generally desirable to start a large thread number to reach maximum *occupancy*.

OpenCL provides a three-level memory hierarchy: *Global memory* is accessible by all work-items and throughput is maximized when threads in the same warp/wavefront access the same cache line (coalesced accesses). Work-items of the same group communicate via a fast shared *local memory* and each work-item has its own *private memory*.

High-Level Hardware Feature Extraction for GPU Performance Prediction of Stencils GPGPU'20, February 23, 2020, San Diego, CA, USA

# <sup>241</sup> 3.2 Lift IR

LIFT [33, 34] is a functional language based on lambda calculus, offering a small set of reusable primitives. It is a compiler-internal data-parallel intermediate language and is compiled to high-performance OpenCL code. LIFT's distinguishes algorithmic primitives which express what to compute, from OpenCL-specific primitives which express how to compute by explicitly mapping computations to the OpenCL programming model. LIFT's type system supports scalar types (e.g. int, float), tuple-types (denoted as  $U \times T$ ) and array-types (denoted as  $[T]_n$ ) where the array size *n* is part of the type. 

**Algorithmic Primitives** LIFT provides well-known functional primitives defined on arrays as listed below:

 $map: (f: T \to U, in: [T]_n) \to [U]_n$   $reduce: (init: U, f: (U, T) \to U, in: [T]_n) \to [U]_1$   $zip: (in1: [T]_n, in2: [U]_n) \to [T \times U]_n$   $iterate: (in: [T]_n, f: [T]_n \to [T]_n, m: int) \to [T]_n$   $split: (m: int, in: [T]_n) \to [[T]_m]_{n/m}$   $join: (in: [[T]_m]_n) \to [T]_{m \times n}$   $slide: (size: int, step: int, in: [T]_n) \to [[T]_{size}]_{n-size+step}$   $pad: (l: int, r: int, h: (i: int, len: int) \to int,$   $in: [T]_n) \to [T]_{l+n+r}$   $at: (i: Cst, in: [T]_n) \to T$   $get: (i: Cst, in: T_1 \times T_2 \times ...) \to T_i$   $array: (n: int, f: (i: int, n: int) \to T) \to [T]_n$   $userFun: (s1: ScalarT, s2: ScalarT', ...) \to ScalarU$ 

LIFT supports the definition of arbitrary scalar-based sequential OpenCL-C function called *userFun*. These are directly embedded in the generated OpenCL code.

**OpenCL-specific primitives** LIFT'S OpenCL specific primitives expose OpenCL's thread and memory hierarchy. These primitives are used to explicitly dictate how to perform the computation expressed with the algorithmic primitives.

Parallelism is exposed via specialized variations of map: mapGlobal<sub>d</sub>, mapWorkgroup<sub>d</sub>, mapLocal<sub>d</sub> and mapSeg. These primitives directly correspond to OpenCL's thread hierarchy. The computation speci-fied within a OpenCL-specific map is performed by it's particular level and dimension  $d \in \{0, 1, 2\}$  of the thread hierarchy, or exe-cuted sequentially by a single thread (*mapSeq*). OpenCL's memory hierarchy is exposed via *toGlobal(f)*, *toLocal(f)* and *toPrivate(f)*, which specify where the output of the function f is stored in mem-orv. 

# 3.3 Rewriting

LIFT encodes optimizations as semantics-preserving rewrite rules.
 These rules are used to transform a high-level expression written
 using the algorithmic primitives into a transformed expression in
 which parallelism and memory is explicitly exploited. Similar to
 LIFT's primitives, rewrite rules are also categorized into algorithmic
 or OpenCL-specific rules. Algorithmic rules such as the divide-and conquer rule:

$$map(f) \rightarrow join \circ map(map(f)) \circ split(n)$$

| <pre>map(reduce(+,0), slide(3,1, pad(1,1, 0, arg)))</pre> | stencil(arg: [float         | t] <sub>N</sub> ) = |          |    |        |  |
|-----------------------------------------------------------|-----------------------------|---------------------|----------|----|--------|--|
|                                                           | <pre>map(reduce(+,0),</pre> | slide(3,1,          | pad(1,1, | 0, | arg))) |  |

Listing 1. 1D 3pt-stencil example in LIFT.

| transformedStencil(arg: $[float]_N$ ) =           |
|---------------------------------------------------|
| mapWrg(tile =>                                    |
| <pre>mapLcl(toGlobal(reduce(+,0)), slide(3,</pre> |
| <pre>mapLcl(toLocal(id, tile))))</pre>            |
| )(slide(18,16, pad(1,1, 0, arg)))                 |
|                                                   |

Listing 2. 1D transformed 3pt-stencil example in LIFT.

| Туре                           | Feature                                                                                                                                                                                |  |
|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Parallelism                    | global size (dimensions 0, 1 and 2)<br>local size (dimensions 0, 1 and 2)                                                                                                              |  |
| Memory                         | amount of local memory allocated<br>global stores per thread<br>global loads per thread<br>local stores per thread<br>local loads per thread<br>average cache lines per access per was |  |
| Control Flow & Synchronisation | barriers per thread<br>if statements per thread<br>for loop bodies executed per thread                                                                                                 |  |

Table 1. List of extracted features

create a space of possible algorithmic implementations for the same expression. OpenCL-specific rules such as:

$$map(f) \rightarrow mapGlobal_0(f)$$

map expressions to the OpenCL's programming model.

#### 3.4 Example

Listing 1 shows a 1D 3-point stencil expressed in LIFT [8]. *pad* is applied adding one element ( $\emptyset$ ) to the left and right of the input array arg to implement a simple boundary handling. *slide* creates overlapping neighborhoods of three elements which are summed up using *map* and *reduce*.

Applying rewrite rules leads to Listing 2, where overlapped tiling has been applied. Every tile is processed by a work-group (*mapWrg*) loading all elements to local memory and computing the output using its work-items before storing it in global memory. From this expression high-performance OpenCL code is generated as shown in [8].

# 4 Feature Extraction

This paper proposes a performance model that predicts the performance of transformed LIFT expressions on GPUs in order to identify the best variant. The model relies on static features extracted from the high-level LIFT IR. Although the features are extracted at a highlevel, they capture information about low-level hardware features. They broadly fall into three categories as seen in Table 1.

#### 4.1 Parallelism

For a fixed input size, the number of threads launched influences how much parallelism versus sequential work is performed. We include both global and local thread counts across the three thread dimensions as features. Local thread count affects how large each work-group will be, which may affect data reuse or the number of concurrent groups.

#### 4.2 Memory

This section covers the features related to the amount of memory allocated, number of accesses and access patterns.

# 4.2.1 Local memory usage

One of the important factors that determines performance on a GPU is occupancy. Occupancy is typically maximized when multiple work-group execute concurrently. More concurrent work-groups typically translates to more threads executing concurrently, which ultimately helps hiding memory latency.

The number of work-groups that execute simultaneously on a core depends on the amount of resources used by each workgroup. One important resource is the amount of fast local memory (shared memory) used by the work-group. Therefore, it is crucial to determine this quantity.

Extracting the amount of local memory used in a LIFT program is straightforward. The program is traversed once, collecting memory allocation sizes and summing up these numbers.

| 5    | input :Lambda expression representing a program                   |  |  |  |  |
|------|-------------------------------------------------------------------|--|--|--|--|
|      | output: Numbers of different types of memory accesses.            |  |  |  |  |
| , C( | puntAccesses(lambda)                                              |  |  |  |  |
|      | totalLoad[local] = 0; totalLoad[global] = 0                       |  |  |  |  |
| 3 2  | <pre>2 totalStore[local] = 0; totalStore[global] = 0</pre>        |  |  |  |  |
| 3    | countAccessesExpr( <i>lambda.body</i> , 1)                        |  |  |  |  |
| ) 4  | return {totalLoad,totalStore}                                     |  |  |  |  |
| L C( | <pre>puntAccessesExpr(expr, iterationCount)</pre>                 |  |  |  |  |
| 5    | switch expr do                                                    |  |  |  |  |
| 6    | case fc@FunCall                                                   |  |  |  |  |
| 3 7  | foreach arg in fc.args do                                         |  |  |  |  |
| 8    | countAccessesExpr( <i>arg</i> , <i>iterationCount</i> )           |  |  |  |  |
| 9    | switch expr.f do                                                  |  |  |  |  |
| 10   | case is <i>l@</i> Lambda                                          |  |  |  |  |
| 5    | <pre>countAccessesExpr(l.body, iterationCount);</pre>             |  |  |  |  |
| 7 11 | case is t@toPrivate or t@toLocal or toGlobal                      |  |  |  |  |
| 12   | countAccessesExpr( <i>t.f.body</i> , <i>iterationCount</i> )      |  |  |  |  |
| 13   | case is m@MapSeq or m@MapGlb or m@MapLcl or                       |  |  |  |  |
| 14   | n = fc.input(0).length                                            |  |  |  |  |
| 15   | <pre>countAccessesExpr(m.body, iterationCount * n)</pre>          |  |  |  |  |
| 16   | case is it@Iterate                                                |  |  |  |  |
| 17   | <pre>countAccessesExpr(it.body, iterationCount * it.count);</pre> |  |  |  |  |
| 18   | case is uf@UserFun                                                |  |  |  |  |
| 19   | foreach arg in fc.args do                                         |  |  |  |  |
| 20   | totalLoad[arg.addrsSpace] += iterationCount                       |  |  |  |  |
| 21   | totalStore[arg.addrsSpace] += iterationCount                      |  |  |  |  |
| 22   | otherwise do // Nothing to count;                                 |  |  |  |  |
| 23   | otherwise do // Nothing to count;                                 |  |  |  |  |
| 24   | return counts                                                     |  |  |  |  |
|      |                                                                   |  |  |  |  |

**Algorithm 1:** Pseudo-code for counting the total number of loads/stores for each type of memory.

#### 4.2.2 Number of Memory Accesses

Performance is largely affected by the amount and type of memory
operations. Applications that exhibit large amount of data re-usage
will benefit from exploiting the fast local memory. The program
can simply reuse the data in local memory several times, reducing the number of global memory accesses, resulting in increased
performance.

| example(arg <sub>0</sub> : [float] <sub>N</sub> , arg <sub>1</sub> : [float] <sub>N</sub> ) = |
|-----------------------------------------------------------------------------------------------|
| mapWrg(x =>                                                                                   |
| <pre>mapLcl(toGlobal(multByTwo), mapLcl(toLocal(add)), x)</pre>                               |
| )(split(64, zip(arg <sub>0</sub> , arg <sub>1</sub> )))                                       |
|                                                                                               |

Listing 3. Example for memory access count extraction.

**Algorithm** The LIFT code generator only produces loads and stores to memory when a user function is called. Therefore, counting the number of loads and stores boils down to counting how often each user function is called. As can be seen in Algorithm 1, a depth-first traversal is performed on the IR while keeping track of the number of times the body of patterns generating loops is executed. Once a user-function is reached, the feature extractor simply updates the total number of loads and stores. In addition to this, the extractor keeps track of the type of memory being accessed, local or global, using the *toLocal* and *toGlobal* patterns. The information about the address space is encoded directly into the IR and is populated by another pass that runs prior to feature extraction. The number of global/local loads and stores is then normalized by the number of total threads.

**Example** Consider the program in Listing 3. The algorithm starts with the top-level lambda and soon encounters the *mapWrg* primitive. At this point in the algorithm, line 14, n will be N/64 (the length of the outer dimension of the input after the *split*). The algorithm calls recursively *countAccesseExpr* with N/64 as the *iterationCount*. When visiting either of the *mapLcl* in line 3 of Listing 3, n will this time be 64 (the length of the inner dimension of the input after the *split*).

When the add function is visited, global loads is updated twice, since the add function has two inputs (the tuple is automatically unboxed). Since at this point, the iterationCount is N/64 \* 64 = N, the total number of global loads is N \* 2 and the total number of local stores is N. When the multByTwo function is visited, local reads and global store are both updated once, resulting in N local loads and N global stores.

### 4.2.3 Memory Access Patterns

The way a program accesses memory has a profound impact on performance. GPUs coalesce several memory requests into a single one when threads in the same warp/wavefront access a single cache line (typically 128 bytes). It is, therefore, important to extract information about memory access patterns for building an accurate performance predictor.

**General Algorithm** To determine the total number of cache line reads, the feature extractor recursively traverses the IR, keeping track of the iteration count. When a memory access is encountered, it determines the number of unique cache lines accessed by the warp as follows. First, it generates the actual index expression using the existing mechanism of the LIFT compiler [34]. If the expression contains no thread id, it means all the threads are accessing the same cache line.

When the expression contains a thread id, a new index expression is generated for each thread in the warp by adding a constant to its id (threads in a warp have consecutive ids). Let's denote the original array index expressed as a function of the thread id as *access(tid)*. Given *n*, the number of threads in a warp, the set of array indices

3 4

| 2   |
|-----|
| 1 1 |

accessed by the warp is:

48

48:

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

540

 $\{access(tid + 0), access(tid + 1), \cdots, access(tid + n - 1)\}$ 

This list of indices expresses the different addresses accessed by a warp. Given the cache line size s (expressed as a multiple of data size), we compute the list of cache lines accessed:

 $\{access(tid + 0)/s, access(tid + 1)/s, \cdots, access(tid + n - 1)/s\}$ 

Finally, we can subtract the elements in the list with each other to identify which ones are equal (when the subtraction results in 0) and count the number of unique accesses.

Implementation details The approach explained above is conceptually correct, however, it relies on having the ability to simplify symbolically arithmetic expressions. While the LIFT arithmetic simplifier supports a significant set of simplifications, it is not powerful enough to deal with some simplifications. In such cases the feature extractor might fail to recognize identical accesses. The following paragraphs explain a few workarounds used inside the feature extractor.

The first issue we encountered, is the difficulty in calculating the set of unique cache lines by subtraction. Conceptually, one could take the first access access(tid + 0)/s, subtract every other accesses by it and hope that the algebraic simplifier would be able to return 0 in case where two accesses are identical. Simplifying expressions as simple as

$$(tid + 0)/s - (tid + 1)/s$$

which is 0 when s > 1, is far from trivial given that / represents the integer division.

To overcome this challenge, we modify our approach slightly and add an extra step. Before dividing by s, we first calculate all the relative array accesses as an offset of the first access by simple subtraction. The intuition behind this two-fold. First, it is much easier to simplify a subtraction if it does not contain terms with integer division. Secondly, we only care about the distances between the accesses rather than their absolute location, therefore, we will still be able to identify the number of unique cache line accessed. So if the original accesses are

they become

# ${(tid + 0) - (tid + 0), (tid + 1) - (tid + 0), \cdots}$

 $\{tid + 0, tid + 1, \cdots\}$ 

which simplifies trivially to  $\{0, 1, \dots\}$ . Then, we perform the division as before, which leads to  $\{0/s, 1/s, \dots\}$  which trivially simplifies to  $\{0, 0, \dots\}$ . Now it is much easier to identify the unique cache lines.

532 *Example* Consider the example program from Listing 4. The array 533 index being read for the argument of f is  $i + n * gl_id$  where i is 534 the iteration variable of the *mapSeq* and gl\_id the global thread id. 535 Depending on the split factor *n*, a different number of cache lines 536 will be accessed by a warp. With a split factor of n = 1, a single 537 cache line would be accessed since the accesses within a warp are 538 consecutive. However, if the split factor is larger than the warp size, 539 then each warp will be touching a different cache line.

| stencil(input | :: [ | float]   | N) =     |
|---------------|------|----------|----------|
| MapGlb(Redu   | iceS | Seq(+, ( | 0.0f),   |
| Slide(3,      | 1,   |          |          |
| Pad(1,        | 1,   | Clamp,   | input))) |

Listing 5. Example for a simple stencil program.

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

With a cache line of 32 words, 32 threads per warp and 1 word for float, the cache line indices within a warp are:

$$\{(i + n * gl_id), (i + n * (gl_id + 1)), \cdots, (i + n * (gl_id + 31))\}$$

Using the trick presented earlier, we can express all indices as an offset from the first one:

$$\{(i + n * gl_id) - (i + n * gl_id), \\ i + n * (gl_id + 1) - (i + n * gl_id), \\ \dots, \\ i + n * (gl_id + 31)) - (i + n * gl_id)\}$$

which simplifies trivially to:  $\{0, n, \dots, n * 31\}$ . Now dividing by the cache line size, we obtain  $\{0, n/32, \dots, n * 31/32\}$ .

If the split factor *n* is 1, this results in 32 zeros, meaning all the thread in the warp access a single cache line. When the split factor n = 4, this will results in the following list: {0, 0, 0, 0, 1, 1, 1, 1, ..., 7, 7, 7, 7}. Since it has 8 unique values, the warp touches 8 cache lines for this memory access.

#### 4.3 Control Flow and Synchronisation

Another important factor that often limits performance for GPUs is control flow and synchronization. if-then-else and for loop statements produce branching instructions which is notoriously bad for GPU performance. Similarly, barriers are detrimental to performance since execution is alted until all threads have reached the barrier. For this reason, the feature extractor determines the total number of *if-then-else*, for loops and barriers produced by the code generator.

Algorithm This is similar to the algorithm used to count the number of memory operations. It traverses the IR recursively, keeping track of the number of times each function is executed. Whenever a pattern that might produce a loop (e.g. iterate, mapLocal, reduceSeq) is encountered, it checks whether a loop will be emitted and update a global loop counters, taking into account the current iteration count.

The algorithm also detects special cases where loops might not be emitted. There are two cases to consider. First, when a mapSeq iterates over an array of size 1, it is clear that a loop is not required. The second case is more subtle and involves mapLocal, mapWrg or mapGlobal. If the size of the input array is smaller than the number of local threads, workgroups or global threads, respectively, the code generator will emit a if-then-else statement instead of a loop since the loop can at most be executed once per thread or workgroup.

To determine the number of barriers, the algorithm looks at mapLcl as OpenCL only has barriers inside workgroups. The LIFT code generator detects unnecessary barriers [34] and tags the call to mapLcl when it is not required. Therefore, we run this barrier elimination pass before feature extraction and use this information to ignore the *mapLcl* which have been marked as not requiring a barrier.

Listing 6. OpenCL-ish code generated for a simple stencil.

#### 4.4 Use of High-Level Semantic Information

Another practical issue has to do with the *pad* pattern which is used to implement boundary conditions in stencil programs. Listing 5 shows a simple stencil program applying a clamping boundary condition which simply repeats the outermost value in case of outof-bounds accesses. Listing 6 shows the generated pseudo-OpenCL code for this program. The *pad* pattern introduces a lot of ternary operators ?: which check that every memory access is in bound. This operator makes it harder for the simplifier to subtract memory accesses with each other to identify unique cache lines.

To overcome this, we exploit the high-level semantic information available: the padded data is rarely accessed and most accesses are in bound. The feature extractor focuses on the common case by simply ignoring the ternary operator and calculate the index for the common case. Identifying the common case by statically analyzing the OpenCL code is much harder even for this simple example. We would have to predict the common case for two ternary operators whos predicates depends on two opaque function calls (global\_id and global\_size) to the OpenCL library.

### 4.5 Summary

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

This section has shown how low-level GPU-specific features are extracted from the LIFT IR. Memory-related, control flow and synchronization features, are extracted using information about the length of arrays from the type. We have seen how the fine-grained memory feature related to cache lines accesses is computed using the power of the LIFT symbolic arithmetic expressions. The next section explains how we build a simple performance model using these features.

# 5 Performance Model

Having seen how hardware-specific information is extracted from the high-level IR, we now focus on the performance model. It is based on k-Nearest Neighbors (kNN), which makes prediction based on the distance between programs in the feature space. Intuitively, LIFT programs that exhibit similar features are likely to have similar performance.

#### 5.1 Output Variable

The prediction output is throughput normalized by the maximum achievable per input/program. This is to ensure that performance is comparable across programs since different programs might exhibit different number of operations.

#### 5.2 Principal Component Analysis

Given that a kNN model works best with a small number of features, we use PCA (Principal Component Analysis) to reduce the
dimensionality of the feature space. Prior to applying PCA, the
features are centered and reduced with a mean of 0 and a standard

deviation of 1. This step is necessary since our features have very different ranges of values. PCA is then applied and we retain the principal components that explain 95% of the variance. In effect, this compresses the feature space by removing redundant features.

# 5.3 K-Nearest Neighbors Model

A k-nearest neighbors model makes a prediction of a new data point by finding the k closest points to it, using Euclidean distance and averaging their responses to make a prediction. In our case, the distance metric is determined by how close the feature vectors are from one another.

The kNN model does not require any special training. The execution time of rewritten LIFT expressions, together with their features, are simply collected and added into a database. When predicting a newly unseen LIFT program, we simply look up the k closest neighbors and average their prediction to form a new prediction. In our experiment, we used k = 5.

# 5.4 Making Predictions

To make a prediction about new programs, we first collect data points from a group of training programs. For each program, we conduct an exploration of their optimization space and store the features and corresponding performance. Given a new program, we proceed as follows:

- 1. For each rewritten program:
  - a. The features are extracted, normalized and projected based on the PCA calculated from the training data;
  - b. The model predicts the performance using the average of the k-nearest neighbors.
- The different rewritten programs are sorted based on the prediction.
- 3. The fastest predicted rewritten program is generated, compiled and executed.

# 6 Experimental Setup

**Platform** The setup consists of two GPUs, an NVIDIA Titan Black and an AMD Radeon R9 295X2. The Nvidia platform uses driver version 367.35 and OpenCL 1.2 (CUDA 8.0.0). The AMD platform uses OpenCL 2.0 AMD-APP (1598.5).

**Benchmarks and Space** We use the 2D stencil benchmarks from [8] listed in Table 2. All experiments are performed using single floating point with matrix sizes from  $512^2$  to  $8192^2$ .

*Model evaluation* The performance model is evaluated using leave-one-out cross-validation, the standard machine learning methodology. When evaluating performance on a given benchmark, the traning data consists of all the data collected from all benchmarks, except the one being tested.

# 7 Feature and Model Analysis

Before looking at how the performance model is used to speedup the optimization space exploration, we first perform an analysis of the features and evaluate the model accuracy.

#### 7.1 Features Analysis

6

We use the redundancy metric to analyze which features are the most informative about performance:

$$R = \frac{I(X, Y)}{H(X) + H(Y)}$$
718
719
719
719
720

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

| Benchmark     | Points | Points Used | # grids |
|---------------|--------|-------------|---------|
| Stencil2D     | 9      | 9           | 1       |
| SRAD1         | 9      | 5           | 1       |
| SRAD2         | 9      | 3           | 2       |
| Hotspot2D     | 9      | 5           | 2       |
| Gradient      | 9      | 5           | 1       |
| Jacobi2D 5 pt | 9      | 5           | 1       |
| Jacobi2D 9 pt | 9      | 9           | 1       |
| Gaussian      | 25     | 25          | 1       |

732

753

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

Table 2. Stencil benchmarks used in the evaluation.

The redundancy metric normalizes the mutual information by the 733 sum of the entropy of the two variables. This ensures that different 734 735 features can be compared with one another. In our case, we are 736 interested in comparing each feature with the output we wish to predict: performance. A higher value between a certain feature 737 and the output indicates that the feature is useful for performance 738 prediction. 739

Figure 4 shows the normalized mutual information between 740 features and performance. As expected, one of the most important 741 features is the average number of cache lines accessed per warp. 742 This feature, which represents locality, is extremely important for 743 744 stencil benchmarks.

The next most important feature for both machines is the global-745 Size in dimension 1. This feature is directly related to the number of 746 threads that execute and, therefore, the amount of parallel work per-747 748 formed. It is also used to determine if the kernels are launched using 749 a 2D or 1D iteration space (in the 1D case, the globalSize1 will be 1). Then, comes the number of global stores, followed closely by the 750 number of global loads. This basically corresponds to the number 751 of memory accesses performed into the slow global memory. 752

For both platforms, barriers and control flow (for loops) do seem 754 to have only a medium impact on performance, whereas the number of if-statements does not seem very relevant at all. Focusing on the least important features, the number of local loads does not seem to affect performance much. We conjecture that, since local memory is anyway very fast, having fewer or more local loads might not make much of a difference in terms of performance, especially compared to the number of global memory operation.

# 7.2 Benchmark diversity

Figure 3 shows the features of the best point in the space for our benchmarks. As can be seen, some benchmarks share similarities, which is essential for being able to make prediction. However, we also observe quite a lot of diversity.

#### 7.3 Performance Model Correlation

We analyze correlation between the predicted and actual values to measure the model ability at distinguishing between good and bad points. For all programs, the correlation coefficient is in the range [0.7 - 0.9], with average of 0.9 on Nvidia and 0.8 on AMD, which shows the predictor works

#### 7.4 Summary

This section has shown that the most important features for performance prediction on GPUs are related to memory access pattern, amount of parallelism and number of global memory accesses. The section has also shown that the model's predictions correlate highly

with actual performance. The next section shows how the model is used to speedup the optimization space exploration of our benchmarks.

#### **Optimization Space Exploration** 8

# 8.1 Model-based Exploration

This section shows the performance achieved when exploring the space with the predictor. The exploration is conducted by generating 1,000 transformed LIFT expression using rewrites and combining them with 10 different thread-counts on average. This leads to 10,000 design point per program/input pair. For each point, we extract the features and use the model to rank them. We then run the design points from highest predicted performance to lowest.

Figure 5 shows the normalized best performance achieved as a function of the number of points evaluated. It also shows the performance achieved using a purely random evaluation order. Using the predictor, it is possible to very quickly achieve 100% of the performance available in the space for all programs. In comparison, the random strategy struggles to reach even 50% of the performance available in some cases after having explored 3% of the whole space.

#### 8.2 Space Exploration Speedups

Figure 6 shows the exploration speedup when using our model compared to random to achieve 90% of the available performance. The speedup is shown in terms of number of samples and total time it takes to run them. A speedup of 10x means the performance model needs 10x less runs, or 10x less time, than random to achieve 90% of the performance.

As can be seen, using the performance model brings large speedup across all programs. When looking at total number of runs required, on Nvidia, the performance model approach requires 35x less runs than random. On AMD, there is an even bigger savings, since the model requires 77x less runs than random. When it comes to total time, the model based approach is a staggering 450x and 2000x faster than random for Nvidia and AMD respectively.

#### 8.3 Detailed Results

This last section shows more details per program/input. Figure 7 shows the actual number of runs required to reach 90% of the performance across programs and input. As can be seen, only one run is necessary in the majority of the cases for Nvidia and two for AMD. In contrast, random needs over 60 runs for Nvidia and over 180 for AMD in most cases.

The average number of runs using the model is 3 for Nvidia and 5 for AMD. In comparison, random requires on average 97 runs for Nvidia and 240 for AMD. These results clearly shows that the performance model is working extremely well in the majority of the cases.

Interestingly, there are a couple of outliers programs/input size combination that requires over 30 runs for the model-based approach. In both cases, stencil2d on Nvidia and srad1 on AMD, this is when the largest or smallest input sizes are used. We believe that in such cases, the behavior of these programs probably changes drastically with the input size. For instance, the data might actually fit entirely in the cache for the smallest input size of stencil2d and, therefore, change drastically the behavior of the application for this input size. Since our features have no notion of working-set size, the model might be unable to pick up this change of behavior.



Figure 4. Normalized mutual information (redundancy) between each feature and performance.

However, even in such cases, the model-based exploration is still ahead of random. For stencil2d, the model needs 31 runs while random needs 691, a 21x speedup!

#### Related Work

Auto-Tuning OpenTuner [1] is a framework for domain-specific multi-objective auto-tuners. CLTune [24] is a generic auto-tuner for OpenCL kernels. ATF [30] is a language-independent auto-tuning framework which supports inter-parameter constraints. These autotuning approaches attempt to find good implementations using online search which is orthogonal to our approach. In fact, autotuners can be easily coupled with a performance predictor.

Analytical Performance Modelling CuMAPz [15] is a compile time analysis tool that helps programmers increase the memory per-formance of CUDA programs. It estimates the effects of performance-critical memory behaviours, such as data reuse, coalesced accesses, channel skew, bank conflict and branch divergence. GROPHECY [22] uses the MWP-CWP model [11] (Memory Warp Parallelism - Com-putation Warp Parallelism) to estimate the GPU performance of skeleton-based applications. GPUPerf [32] is an enhanced version of the analytical MWP-CWP model with added metrics and a way of understanding performance bottlenecks. The boat hull model [25] 

rithm classification and produces a roofline model for each class of

distance theory with parallel execution, memory latency, limited associativity, miss-status holding-registers and warp divergence. COMPASS [17] introduces a language for creating analytical performance models that analyze the amount of floating point and memory operations based on static code features. Coloured petri nets [19] have been proposed for GPGPU performance modelling. Another approach [3] builds an analytical performance model to determine the lower bound on execution time. Low-level GPU ISA solving and assembly microbenchmarking [37] has been used to

predict execution time and determine resource bottlenecks for a

Analytical models describe low-level details of the hardware to model performance using a model written by a hardware expert. They typically use low-level kernel representations to make their predictions. In contrast, our approach based on machine-learning is fully automatic and works by extracting features at a much higher level.

Statistical Performance Modelling Early work [7] extracts static code features and uses machine learning to predict the performance of optimization sequences.Principal component analysis, cluster analysis and regression modelling have been used [14] to generate predictive models for GPUs and CPUs. Predictive modelling has also been applied in polyhedral compilation [28] to predict speedups for different combinations of polyhedral transformations from hardware performance counters. Graph-based program characterization [27] has also been used for polyhedral compilation to predict the speedups of optimization sequences. Clustering on similarity of a graph-based intermediate representation has been used [6] to cluster similar programs. Another approach [35] uses machine learning models trained on assembly level features to choose a good combination of transformations for vectorization.

All these approaches use hardware counters, low-level code features, assembly-level features or compiler data structures to predict speedups or optimization sequences. In contrast, our work shows how we can extract features at a much higher-level and still predict performance accurately.

MaSiF [5] uses PCA and kNN to auto-tune skeleton parameters for programs written using TBB and FastFlow. Stargazer [12] uses step-wise linear regression together with cubic splines to estimate the performance of programs on different GPU designs in GPGPU-Sim [2]. Starchart [13] uses random sampling and building



Figure 5. Achieved performance when exploring the space for a 4K input size using a model trained on other programs.



**Figure 6.** Reduction in the number of samples and corresponding time required to explore to reach at least 90% of the available performance (average = geometric mean).

regression trees to divide the whole optimization space into smaller subspaces.

These approaches try to directly predict the effect tunable parameters have on the performance. However, they rely on the fact that the number of parameters is fixed and known in advance. In contrast, our approach predicts the performance independently of the number of parameters in the program.

*Artificial Intelligence for Compilers* Genetic programming has been used [16] to generate features for predicting loop unrolling factors. Others [23] have proposed ways of generating program

features out of simple ones. Features are encoded as numeric relations and new ones are generated by joining existing relations and aggregating them. Support Vector Machines have also been used in compilers [31]. Machine learning has also been used to automatically learn compiler heuristics. [36] A neural-network cascade [20] is used to determine the amount of thread coarsening to apply to OpenCL programs for different GPUs.

Machine learning models in compilers traditionally use features extracted from a deeper stage in the compilation pipeline. Our work instead extracts them at a considerably higher-level from a functional IR.

# 10 Conclusions

This paper has demonstrated that it is possible to extract low-level hardware-specific features from the LIFT high-level functional IR. We have seen how type information, such as array length, is useful for computing certain features. The ability to reason symbolically about array indices also enables the extraction of very fined-grained features such as the number of accessed cache lines per warp. To the best of our knowledge, this is the first time a paper has shown how low-level features can be extracted at such high level, without requiring any profiling or performance counters.

The paper has also demonstrated how a simple performance model is built to make accurate performance predictions about different program variants. Using an Nvidia and AMD GPUs, and stencil applications, we have seen that the model is able to predict points in the space that are within 90% of the best within one or two runs in the majority of the cases. When compared to a random search strategy, the model requires on average 77x less runs than random on AMD and 35x less on Nvidia which translates to time savings of 2000x and 450x respectively.

## References

- Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman P. Amarasinghe. 2014. Open-Tuner: an extensible framework for program autotuning. In *PACT*. ACM. https: //doi.org/10.1145/2628071.2628092
- [2] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In *ISPASS*. IEEE. https://doi.org/10.1109/ISPASS.2009.4919648
- [3] Ulysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar, and Albert Cohen. 2017. Optimization Space Pruning Without Regrets. In CC. ACM. https: //doi.org/10.1145/3033019.3033023

#### GPGPU'20, February 23, 2020, San Diego, CA, USA

