Wednesday, August 31, 2016

#GSoC 2016 Final report

Hi everyone!

I have successfully passed the Google Summer of Code 2016 and below is the copy of my final report.

Dear LLVM contributors,

during the Google Summer of Code 2016 I have been working on the "Improvement of vectorization process in Polly" project.

I planned to add a mode that uses the same Polly tool flow and processes statements, which contain only matrix multiplication of the following form, in a special way.

C := AB + C, where C, A, and B are matrices of sizes m × n, m × k, and k × n, respectively.

It was also planned to implement determination of vectorization factors and profitability of vectorization.

We decided to adjust the goals and focus only on the optimization of matrix multiplication and its generalization. In the result, we optimize a class of statements (see [1] for further details), which also contains the matrix multiplication.

The result of evaluations of the corresponding code compiled with r279395 of Clang on the Intel Core i7-3820 SandyBridge, for different values of m, n, k and different compiler options can be found on the following link [2].

According to the figure [2], the algorithm for the computation of the optimal values of the parameters of matrix multiplication can be improved. One way is a new algorithm for the computation of the parameter Mc (see [1] for further details). If we, for example, manually equate the Mc to Nc we get the following results [3]. Its improvement is planned for the future.

The class also contains general matrix multiplication of the form C := αAB + βC, where C, A, and B are matrices of sizes m × n, m × k, and k × n, respectively. In case, for example, of m = n = 1056 and k = 1024, α = 32412, β = 2123, the corresponding code compiled with r278666 of Clang with following options

clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true -DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8 -mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm -polly-target-througput-vector-fma=1 -mllvm -polly-target-latency-vector-fma=8

takes about 0.30 on the Intel Core i7-3820 SandyBridge. In case the described optimization is disabled, it takes about 0.75 and 2.1 in case Polly is not used.

Furthermore, using the optimization we can optimize the following examples of code:

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2   for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
             C[i][j] +=  32412 Op1 B[k][j]
           endfor
         endfor
       endfor

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2   for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
             C[i][j] = C[i][j] Op1 A[i][k] Op2 B[k][j]
           endfor
         endfor
       endfor

where Opi is an operation from the set {+, -, /, *} (see [4], [5], [6], [7] for results of their evaluations).

The described optimization is implemented in [8], [9], [10], [11], [12], [13], [14] that are already committed and [15] that is under review. For further details about the project, what code got merged, what code did not get merged, and what is left to do, please see my blog posts [1], [16].

I am planning to continue working on the project as part of my Ph.D. thesis and implement the following:

1.  Prefetching of the certain cache lines

The BLIS implementation of matrix-matrix multiplication prefetches only the first cache line, before traversing a given interval of memory. This could confirm that the implementation relies on hardware preteching to prefetch the subsequent lines [17]. Consequently, there is a possibility that its utilization could improve the described optimization.

2.  Reduction of the number of the parameters of a target architecture that require to be specified as command line parameters

At the moment, to be able to use the implemented optimization, one should specify parameters of a target architecture like, for example, throughput of vector instructions per clock cycle (see [1] for further details). Reduction of the number of such parameters using the information about the target architecture of the LLVM (e.g., TargetTransformInfo::getArithmeticInstrCost) could be preferable.

3. Generalization of determination of parameters of the transformation

At the moment, for this purpose, we use an algorithm described in "Analytical Modeling is Enough for High Performance BLIS" [1] that was designed to optimize a subclass of the class and could possibly cause performance regression, if we try to apply it to all statements of the class. We could try to create a new algorithm specialized for the whole class.

I would like to thank Michael Kruse, Albert Cohen, Johannes Doerfert and the LLVM community for your comments, reviews, ideas and the opportunity to work on this project.

I would like to especially thank Tobias Grosser for his encouragement and excellent support.

Refs.:

[1] - http://romangareev.blogspot.ru/2016/06/gsoc-2016-report-i.html
[2] - https://drive.google.com/file/d/0B2Wloo-931AoVTRINXp0dDZPSjQ/view?usp=sharing
[3] - https://drive.google.com/file/d/0B2Wloo-931AoWG9jTjg3aEVxRG8/view?usp=sharing
[4] - https://docs.google.com/spreadsheets/d/1RtioBokRePdX2zdFxGxOxTbtY_WnI4moYFCfbujr7S8/pubhtml?gid=0&single=true
[5] - https://docs.google.com/spreadsheets/d/1hUtXH9JsgYUhHCdydVtE7ayNupc9ViZuAG8z_JKevP0/pubhtml?gid=0&single=true
[6] - https://docs.google.com/spreadsheets/d/1NQHqCuhc1A8pVJtN3FuTHrQ2gcyXUYs5T2L8tMXqLE4/pubhtml?gid=0&single=true
[7] - https://docs.google.com/spreadsheets/d/1nr4qZpG5V--Cq5cTLX1TLA7OpfdFqhIGr6-IFuoq1yw/pubhtml?gid=0&single=true
[8] - http://reviews.llvm.org/D21140
[9] - http://reviews.llvm.org/D21491
[10] - http://reviews.llvm.org/D20575
[11] - https://reviews.llvm.org/D22187
[12] - https://reviews.llvm.org/D22828
[13] - https://reviews.llvm.org/D23740
[14] - https://reviews.llvm.org/D23759
[15] - https://reviews.llvm.org/D23260
[16] - http://romangareev.blogspot.ru/2016/08/gsoc-2016-report-ii.html
[17] - http://lists.llvm.org/pipermail/llvm-dev/2016-May/100229.html

Monday, August 22, 2016

#GSoC 2016 Report II

Hi everyone!

As was mentioned previously, my next task is to implement the packing transformation. It can be described as a data-layout transformation that requires to introduce a new array, copy data to it and change memory access locations to reference the array (see [1], p. 6 for further details).

By the moment of implementation of the packing transformation, Polly supported the only one data-layout transformation, namely changing of memory access locations. The remaining tasks were to implement creation of empty arrays and copying data to it. However, it was first required to identify the locations that should be changed.

In case of the following code that could correspond to the matrix-matrix multiplication,

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2   for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
                   C[i][j] += A[i][k] * B[k][j]
                 endfor
               endfor
              endfor

the following access function can be built:

S[i, j, k]->A[i][k], S[i, j, k]->B[k][j].

According to the algorithm used to optimize matrix multiplication and described in "Analytical Modeling is Enough for High Performance BLIS" [1] they should be changed to the following access relations:

S[O0, O1, O2, O3, O4, O5, O6, O7, O8]->Ac[O5 + K * O3, O6] and

S[O0, O1, O2, O3, O4, O5, O6, O7, O8]->Bc[O5 + K * O4, O7], respectively.

The transformation helps to access elements of operands of a matrix multiplication after creation of the BLIS micro and macro kernels. The described approach is implemented in [3] and also already
available from the main repository.

The next task was to create an empty array. To do it, we use an object of the ScopArrayInfo class
that is designed to store information (e.g., the type of the elements, dimension sizes) about arrays in the SCoP (see [2] for further details). In our case, it is not a part of the SCoP originally. Furthermore, the array would be allocated only during the work of the CodeGeneration pass that takes a mathematical model derived and optimized using Polly and translates it back to LLVM-IR. The described approach is implemented in [3] and already available from the main repository.

The last task was to copy data to the newly created arrays according to the packing transformation.

To implement it, we decided to introduce a new type of SCoP statement that has one read memory access and one write memory access (in this very order). In particular, data is loaded from the location described by the read memory access and written to the location described by the
write memory access. To describe the domain of such a statement, we use extension nodes of the integer set library [4].

The described approach is implemented in [5]. It helps to get the following execution time improvements of benchmarks from PolyBench [6], which correspond to matrix multiplications.

Performance Improvements - Execution Time                                                   Δ       Previous  Current
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm     -70.59%  7.1680    2.1080
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm     -68.85%  4.7640    1.4840
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm -66.61%  2.3720    0.7920

The results obtained from the Polly buildbot perf-x86_64-penryn-O3-polly are available on the following link [7].

Although it is already implemented, we continue to improve the patch and general Polly infrastructure related to it. In particular, we are planning to implement the ability to introduce the SCoP statements via the JSCoP that helps to test new optimizations without any knowledge about compiler internals [8]. During this GSoC we’ve already extended the JSCoP to allow the user to declare new arrays and to reference these arrays from access expressions. It is implemented in [9] and
already available from the main repository.

Refs.:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://llvm.org/viewvc/llvm-project/polly/trunk/include/polly/ScopInfo.h?view=markup
[3] - https://reviews.llvm.org/D22187
[4] - http://isl.gforge.inria.fr
[5] - https://reviews.llvm.org/D23260
[6] - http://web.cs.ucla.edu/~pouchet/software/polybench/
[7] - https://gareevroman.github.io
[8] - http://www.infosun.fim.uni-passau.de/cl/arbeiten/grosser-d.pdf
[9] - https://reviews.llvm.org/D22828