# Accelerated Remote Sensing Data Retrievals using Multi-Core and GPU Architectures

The generic design of the invlib library separates the mathematical formulation of the retrieval from the underlying linear algebra implementation. This allows for the use of optimized, hardware-specific linear algebra libraries to accelerate the data retrieval. As an extension of my thesis work, I have added sparse matrix and vector types that use the Intel MKL library and the CUDA cuBLAS and cuSPARSE libraries for linear algebra.

In the following I will present some important considerations for the implementation of remote sensing data retrievals on multi-core CPUs and GPUs as well as performance results obtained with invlibs newly developed MKL and CUDA matrix and vector types. I will concentrate on retrieval implementations that use a conjugate gradient solver for the linear subproblem of the optimization method and whose Jacobian as well as the covariance matrices can be efficiently represented by sparse matrices. This is the case for the retrieval of data from the MATS satellite mission [1].

All computations described below have been performed on a 20-core *Intel Xeon E5-2650 v3* and a *NVIDIA K40 GPU*.

## Sparse Matrix Representations

The three most common sparse matrix representations are the *coordinate*, the *compressed row* and the *compressed column* format.

In coordinate format (**coo**), a matrix is given by three arrays: An *element array* containing the non-zero elements of the matrix, as well as two integer arrays of the same length containing the row and column indices corresponding to the elements in the element array. Numerical libraries generally require the elements to be ordered. The Intel MKL library, for example, requires the elements to be stored in *row-major order*, i.e. first ordered with respect to the row index and then with respect to the column index.

In compressed row format (**csr**) the same matrix is given by three arrays in row-major order, with the column index array replaced by an integer array of length \(n\) containing at position \(i\) the index of the element array that contains the first element of the \(i\)th column of the matrix. In addition to that, both the Intel MKL as well as the cuSPARSE library require this *column start* or *column pointer* array to contain an additional element that points behind the last element in the last row of the matrix. For simple use cases this is just the number of non-zero elements of the matrix.

The compressed column format (**csc**) is the counterpart of the csr format. Here the matrix is described by an element array in column-major order, a column index array and a *row start* or *row pointer array*. This is analogous to the column start array of the csr format in that it contains at the \(i\)th position the index of the element array which contains the first element of the \(i\)th row.

An important thing to note is that a matrix \(\mathbf{A}\) given in csr (csc) format is identical to its transpose \(\mathbf{A}^T\) in csc (csr) format. For this reason, the cuSPARSE library provides matrix vector multiplication functions only for the csr representation. These can be used for the csc format by exchanging non-transposed and transposed matrix vector multiplication.

## The Linear Subproblem

The linear subproblem refers to the linear system that needs to be solved in each step of the optimization loop that computes the MAP estimator for the retrieved state. This linear system usually takes the form

\begin{align} \left ( \mathbf{K}^T\mathbf{S}_\epsilon\mathbf{K} + (1 + \lambda) \mathbf{S}_\alpha\ \right) \mathbf{x} &= \mathbf{v} \end{align}with \(\lambda = 0\) for the Gauss-Newton method and \(\lambda \geq 0\) for the Levenberg-Marquardt method.

The advantage of the CG solver is that it does not require the matrix describing the linear system to be evaluated, which would require too much computing and memory resources for large retrievals. Instead the CG method only requires this matrix to be multiplied by a vector from the right. Each step of the conjugate gradient solver requires the computation of four matrix vector products of which one involves a transposed matrix. The matrix vector products are critical for the performance of the CG method and are thus of primary interest here.

## Sparse Matrix Vector Multiplication

To obtain an impression of the performance of the different implementations and matrix representations, benchmarks of the matrix vector product and the transposed matrix vector product for random sparse matrices have been computed. To this end, the compute time for the evaluation of products of the form

\begin{align} \mathbf{A}\mathbf{A}\mathbf{A}\mathbf{A}\mathbf{A}\mathbf{x} \\ \mathbf{A}^T\mathbf{A}^T\mathbf{A}^T\mathbf{A}^T\mathbf{A}^T\mathbf{x} \end{align}has been measured. The matrix \(\mathbf{A}\) has been chosen as sparse \(n \times n\) matrix in the given representation, filled with \(n\) elements at randomly chosen positions. For the benchmark, the mean time to solution and its standard deviation have been computed for 100 products, each with a different random matrix \(\mathbf{A}\). The results are displayed in the two plots below.

Two important things are to note here. First of all, it can be seen that the coordinate format, which is supported only by the MKL library, performs worst for the intel types in both benchmarks. It does therefore not seem to be of practical interest for high-performance use cases.

Second and more importantly, it can be seen that for both libraries the best performance for a given operation is achieved either by the csc or the csr format. For the matrix vector multiplication the best performance is achieved by the csr format, while the csc format performs worse. For transposed matrix vector multiplication it is the opposite: The best performance is achieved using the csc format, while csr performs badly. For the CUDA implementation, the difference between csr and csr performance is more pronounced. This is likely due to the increased cost of atomic operations on those architectures (See the hint at page 17 of the cuSPARSE manual).

Since for the data retrieval both non-transposed as well as transposed matrix vector products have to be computed, I introduced an additional matrix representation which combines the csr and the csc format. As can be seen from the plots above, this *hybrid* (**hy**) format minimizes the combined time for the non-transposed and the transposed matrix vector product.

## MATS Benchmark Results

Finally also the performance of the different types for the MATS data retrieval has been evaluated. Since the retrieval problem is linear, finding the MAP estimator is reduced to solving the linear subproblem. For the benchmark the compute time for executing 1000 steps of the CG method has been measured. The results are displayed in the plot below.

The best performance here is obtained on the NVIDIA GPU, but only if the hybrid matrix representation is used. In this case the performance on the GPU is about 3 times better than the best CPU performance. For the CPU the advantage of the hybrid representation is negligible compared to the csr format. For comparison, also the performance of the Eigen 3 library, which has initially been used for the data retrieval, is displayed in the figure.

### Scaling

All CPU computations so far have been performend with a maximum number of 10 threads on the Intel 20-core CPU. To investigate the scaling with respect to the number of available compute cores the MATS benchmark has been repeated while varying the number of compute threads. The relative speedup obtained with respect to serial execution is displayed in the plot below. From the plot it can be seen that neither the coo format nor the Eigen types support thread-parallelism.

## Summary

The invlib library has been extended with sparse matrix and vector types that use the Intel MKL and the CUDA cuBLAS and cuSPARSE libraries for linear algebra. In addition to the standard sparse matrix representations (coordinate, compressed row and compressed column format) a hybrid format has been introduced that minimizes the combined compute time for non-transposed and transposed matrix vector products. Using this matrix representation, a three times speed up can be obtained by performing the retrievals on an NVIDIA K40 as compared to an *Intel Xeon E5-2650 v3* 20-core CPU.

## Acknowledgements

All computation described above have been performed at the Chalmers Centre for Computational Science and Engineering (C3SE) provided by the Swedish National Infrastructure for Computing (SNIC).

## Bibliography

[1] J. Gumbel, N. Ahlgren, N. Larsson, F. v. Schéele, MATS mission definition phase report, J. Phys. Theor. Appl. 6 (2014) 202–241.