Numerical linear algebra

In [1]:
versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Introduction

  • Topics in numerical algebra:

    • BLAS
    • solve linear equations $\mathbf{A} \mathbf{x} = \mathbf{b}$
    • regression computations $\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}$
    • eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$
    • generalized eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{B} \mathbf{x}$
    • singular value decompositions $\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T$
    • iterative methods for numerical linear algebra
  • Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).

  • Our major goal (or learning objectives) is to

    1. know the complexity (flop count) of each task
    2. be familiar with the BLAS and LAPACK functions (what they do)
    3. do not re-invent wheels by implementing these dense linear algebra subroutines by yourself
    4. understand the need for iterative methods
    5. apply appropriate numerical algebra tools to various statistical problems
  • All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra.

    • Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See documentation.

BLAS

  • BLAS stands for basic linear algebra subprograms.

  • See netlib for a complete list of standardized BLAS functions.

  • There are many implementations of BLAS.

    • Netlib provides a reference implementation.
    • Matlab uses Intel's MKL (mathematical kernel libaries). MKL implementation is the gold standard on market. It is not open source but the compiled library is free for Linux and MacOS.
    • Julia uses OpenBLAS. OpenBLAS is the best open source implementation.
  • There are 3 levels of BLAS functions.

Level Example Operation Name Dimension Flops
1 $\alpha \gets \mathbf{x}^T \mathbf{y}$ dot product $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $2n$
1 $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ axpy $\alpha \in \mathbb{R}$, $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $2n$
2 $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ gaxpy $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ $2mn$
2 $\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T$ rank one update $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ $2mn$
3 $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ matrix multiplication $\mathbf{A} \in \mathbb{R}^{m \times p}$, $\mathbf{B} \in \mathbb{R}^{p \times n}$, $\mathbf{C} \in \mathbb{R}^{m \times n}$ $2mnp$
  • Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).

Examples

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

Some operations appear as level-3 but indeed are level-2.

Example 1. A common operation in statistics is column scaling or row scaling $$ \begin{eqnarray*} \mathbf{A} &=& \mathbf{A} \mathbf{D} \quad \text{(column scaling)} \\ \mathbf{A} &=& \mathbf{D} \mathbf{A} \quad \text{(row scaling)}, \end{eqnarray*} $$ where $\mathbf{D}$ is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form
$$ \mathbf{X}^T \mathbf{W} \mathbf{X}, $$ where $\mathbf{W}$ is a diagonal matrix with observation weights on diagonal.

Column and row scalings are essentially level-2 operations!

In [2]:
using BenchmarkTools, LinearAlgebra, Random

Random.seed!(123) # seed
n = 2000
A = rand(n, n) # n-by-n matrix
d = rand(n)  # n vector
D = Diagonal(d) # diagonal matrix with d as diagonal
Out[2]:
2000×2000 Diagonal{Float64, Vector{Float64}}:
 0.140972   ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅ 
  ⋅        0.143596   ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅        0.612494   ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅        0.0480573      ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
 ⋮                                        ⋱                      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅            0.882415   ⋅         ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅        0.450904   ⋅ 
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅        0.814614
In [3]:
Dfull = convert(Matrix, D) # convert to full matrix
Out[3]:
2000×2000 Matrix{Float64}:
 0.140972  0.0       0.0       0.0        …  0.0       0.0       0.0
 0.0       0.143596  0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.612494  0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0480573     0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 ⋮                                        ⋱                      
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.0
 0.0       0.0       0.0       0.0           0.882415  0.0       0.0
 0.0       0.0       0.0       0.0           0.0       0.450904  0.0
 0.0       0.0       0.0       0.0           0.0       0.0       0.814614
In [4]:
# this is calling BLAS routine for matrix multiplication: O(n^3) flops
# this is SLOW!
@benchmark $A * $Dfull
Out[4]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     102.632 ms (0.00% GC)
  median time:      108.968 ms (0.00% GC)
  mean time:        115.806 ms (3.95% GC)
  maximum time:     181.946 ms (0.23% GC)
  --------------
  samples:          44
  evals/sample:     1
In [5]:
# dispatch to special method for diagonal matrix multiplication.
# columnwise scaling: O(n^2) flops
@benchmark $A * $D
Out[5]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  4
  --------------
  minimum time:     8.917 ms (0.00% GC)
  median time:      9.596 ms (0.00% GC)
  mean time:        11.821 ms (19.84% GC)
  maximum time:     80.565 ms (87.89% GC)
  --------------
  samples:          424
  evals/sample:     1
In [6]:
# in-place: avoid allocate space for result
# rmul!: compute matrix-matrix product AB, overwriting A, and return the result.
@benchmark rmul!($A, $D)
Out[6]:
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  2
  --------------
  minimum time:     5.033 ms (0.00% GC)
  median time:      9.243 ms (0.00% GC)
  mean time:        10.230 ms (0.00% GC)
  maximum time:     18.942 ms (0.00% GC)
  --------------
  samples:          489
  evals/sample:     1

Note: In R or Matlab, diag(d) will create a full matrix. Be cautious using diag function: do we really need a full diagonal matrix?

In [7]:
using RCall

R"""
d <- runif(5)
diag(d)
"""
Out[7]:
RObject{RealSxp}
          [,1]      [,2]     [,3]      [,4]   [,5]
[1,] 0.3306936 0.0000000 0.000000 0.0000000 0.0000
[2,] 0.0000000 0.4492671 0.000000 0.0000000 0.0000
[3,] 0.0000000 0.0000000 0.734508 0.0000000 0.0000
[4,] 0.0000000 0.0000000 0.000000 0.9110102 0.0000
[5,] 0.0000000 0.0000000 0.000000 0.0000000 0.4198
In [8]:
using MATLAB

mat"""
d = rand(5, 1)
diag(d)
"""
ArgumentError: Package MATLAB not found in current path:
- Run `import Pkg; Pkg.add("MATLAB")` to install the MATLAB package.


Stacktrace:
 [1] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:871
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

Example 2. Innter product between two matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$ is often written as $$ \text{trace}(\mathbf{A}^T \mathbf{B}), \text{trace}(\mathbf{B} \mathbf{A}^T), \text{trace}(\mathbf{A} \mathbf{B}^T), \text{ or } \text{trace}(\mathbf{B}^T \mathbf{A}). $$ They appear as level-3 operation (matrix multiplication with $O(m^2n)$ or $O(mn^2)$ flops).

In [9]:
Random.seed!(123)
n = 2000
A, B = randn(n, n), randn(n, n)

# slow way to evaluate this thing
@benchmark tr(transpose($A) * $B)
Out[9]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     104.078 ms (0.00% GC)
  median time:      116.660 ms (0.00% GC)
  mean time:        136.762 ms (1.64% GC)
  maximum time:     250.691 ms (4.25% GC)
  --------------
  samples:          37
  evals/sample:     1

But $\text{trace}(\mathbf{A}^T \mathbf{B}) = <\text{vec}(\mathbf{A}), \text{vec}(\mathbf{B})>$. The latter is level-2 operation with $O(mn)$ flops.

In [10]:
@benchmark dot($A, $B)
Out[10]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.598 ms (0.00% GC)
  median time:      3.627 ms (0.00% GC)
  mean time:        3.727 ms (0.00% GC)
  maximum time:     8.622 ms (0.00% GC)
  --------------
  samples:          1340
  evals/sample:     1

Example 3. Similarly $\text{diag}(\mathbf{A}^T \mathbf{B})$ can be calculated in $O(mn)$ flops.

In [11]:
# slow way to evaluate this thing: O(n^3)
@benchmark diag(transpose($A) * $B)
Out[11]:
BenchmarkTools.Trial: 
  memory estimate:  30.53 MiB
  allocs estimate:  3
  --------------
  minimum time:     130.037 ms (0.00% GC)
  median time:      140.912 ms (0.00% GC)
  mean time:        157.125 ms (1.50% GC)
  maximum time:     310.582 ms (3.53% GC)
  --------------
  samples:          33
  evals/sample:     1
In [12]:
# smarter: O(n^2)
@benchmark Diagonal(vec(sum($A .* $B, dims=1)))
Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  30.53 MiB
  allocs estimate:  5
  --------------
  minimum time:     9.171 ms (0.00% GC)
  median time:      10.793 ms (0.00% GC)
  mean time:        13.928 ms (18.92% GC)
  maximum time:     35.328 ms (45.95% GC)
  --------------
  samples:          359
  evals/sample:     1

To get rid of allocation of intermediate array at all, we can just write a double loop or use dot function.

In [13]:
using LoopVectorization

function diag_matmul!(d, A, B)
    m, n = size(A)
    @assert size(B) == (m, n) "A and B should have same size"
    fill!(d, 0)
    @avx for j in 1:n, i in 1:m
        d[j] += A[i, j] * B[i, j]
    end
#     for j in 1:n
#         @views d[j] = dot(A[:, j], B[:, j])
#     end
    Diagonal(d)
end

d = zeros(eltype(A), size(A, 2))
@benchmark diag_matmul!($d, $A, $B)
Out[13]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.103 ms (0.00% GC)
  median time:      3.315 ms (0.00% GC)
  mean time:        3.713 ms (0.00% GC)
  maximum time:     22.960 ms (0.00% GC)
  --------------
  samples:          1346
  evals/sample:     1

Memory hierarchy and level-3 fraction

Key to high performance is effective use of memory hierarchy. True on all architectures.

  • Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.

  • Numbers everyone should know
Operation Time
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Read 1 MB sequentially from memory 250,000 ns
Read 1 MB sequentially from SSD 1,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns

Source: https://gist.github.com/jboner/2841832

  • For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.

  • Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?
    Answer: use high-level BLAS as much as possible.

BLAS Dimension Mem. Refs. Flops Ratio
Level 1: $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $3n$ $2n$ 3:2
Level 2: $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{A} \in \mathbb{R}^{n \times n}$ $n^2$ $2n^2$ 1:2
Level 3: $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ $\mathbf{A}, \mathbf{B}, \mathbf{C} \in\mathbb{R}^{n \times n}$ $4n^2$ $2n^3$ 2:n
  • Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. Surface-to-volume effect.
    See Dongarra slides.

  • A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called level-3 fraction.

  • To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the Quora question, especially the video. Bottomline is

Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.

Effect of data layout

  • Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.

  • Storage mode: column-major (Fortran, Matlab, R, Julia) vs row-major (C/C++).

  • Cache line is the minimum amount of cache which can be loaded and stored to memory.

    • x86 CPUs: 64 bytes
    • ARM CPUs: 32 bytes

  • Accessing column-major stored matrix by rows ($ij$ looping) causes lots of cache misses.

  • Take matrix multiplication as an example $$ \mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}. $$ Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops.

    • jki or kji looping:
      # inner most loop
        for i = 1:m
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
    • ikj or kij looping:
      # inner most loop        
        for j = 1:n
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
    • ijk or jik looping:
      # inner most loop        
        for k = 1:p
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
  • We pay attention to the innermost loop, where the vector calculation occurs. The associated stride when accessing the three matrices in memory (assuming column-major storage) is
Variant A Stride B Stride C Stride
$jki$ or $kji$ Unit 0 Unit
$ikj$ or $kij$ 0 Non-Unit Non-Unit
$ijk$ or $jik$ Non-Unit Unit 0

Apparently the variants $jki$ or $kji$ are preferred.

In [14]:
"""
    matmul_by_loop!(A, B, C, order)

Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.
"""
function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)
    
    m = size(A, 1)
    p = size(A, 2)
    n = size(B, 2)
    fill!(C, 0)
    
    if order == "jki"
        @inbounds for j = 1:n, k = 1:p, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kji"
        @inbounds for k = 1:p, j = 1:n, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ikj"
        @inbounds for i = 1:m, k = 1:p, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kij"
        @inbounds for k = 1:p, i = 1:m, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ijk"
        @inbounds for i = 1:m, j = 1:n, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "jik"
        @inbounds for j = 1:n, i = 1:m, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
end

using Random

Random.seed!(123)
m, p, n = 2000, 100, 2000
A = rand(m, p)
B = rand(p, n)
C = zeros(m, n);
  • $jki$ and $kji$ looping:
In [15]:
using BenchmarkTools

@benchmark matmul_by_loop!($A, $B, $C, "jki")
Out[15]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     74.823 ms (0.00% GC)
  median time:      87.405 ms (0.00% GC)
  mean time:        97.224 ms (0.00% GC)
  maximum time:     197.993 ms (0.00% GC)
  --------------
  samples:          52
  evals/sample:     1
In [16]:
@benchmark matmul_by_loop!($A, $B, $C, "kji")
Out[16]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     464.418 ms (0.00% GC)
  median time:      479.294 ms (0.00% GC)
  mean time:        486.607 ms (0.00% GC)
  maximum time:     554.250 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1
  • $ikj$ and $kij$ looping:
In [17]:
@benchmark matmul_by_loop!($A, $B, $C, "ikj")
Out[17]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.662 s (0.00% GC)
  median time:      2.753 s (0.00% GC)
  mean time:        2.753 s (0.00% GC)
  maximum time:     2.844 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
In [18]:
@benchmark matmul_by_loop!($A, $B, $C, "kij")
Out[18]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.838 s (0.00% GC)
  median time:      2.841 s (0.00% GC)
  mean time:        2.841 s (0.00% GC)
  maximum time:     2.843 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
  • $ijk$ and $jik$ looping:
In [19]:
@benchmark matmul_by_loop!($A, $B, $C, "ijk")
Out[19]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     507.060 ms (0.00% GC)
  median time:      512.543 ms (0.00% GC)
  mean time:        513.585 ms (0.00% GC)
  maximum time:     525.603 ms (0.00% GC)
  --------------
  samples:          10
  evals/sample:     1
In [20]:
@benchmark matmul_by_loop!($A, $B, $C, "ijk")
Out[20]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     509.894 ms (0.00% GC)
  median time:      519.896 ms (0.00% GC)
  mean time:        521.933 ms (0.00% GC)
  maximum time:     548.736 ms (0.00% GC)
  --------------
  samples:          10
  evals/sample:     1
  • Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product).
In [21]:
@benchmark mul!($C, $A, $B)
Out[21]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.165 ms (0.00% GC)
  median time:      10.710 ms (0.00% GC)
  mean time:        10.179 ms (0.00% GC)
  maximum time:     14.958 ms (0.00% GC)
  --------------
  samples:          492
  evals/sample:     1
In [22]:
# direct call of BLAS wrapper function
@benchmark LinearAlgebra.BLAS.gemm!('N', 'N', 1.0, $A, $B, 0.0, $C)
Out[22]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.187 ms (0.00% GC)
  median time:      7.671 ms (0.00% GC)
  mean time:        7.989 ms (0.00% GC)
  maximum time:     12.011 ms (0.00% GC)
  --------------
  samples:          626
  evals/sample:     1

Exercise: Annotate the loop in matmul_by_loop! by @avx and benchmark again.

BLAS in R

  • Tip for R user. Standard R distribution from CRAN uses a very out-dated BLAS/LAPACK library.
In [23]:
using RCall

R"""
library(dplyr)
library(bench)
bench::mark($A %*% $B) %>%
  print(width = Inf)
""";
┌ Warning: RCall.jl: 
│ Attaching package: ‘dplyr’
│ 
│ The following objects are masked from ‘package:stats’:
│ 
│     filter, lag
│ 
│ The following objects are masked from ‘package:base’:
│ 
│     intersect, setdiff, setequal, union
│ 
└ @ RCall /Users/huazhou/.julia/packages/RCall/eRsxl/src/io.jl:160
# A tibble: 1 x 13
  expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 `#JL`$A %*% `#JL`$B    251ms    287ms      3.49    30.5MB     3.49     2     2
  total_time result                       memory                 time        
    <bch:tm> <list>                       <list>                 <list>      
1      573ms <dbl[,2000] [2,000 × 2,000]> <Rprofmem[,3] [1 × 3]> <bch:tm [2]>
  gc              
  <list>          
1 <tibble [2 × 3]>
┌ Warning: RCall.jl: Warning: Some expressions had a GC in every iteration; so filtering is disabled.
└ @ RCall /Users/huazhou/.julia/packages/RCall/eRsxl/src/io.jl:160
  • Re-build R from source using OpenBLAS or MKL will immediately boost linear algebra performance in R. Google build R using MKL to get started. Similarly we can build Julia using MKL.

  • Matlab uses MKL. Usually it's very hard to beat Matlab in terms of linear algebra.

In [24]:
using MATLAB

mat"""
f = @() $A * $B;
timeit(f)
"""
ArgumentError: Package MATLAB not found in current path:
- Run `import Pkg; Pkg.add("MATLAB")` to install the MATLAB package.


Stacktrace:
 [1] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:871
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

Avoid memory allocation: some examples

  1. Transposing matrix is an expensive memory operation.
    • In R, the command
      t(A) %*% x
      
      will first transpose A then perform matrix multiplication, causing unnecessary memory allocation
    • Julia is smart to avoid transposing matrix if possible.
In [25]:
using Random, LinearAlgebra, BenchmarkTools
Random.seed!(123)

n = 1000
A = rand(n, n)
x = rand(n);
In [26]:
typeof(transpose(A))
Out[26]:
Transpose{Float64, Matrix{Float64}}
In [27]:
fieldnames(typeof(transpose(A)))
Out[27]:
(:parent,)
In [28]:
# same data in tranpose(A) and original matrix A
pointer(transpose(A).parent), pointer(A)
Out[28]:
(Ptr{Float64} @0x00007fe887000000, Ptr{Float64} @0x00007fe887000000)
In [29]:
# dispatch to BLAS
# does *not* actually transpose the matrix
@benchmark transpose($A) * $x
Out[29]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     206.314 μs (0.00% GC)
  median time:      277.555 μs (0.00% GC)
  mean time:        308.237 μs (0.00% GC)
  maximum time:     7.062 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [30]:
# pre-allocate result
out = zeros(size(A, 2))
@benchmark mul!($out, transpose($A), $x)
Out[30]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     215.436 μs (0.00% GC)
  median time:      276.495 μs (0.00% GC)
  mean time:        291.696 μs (0.00% GC)
  maximum time:     2.387 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [31]:
# or call BLAS wrapper directly
@benchmark LinearAlgebra.BLAS.gemv!('T', 1.0, $A, $x, 0.0, $out)
Out[31]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     75.687 μs (0.00% GC)
  median time:      103.846 μs (0.00% GC)
  mean time:        115.749 μs (0.00% GC)
  maximum time:     1.645 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  1. Broadcasting in Julia achieves vectorized code without creating intermediate arrays.

    Suppose we want to calculate elementsize maximum of absolute values of two large arrays. In R or Matlab, the command

    max(abs(X), abs(Y))
    

    will create two intermediate arrays and then one result array.

In [32]:
using RCall
Random.seed!(123)
X, Y = rand(1000, 1000), rand(1000, 1000)

R"""
library(dplyr)
library(bench)
bench::mark(max(abs($X), abs($Y))) %>%
  print(width = Inf)
""";
# A tibble: 1 x 13
  expression                           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 max(abs(`#JL`$X), abs(`#JL`$Y))   5.84ms   6.31ms      149.    15.3MB     82.2
  n_itr  n_gc total_time result    memory                 time         
  <int> <dbl>   <bch:tm> <list>    <list>                 <list>       
1    38    21      256ms <dbl [1]> <Rprofmem[,3] [2 × 3]> <bch:tm [59]>
  gc               
  <list>           
1 <tibble [59 × 3]>

In Julia, dot operations are fused so no intermediate arrays are created.

In [33]:
# no intermediate arrays created, only result array created
@benchmark max.(abs.($X), abs.($Y))
Out[33]:
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.354 ms (0.00% GC)
  median time:      1.876 ms (0.00% GC)
  mean time:        2.629 ms (21.04% GC)
  maximum time:     22.173 ms (64.41% GC)
  --------------
  samples:          1850
  evals/sample:     1

Pre-allocating result array gets rid of memory allocation at all.

In [34]:
# no memory allocation at all!
Z = zeros(size(X)) # zero matrix of same size as X
@benchmark $Z .= max.(abs.($X), abs.($Y)) # .= (vs =) is important!
Out[34]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.193 ms (0.00% GC)
  median time:      1.465 ms (0.00% GC)
  mean time:        1.682 ms (0.00% GC)
  maximum time:     35.152 ms (0.00% GC)
  --------------
  samples:          2968
  evals/sample:     1
  1. View-1) avoids creating extra copy of matrix data.
In [35]:
Random.seed!(123)
A = randn(1000, 1000)

# sum entries in a sub-matrix
@benchmark sum($A[1:2:500, 1:2:500])
Out[35]:
BenchmarkTools.Trial: 
  memory estimate:  488.39 KiB
  allocs estimate:  2
  --------------
  minimum time:     77.908 μs (0.00% GC)
  median time:      271.204 μs (0.00% GC)
  mean time:        299.615 μs (11.03% GC)
  maximum time:     9.859 ms (95.91% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [36]:
# view avoids creating a separate sub-matrix
@benchmark sum(@view $A[1:2:500, 1:2:500])
Out[36]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     146.091 μs (0.00% GC)
  median time:      149.635 μs (0.00% GC)
  mean time:        159.444 μs (0.00% GC)
  maximum time:     1.090 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The @views macro, which can be useful in some operations.

In [37]:
@benchmark @views sum($A[1:2:500, 1:2:500])
Out[37]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     144.778 μs (0.00% GC)
  median time:      145.064 μs (0.00% GC)
  mean time:        154.738 μs (0.00% GC)
  maximum time:     1.067 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1