GPU Computing in Julia¶

This session introduces GPU computing in Julia.

GPGPU¶

GPUs are ubiquitous in modern computers. Following are GPUs today's typical computer systems.

NVIDIA GPUs	Tesla K80	GTX 1080	GT 650M

Computers	servers, cluster	desktop	laptop

Main usage	scientific computing	daily work, gaming	daily work
Memory	24 GB	8 GB	1GB
Memory bandwidth	480 GB/sec	320 GB/sec	80GB/sec
Number of cores	4992	2560	384
Processor clock	562 MHz	1.6 GHz	0.9GHz
Peak DP performance	2.91 TFLOPS	257 GFLOPS
Peak SP performance	8.73 TFLOPS	8228 GFLOPS	691Gflops

GPU architecture vs CPU architecture.

GPUs contain 100s of processing cores on a single card; several cards can fit in a desktop PC
Each core carries out the same operations in parallel on different input data -- single program, multiple data (SPMD) paradigm
Extremely high arithmetic intensity if one can transfer the data onto and results off of the processors quickly

GPGPU in Julia¶

GPU support by Julia is under active development. Check JuliaGPU for currently available packages.

There are at least three paradigms to program GPU in Julia.

CUDA is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, ...

The CuArrays.jl package allows defining arrays on Nvidia GPUs and overloads many common operations. CuArrays.jl supports Julia v1.0+.
OpenCL is a standard supported multiple manufacturers (Nvidia, AMD, Intel, Apple, ...), but lacks some libraries essential for statistical computing.

The CLArrays.jl package allows defining arrays on OpenCL devices and overloads many common operations.
ArrayFire is a high performance library that works on both CUDA or OpenCL framework.

The ArrayFire.jl package wraps the library for julia.
Warning: Most recent Apple operating system iOS 10.15 (Catalina) does not support CUDA yet.

I'll illustrate using CuArrays on my Linux box running CentOS 7. It has a NVIDIA GeForce RTX 2080 Ti OC with 11GB GDDR6 (14 Gbps) and 4352 cores.

versioninfo()

Julia Version 1.5.0
Commit 96786e22cc (2020-08-01 23:44 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9920X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 8

Query GPU devices in the system¶

using CuArrays, CUDAdrv

# check available devices on this machine and show their capability
for device in CuArrays.devices()
    @show capability(device)
end

capability(device) = v"7.5.0"

Transfer data between main memory and GPU¶

# generate data on CPU
x = rand(Float32, 3, 3)
# transfer data form CPU to GPU
xd = CuArray(x)

3×3 CuArray{Float32,2,Nothing}:
 0.0778205  0.831817  0.20406
 0.295554   0.359421  0.548004
 0.455887   0.67262   0.818847

# generate array on GPU directly
yd = ones(CuArray{Float32}, 3, 3)

3×3 CuArray{Float32,2,Nothing}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0

# collect data from GPU to CPU
x = collect(xd)

3×3 Array{Float32,2}:
 0.0778205  0.831817  0.20406
 0.295554   0.359421  0.548004
 0.455887   0.67262   0.818847

Linear algebra¶

using BenchmarkTools, LinearAlgebra

n = 1024
# on CPU
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)
# on GPU
xd = CuArray(x)
yd = CuArray(y)
zd = CuArray(z)

# SP matrix multiplication on GPU
@benchmark mul!($zd, $xd, $yd)

┌ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
└ @ Base loading.jl:1278

BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     2.972 μs (0.00% GC)
  median time:      176.412 μs (0.00% GC)
  mean time:        169.835 μs (0.00% GC)
  maximum time:     185.467 μs (0.00% GC)
  --------------
  samples:          3271
  evals/sample:     9

# SP matrix multiplication on CPU
@benchmark mul!($z, $x, $y)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.376 ms (0.00% GC)
  median time:      7.520 ms (0.00% GC)
  mean time:        7.612 ms (0.00% GC)
  maximum time:     12.965 ms (0.00% GC)
  --------------
  samples:          657
  evals/sample:     1

We see ~40-50x fold speedup in this matrix multiplication example.

# cholesky on Gram matrix
xtxd = xd'xd + I
@benchmark cholesky($(Symmetric(xtxd)))

BenchmarkTools.Trial: 
  memory estimate:  3.81 KiB
  allocs estimate:  90
  --------------
  minimum time:     818.295 μs (0.00% GC)
  median time:      825.061 μs (0.00% GC)
  mean time:        826.854 μs (0.13% GC)
  maximum time:     6.486 ms (51.50% GC)
  --------------
  samples:          6046
  evals/sample:     1

xtx = collect(xtxd)
@benchmark cholesky($(Symmetric(xtx)))

BenchmarkTools.Trial: 
  memory estimate:  4.00 MiB
  allocs estimate:  4
  --------------
  minimum time:     1.971 ms (0.00% GC)
  median time:      2.018 ms (0.00% GC)
  mean time:        2.480 ms (5.54% GC)
  maximum time:     14.605 ms (0.00% GC)
  --------------
  samples:          2016
  evals/sample:     1

GPU speedup of Cholesky on this example is moderate.

Elementiwise operations on GPU¶

# elementwise function on GPU arrays
fill!(yd, 1)
@benchmark $zd .= log.($yd .+ sin.($xd))

BenchmarkTools.Trial: 
  memory estimate:  4.00 KiB
  allocs estimate:  82
  --------------
  minimum time:     9.527 μs (0.00% GC)
  median time:      27.800 μs (0.00% GC)
  mean time:        25.060 μs (0.00% GC)
  maximum time:     113.848 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

# elementwise function on CPU arrays
x, y, z = collect(xd), collect(yd), collect(zd)
@benchmark $z .= log.($y .+ sin.($x))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.301 ms (0.00% GC)
  median time:      14.314 ms (0.00% GC)
  mean time:        14.318 ms (0.00% GC)
  maximum time:     14.398 ms (0.00% GC)
  --------------
  samples:          350
  evals/sample:     1

GPU brings great speedup (>500x) to the massive evaluation of elementary math functions.