Cholesky Decomposition

  • A basic tenet in numerical analysis:

The structure should be exploited whenever solving a problem.

Common structures include: symmetry, positive (semi)definiteness, sparsity, Kronecker product, low rank, ...

  • LU decomposition (Gaussian Elimination) is not used in statistics so often because most of time statisticians deal with positive (semi)definite matrix.

  • That's why I often criticize use of the solve() function in base R code, which inverts a matrix using LU decomposition. First, most likely we don't need a matrix inverse. Second, most likely we are dealing with a pd/psd matrix and should use Cholesky.

  • For example, in the normal equation $$ \mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y} $$ for linear regression, the coefficient matrix $\mathbf{X}^T \mathbf{X}$ is symmetric and positive semidefinite. How to exploit this structure?

In [1]:
versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Cholesky decomposition

  • Theorem: Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be symmetric and positive definite. Then $\mathbf{A} = \mathbf{L} \mathbf{L}^T$, where $\mathbf{L}$ is lower triangular with positive diagonal entries and is unique.
    Proof (by induction):
    If $n=1$, then $\ell = \sqrt{a}$. For $n>1$, the block equation $$ \begin{eqnarray*} \begin{pmatrix} a_{11} & \mathbf{a}^T \\ \mathbf{a} & \mathbf{A}_{22} \end{pmatrix} = \begin{pmatrix} \ell_{11} & \mathbf{0}_{n-1}^T \\ \mathbf{l} & \mathbf{L}_{22} \end{pmatrix} \begin{pmatrix} \ell_{11} & \mathbf{l}^T \\ \mathbf{0}_{n-1} & \mathbf{L}_{22}^T \end{pmatrix} \end{eqnarray*} $$ has solution $$ \begin{eqnarray*} \ell_{11} &=& \sqrt{a_{11}} \\ \mathbf{l} &=& \ell_{11}^{-1} \mathbf{a} \\ \mathbf{L}_{22} \mathbf{L}_{22}^T &=& \mathbf{A}_{22} - \mathbf{l} \mathbf{l}^T = \mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T. \end{eqnarray*} $$
    Now $a_{11}>0$ (why?), so $\ell_{11}$ and $\mathbf{l}$ are uniquely determined. $\mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T$ is positive definite because $\mathbf{A}$ is positive definite (why?). By induction hypothesis, $\mathbf{L}_{22}$ exists and is unique.

  • The constructive proof completely specifies the algorithm:

  • Computational cost: $$ \frac{1}{2} [2(n-1)^2 + 2(n-2)^2 + \cdots + 2 \cdot 1^2] \approx \frac{1}{3} n^3 \quad \text{flops} $$ plus $n$ square roots. Half the cost of LU decomposition by utilizing symmetry.

  • In general Cholesky decomposition is very stable. Failure of the decomposition simply means $\mathbf{A}$ is not positive definite. It is an efficient way to test positive definiteness.

Pivoting

  • When $\mathbf{A}$ does not have full rank, e.g., $\mathbf{X}^T \mathbf{X}$ with a non-full column rank $\mathbf{X}$, we encounter $a_{kk} = 0$ during the procedure.

  • Symmetric pivoting. At each stage $k$, we permute both row and column such that $\max_{k \le i \le n} a_{ii}$ becomes the pivot. If we encounter $\max_{k \le i \le n} a_{ii} = 0$, then $\mathbf{A}[k:n,k:n] = \mathbf{0}$ (why?) and the algorithm terminates.

  • With symmetric pivoting: $$ \mathbf{P} \mathbf{A} \mathbf{P}^T = \mathbf{L} \mathbf{L}^T, $$ where $\mathbf{P}$ is a permutation matrix and $\mathbf{L} \in \mathbb{R}^{n \times r}$, $r = \text{rank}(\mathbf{A})$.

LAPACK and Julia implementation

Example: positive definite matrix

In [2]:
using LinearAlgebra

A = [4.0 12 -16; 12 37 -43; -16 -43 98]
Out[2]:
3×3 Matrix{Float64}:
   4.0   12.0  -16.0
  12.0   37.0  -43.0
 -16.0  -43.0   98.0
In [3]:
# Cholesky without pivoting
Achol = cholesky(Symmetric(A))
Out[3]:
Cholesky{Float64, Matrix{Float64}}
U factor:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 2.0  6.0  -8.0
  ⋅   1.0   5.0
  ⋅    ⋅    3.0
In [4]:
typeof(Achol)
Out[4]:
Cholesky{Float64, Matrix{Float64}}
In [5]:
dump(Achol)
Cholesky{Float64, Matrix{Float64}}
  factors: Array{Float64}((3, 3)) [2.0 6.0 -8.0; 12.0 1.0 5.0; -16.0 -43.0 3.0]
  uplo: Char 'U'
  info: Int64 0
In [6]:
# retrieve the lower triangular Cholesky factor
Achol.L
Out[6]:
3×3 LowerTriangular{Float64, Matrix{Float64}}:
  2.0   ⋅    ⋅ 
  6.0  1.0   ⋅ 
 -8.0  5.0  3.0
In [7]:
# retrieve the upper triangular Cholesky factor
Achol.U
Out[7]:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 2.0  6.0  -8.0
  ⋅   1.0   5.0
  ⋅    ⋅    3.0
In [8]:
b = [1.0; 2.0; 3.0]
A \ b # this does LU; wasteful!!!; 2/3 n^3 + 2n^2
Out[8]:
3-element Vector{Float64}:
 28.58333333333338
 -7.666666666666679
  1.3333333333333353
In [9]:
Achol \ b # two triangular solves; only 2n^2 flops
Out[9]:
3-element Vector{Float64}:
 28.583333333333332
 -7.666666666666666
  1.3333333333333333
In [10]:
det(A) # this does LU; wasteful!!! (2/3) n^3
Out[10]:
35.99999999999994
In [11]:
det(Achol) # cheap
Out[11]:
36.0
In [12]:
inv(A) # this does LU! (2/3) n^3 + (4/3) n^3
Out[12]:
3×3 Matrix{Float64}:
  49.3611   -13.5556     2.11111
 -13.5556     3.77778   -0.555556
   2.11111   -0.555556   0.111111
In [13]:
inv(Achol) # (4/3) n^3
Out[13]:
3×3 Matrix{Float64}:
  49.3611   -13.5556     2.11111
 -13.5556     3.77778   -0.555556
   2.11111   -0.555556   0.111111

Example: positive semi-definite matrix.

In [14]:
using Random

Random.seed!(123) # seed
A = randn(5, 3)
A = A * transpose(A) # A has rank 3
Out[14]:
5×5 Matrix{Float64}:
  1.97375    2.0722    1.71191    0.253774   -0.544089
  2.0722     5.86947   3.01646    0.93344    -1.50292
  1.71191    3.01646   2.10156    0.21341    -0.965213
  0.253774   0.93344   0.21341    0.393107   -0.0415803
 -0.544089  -1.50292  -0.965213  -0.0415803   0.546021
In [15]:
Achol = cholesky(A, Val(true)) # 2nd argument requests pivoting
RankDeficientException(1)

Stacktrace:
 [1] chkfullrank
   @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/cholesky.jl:639 [inlined]
 [2] #cholesky!#127
   @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/cholesky.jl:295 [inlined]
 [3] cholesky!(A::Matrix{Float64}, ::Val{true}; tol::Float64, check::Bool)
   @ LinearAlgebra /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/cholesky.jl:321
 [4] #cholesky#131
   @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/cholesky.jl:402 [inlined]
 [5] cholesky(A::Matrix{Float64}, ::Val{true})
   @ LinearAlgebra /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/cholesky.jl:402
 [6] top-level scope
   @ In[15]:1
 [7] eval
   @ ./boot.jl:360 [inlined]
 [8] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [16]:
Achol = cholesky(A, Val(true), check=false) # turn off checking pd
Out[16]:
CholeskyPivoted{Float64, Matrix{Float64}}
U factor with rank 4:
5×5 UpperTriangular{Float64, Matrix{Float64}}:
 2.4227  0.855329   0.38529    -0.620349     1.24508
  ⋅      1.11452   -0.0679895  -0.0121011    0.580476
  ⋅       ⋅         0.489935    0.4013      -0.463002
  ⋅       ⋅          ⋅          1.49012e-8   0.0
  ⋅       ⋅          ⋅           ⋅           0.0
permutation:
5-element Vector{Int64}:
 2
 1
 4
 5
 3
In [17]:
rank(Achol) # determine rank from Cholesky factor
Out[17]:
4
In [18]:
rank(A) # determine rank from SVD (default), which is more numerically stable
Out[18]:
3
In [19]:
Achol.L
Out[19]:
5×5 LowerTriangular{Float64, Matrix{Float64}}:
  2.4227      ⋅           ⋅         ⋅           ⋅ 
  0.855329   1.11452      ⋅         ⋅           ⋅ 
  0.38529   -0.0679895   0.489935   ⋅           ⋅ 
 -0.620349  -0.0121011   0.4013    1.49012e-8   ⋅ 
  1.24508    0.580476   -0.463002  0.0         0.0
In [20]:
Achol.U
Out[20]:
5×5 UpperTriangular{Float64, Matrix{Float64}}:
 2.4227  0.855329   0.38529    -0.620349     1.24508
  ⋅      1.11452   -0.0679895  -0.0121011    0.580476
  ⋅       ⋅         0.489935    0.4013      -0.463002
  ⋅       ⋅          ⋅          1.49012e-8   0.0
  ⋅       ⋅          ⋅           ⋅           0.0
In [21]:
Achol.p
Out[21]:
5-element Vector{Int64}:
 2
 1
 4
 5
 3
In [22]:
# P A P' = L U
A[Achol.p, Achol.p]  Achol.L * Achol.U
Out[22]:
true

Applications

  • No inversion mentality: Whenever we see matrix inverse, we should think in terms of solving linear equations. If the matrix is positive (semi)definite, use Cholesky decomposition, which is twice cheaper than LU decomposition.

Multivariate normal density

Multivariate normal density $\text{MVN}(0, \Sigma)$, where $\Sigma$ is p.d., is $$ \, - \frac{n}{2} \log (2\pi) - \frac{1}{2} \log \det \Sigma - \frac{1}{2} \mathbf{y}^T \Sigma^{-1} \mathbf{y}. $$

  • Method 1: (a) compute explicit inverse $\Sigma^{-1}$ ($2n^3$ flops), (b) compute quadratic form ($2n^2 + 2n$ flops), (c) compute determinant ($2n^3/3$ flops).

  • Method 2: (a) Cholesky decomposition $\Sigma = \mathbf{L} \mathbf{L}^T$ ($n^3/3$ flops), (b) Solve $\mathbf{L} \mathbf{x} = \mathbf{y}$ by forward substitutions ($n^2$ flops), (c) compute quadratic form $\mathbf{x}^T \mathbf{x}$ ($2n$ flops), and (d) compute determinant from Cholesky factor ($n$ flops).

Which method is better?

In [23]:
# this is a person w/o numerical analysis training
function logpdf_mvn_1(y::Vector, Σ::Matrix)
    n = length(y)
    - (n//2) * log(2π) - (1//2) * logdet(Symmetric(Σ)) - (1//2) * transpose(y) * inv(Σ) * y
end

# this is a computing-savvy person 
function logpdf_mvn_2(y::Vector, Σ::Matrix)
    n = length(y)
    Σchol = cholesky(Symmetric(Σ))
    - (n//2) * log(2π) - (1//2) * logdet(Σchol) - (1//2) * abs2(norm(Σchol.L \ y))
end

# better memory efficiency - input Σ is overwritten
function logpdf_mvn_3(y::Vector, Σ::Matrix)
    n = length(y)
    Σchol = cholesky!(Symmetric(Σ)) # Σ is overwritten
    - (n//2) * log(2π) - (1//2) * logdet(Σchol) - (1//2) * dot(y, Σchol \ y)
end
Out[23]:
logpdf_mvn_3 (generic function with 1 method)
In [24]:
using BenchmarkTools, Distributions, Random

Random.seed!(123) # seed

n = 1000
# a pd matrix
Σ = convert(Matrix{Float64}, Symmetric([i * (n - j + 1) for i in 1:n, j in 1:n]))
y = rand(MvNormal(Σ)) # one random sample from N(0, Σ)

# at least they should give the same answer
@show logpdf_mvn_1(y, Σ)
@show logpdf_mvn_2(y, Σ)
Σc = copy(Σ)
@show logpdf_mvn_3(y, Σc);
logpdf_mvn_1(y, Σ) = -4878.375103770505
logpdf_mvn_2(y, Σ) = -4878.375103770553
logpdf_mvn_3(y, Σc) = -4878.375103770553
In [25]:
@benchmark logpdf_mvn_1($y, $Σ)
Out[25]:
BenchmarkTools.Trial: 
  memory estimate:  15.78 MiB
  allocs estimate:  10
  --------------
  minimum time:     33.638 ms (0.00% GC)
  median time:      36.739 ms (0.00% GC)
  mean time:        38.293 ms (2.76% GC)
  maximum time:     56.384 ms (5.95% GC)
  --------------
  samples:          131
  evals/sample:     1
In [26]:
@benchmark logpdf_mvn_2($y, $Σ)
Out[26]:
BenchmarkTools.Trial: 
  memory estimate:  15.27 MiB
  allocs estimate:  8
  --------------
  minimum time:     8.095 ms (0.00% GC)
  median time:      10.639 ms (0.00% GC)
  mean time:        11.505 ms (10.06% GC)
  maximum time:     27.415 ms (26.83% GC)
  --------------
  samples:          435
  evals/sample:     1
In [27]:
@benchmark logpdf_mvn_3($y, $Σc) setup=(copy!(Σc, Σ)) evals=1
Out[27]:
BenchmarkTools.Trial: 
  memory estimate:  7.98 KiB
  allocs estimate:  3
  --------------
  minimum time:     5.614 ms (0.00% GC)
  median time:      5.782 ms (0.00% GC)
  mean time:        6.011 ms (0.00% GC)
  maximum time:     8.993 ms (0.00% GC)
  --------------
  samples:          734
  evals/sample:     1
  • To evaluate same multivariate normal density at many observations $y_1, y_2, \ldots$, we pre-compute the Cholesky decomposition ($n^3/3$ flops), then each evaluation costs $n^2$ flops.

Linear regression

  • Cholesky decomposition is one approach to solve linear regression
    $$ \widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}. $$

  • Assume $\mathbf{X} \in \mathbb{R}^{n \times p}$ and $\mathbf{y} \in \mathbb{R}^n$.

    • Compute $\mathbf{X}^T \mathbf{X}$: $np^2$ flops
    • Compute $\mathbf{X}^T \mathbf{y}$: $2np$ flops
    • Cholesky decomposition of $\mathbf{X}^T \mathbf{X}$: $\frac{1}{3} p^3$ flops
    • Solve normal equation $\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}$: $2p^2$ flops
    • If need standard errors, another $(4/3)p^3$ flops
  • Total computational cost is $np^2 + (1/3) p^3$ (without s.e.) or $np^2 + (5/3) p^3$ (with s.e.) flops.

Further reading