Due May 14 @ 11:59PM
versioninfo()
People often think linear regression on a dataset with millions of observations is a big data problem. Now we learnt various methods for solving linear regression and should realize that, with right choice of algorithm, it is a problem that can be handled by any moderate computer.
Download the flight data from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7. Do not put data files in Git. You will lose points if you do. For grading purpose (reproducibility), we will assume that data files are in a subfolder flights
.
;ls -l flights
Find out how many data points are in each year.
We are interested in how the arrival delay of a flight, ArrDelay
, depends on the distance traveled (Distance
), departure delay (DepDelay
), weekday (DayOfWeek
), and airline (UniqueCarrier
).
We want to fit a linear regression ArrDelay ~ 1 + Distance + DepDelay + DayOfWeek + UniqueCarrier
using data from 1987-2008. Treat DayOfWeek
as a factor with 6 levels. We use the dummy coding with 1
(Monday) as the base level. Treat UniqueCarrier
as a factor with 8 levels: "AA", "AS", "CO", "DL", "NW", "UA", "US", and "WN". We use the dummy coding with "AA" as the base level.
Will the design matrix $\mathbf{X}$ (in double precision) fit into the memory of your computer?
Assume your computer has limited memory, say only 1GB. Review the Summary of Linear Regression and choose one method in the table to solve the linear regression.
Report the estimated regression coefficients $\widehat \beta$, estimated variance $\widehat \sigma^2 = \sum_i (y_i - \widehat y_i)^2 / (n - p)$, and standard errors of $\widehat \beta$.
Hint: It took my laptop about 10-11 minutes to import data and fit linear regression.
Go to your resume/cv and claim you have experience performing analytics on data with 100 million observations.
Following code explores the data in 2003 and generates the design matrix and responses for that year. Feel free to use the code in your solution.
using CodecBzip2, CSV, DataFrames, Distributions, LinearAlgebra,
Serialization, StatsModels, SweepOperator
ENV["COLUMNS"] = 200
# Print first 10 lines of 2003 data.
run(pipeline(`bunzip2 -c flights/2003.csv.bz2`, `head`))
# how many data points in 2003?
open("flights/2003.csv.bz2", "r") do io
countlines(Bzip2DecompressorStream(io))
end
# # figure out which airlines appear in each year of 1987-2008
# airlines = Vector{Vector{String}}(undef, 22)
# @time for year in 1987:2008
# println("year $year")
# filename = "flights/" * string(year) * ".csv.bz2"
# df = open(filename, "r") do io
# CSV.File(
# Bzip2DecompressorStream(io),
# select = ["UniqueCarrier"],
# types = Dict("UniqueCarrier" => String),
# missingstring = "NA"
# ) |> DataFrame
# end
# airlines[year - 1986] = unique(df[!, :UniqueCarrier])
# end
# intersect(airlines...) |> sort
# load 2003 data into DataFrame
@time df = open("flights/2003.csv.bz2", "r") do io
CSV.File(
Bzip2DecompressorStream(io),
select = ["DayOfWeek", "UniqueCarrier", "ArrDelay",
"DepDelay", "Distance"],
types = Dict(
"DayOfWeek" => UInt8,
"UniqueCarrier" => String,
"ArrDelay" => Float64,
"DepDelay" => Float64,
"Distance" => UInt16
),
missingstring = "NA"
) |> DataFrame
end
# how many rows?
size(df, 1)
# drop rows with missing values
dropmissing!(df)
size(df, 1)
# filter out rows not in the airline list
airlines = ["AA", "AS", "CO", "DL", "NW", "UA", "US", "WN"]
filter!(row -> row[:UniqueCarrier] ∈ airlines, df)
size(df, 1)
# model frame for year 2003
mf = ModelFrame(
@formula(ArrDelay ~ 1 + DayOfWeek + Distance + DepDelay + UniqueCarrier),
df,
contrasts = Dict(
:DayOfWeek => StatsModels.DummyCoding(base = 1, levels = UInt8.(1:7)),
:UniqueCarrier => StatsModels.DummyCoding(
base = "AA",
levels = ["AA", "AS", "CO", "DL", "NW", "UA", "US", "WN"]
)
)
)
# generate the covariate matrix X for year 2003
X = modelmatrix(mf)
# generate the response vector Y for year 2003
y = df[!, :ArrDelay]
We are going to try different numerical methods learnt in class on the Google PageRank problem.
Let $\mathbf{A} \in \{0,1\}^{n \times n}$ be the connectivity matrix of $n$ web pages with entries $$ \begin{eqnarray*} a_{ij}= \begin{cases} 1 & \text{if page $i$ links to page $j$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*} $$ $r_i = \sum_j a_{ij}$ is the out-degree of page $i$. That is $r_i$ is the number of links on page $i$. Imagine a random surfer exploring the space of $n$ pages according to the following rules.
The process defines a Markov chain on the space of $n$ pages. Write the transition matrix $\mathbf{P}$ of the Markov chain as a sparse matrix plus rank 1 matrix.
According to standard Markov chain theory, the (random) position of the surfer converges to the stationary distribution $\mathbf{x} = (x_1,\ldots,x_n)^T$ of the Markov chain. $x_i$ has the natural interpretation of the proportion of times the surfer visits page $i$ in the long run. Therefore $\mathbf{x}$ serves as page ranks: a higher $x_i$ means page $i$ is more visited. It is well-known that $\mathbf{x}$ is the left eigenvector corresponding to the top eigenvalue 1 of the transition matrix $\mathbf{P}$. That is $\mathbf{P}^T \mathbf{x} = \mathbf{x}$. Therefore $\mathbf{x}$ can be solved as an eigen-problem. It can also be cast as solving a linear system. Since the row sums of $\mathbf{P}$ are 1, $\mathbf{P}$ is rank deficient. We can replace the first equation by the $\sum_{i=1}^n x_i = 1$.
Hint: For iterative solvers, we don't need to replace the 1st equation. We can use the matrix $\mathbf{I} - \mathbf{P}^T$ directly if we start with a vector with all positive entries.
Obtain the connectivity matrix A
from the SNAP/web-Google
data in the MatrixDepot package.
using MatrixDepot
md = mdopen("SNAP/web-Google")
# display documentation for the SNAP/web-Google data
mdinfo(md)
# connectivity matrix
A = md.A
Compute summary statistics:
A
take? If converted to a Matrix{Float64}
(don't do it!), how much memory will it take? A[1:10000, 1:10000]
. Hint: For plots, you can use the UnicodePlots.jl package (spy
, histogram
, etc), which is fast for large data.
Consider the following methods to obtain the page ranks of the SNAP/web-Google
data.
For the LU approach, estimate (1) the memory usage and (2) how long it will take assuming that the LAPACK functions can achieve the theoretical throughput of your computer.
Set the teleportation parameter at $p = 0.85$. Consider the following methods for solving the PageRank problem.
For iterative methods, we have many choices in Julia. See a list of existing Julia packages for linear solvers at this page. The start-up code below uses the KrylovKit.jl package. You can use other packages if you prefer. Make sure to utilize the special structure of $\mathbf{P}$ (sparse + rank 1) to speed up the matrix-vector multiplication.
Let's implement a type PageRankImPt
that mimics the matrix $\mathbf{M} = \mathbf{I} - \mathbf{P}^T$. For iterative methods, all we need to provide are methods for evaluating $\mathbf{M} \mathbf{v}$ and $\mathbf{M}^T \mathbf{v}$ for arbitrary vector $\mathbf{v}$.
using BenchmarkTools, LinearAlgebra, SparseArrays, Revise
# a type for the matrix M = I - P^T in PageRank problem
struct PageRankImPt{TA <: Number, IA <: Integer, T <: AbstractFloat} <: AbstractMatrix{T}
A :: SparseMatrixCSC{TA, IA} # adjacency matrix
telep :: T
# working arrays
# TODO: whatever intermediate arrays you may want to pre-allocate
end
# constructor
function PageRankImPt(A::SparseMatrixCSC, telep::T) where T <: AbstractFloat
n = size(A, 1)
# TODO: initialize and pre-allocate arrays
PageRankImPt(A, telep)
end
LinearAlgebra.issymmetric(::PageRankImPt) = false
Base.size(M::PageRankImPt) = size(M.A)
# TODO: implement this function for evaluating M[i, j]
Base.getindex(M::PageRankImPt, i, j) = M.telep
# overwrite `out` by `(I - Pt) * v`
function LinearAlgebra.mul!(
out :: Vector{T},
M :: PageRankImPt{<:Number, <:Integer, T},
v :: Vector{T}) where T <: AbstractFloat
# TODO: implement mul!(out, M, v)
sleep(1e-2) # wait 10 ms as if your code takes 1ms
return out
end
# overwrite `out` by `(I - P) * v`
function LinearAlgebra.mul!(
out :: Vector{T},
Mt :: Transpose{T, PageRankImPt{TA, IA, T}},
v :: Vector{T}) where {TA<:Number, IA<:Integer, T <: AbstractFloat}
M = Mt.parent
# TODO: implement mul!(out, transpose(M), v)
sleep(1e-2) # wait 10 ms as if your code takes 1ms
out
end
To check correctness. Note $$ \mathbf{M}^T \mathbf{1} = \mathbf{0} $$ and $$ \mathbf{M} \mathbf{x} = \mathbf{0} $$ for stationary distribution $\mathbf{x}$.
Download the solution file pgrksol.csv.gz
. Do not put this file in your Git. You will lose points if you do. You can add a line pgrksol.csv.gz
to your .gitignore
file.
using CodecZlib, DelimitedFiles
isfile("pgrksol.csv.gz") || download("https://raw.githubusercontent.com/ucla-biostat-257-2021spring/ucla-biostat-257-2021spring.github.io/master/hw/hw3/pgrksol.csv.gz")
xsol = open("pgrksol.csv.gz", "r") do io
vec(readdlm(GzipDecompressorStream(io)))
end
You will lose all 35 points (Steps 1 and 2) if the following statements throw AssertError.
M = PageRankImPt(A, 0.85)
n = size(M, 1)
@assert transpose(M) * ones(n) ≈ zeros(n)
@assert M * xsol ≈ zeros(n)
We want to benchmark the hot functions mul!
to make sure they are efficient and allocate little memory.
M = PageRankImPt(A, 0.85)
n = size(M, 1)
v, out = ones(n), zeros(n)
bm_mv = @benchmark mul!($out, $M, $v) setup=(fill!(out, 0); fill!(v, 1))
bm_mtv = @benchmark mul!($out, $(transpose(M)), $v) setup=(fill!(out, 0); fill!(v, 1))
You will lose 1 point for each 100 bytes memory allocation. So the points you will get is
clamp(10 - median(bm_mv).memory / 100, 0, 10) +
clamp(10 - median(bm_mtv).memory / 100, 0, 10)
Hint: My median run times are 30-40 ms and memory allocations are 0 bytes.
Let's first try to solve the PageRank problem by the GMRES method for solving linear equations.
using KrylovKit
# normalize in-degrees to be the start point
x0 = vec(sum(A, dims = 1)) .+ 1.0
x0 ./= sum(x0)
# right hand side
b = zeros(n)
# warm up (compilation)
linsolve(M, b, x0, issymmetric = false, isposdef = false, maxiter = 1)
# output is complex eigenvalue/eigenvector
(x_gmres, info), time_gmres, = @timed linsolve(M, b, x0, issymmetric = false, isposdef = false)
Check correctness. You will lose all 20 points if the following statement throws AssertError
.
@assert norm(x_gmres - xsol) < 1e-8
GMRES should be reasonably fast. The points you'll get is
clamp(20 / time_gmres * 20, 0, 20)
Hint: My runtime is about 7-8 seconds.
Let's first try to solve the PageRank problem by the Arnoldi method for solving eigen problems.
# warm up (compilation)
eigsolve(M, x0, 1, :SR, issymmetric = false, maxiter = 1)
# output is complex eigenvalue/eigenvector
(vals, vecs, info), time_arnoldi, = @timed eigsolve(M, x0, 1, :SR, issymmetric = false)
Check correctness. You will lose all 20 points if the following statement throws AssertError
.
@assert abs(Real(vals[1])) < 1e-8
x_arnoldi = abs.(Real.(vecs[1]))
x_arnoldi ./= sum(x_arnoldi)
@assert norm(x_arnoldi - xsol) < 1e-8
Arnoldi should be reasonably fast. The points you'll get is
clamp(20 / time_arnoldi * 20, 0, 20)
Hint: My runtime is about 11-12 seconds.
List the top 20 pages you found and their corresponding PageRank score. Do they match the top 20 pages ranked according to in-degrees?
Go to your resume/cv and claim you have experience performing analysis on a network of one million nodes.