versioninfo()
bit = binary + digit (coined by statistician John Tukey). byte = 8 bits. Julia function Base.summarysize shows the amount of memory (in bytes) used by an object.
x = rand(100, 100)
Base.summarysize(x)
varinfo() function prints all variables in workspace and their sizes.
varinfo() # similar to Matlab whos()
.jl, .r, .c, .cpp, .ipynb, .html, .tex, ... # integers 0, 1, ..., 127 and corresponding ascii character
[0:127 Char.(0:127)]
# integers 128, 129, ..., 255 and corresponding extended ascii character
# show(STDOUT, "text/plain", [128:255 Char.(128:255)])
[128:255 Char.(128:255)]
Unicode: UTF-8, UTF-16 and UTF-32 support many more characters including foreign characters; last 7 digits conform to ASCII.
UTF-8 is the current dominant character encoding on internet.
# \beta-<tab>
β = 0.0
# \beta-<tab>-\hat-<tab>
β̂ = 0.0
Fixed-point number system is a computer model for integers $\mathbb{Z}$.
The number of bits and method of representing negative numbers vary from system to system.
integer type in R has $M=32$ or 64 bits, determined by machine word size. (u)int8, (u)int16, (u)int32, (u)int64. Julia has even more integer types. Using Tom Breloff's Plots.jl and GraphRecipes.jl packages, we can visualize the type tree under Integer
Signed or Unsigned integer can be $M = 8, 16, 32, 64$ or 128 bits. using GraphRecipes, Plots
#pyplot(size=(800, 600))
gr(size=(600, 400))
theme(:default)
plot(Integer, method=:tree, fontsize=4)
First bit indicates sign.
0 for nonnegative numbers1 for negative numbers Two's complement representation for negative numbers.
x is same as the unsigned integer 2^64 + x.@show typeof(18)
@show bitstring(18)
@show bitstring(-18)
@show bitstring(UInt64(Int128(2)^64 - 18)) == bitstring(-18)
@show bitstring(2 * 18) # shift bits of 18
@show bitstring(2 * -18); # shift bits of -18

Arithmetics (addition, substraction, multiplication) of integers are exact except for the possiblity of overflow and underflow.
Range of representable integers by $M$-bit signed integer is $[-2^{M-1},2^{M-1}-1]$.
typemin(T) and typemax(T) give the lowest and highest representable number of a type T respectivelytypemin(Int64), typemax(Int64)
for T in [Int8, Int16, Int32, Int64, Int128]
println(T, '\t', typemin(T), '\t', typemax(T))
end
for t in [UInt8, UInt16, UInt32, UInt64, UInt128]
println(t, '\t', typemin(t), '\t', typemax(t))
end
BigInt¶Julia BigInt type is arbitrary precision.
@show typemax(Int128)
@show typemax(Int128) + 1 # modular arithmetic!
@show BigInt(typemax(Int128)) + 1;
R reports NA for integer overflow and underflow.
Julia outputs the result according to modular arithmetic.
@show typemax(Int32)
@show typemax(Int32) + Int32(1); # modular arithmetics!
using RCall
R"""
.Machine$integer.max
"""
R"""
M <- 32
big <- 2^(M-1) - 1
as.integer(big)
"""
R"""
as.integer(big+1)
"""
Floating-point number system is a computer model for real numbers.
Most computer systems adopt the IEEE 754 standard, established in 1985, for floating-point arithmetics.
For the history, see an interview with William Kahan.
In the scientific notation, a real number is represented as $$\pm d_0.d_1d_2 \cdots d_p \times b^e.$$ In computer, the base is $b=2$ and the digits $d_i$ are 0 or 1.
Normalized vs denormalized numbers. For example, decimal number 18 is $$ +1.0010 \times 2^4 \quad (\text{normalized})$$ or, equivalently, $$ +0.1001 \times 2^5 \quad (\text{denormalized}).$$
In the floating-number system, computer stores
using GraphRecipes, Plots
#pyplot(size=(800, 600))
gr(size=(600, 400))
theme(:default)
plot(AbstractFloat, method=:tree, fontsize=4)

Double precision (64 bits = 8 bytes) numbers are the dominant data type in scientific computing.
In Julia, Float64 is the type for double precision numbers.
First bit is sign bit.
$p=52$ significant bits.
11 exponent bits: $e_{\max}=1023$, $e_{\min}=-1022$, bias=1023.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 308}$ in decimal because $\log_{10} (2^{1023}) \approx 308$.
precision to the $- \log_{10}(2^{-52}) \approx 15$ decimal point.
println("Double precision:")
@show bitstring(Float64(18)) # 18 in double precision
@show bitstring(Float64(-18)); # -18 in double precision

In Julia, Float32 is the type for single precision numbers.
First bit is sign bit.
$p=23$ significant bits.
8 exponent bits: $e_{\max}=127$, $e_{\min}=-126$, bias=127.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 38}$ in decimal because $\log_{10} (2^{127}) \approx 38$.
precision: $- \log_{10}(2^{-23}) \approx 7$ decimal point.
println("Single precision:")
@show bitstring(Float32(18.0)) # 18 in single precision
@show bitstring(Float32(-18.0)); # -18 in single precision

In Julia, Float16 is the type for half precision numbers.
First bit is sign bit.
$p=10$ significant bits.
5 exponent bits: $e_{\max}=15$, $e_{\min}=-14$, bias=15.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 4}$ in decimal because $\log_{10} (2^{15}) \approx 4$.
precision: $\log_{10}(2^{10}) \approx 3$ decimal point.
println("Half precision:")
@show bitstring(Float16(18)) # 18 in half precision
@show bitstring(Float16(-18)); # -18 in half precision
@show bitstring(Inf) # Inf in double precision
@show bitstring(-Inf); # -Inf in double precision
Exponent $e_{\max}+1$ plus a nonzero mantissa means NaN. NaN could be produced from 0 / 0, 0 * Inf, ...
In general NaN ≠ NaN bitwise. Test whether a number is NaN by isnan function.
@show bitstring(0 / 0) # NaN
@show bitstring(0 * Inf); # NaN
@show bitstring(0.0); # 0 in double precision
@show nextfloat(0.0) # next representable number
@show bitstring(nextfloat(0.0)); # denormalized
RoundNearest. For example, the number 0.1 in decimal system cannot be represented accurately as a floating point number:
$$ 0.1 = 1.10011001... \times 2^{-4} $$# half precision Float16, ...110(011...) rounds down to 110
@show bitstring(Float16(0.1))
# single precision Float32, ...100(110...) rounds up to 101
@show bitstring(0.1f0)
# double precision Float64, ...001(100..) rounds up to 010
@show bitstring(0.1);
For a number with mantissa ending with ...001(100..., all 0 digits after), it's a tie and will be rounded to ...010 to make the mantissa even.
Single precision: range $\pm 10^{\pm 38}$ with precision up to 7 decimal digits.
Double precision: range $\pm 10^{\pm 308}$ with precision up to 16 decimal digits.
The floating-point numbers do not occur uniformly over the real number line
Each magnitude has same number of representible numbers
Machine epsilons are the spacings of numbers around 1:
$$\epsilon_{\min}=b^{-p}, \quad \epsilon_{\max} = b^{1-p}.$$

@show eps(Float32) # machine epsilon for a floating point type
@show eps(Float64) # same as eps()
# eps(x) is the spacing after x
@show eps(100.0)
@show eps(0.0) # grace underflow
# nextfloat(x) and prevfloat(x) give the neighbors of x
@show x = 1.25f0
@show prevfloat(x), x, nextfloat(x)
@show bitstring(prevfloat(x)), bitstring(x), bitstring(nextfloat(x));
.Machine contains numerical characteristics of the machine.R"""
.Machine
"""
Float16 (half precision), Float32 (single precision), Float64 (double precision), and BigFloat (arbitrary precision).For double precision, the range is $\pm 10^{\pm 308}$. In most situations, underflow (magnitude of result is less than $10^{-308}$) is preferred over overflow (magnitude of result is larger than $10^{-308}$). Overflow produces $\pm \inf$. Underflow yields zeros or denormalized numbers.
E.g., the logit link function is
$$p = \frac{\exp (x^T \beta)}{1 + \exp (x^T \beta)} = \frac{1}{1+\exp(- x^T \beta)}.$$
The former expression can easily lead to Inf / Inf = NaN, while the latter expression leads to graceful underflow.
floatmin and floatmax functions gives largest and smallest finite number represented by a type.
for T in [Float16, Float32, Float64]
println(T, '\t', floatmin(T), '\t', floatmax(T), '\t', typemin(T),
'\t', typemax(T), '\t', eps(T))
end
BigFloat in Julia offers arbitrary precision.@show precision(BigFloat)
@show floatmin(BigFloat)
@show floatmax(BigFloat);
@show BigFloat(π); # default precision for BigFloat is 256 bits
# set precision to 1024 bits
setprecision(BigFloat, 1024) do
@show BigFloat(π)
end;
@show a = 2.0^30
@show b = 2.0^-30
@show a + b == a
a = 1.2345678f0 # rounding
@show bitstring(a) # rounding
b = 1.2345677f0
@show bitstring(b)
@show a - b # correct result should be 1e-7
Floating-point numbers may violate many algebraic laws we are familiar with, such associative and distributive laws. See Homework 1 problems.
Textbook treatment, e.g., Chapter II.2 of Computational Statistics by James Gentle (2010).
What every computer scientist should know about floating-point arithmetic by David Goldberg (1991).