Exercise 4.1¶
Prove that the Python function gauss_sample defined in Chapter 4 indeed generates samples of a Gaussian distribution with mean $c$ and covariance $\Sigma$.
Solution: Let's first recall the function from the course notes:
import numpy as np
from numpy.random import default_rng
rng = default_rng(0) # good practice for reproducibility
def gauss_sample(N, c, Sigma):
"""
returns N samples from a multivariate Gaussian
with mean c and covariance matrix Sigma
"""
m = len(c)
L = np.linalg.cholesky(Sigma)
z = rng.normal(size=(m, N))
x = c[:,None] + np.dot(L, z)
return x
Let $L$ be the Cholesky factor such that $\Sigma = L L^T$. We calculate the covariance of a generated vector $x = c + L z$, where $z$ is i.i.d. standard normal:
$$ \mathbb{E}[(x-c)(x-c)^T] = \mathbb{E}[L z z^T L^T] = L \mathbb{E}[z z^T] L^T = L I L^T = \Sigma. $$
Exercise 4.2¶
Recall the definition of a covariance matrix. Implement a version of the NumPy function np.cov yourself (you may call it my_cov, for example). Then go to the NumPy documentation of np.cov (https://numpy.org/doc/stable/reference/generated/numpy.cov.html) and click on the [source] link of that page. After some searching, you should be able to view the Python implementation of np.cov. Can you identify the part where the actual computation happens?
# Solution
import numpy as np
def my_cov(X):
"""
estimates the covariance matrix of the (real-valued) data matrix X
"""
_, N = X.shape
Xc = X - X.mean(axis=1).reshape(-1,1) # mean centering
return np.dot(Xc,Xc.T)/(N-1)
# testing code
N = 1000
c = np.array([4, -2])
Sigma = np.array([[2, 0.5], [0.5, 1]])
X = gauss_sample(N, c, Sigma)
print('covariance matrix estimated from data:\n', my_cov(X))
# Note: this will only work for truly 2D NumPy arrays;
# try my_cov(X[0,:]) to see it fail vs my_cov(X[0:1,:])
covariance matrix estimated from data: [[1.9100027 0.55528846] [0.55528846 1.0743857 ]]
Exercise 4.3¶
Recall Example 4.1 from Chapter 4 that produces a contour plot of the gauss_density.
That code uses a nested for-loop over all entries in the matrices X0 and X1 to compute the corresponding entry in Z. This is inefficient for large matrices for at least two reasons:
- every call of
gauss_densitycomputes the same normalisation constant anew, even though this does not depend onx - the numerical libraries underlying NumPy can perform vector and matrix operations much more efficiently than element-wise operations. Rewriting code to make best use of vector and matrix operations is called vectorization or array programming.
Write a function gauss_density_2d(X0, X1, c, Sigma) that accepts NumPy matrices X0, X1 of 2D point coordinates and returns Z in one go. Can you avoid using any for-loops? Perform a timing comparison of both approaches for evaluating the Gaussian density function.
# Solution
def gauss_density_2d(X0, X1, c, Sigma):
# get distance from c for each X0,X1 pair
X0c = X0 - c[0]
X1c = X1 - c[1]
# matrix vector product inv(Sigma)*[X0c;X1c] using
# formula for inverse of 2x2 matrix (may fail for ill-conditioned Sigma)
det = Sigma[0,0]*Sigma[1,1] - Sigma[1,0]*Sigma[0,1] # det(Sigma)
Sigmainv = [[Sigma[1,1] , -Sigma[0,1]], [-Sigma[1,0] , Sigma[0,0]]]/det
SigmainvX0c = Sigmainv[0,0]*X0c + Sigmainv[0,1]*X1c
SigmainvX1c = Sigmainv[1,0]*X0c + Sigmainv[1,1]*X1c
# inner product of [X0c;X1c] with Sigmainv*[X0c;X1c]
ip = X0c*SigmainvX0c + X1c*SigmainvX1c
m = 2
norm_const = 1 / (np.sqrt((2 * np.pi)**m * det))
return norm_const * np.exp(-0.5 * ip)
import matplotlib.pyplot as plt
from time import time
st = time()
# contour plot of Gaussian density function in 2d
x0 = np.linspace(-2, 10, 100)
x1 = np.linspace(-7, 3, 100)
X0, X1 = np.meshgrid(x0, x1)
Z1 = gauss_density_2d(X0, X1, c, Sigma)
print("evaluating Z took", time()-st, "seconds")
plt.contourf(X0, X1, Z1, levels=[0, 0.01, 0.05, 0.1, 0.2], cmap=plt.cm.Blues);
plt.colorbar();
plt.scatter(X[0,:], X[1,:], s=2, c='r')
plt.axis('tight'), plt.title('Gaussian density'); plt.show();
evaluating Z took 0.0 seconds
Exercise 4.4¶
What is the interpretation of the empirical error $\hat R(h)\approx 0.0016$ in Example 4.2, in terms of individual data points?
Solution: The empirical error using the zero-one loss function corresponds to the fraction of data points that have been misclassified. In this case, about $0.0016\times 10000 = 16$ data points.
Exercise 4.5¶
Try increasing N in the code of Example 4.2. You'll notice that the empirical error converges to a value $\approx 0.002112$. Write down a formula for the precise limiting value.
Solution:
In the code of Example 4.3 we already calculate the (non-empirical) generalisation error but for the $L_2$ loss function. By simply changing from loss_L2 to loss_01 one obtains the desired value.
An alternative approach to arrive at this number, perhaps more intuitively, is to ask for each Gaussian: what is the probability that a data point of class $i$ lies outside (or inside, then subtract from one) the rectangular decision region? This can be written as a double integral. By summing the contributions for all classes, one also arrives at the formula for the generalisation error $$ R(h) := \int_{\mathcal Z} L(y, h(x)) \mathrm{d} \mu_{\mathcal Z}(x,y). $$
Below's a code demonstrating the convergence.
# We first need to copy over some functions and variables from the course notes.
# list of centers and covariance matrices for each class
c_ = { 0: np.array([4, -2]),
1: np.array([10, 3]),
2: np.array([5, 4]) }
Sigma_ = { 0: np.array([[2, 0.5], [0.5, 1]]),
1: np.array([[0.3, 0.2], [0.2, 0.8]]),
2: np.array([[1, 0.5], [0.5, 1]]) }
def multiple_classes(N, c_, Sigma_):
"""
returns N samples of 2d data from multiple classes
"""
nr_classes = len(c_)
n0 = N//nr_classes
X = np.zeros((2, 0))
y = np.zeros(0)
for i in range(nr_classes):
if i == nr_classes-1:
n0 = N - (nr_classes-1)*n0
X = np.hstack((X, gauss_sample(n0, c_[i], Sigma_[i])))
y = np.hstack((y, i*np.ones(n0)))
# important: always shuffle the order of your data points
p = rng.permutation(N)
X = X[:,p]
y = y[p]
return X, y
def h_coord_split(x):
if x[0] > 8:
return 1
if x[1] > 1:
return 2
return 0
loss_01 = lambda y, yp: y != yp
def empirical_err(h, loss, X, y):
s = 0
for n in range(len(y)):
s += loss(y[n], h(X[:,n]))
return s/len(y)
# Convergence of the empirical error to
# the true generalisation error (sqrt(N))
limit = 0.00211238
rng = default_rng(0)
for N in [10, 100, 1000, 10000, 10000, 1000000]:
Xnew, ynew = multiple_classes(N, c_, Sigma_)
val = empirical_err(h_coord_split, loss_01, Xnew, ynew)
print(N, val, np.abs(val - limit))
10 0.0 0.00211238 100 0.0 0.00211238 1000 0.001 0.00111238 10000 0.0021 1.2380000000000203e-05 10000 0.0026 0.0004876199999999998 1000000 0.002096 1.6379999999999867e-05
Exercise 4.6¶
In Section 4.5 we have implemented the Bayes hypothesis h_best_L2 and computed its generalisation error. As discussed there, h_best_L2 does not necessarily attain integer values in $\{0,1,2\}$. Can you come up with a best possible hypothesis $h$ that takes only values in $\{0,1,2\}$? And compute its generalisation error?
Solution: Recall that a best $L^2$ hypothesis $h_\text{best}$ is given as the weighted sum of the class labels, weighted by their probability of appearance. If we only want integer labels, we should return for each $x$ a most likely label, that is, $$ h_\text{best int} := \arg \max_{i=0,1,2} \{ \pi_i(x) \}. $$ If two labels are equally likely with highest probability, it does not matter which one is chosen.