Exercise 4.1¶

Prove that the Python function gauss_sample defined in Chapter 4 indeed generates samples of a Gaussian distribution with mean $c$ and covariance $\Sigma$.

Solution: Let's first recall the function from the course notes:

In [1]:
import numpy as np
from numpy.random import default_rng
rng = default_rng(0)    # good practice for reproducibility

def gauss_sample(N, c, Sigma):
    """
    returns N samples from a multivariate Gaussian 
    with mean c and covariance matrix Sigma
    """
    m = len(c)
    L = np.linalg.cholesky(Sigma)
    z = rng.normal(size=(m, N))
    x = c[:,None] + np.dot(L, z)
    return x

Let $L$ be the Cholesky factor such that $\Sigma = L L^T$. We calculate the covariance of a generated vector $x = c + L z$, where $z$ is i.i.d. standard normal:

$$ \mathbb{E}[(x-c)(x-c)^T] = \mathbb{E}[L z z^T L^T] = L \mathbb{E}[z z^T] L^T = L I L^T = \Sigma. $$

Exercise 4.2¶

Recall the definition of a covariance matrix. Implement a version of the NumPy function np.cov yourself (you may call it my_cov, for example). Then go to the NumPy documentation of np.cov (https://numpy.org/doc/stable/reference/generated/numpy.cov.html) and click on the [source] link of that page. After some searching, you should be able to view the Python implementation of np.cov. Can you identify the part where the actual computation happens?

In [2]:
# Solution
import numpy as np

def my_cov(X):
    """
    estimates the covariance matrix of the (real-valued) data matrix X
    """
    _, N = X.shape
    Xc = X - X.mean(axis=1).reshape(-1,1) # mean centering
    return np.dot(Xc,Xc.T)/(N-1)

# testing code
N = 1000
c = np.array([4, -2])
Sigma = np.array([[2, 0.5], [0.5, 1]])
X = gauss_sample(N, c, Sigma)
print('covariance matrix estimated from data:\n', my_cov(X))

# Note: this will only work for truly 2D NumPy arrays; 
# try my_cov(X[0,:]) to see it fail vs my_cov(X[0:1,:])
covariance matrix estimated from data:
 [[1.9100027  0.55528846]
 [0.55528846 1.0743857 ]]

Exercise 4.3¶

Recall Example 4.1 from Chapter 4 that produces a contour plot of the gauss_density.

That code uses a nested for-loop over all entries in the matrices X0 and X1 to compute the corresponding entry in Z. This is inefficient for large matrices for at least two reasons:

  1. every call of gauss_density computes the same normalisation constant anew, even though this does not depend on x
  2. the numerical libraries underlying NumPy can perform vector and matrix operations much more efficiently than element-wise operations. Rewriting code to make best use of vector and matrix operations is called vectorization or array programming.

Write a function gauss_density_2d(X0, X1, c, Sigma) that accepts NumPy matrices X0, X1 of 2D point coordinates and returns Z in one go. Can you avoid using any for-loops? Perform a timing comparison of both approaches for evaluating the Gaussian density function.

In [3]:
# Solution
def gauss_density_2d(X0, X1, c, Sigma):
    # get distance from c for each X0,X1 pair
    X0c = X0 - c[0]  
    X1c = X1 - c[1]
    # matrix vector product inv(Sigma)*[X0c;X1c] using 
    # formula for inverse of 2x2 matrix (may fail for ill-conditioned Sigma)
    det = Sigma[0,0]*Sigma[1,1] - Sigma[1,0]*Sigma[0,1]  # det(Sigma)
    Sigmainv = [[Sigma[1,1] , -Sigma[0,1]], [-Sigma[1,0] , Sigma[0,0]]]/det  
    SigmainvX0c = Sigmainv[0,0]*X0c + Sigmainv[0,1]*X1c
    SigmainvX1c = Sigmainv[1,0]*X0c + Sigmainv[1,1]*X1c
    # inner product of [X0c;X1c] with Sigmainv*[X0c;X1c]
    ip = X0c*SigmainvX0c + X1c*SigmainvX1c
    m = 2
    norm_const = 1 / (np.sqrt((2 * np.pi)**m * det))
    return norm_const * np.exp(-0.5 * ip)

import matplotlib.pyplot as plt
from time import time
st = time()

# contour plot of Gaussian density function in 2d
x0 = np.linspace(-2, 10, 100)
x1 = np.linspace(-7, 3, 100)
X0, X1 = np.meshgrid(x0, x1)

Z1 = gauss_density_2d(X0, X1, c, Sigma)
print("evaluating Z took", time()-st, "seconds")
plt.contourf(X0, X1, Z1, levels=[0, 0.01, 0.05, 0.1, 0.2], cmap=plt.cm.Blues);
plt.colorbar();
plt.scatter(X[0,:], X[1,:], s=2, c='r')
plt.axis('tight'), plt.title('Gaussian density'); plt.show();
evaluating Z took 0.0 seconds
No description has been provided for this image

Exercise 4.4¶

What is the interpretation of the empirical error $\hat R(h)\approx 0.0016$ in Example 4.2, in terms of individual data points?

Solution: The empirical error using the zero-one loss function corresponds to the fraction of data points that have been misclassified. In this case, about $0.0016\times 10000 = 16$ data points.

Exercise 4.5¶

Try increasing N in the code of Example 4.2. You'll notice that the empirical error converges to a value $\approx 0.002112$. Write down a formula for the precise limiting value.

Solution: In the code of Example 4.3 we already calculate the (non-empirical) generalisation error but for the $L_2$ loss function. By simply changing from loss_L2 to loss_01 one obtains the desired value.

An alternative approach to arrive at this number, perhaps more intuitively, is to ask for each Gaussian: what is the probability that a data point of class $i$ lies outside (or inside, then subtract from one) the rectangular decision region? This can be written as a double integral. By summing the contributions for all classes, one also arrives at the formula for the generalisation error $$ R(h) := \int_{\mathcal Z} L(y, h(x)) \mathrm{d} \mu_{\mathcal Z}(x,y). $$

Below's a code demonstrating the convergence.

In [4]:
# We first need to copy over some functions and variables from the course notes.

# list of centers and covariance matrices for each class
c_ = { 0: np.array([4, -2]), 
       1: np.array([10, 3]), 
       2: np.array([5, 4]) }
Sigma_ = { 0: np.array([[2, 0.5], [0.5, 1]]), 
           1: np.array([[0.3, 0.2], [0.2, 0.8]]), 
           2: np.array([[1, 0.5], [0.5, 1]]) }

def multiple_classes(N, c_, Sigma_):
    """
    returns N samples of 2d data from multiple classes
    """
    nr_classes = len(c_)
    n0 = N//nr_classes
    X = np.zeros((2, 0))
    y = np.zeros(0)
    for i in range(nr_classes):
        if i == nr_classes-1:
            n0 = N - (nr_classes-1)*n0
        X = np.hstack((X, gauss_sample(n0, c_[i], Sigma_[i])))
        y = np.hstack((y, i*np.ones(n0)))

    # important: always shuffle the order of your data points
    p = rng.permutation(N)
    X = X[:,p]
    y = y[p]
    return X, y

def h_coord_split(x):
    if x[0] > 8:
        return 1
    if x[1] > 1:
        return 2
    return 0

loss_01 = lambda y, yp: y != yp

def empirical_err(h, loss, X, y):
    s = 0
    for n in range(len(y)):
        s += loss(y[n], h(X[:,n]))
    return s/len(y)
In [5]:
# Convergence of the empirical error to 
# the true generalisation error (sqrt(N))

limit = 0.00211238
rng = default_rng(0) 

for N in [10, 100, 1000, 10000, 10000, 1000000]:
    Xnew, ynew = multiple_classes(N, c_, Sigma_)
    val = empirical_err(h_coord_split, loss_01, Xnew, ynew)
    print(N, val, np.abs(val - limit))
10 0.0 0.00211238
100 0.0 0.00211238
1000 0.001 0.00111238
10000 0.0021 1.2380000000000203e-05
10000 0.0026 0.0004876199999999998
1000000 0.002096 1.6379999999999867e-05

Exercise 4.6¶

In Section 4.5 we have implemented the Bayes hypothesis h_best_L2 and computed its generalisation error. As discussed there, h_best_L2 does not necessarily attain integer values in $\{0,1,2\}$. Can you come up with a best possible hypothesis $h$ that takes only values in $\{0,1,2\}$? And compute its generalisation error?

Solution: Recall that a best $L^2$ hypothesis $h_\text{best}$ is given as the weighted sum of the class labels, weighted by their probability of appearance. If we only want integer labels, we should return for each $x$ a most likely label, that is, $$ h_\text{best int} := \arg \max_{i=0,1,2} \{ \pi_i(x) \}. $$ If two labels are equally likely with highest probability, it does not matter which one is chosen.