Exercise 7.1¶
Suppose that the distinct plane points $(x_i,y_i)$ for $i=1,\ldots,n$ are to be fit using a linear function without intercept, $\hat{f}(x)=w x$. Use calculus to find a formula for the value of $w$ that minimizes the sum of squared residuals,
$$ L(w) = \sum_{i=1}^n \bigl(y_i - \hat{f}(x_i)\bigr)^2. $$
Solution: Let's take the derivative of $L(w)$ with respect to $w$:
$$ \frac{dL(w)}{dw} = \frac{d}{dw} \sum_{i=1}^n \bigl(y_i - w x_i\bigr)^2 = \sum_{i=1}^n 2 \bigl(y_i - w x_i\bigr)(-x_i). $$
Simplifying and setting $\frac{dL(w)}{dw} = 0$ gives
$$ \frac{dL(w)}{dw} = -2 \sum_{i=1}^n x_i y_i + 2w \sum_{i=1}^n x_i^2 = 0, $$
which we can solve for $w$: $$ w = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}. $$
Exercise 7.2¶
Suppose that $x_1=-2$, $x_2=1$, and $x_3=2$. Define $w$ as in Exercise 7.1, and define the predicted values $\hat{y}_k=w x_k$ for $k=1,2,3$. Express each $\hat{y}_k$ as a combination of the three values $y_1$, $y_2$, and $y_3$, which remain arbitrary. (This is a special case of a general fact about linear regression: each prediction is a linear combination of the training values.)
Solution: Using the formula for $w$ derived in Exercise 7.1 and $\hat{y}_k = w x_k$, we can write
$$ \hat{y}_k = x_k \cdot \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} = \sum_{i=1}^n \left(x_k \frac{x_i}{\sum_{j=1}^n x_j^2} \right) y_i. $$
Clearly, $\hat{y}_k$ is a linear combination of $y_1,y_2,y_3$ and the coefficients are given by the terms in parenthesis, which you can evaluate for $x_1 = -2$, $x_2 = 1$, $x_3 = 2$.
Exercise 7.3¶
Using the formulas derived in Section 7.1, show that the point $(\bar{x},\bar{y})$ always lies on the linear regression line. (Hint: You only have to show that $\hat{f}(\bar{x}) = \bar{y}$, which can be done without first solving for $a$ and $b$.)
Solution: This follows from the second equation in the system (7.3), which reads
$$ a \left(\sum_{i=1}^n x_i\right) + b n = \sum_{i=1}^n y_i. $$
Dividing both sides by $n$, we see that $\hat{f}(\bar{x}) = \bar{y}$.
Exercise 7.4¶
Prove that minimizing $\| \mathbf{X} \mathbf{w} - \mathbf y\|_2 $ for $\mathbf{w}$ is equivalent to finding a solution to the normal equations $\mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{X}^T \mathbf{y}$.
What can we say about the set of solutions $\{ \mathbf{w} \}$?
Solution: This is Theorem 5.4 in the semester 1 materials, but here is a slightly more detailed proof. The objective is to minimize the squared Euclidean norm, which we can write in terms of an inner product: $$ L(\mathbf{w}) = \| \mathbf{X} \mathbf{w} - \mathbf{y} \|_2^2 = (\mathbf{X} \mathbf{w} - \mathbf{y})^T (\mathbf{X} \mathbf{w} - \mathbf{y}). $$
Expanding this quadratic form we find $$ L(\mathbf{w}) = \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} - 2 \mathbf{w}^T \mathbf{X}^T \mathbf{y} + \mathbf{y}^T \mathbf{y}. $$
To minimize $L(\mathbf{w})$, we need the gradient $\nabla_{\mathbf{w}} L(\mathbf{w})$. It turns out that $$ \nabla_{\mathbf{w}} \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} = 2 \mathbf{X}^T \mathbf{X} \mathbf{w}, \quad \nabla_{\mathbf{w}} \mathbf{w}^T \mathbf{X}^T \mathbf{y} = \mathbf{X}^T \mathbf{y}. $$
We only verify the first expression. Consider the partial derivative of $\mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w}$ with respect to $w_i$: $$ \partial_{w_i} \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} = \partial_{w_i} \sum_{j=1}^d \sum_{k=1}^d w_j (\mathbf{X}^T \mathbf{X})_{jk} w_k = 2 \sum_{j=1}^d (\mathbf{X}^T \mathbf{X})_{ij} w_j. $$
The right-hand side is two times the $i$-th entry of $\mathbf{X}^T \mathbf{X} \mathbf{w}$, as required.
Hence, overall, we have $$ \nabla_{\mathbf{w}} L(\mathbf{w}) = 2 \mathbf{X}^T \mathbf{X} \mathbf{w} - 2 \mathbf{X}^T \mathbf{y}. $$
At the minimum, the gradient is necessarily zero, resulting in the normal equations.
If $\mathbf{X}^T \mathbf{X}$ is nonsingular, there is a unique solution $\mathbf{w}$ given by $\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$. Students who have taken MATH36001 Matrix Analysis, will recognize that this can be written using the Moore-Penrose pseudoinverse: $\mathbf{w} = \mathbf{X}^\dagger \mathbf{y}$.
If $\mathbf{X}^T \mathbf{X}$ is singular, all minimizing weight vectors are given as $$ \mathbf{w} = \mathbf{w}_P + \mathbf{w}_0, $$ where $\mathbf{w}_P$ is a particular solution to the normal equations and $\mathbf{w}_0$ is any right null vector of $\mathbf{X}$ (i.e., $\mathbf{X}\mathbf{w}_0 = \mathbf{0}$). The set of all such $\mathbf{w}$ is an affine subspace of $\mathbb{R}^d$. The particular solution $\mathbf{w}_P = \mathbf{X}^\dagger \mathbf{y}$ is the unique least squares minimizer of smallest Euclidean norm $\|\mathbf{w}_P \|_2$.
Exercise 7.5¶
Suppose that $$ \begin{split} \mathbf x &= [-2, 0, 1, 3] \\ \mathbf y &= [4, 1, 2, 0]. \end{split} $$
Find the (a) MSE, (b) MAE, and (c) coefficient of determination on this set for the regression function $\hat{f}(x)=1-x$.
Solution: First evaluate the predictions $$ \hat{\mathbf y} = [3, 1, 2, -2], $$ then apply the formulas from Section 7.1.2.
Exercise 7.6¶
Suppose for $d=3$ features you have the $n=4$ sample vectors $$ \mathbf x_1 = [1,0,1], \quad \mathbf x_2 = [-1,2,2],\quad \mathbf x_3=[3,-1,0], \quad \mathbf x_4 = [0,2,-2], $$
and a multilinear regression computes the weight vector $\mathbf w = [2,1,-1]$.
Find (a) the matrix-vector product $\mathbf X\mathbf w$, and (b) the predictions of the regressor on the sample vectors.
Solution: This is basic calculation using the feature matrix $$ \mathbf{X} = \begin{bmatrix} 1 & 0 & 1 \\ -1 & 2 & 2 \\ 3 & -1 & 0 \\ 0 & 2 & -2 \end{bmatrix}. $$
Note that the predictions of the regressor are given by the elements in the vector $\mathbf X\mathbf w$, so (a) and (b) are the same thing.
Exercise 7.7¶
Suppose that values $y_i$ for $i=1,\ldots,n$ are to be fit to 2D sample vectors using a multilinear regression function $\hat{f}(\mathbf x)=w_1 x_1 + w_2 x_2$. Define the sum-of-squared-residuals loss function $$ L(w_1,w_2) = \sum_{i=1}^n \bigl(y_i - \hat{f}(\mathbf x_i)\bigr)^2. $$
Show that by holding $w_1$ constant and taking a derivative with respect to $w_2$, and then holding $w_2$ constant and taking a derivative with respect to $w_1$, at the minimum loss we must have $$ \begin{split} \left(\sum X_{i,1}^2 \right) w_1 + \left(\sum X_{i,1} X_{i,2} \right) w_2 &= \sum X_{i,1}\, y_i, \\ \left(\sum X_{i,1} X_{i,2} \right) w_1 + \left(\sum X_{i,2}^2 \right) w_2 &= \sum X_{i,2} \, y_i, \end{split} $$
where $X_{i,1}$ and $X_{i,2}$ are the entries in the $i$-th row of the feature matrix $\mathbf X$. (In each case above the sum is from $i=1$ to $i=n$.)
Solution: We can write the loss function in terms of the elements of $\mathbf X$ as
$$ L(w_1,w_2) = \sum_{i=1}^n \bigl(y_i - X_{i,1} w_1 - X_{i,2} w_2 \bigr)^2. $$
Now compute the partial derivatives $\partial_{w_1} L$ and $\partial_{w_2} L$ and set them to zero to get the two stated formulas.
Exercise 7.8¶
If we fit the model $\hat{f}(x)=w x$ to the single data point $(2,6)$, then the ridge loss is $$ L(w) = (2w-6)^2 + \alpha w^2, $$
where $\alpha$ is a nonnegative constant. When $\alpha = 0$, it's clear that $w=3$ is the minimizer of $L(w)$.
Show that if $\alpha>0$, then $L'(w)$ is zero at a value of $w$ in the interval $(0,3)$. (This shows that the optimum weight choice decreases in the presence of the regularization penalty.)
Solution: We calculate
$$ L'(w) = 2(2w - 6)w + 2 \alpha w. $$
Now solve the quadratic equation $L'(w) = 4w^2 + (2\alpha - 6)w = 0$ to find the optimum $w$.
Exercise 7.9¶
If we fit the model $\hat{f}(x)=w x$ to the single data point $(2,6)$, then the LASSO loss is $$ L(w) = (2w-6)^2 + \alpha |w|, $$
where $\alpha$ is a nonnegative constant. When $\alpha = 0$, it's clear that $w=3$ is the global minimizer of $L(w)$. Below you will show that the minimizer is less than this if $\alpha > 0$.
(a) Show that if $w < 0$ and $\alpha>0$, then $L'(w)$ can never be zero.
(b) Show that if $w>0$ and $0<\alpha < 24$, then $L'(w)$ has a single root in the interval $(0,3)$.
Solution:
For $w \neq 0$, the derivative of $L$ is $$ L'(w) = 4(2w-6) + \alpha \cdot \text{sgn}(w), $$ where $\text{sgn}(w)$ is the sign function: $$ \text{sgn}(w) = \begin{cases} 1 & \text{if } w > 0, \\ -1 & \text{if } w < 0. \end{cases} $$
(a) For $w < 0$ we have $$ L'(w) = 4(2w-6) - \alpha. $$ For $L'(w)$ to be zero, we would need: $$ w = \frac{24 + \alpha}{8}. $$ However, since $w < 0$ and $\frac{24 + \alpha}{8} > 0$ for $\alpha > 0$, it is impossible for $L'(w)$ to be zero when $w < 0$.
(b) If $w > 0$, $$ L'(w) = 4(2w-6) + \alpha. $$ Setting $L'(w)$ to zero, we obtain $$ w = \frac{24 - \alpha}{8} $$ and so $\alpha < 24$. The root $w = \frac{24 - \alpha}{8}$ clearly lies in the interval $(0,3)$.
Exercise 7.10¶
For each function on two-dimensional vectors, either prove that it is linear or produce a counterexample that shows it cannot be linear.
(a) $\hat{f}(\mathbf x) = x_1 x_2$
(b) $\hat{f}(\mathbf x) = x_2$
(c) $\hat{f}(\mathbf x) = x_1 + x_2 + 1$
Solution:
For (a), note that $\hat{f}(\alpha \mathbf x) = \alpha^2 \hat{f}(\mathbf x)$ which violates the linearity condition for all $\alpha \not\in\{0, 1\}$.
(b) be written as $\hat{f}(\mathbf x) = \mathbf w^T \mathbf x$ with $\mathbf{w}^T = [0,1]$, hence linear.
The function $\hat{f}$ in (c) does not strictly follow the conditions required of a linear function (it's affine), but we could make it a linear function on three-dimensional vectors $\mathbf{x}$ by adding the feature $x_0=1$ and setting $\mathbf{w}^T = [1,1,1]$.
Exercise 7.11¶
Given the data set $(x_i,y_i)=\{(0,-1),(1,1),(2,3),(3,0),(4,3)\}$, find the MAE-based $Q$ score for the following hypothetical decision tree splits.
(a) $x \le 0.5, \qquad$ (b) $x \le 1.5, \qquad$ (c) $x \le 2.5,\qquad$ (d) $x \le 3.5$.
# Solution
# This is similar to Exercise 5.5, but now for regression.
import numpy as np
data = { (0,-1), (1,1), (2,3), (3,0), (4,3) }
def H(S):
""""mean abolute error of a list"""
S = np.array(S)
g = np.mean(np.abs(S - np.median(S)))
return g
for theta in [0.5, 1.5, 2.5, 3.5]:
SL = [ y for (x,y) in data if x<=theta ]
SR = [ y for (x,y) in data if x>theta ]
Q = len(SL)*H(SL) + len(SR)*H(SR)
print("{:2} {:32} {:32} {:3.1f}".format(theta, str(SL), str(SR), Q))
0.5 [-1] [3, 3, 1, 0] 5.0 1.5 [1, -1] [3, 3, 0] 5.0 2.5 [3, 1, -1] [3, 0] 7.0 3.5 [3, 1, -1, 0] [3] 5.0
Exercise 7.12¶
Here are values (labels) on an integer lattice.
Let $\hat{f}(x_1,x_2)$ be the kNN regressor using $k=4$, Euclidean metric, and mean averaging. In each case below, a function $g(t)$ is defined from values of $\hat{f}$ along a vertical or horizontal line. Carefully sketch a plot of $g(t)$ for $-2\le t \le 2$.
(a) $g(t) = \hat{f}(1.2,t)$
(b) $g(t) = \hat{f}(t,-0.75)$
(c) $g(t) = \hat{f}(t,1.6)$
(d) $g(t) = \hat{f}(-0.25,t)$
Exercise 7.13¶
Here are some label values and probabilistic predictions by a logistic regressor:
$$ \begin{split} \mathbf y &= [0,0,1,1], \\ \hat{\mathbf{p}} &= [\tfrac{3}{4},0,1,\tfrac{1}{2}]. \end{split} $$
Using base-2 logarithms, calculate the cross-entropy loss for these predictions.
Solution: Apply formula (7.14) from the lecture notes.
Exercise 7.14¶
Let $\mathbf X=[[-1],[0],[1]]$ and $\mathbf y=[0,1,0]$. This small dataset is to be fit to a probabilistic predictor $\hat{p}(x) = \sigma(w x)$ for scalar weight $w$.
(a) Let $L(w)$ be the cross-entropy loss function using natural logarithms. Show that $$ L'(w) = \frac{e^w-1}{e^w+1}. $$
(b) Explain why part (a) implies that $w=0$ is the global minimizer of the loss $L$.
(c) Using the result of part (b), simplify the optimum predictor function $\hat{p}(x)$.
Solution:
(a) Recall that the cross-entropy loss function is $$ L(w) = -\sum_{i=1}^n \left[ y_i \ln(\hat{p}(x_i)) + (1-y_i) \ln(1-\hat{p}(x_i)) \right], $$ where $\hat{p}(x) = \sigma(w x)$ with the logistic function $\sigma(z) = \dfrac{1}{1+e^{-z}}$.
For the given dataset, the loss becomes: $$ L(w) = -\left[ \ln(1-\sigma(-w)) + \ln(\sigma(0)) + \ln(1-\sigma(w)) \right]. $$
We need to compute the $L'(w)$. Let's only focus on the first term in the square brackets, $$ \frac{d}{dw} \ln(1-\sigma(-w)) = \frac{-\sigma'(-w)}{1-\sigma(-w)} \cdot (-1) = \sigma(-w). $$ Overall, $$ L'(w) = \sigma(-w) - \sigma(w). $$ Using the property $\sigma(-z) = 1-\sigma(z)$, we get: $$ L'(w) = (1-\sigma(w)) - \sigma(w) = 1 - 2\sigma(w) = \frac{e^w - 1}{e^w + 1}. $$
(b) At $w=0$, we have: $$ L'(0) = \frac{e^0 - 1}{e^0 + 1} = \frac{0}{2} = 0. $$
For $w > 0$, $e^w > 1$, so $L'(w) > 0$, indicating that $L(w)$ is increasing. For $w < 0$, $e^w < 1$, so $L'(w) < 0$, indicating that $L(w)$ is decreasing. Thus, $w=0$ is the global minimizer of $L(w)$.
(c) With the optimal $w=0$, the optimal predictor is $$ \hat{p}(x) = \sigma(w x) = \sigma(0) = \frac{1}{2}, $$ for all values of $x$.
Exercise 7.15¶
Let $\mathbf X=[[-1],[1]]$ and $\mathbf y=[0,1]$. This small dataset is fit to a probabilistic predictor $\hat{p}(x) = \sigma(w x)$ for scalar weight $w$. Without regularization, the best fit takes $w\to\infty$, which makes the predictor become infinitely steep at $x=0$. To combat this behavior, let $L$ be the cross-entropy loss function with LASSO penalty, i.e.,
$$ L(w) = -\ln[1-\hat{p}(-1)] - \ln[\hat{p}(1)] + \alpha |w|, $$
for a positive regularization constant $\alpha$.
(a) Show that $L'$ is never zero for $w < 0$.
(b) Show that if $0 <\alpha <2$, then $L'$ has a zero at
$$ w = \ln\left( \frac{2}{\alpha}-1 \right). $$
(c) Show that $w$ from part (b) is a decreasing function of $\alpha$. (Therefore, increasing $\alpha$ makes the predictor less steep as a function of $x$.)
Solution:
(a) For $w < 0$, the derivative of $L(w)$ is given by $$ L'(w) = -\frac{\sigma'(-w)}{1-\sigma(-w)} - \frac{\sigma'(w)}{\sigma(w)} - \alpha. $$ Using the property $\sigma'(w) = \sigma(w)\sigma(-w)$ and $\sigma(-w) = 1-\sigma(w)$, this can be simplified to $$ L'(w) = -2 \sigma(-w) - \alpha. $$ Since $\sigma(-w) >0$ and $\alpha > 0$, $L'(w) \neq 0$.
(b) For $w > 0$, the derivative of $L(w)$ is: $$ L'(w) = -2 \sigma(-w) + \alpha. $$ Setting $L'(w) = 0$, we obtain $$ \frac{-2}{1 + e^w} + \alpha = 0, $$ which we can solve for $w$ to obtain the desired result.
(c) Differentiation yields $$ \frac{dw}{d\alpha} = -\frac{2}{\alpha^2 \left(\frac{2}{\alpha} - 1\right)}. $$ Since $\frac{2}{\alpha} - 1 > 0$ for $0 < \alpha < 2$, it follows that $\frac{dw}{d\alpha} < 0$. Thus, $w$ is a decreasing function of $\alpha$.
Exercise 7.16¶
As for classification, one can ask the question about what would be the Bayes hypothesis for a given dataset $(x_1,y_1),\ldots,(x_n,y_n)$, defined by a function $y = h(x)$ that minimizes some loss over the whole dataset. If the dataset satisfies a functional relationship, that is $x_i = x_j \Rightarrow y_i = y_j$, then the Bayes hypothesis $h$ is simply defined by $h(x_i) := y_i$. This then means that if there are no two features $x_i,x_j$ which are identical, then the Bayes hypothesis achieves zero loss.
Use the below code to load the star_classification dataset and think of an efficient way to check whether there are two features $x_i,x_j$ which are identical ($i\neq j$).
(Hint: It can be done on $O(n\log n)$ operations using sorting.)
import pandas as pd
import numpy as np
astro = pd.read_csv("_datasets/star_classification.csv")
astro.replace(-9999, np.nan, inplace=True)
astro.dropna(inplace=True)
astro.head()
X = astro[["u", "g", "r", "i", "z", "redshift"]].values
y = astro["class"].values
Solution: First, let us look at a very inefficient way to check whether there are two features $x_i,x_j$ which are identical. That would be a nested loop to do all pairwise comparisons, which would take $O(n^2)$ time.
# check if the feature rows in X are unique
# extremely inefficient!
for i in range(X.shape[0]):
for j in range(i + 1, X.shape[0]):
if np.array_equal(X[i], X[j]):
print(f"Duplicate rows found: {i}, {j}")
You could try running this code to see how long it takes. (It will likely be in the order of minutes.)
A much better way is to use an idea called hashing. We will first transform each row of $X$ to a scalar by multiplying it by a random probing vector. Identical rows will then have the same scalar hash value. We can then sort the hash values and check for duplicates in $O(n\log n)$ time.
The below code only takes a few miliseconds to run and confirms that the star_classification features are in functional relationship and hence there exists a Bayes hypothesis with zero loss for classification and regression:
# generate a random vector
v = np.random.rand(X.shape[1])
# compute X*v (random probing or hashing)
probe = np.dot(X, v)
# sort the probe vector and take difference of consecutive elements
d = np.diff(np.sort(probe))
# check if there is a zero difference between consecutive probes
np.any(d==0)
# as the answer is fals
np.False_