Exercise 7.1¶
Suppose that the distinct plane points $(x_i,y_i)$ for $i=1,\ldots,n$ are to be fit using a linear function without intercept, $\hat{f}(x)=w x$. Use calculus to find a formula for the value of $w$ that minimizes the sum of squared residuals,
$$ L(w) = \sum_{i=1}^n \bigl(y_i - \hat{f}(x_i)\bigr)^2. $$
Exercise 7.2¶
Suppose that $x_1=-2$, $x_2=1$, and $x_3=2$. Define $w$ as in Exercise 7.1, and define the predicted values $\hat{y}_k=w x_k$ for $k=1,2,3$. Express each $\hat{y}_k$ as a combination of the three values $y_1$, $y_2$, and $y_3$, which remain arbitrary. (This is a special case of a general fact about linear regression: each prediction is a linear combination of the training values.)
Exercise 7.3¶
Using the formulas derived in Section 7.1, show that the point $(\bar{x},\bar{y})$ always lies on the linear regression line. (Hint: You only have to show that $\hat{f}(\bar{x}) = \bar{y}$, which can be done without first solving for $a$ and $b$.)
Exercise 7.4¶
Prove that minimizing $\| \mathbf{X} \mathbf{w} - \mathbf y\|_2 $ for $\mathbf{w}$ is equivalent to finding a solution to the normal equations $\mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{X}^T \mathbf{y}$.
What can we say about the set of solutions $\{ \mathbf{w} \}$?
Exercise 7.5¶
Suppose that $$ \begin{split} \mathbf x &= [-2, 0, 1, 3] \\ \mathbf y &= [4, 1, 2, 0]. \end{split} $$
Find the (a) MSE, (b) MAE, and (c) coefficient of determination on this set for the regression function $\hat{f}(x)=1-x$.
Exercise 7.6¶
Suppose for $d=3$ features you have the $n=4$ sample vectors $$ \mathbf x_1 = [1,0,1], \quad \mathbf x_2 = [-1,2,2],\quad \mathbf x_3=[3,-1,0], \quad \mathbf x_4 = [0,2,-2], $$
and a multilinear regression computes the weight vector $\mathbf w = [2,1,-1]$.
Find (a) the matrix-vector product $\mathbf X\mathbf w$, and (b) the predictions of the regressor on the sample vectors.
Exercise 7.7¶
Suppose that values $y_i$ for $i=1,\ldots,n$ are to be fit to 2D sample vectors using a multilinear regression function $\hat{f}(\mathbf x)=w_1 x_1 + w_2 x_2$. Define the sum-of-squared-residuals loss function $$ L(w_1,w_2) = \sum_{i=1}^n \bigl(y_i - \hat{f}(\mathbf x_i)\bigr)^2. $$
Show that by holding $w_1$ constant and taking a derivative with respect to $w_2$, and then holding $w_2$ constant and taking a derivative with respect to $w_1$, at the minimum loss we must have $$ \begin{split} \left(\sum X_{i,1}^2 \right) w_1 + \left(\sum X_{i,1} X_{i,2} \right) w_2 &= \sum X_{i,1}\, y_i, \\ \left(\sum X_{i,1} X_{i,2} \right) w_1 + \left(\sum X_{i,2}^2 \right) w_2 &= \sum X_{i,2} \, y_i, \end{split} $$
where $X_{i,1}$ and $X_{i,2}$ are the entries in the $i$-th row of the feature matrix $\mathbf X$. (In each case above the sum is from $i=1$ to $i=n$.)
Exercise 7.8¶
If we fit the model $\hat{f}(x)=w x$ to the single data point $(2,6)$, then the ridge loss is $$ L(w) = (2w-6)^2 + \alpha w^2, $$
where $\alpha$ is a nonnegative constant. When $\alpha = 0$, it's clear that $w=3$ is the minimizer of $L(w)$.
Show that if $\alpha>0$, then $L'(w)$ is zero at a value of $w$ in the interval $(0,3)$. (This shows that the optimum weight choice decreases in the presence of the regularization penalty.)
Exercise 7.9¶
If we fit the model $\hat{f}(x)=w x$ to the single data point $(2,6)$, then the LASSO loss is $$ L(w) = (2w-6)^2 + \alpha |w|, $$
where $\alpha$ is a nonnegative constant. When $\alpha = 0$, it's clear that $w=3$ is the global minimizer of $L(w)$. Below you will show that the minimizer is less than this if $\alpha > 0$.
(a) Show that if $w < 0$ and $\alpha>0$, then $L'(w)$ can never be zero.
(b) Show that if $w>0$ and $0<\alpha < 24$, then $L'(w)$ has a single root in the interval $(0,3)$.
Exercise 7.10¶
For each function on two-dimensional vectors, either prove that it is linear or produce a counterexample that shows it cannot be linear.
(a) $\hat{f}(\mathbf x) = x_1 x_2$
(b) $\hat{f}(\mathbf x) = x_2$
(c) $\hat{f}(\mathbf x) = x_1 + x_2 + 1$
Exercise 7.11¶
Given the data set $(x_i,y_i)=\{(0,-1),(1,1),(2,3),(3,0),(4,3)\}$, find the MAE-based $Q$ score for the following hypothetical decision tree splits.
(a) $x \le 0.5, \qquad$ (b) $x \le 1.5, \qquad$ (c) $x \le 2.5,\qquad$ (d) $x \le 3.5$.
Exercise 7.12¶
Here are values (labels) on an integer lattice.
Let $\hat{f}(x_1,x_2)$ be the kNN regressor using $k=4$, Euclidean metric, and mean averaging. In each case below, a function $g(t)$ is defined from values of $\hat{f}$ along a vertical or horizontal line. Carefully sketch a plot of $g(t)$ for $-2\le t \le 2$.
(a) $g(t) = \hat{f}(1.2,t)$
(b) $g(t) = \hat{f}(t,-0.75)$
(c) $g(t) = \hat{f}(t,1.6)$
(d) $g(t) = \hat{f}(-0.25,t)$
Exercise 7.13¶
Here are some label values and probabilistic predictions by a logistic regressor:
$$ \begin{split} \mathbf y &= [0,0,1,1], \\ \hat{\mathbf{p}} &= [\tfrac{3}{4},0,1,\tfrac{1}{2}]. \end{split} $$
Using base-2 logarithms, calculate the cross-entropy loss for these predictions.
Exercise 7.14¶
Let $\mathbf X=[[-1],[0],[1]]$ and $\mathbf y=[0,1,0]$. This small dataset is to be fit to a probabilistic predictor $\hat{p}(x) = \sigma(w x)$ for scalar weight $w$.
(a) Let $L(w)$ be the cross-entropy loss function using natural logarithms. Show that $$ L'(w) = \frac{e^w-1}{e^w+1}. $$
(b) Explain why part (a) implies that $w=0$ is the global minimizer of the loss $L$.
(c) Using the result of part (b), simplify the optimum predictor function $\hat{p}(x)$.
Exercise 7.15¶
Let $\mathbf X=[[-1],[1]]$ and $\mathbf y=[0,1]$. This small dataset is fit to a probabilistic predictor $\hat{p}(x) = \sigma(w x)$ for scalar weight $w$. Without regularization, the best fit takes $w\to\infty$, which makes the predictor become infinitely steep at $x=0$. To combat this behavior, let $L$ be the cross-entropy loss function with LASSO penalty, i.e.,
$$ L(w) = -\ln[1-\hat{p}(-1)] - \ln[\hat{p}(1)] + \alpha |w|, $$
for a positive regularization constant $\alpha$.
(a) Show that $L'$ is never zero for $w < 0$.
(b) Show that if $0 <\alpha <2$, then $L'$ has a zero at
$$ w = \ln\left( \frac{2}{\alpha}-1 \right). $$
(c) Show that $w$ from part (b) is a decreasing function of $\alpha$. (Therefore, increasing $\alpha$ makes the predictor less steep as a function of $x$.)
Exercise 7.16¶
As for classification, one can ask the question about what would be the Bayes hypothesis for a given dataset $(x_1,y_1),\ldots,(x_n,y_n)$, defined by a function $y = h(x)$ that minimizes some loss over the whole dataset. If the dataset satisfies a functional relationship, that is $x_i = x_j \Rightarrow y_i = y_j$, then the Bayes hypothesis $h$ is simply defined by $h(x_i) := y_i$. This then means that if there are no two features $x_i,x_j$ which are identical, then the Bayes hypothesis achieves zero loss.
Use the below code to load the star_classification dataset and think of an efficient way to check whether there are two features $x_i,x_j$ which are identical ($i\neq j$).
(Hint: It can be done on $O(n\log n)$ operations using sorting.)
import pandas as pd
import numpy as np
astro = pd.read_csv("_datasets/star_classification.csv")
astro.replace(-9999, np.nan, inplace=True)
astro.dropna(inplace=True)
astro.head()
X = astro[["u", "g", "r", "i", "z", "redshift"]].values
y = astro["class"].values