For these exercises, you may use computer help to work on a problem, but your solution should be self-contained without reference to computer output (unless stated otherwise). Use the Jupyter Notebook itself to typeset your solutions as markdown cells.

Exercise 3.1¶

The following parts are about the sample set of $n$ values ($n>2$) $$ 0, 0, 0, \ldots, 0, 1000. $$

(That is, there are $n-1$ copies of 0 and one copy of 1000.)

  1. Show that the sample mean is $1000/n$.
  2. Find the sample median when $n$ is odd.
  3. Show that the corrected sample variance $s_{n-1}^2$ is $10^6/n$.
  4. Find the sample z-scores of all the values.

Solution:

  1. The sample mean is $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i = \frac{1}{n} (0+\cdots+0+1000) = 1000/n.$

  2. The sample median for odd $n$ (and more generally, all $n\geq 3$), is $0$.

  3. $s_{n-1}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 = \frac{1}{n-1} \big[\underbrace{ (1000/n)^2 + \cdots + (1000/n)^2 }_{n-1 \text{\ terms}} + (1000-1000/n)^2 \big]$

    $\quad =(1000/n)^2 + \frac{(1000n - 1000)^2}{n^2 (n-1)} = (1000/n)^2 + \frac{1000^2 (n-1)^2}{n^2 (n-1)} = 1000^2 / n$

  4. The values $x_i = 0$ become $z_i = (x_i - \bar{x})/s_{n-1} = -(1000/n)/\sqrt{1000^2 / n} = -\frac{1}{\sqrt{n}}.$

    The value $x_n = 1000$ becomes $z_n = (1000-1000/n)/\sqrt{1000^2 / n} = (n-1)/\sqrt{n}$.

Exercise 3.2¶

Suppose given samples $x_1,\ldots,x_n$ have the sample z-scores $z_1,\ldots,z_n$.

  1. Show that $\displaystyle \sum_{i=1}^n z_i = 0.$

  2. Show that $\displaystyle \sum_{i=1}^n z_i^2 = n-1.$

Solution:

  1. This has been shown in the course notes, chapter 3, following Theorem 3.2.

  2. We have $$ \sum_{i=1}^n z_i^2 = \sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s}\right)^2 = \frac{1}{s^2} \sum_{i=1}^n (x_i - \bar{x})^2. $$ By definition, the sample variance $s^2$ is given by: $$ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2, $$ leading to the result.

Exercise 3.3¶

Define 8 points on an ellipse by $x_k=a\cos(\theta_k)$ and $y_k=b\sin(\theta_k)$, where $a$ and $b$ are positive and $$ \theta_1= \frac{\pi}{4}, \theta_2 = \frac{\pi}{2}, \theta_3 = \frac{3\pi}{4}, \ldots, \theta_8 = 2\pi. $$ Let $u_1,\ldots,u_8$ and $v_1,\ldots,v_8$ be the z-scores of the $x_k$ and the $y_k$, respectively. Show that the points $(u_k,v_k)$ all lie on a circle centered at the origin for all $k=1,\ldots,8$. (By extension, standardizing points into z-scores is sometimes called sphereing them.)

Solution: First, we calculate the means $$ \bar{x} = \frac{1}{8} \sum_{k=1}^8 x_k = \frac{1}{8} \sum_{k=1}^8 a \cos(\theta_k) $$

$$ \bar{y} = \frac{1}{8} \sum_{k=1}^8 y_k = \frac{1}{8} \sum_{k=1}^8 b \sin(\theta_k) $$

Since the points are symmetrically distributed around an ellipse, the sums of the cosines and sines over one full period are zero, and so $\bar{x} = \bar{y} = 0$.

Next, we calculate the sample variances $$ s_x^2 = {\frac{1}{8} \sum_{k=1}^8 (x_k - \bar{x})^2} = {\frac{1}{8} \sum_{k=1}^8 (a \cos(\theta_k))^2} = {\frac{a^2}{8} \sum_{k=1}^8 \cos^2(\theta_k)} $$

$$ s_y^2 = {\frac{1}{8} \sum_{k=1}^8 (y_k - \bar{y})^2} = {\frac{1}{8} \sum_{k=1}^8 (b \sin(\theta_k))^2} = {\frac{b^2}{8} \sum_{k=1}^8 \sin^2(\theta_k)}. $$

(One can also use the unbiased variants with just a small modification.)

Since $\cos^2(\theta) + \sin^2(\theta) = 1$ and the sum of squares of cosines and sines over one full period is equal: $$ \sum_{k=1}^8 \cos^2(\theta_k) = \sum_{k=1}^8 \sin^2(\theta_k) = 4. $$

Thus, $$ s_x = \frac{a}{\sqrt{2}}, \qquad s_y = \frac{b}{\sqrt{2}}. $$

We can now calculate the z-scores:

$$ u_k = \frac{x_k - \bar{x}}{s_x} = \frac{a \cos(\theta_k)}{a / \sqrt{2}} = \sqrt{2} \cos(\theta_k) $$

$$ v_k = \frac{y_k - \bar{y}}{s_y} = \frac{b \sin(\theta_k)}{b / \sqrt{2}} = \sqrt{2} \sin(\theta_k) $$

Thus, the points $(u_k, v_k)$ all lie on a circle of radius $\sqrt{2}$ centered at the origin.

Exercise 3.4¶

Given a population of values $x_1,x_2,\ldots,x_n$, define the function $$ r_2(x) = \sum_{i=1}^n (x_i-x)^2. $$

Show using calculus that $r_2$ is minimized at $x=\mu$, the population mean. (The idea is that minimizing $r_2$ is a way to find the "most representative" value for the dataset.)

Solution: Consider the derivative

$$ r_2'(x) = - \sum_{i=1}^n 2 (x_i - x) = 2 n x - 2\sum_{i=1}^n x_i, $$

which clearly has a root at $x = \frac{1}{n} \sum_{i=1}^n x_i = \mu$. Since $r_2''(\mu) = 2 n > 0$, we've indeed found the minimum.

Exercise 3.5¶

Suppose that $n=2k-1$ and a population has values $x_1,x_2,\ldots,x_{n}$ in sorted order, so that the median is equal to $x_k$. Define the function $$ r_1(x) = \sum_{i=1}^n |x_i - x|. $$

(This function is called the total absolute deviation of $x$ from the population.) Show that $r_1$ has a global minimum at $x=x_k$ by way of the following steps.

  1. Explain why the derivative of $r_1$ is undefined at every $x_i$. Consequently, all of the $x_i$ are critical points of $r_1$.

  2. Determine $r_1'$ within each interval $(-\infty,x_1),\, (x_1,x_2),\, (x_2,x_3),$ and so on. Explain why this shows that there cannot be any additional critical points to consider.

    (Note: you can replace the absolute values with a piecewise definition of $r_1$, where the formula for the pieces changes as you cross over each $x_i$.)

  3. By considering the $r_1'$ values between the $x_i$, explain why it must be that $$ r_1(x_1) > r_1(x_2) > \cdots > r_1(x_k) < r_1(x_{k+1}) < \cdots < r_1(x_n). $$

Solution:

  1. The derivative of $r_1$ is undefined at every $x_i$ because the absolute value function $|x_i - x|$ has a cusp at $x = x_i$. Consequently, the function $r_1(x)$, which is a sum of such absolute value terms, will also have cusps at each $x_i$. Therefore, all of the $x_i$ are critical points of $r_1$.

  2. To determine $r_1'$ within each interval, we consider the piecewise definition of $r_1$. For $x$ in the interval $(x_{i-1}, x_i)$, the function $r_1$ can be written as: $$ r_1(x) = \sum_{j=1}^{i-1} (x - x_j) + \sum_{j=i}^n (x_j - x). $$ Taking the derivative within this interval, we get: $$ r_1'(x) = \sum_{j=1}^{i-1} 1 - \sum_{j=i}^n 1 = (i-1) - (n-i+1) = 2i - n - 2. $$ Since $r_1'$ is constant within each interval $(x_{i-1}, x_i)$, there cannot be any additional critical points to consider.

  3. By considering the values of $r_1'$ between the $x_i$, we observe that:

    • For $x < x_1$, $r_1'(x) = - n < 0$.
    • For $x \in (x_1, x_2)$, $r_1'(x) = -n + 2 < 0$.
    • For $x \in (x_2, x_3)$, $r_1'(x) = -n + 4 < 0$.
    • $\ldots$
    • For $x \in (x_{k-1}, x_k)$, $r_1'(x) = 2 k - n - 2 = -1 < 0$.
    • For $x \in (x_k, x_{k+1})$, $r_1'(x) = 2 k - n + 2 = +1 > 0$.
    • $\ldots$
    • For $x > x_n$, $r_1'(x) = +n > 0$.

    Therefore, $r_1(x)$ decreases as $x$ approaches $x_k$ from the left and increases as $x$ moves away from $x_k$ to the right. This implies that: $$ r_1(x_1) > r_1(x_2) > \cdots > r_1(x_k) < r_1(x_{k+1}) < \cdots < r_1(x_n). $$ Hence, $r_1$ has a global minimum at $x = x_k$.

Exercise 3.6¶

This problem is about the dataset $$ 1, 3, 4, 5, 5, 6, 7, 8. $$

  1. Make a table of the values of the ECDF $\hat{F}(t)$ at the values $t=0,1,2,\ldots,10$.

  2. Carefully sketch the ECDF of the dataset over the interval $[0,10]$.

  3. Make a table of counts $c_k$ for the bins $(0,2],(2,4],(4,6],(6,8],(8,10]$.

  4. Sketch a histogram of the dataset using the bins from part 3.

  5. Verify the equation (3.6) in Chapter 3 for the bins from part 3.

In [6]:
# Solution
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset
data = np.array([1, 3, 4, 5, 5, 6, 7, 8])

# 1. Make a table of the values of the ECDF at the values t=0,1,2,...,10
t_values = np.arange(11)
ecdf_values = [(data <= t).sum() / len(data) for t in t_values]
ecdf_table = pd.DataFrame({'t': t_values, 'ECDF': ecdf_values})
print(ecdf_table)

# 2. Sketch the ECDF of the dataset over the interval [0,10]
plt.step(t_values, ecdf_values, where='post')
plt.xlabel('t')
plt.ylabel('ECDF')
plt.title('ECDF of the dataset')
plt.grid(True)
plt.show()

# 3. Make a table of counts
bins = np.arange(0,12,2)
table = pd.DataFrame({'data':data})
cuts = pd.cut(table["data"], bins, right=True)
counts_table = cuts.value_counts().sort_index().reset_index()
display(counts_table)

# 4. Histogram - careful about bin edges
sns.displot(data, bins=bins+1e-14); # little hack because displot bins are [left,right)
     t   ECDF
0    0  0.000
1    1  0.125
2    2  0.125
3    3  0.250
4    4  0.375
5    5  0.625
6    6  0.750
7    7  0.875
8    8  1.000
9    9  1.000
10  10  1.000
No description has been provided for this image
index data
0 (0, 2] 1
1 (2, 4] 2
2 (4, 6] 3
3 (6, 8] 2
4 (8, 10] 0
No description has been provided for this image

Exercise 3.7¶

Suppose that a distribution has continuous PDF $f(t)$ and CDF $F(t)$, and that $F(a)=0$ and $F(b)=1$. Explain why $$ \int_a^b f(t)\,dt = 1. $$

Solution: By the Fundamental Theorem of Calculus, $F(b) = F(a) + \int_a^b f(t)\,dt.$ Given that $F(a) = 0$ and $F(b) = 1$, we can write: $$ F(b) = \int_a^b f(t)\,dt = 1. $$

Exercise 3.8¶

Suppose that a distribution has PDF $$ f(t) = \begin{cases} 0, & |t| > 1, \\ \tfrac{1}{2}(1+t), & |t| \le 1. \end{cases} $$ Find a formula for its CDF. (Hint: It's a piecewise formula. First find it for $t< -1,$ then for $-1 \le t \le 1,$ and finally for $t>1$.)

Solution:

We simply apply the definition $$ F(t) = \int_{-\infty}^t f(t) \, dt. $$

For $t < -1$, $F(t) = 0$. For $t \in [-1,1]$, we have $$ F(t) = \int_{-1}^t \frac{1}{2} (1+s)\, ds = \left[ \frac{1}{2} s + \frac{1}{4} s^2 \right]_{-1}^t = \frac{1}{2} t + \frac{1}{4} t^2 + \frac{1}{4}. $$

Finally, for $t\geq 1$ we have $F(t) = F(1) = 1$.

Combining all three cases, we get the piecewise formula for the CDF: $$ F(t) = \begin{cases} 0, & t < -1, \\ \frac{1}{2} t + \frac{1}{4} t^2 + \frac{1}{4}, & -1 \le t \le 1, \\ 1, & t > 1. \end{cases} $$

Exercise 3.9¶

What is the median of the normal distribution whose PDF is given by equation (3.12) in Chapter 3? The answer is probably intuitively clear, but you should make a mathematical argument (though it does not require difficult calculations).

Note: there is no simple antiderivative formula for the PDF, and you do not need it anyway.

Solution:

For a normal distribution with mean $\mu$ and standard deviation $\sigma$, the probability density function (PDF) is given by:

$$ f(t) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(t - \mu)^2}{2\sigma^2}\right). $$

The median $m$ is the value of $t$ for which $F(t) = \int_{-\infty}^t f(s) \, ds = 0.5$. The PDF is symmetric around the mean $\mu$; that is, $f(\mu+s) = f(\mu-s)$. This symmetry implies that the point where the CDF equals 0.5 is exactly at the mean $\mu$.

Exercise 3.10¶

This exercise is about the same set of sample values as Exercise 3.1. Suppose the 2σ-outlier criterion is applied using the sample mean and sample variance.

  1. Show that regardless of $n$, the value 0 is never an outlier.

  2. Show that the value 1000 is an outlier if $n \ge 6$.

Solution:

  1. We have already calculated the sample mean $\bar{x} = 1000/n$ and the sample standard variance $s_{n-1}^2$ in Exercise 3.1. The sample standard deviation $s_{n-1}$ is $$ s_{n-1} = \sqrt{s_{n-1}^2} = \frac{1000}{\sqrt{n}}. $$ In the 2σ-outlier criterion, a value is considered an outlier if it lies outside the interval $\left( \bar{x} - 2\sigma, \bar{x} + 2\sigma \right)$. We have $$ \bar{x} - 2\sigma = \frac{1000}{n} - 2 \cdot \frac{1000}{\sqrt{n}} = \frac{1000}{n} - \frac{2000}{\sqrt{n}} < 0 $$ for any $n \geq 1$, and so the value 0 is always within the interval and thus never an outlier.

  2. We have $$ \bar{x} + 2\sigma = \frac{1000}{n} + 2 \cdot \frac{1000}{\sqrt{n}} = \frac{1000}{n} + \frac{2000}{\sqrt{n}}. $$

    For 1000 to be an outlier, we need: $$ 1000 > \frac{1000}{n} + \frac{2000}{\sqrt{n}}. $$

    Rearranging, we get: $$ n - 2\sqrt{n} - 1 > 0. $$

    Solving this quadratic inequality, we find that $n \ge 6$. Therefore, the value 1000 is an outlier if $n \ge 6$.

Exercise 3.11¶

Define a population by $$ x_i = \begin{cases} 1, & 1 \le i \le 11, \\ 2, & 12 \le i \le 14,\\ 4, & 15 \le i \le 22, \\ 6, & 23 \le i \le 32. \end{cases} $$ (That is, there are 11 values of 1, 3 values of 2, 8 values of 4, and 10 values of 6.)

  1. Find the median of the population.

  2. Find the smallest interval containing all non-outlier values according to the 1.5 IQR criterion.

Solution:

  1. To find the median, we need to determine the middle value of the sorted population. The population has 32 values in total. The median is the average of the 16th and 17th values.

    The sorted population is: $$ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6 $$

    The 16th and 17th values are both 4. Therefore, the median is: $$ \text{Median} = \frac{4 + 4}{2} = 4 $$

  2. First, we need to calculate the first quartile ($Q_{25}$) and the third quartile ($Q_{75}$), which are 1 and 6, respectively.

    The interquartile range is $I = Q_{75} - Q_{25} = 6 - 1 = 5$.

    According to the 1.5 IQR criterion, the lower bound is $$ Q_{25} - 1.5 \times I = 1 - 1.5 \times 5 = 1 - 7.5 = -6.5, $$ while the upper bound is $$ Q_{75} + 1.5 \times I = 6 + 1.5 \times 5 = 6 + 7.5 = 13.5. $$

    Hence, there are no outliers and the smallest interval containing all non-outlier values is $[1,6]$.

Exercise 3.12¶

Prove that two sample sets have a Pearson correlation coefficient equal to 1 if they have identical z-scores. (Hint: Use the results of Exercise 3.2.)

Solution:

Let the two sample sets be $\{ x_i\}$ and $\{y_i\}$ with identical z-scores $$ z_{i} = \frac{x_i - \bar{x}}{s_x} = \frac{y_i - \bar{y}}{s_y}. $$

By formula (3.16) from the course notes, the Pearson correlation coefficient is $$ r_{xy} = \frac{1}{n-1} \sum_{i=1}^n \left(\frac{x_i-\bar{x}}{s_x}\right) \left(\frac{y_i-\bar{y}}{s_y}\right) = \frac{1}{n-1} \sum_{i=1}^n z_i^2. $$

By Exercise 3.2, $r_{xy}=1$.

Exercise 3.13¶

Suppose that two sample sets satisfy $y_i=-x_i$ for all $i$. Prove that the Pearson correlation coefficient between the sets equals $-1$.

Solution: Given that $y_i = -x_i$ for all $i$, we have:

$$ \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i = \frac{1}{n} \sum_{i=1}^n (-x_i) = -\bar{x} $$

Substituting $y_i = -x_i$ and $\bar{y} = -\bar{x}$ into the formula for $r_{xy}$, we get:

$$ r_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(-x_i - (-\bar{x}))}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (-x_i - (-\bar{x}))^2}} $$

Simplifying the numerator:

$$ \sum_{i=1}^n (x_i - \bar{x})(-x_i + \bar{x}) = -\sum_{i=1}^n (x_i - \bar{x})^2 $$

Simplifying the denominator:

$$ \sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (-x_i + \bar{x})^2} = \sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} = \sum_{i=1}^n (x_i - \bar{x})^2 $$

Thus, the Pearson correlation coefficient is $r_{xy} = -1$.

Exercise 3.14¶

Download and solve the sample test available under Assessments.