For these exercises, you may use computer help to work on a problem, but your solution should be self-contained without reference to computer output (unless stated otherwise). Use the Jupyter Notebook itself to typeset your solutions as markdown cells.

Exercise 3.1¶

The following parts are about the sample set of $n$ values ($n>2$) $$ 0, 0, 0, \ldots, 0, 1000. $$

(That is, there are $n-1$ copies of 0 and one copy of 1000.)

  1. Show that the sample mean is $1000/n$.
  2. Find the sample median when $n$ is odd.
  3. Show that the corrected sample variance $s_{n-1}^2$ is $10^6/n$.
  4. Find the sample z-scores of all the values.

Exercise 3.2¶

Suppose given samples $x_1,\ldots,x_n$ have the sample z-scores $z_1,\ldots,z_n$.

  1. Show that $\displaystyle \sum_{i=1}^n z_i = 0.$

  2. Show that $\displaystyle \sum_{i=1}^n z_i^2 = n-1.$

Exercise 3.3¶

Define 8 points on an ellipse by $x_k=a\cos(\theta_k)$ and $y_k=b\sin(\theta_k)$, where $a$ and $b$ are positive and $$ \theta_1= \frac{\pi}{4}, \theta_2 = \frac{\pi}{2}, \theta_3 = \frac{3\pi}{4}, \ldots, \theta_8 = 2\pi. $$ Let $u_1,\ldots,u_8$ and $v_1,\ldots,v_8$ be the z-scores of the $x_k$ and the $y_k$, respectively. Show that the points $(u_k,v_k)$ all lie on a circle centered at the origin for all $k=1,\ldots,8$. (By extension, standardizing points into z-scores is sometimes called sphereing them.)

Exercise 3.4¶

Given a population of values $x_1,x_2,\ldots,x_n$, define the function $$ r_2(x) = \sum_{i=1}^n (x_i-x)^2. $$

Show using calculus that $r_2$ is minimized at $x=\mu$, the population mean. (The idea is that minimizing $r_2$ is a way to find the "most representative" value for the dataset.)

Exercise 3.5¶

Suppose that $n=2k-1$ and a population has values $x_1,x_2,\ldots,x_{n}$ in sorted order, so that the median is equal to $x_k$. Define the function $$ r_1(x) = \sum_{i=1}^n |x_i - x|. $$

(This function is called the total absolute deviation of $x$ from the population.) Show that $r_1$ has a global minimum at $x=x_k$ by way of the following steps.

  1. Explain why the derivative of $r_1$ is undefined at every $x_i$. Consequently, all of the $x_i$ are critical points of $r_1$.

  2. Determine $r_1'$ within each interval $(-\infty,x_1),\, (x_1,x_2),\, (x_2,x_3),$ and so on. Explain why this shows that there cannot be any additional critical points to consider.

    (Note: you can replace the absolute values with a piecewise definition of $r_1$, where the formula for the pieces changes as you cross over each $x_i$.)

  3. By considering the $r_1'$ values between the $x_i$, explain why it must be that $$ r_1(x_1) > r_1(x_2) > \cdots > r_1(x_k) < r_1(x_{k+1}) < \cdots < r_1(x_n). $$

Exercise 3.6¶

This problem is about the dataset $$ 1, 3, 4, 5, 5, 6, 7, 8. $$

  1. Make a table of the values of the ECDF $\hat{F}(t)$ at the values $t=0,1,2,\ldots,10$.

  2. Carefully sketch the ECDF of the dataset over the interval $[0,10]$.

  3. Make a table of counts $c_k$ for the bins $(0,2],(2,4],(4,6],(6,8],(8,10]$.

  4. Sketch a histogram of the dataset using the bins from part 3.

  5. Verify the equation (3.6) in Chapter 3 for the bins from part 3.

Exercise 3.7¶

Suppose that a distribution has continuous PDF $f(t)$ and CDF $F(t)$, and that $F(a)=0$ and $F(b)=1$. Explain why $$ \int_a^b f(t)\,dt = 1. $$

Exercise 3.8¶

Suppose that a distribution has PDF $$ f(t) = \begin{cases} 0, & |t| > 1, \\ \tfrac{1}{2}(1+t), & |t| \le 1. \end{cases} $$ Find a formula for its CDF. (Hint: It's a piecewise formula. First find it for $t< -1,$ then for $-1 \le t \le 1,$ and finally for $t>1$.)

Exercise 3.9¶

What is the median of the normal distribution whose PDF is given by equation (3.12) in Chapter 3? The answer is probably intuitively clear, but you should make a mathematical argument (though it does not require difficult calculations).

Note: there is no simple antiderivative formula for the PDF, and you do not need it anyway.

Exercise 3.10¶

This exercise is about the same set of sample values as Exercise 3.1. Suppose the 2σ-outlier criterion is applied using the sample mean and sample variance.

  1. Show that regardless of $n$, the value 0 is never an outlier.

  2. Show that the value 1000 is an outlier if $n \ge 6$.

Exercise 3.11¶

Define a population by $$ x_i = \begin{cases} 1, & 1 \le i \le 11, \\ 2, & 12 \le i \le 14,\\ 4, & 15 \le i \le 22, \\ 6, & 23 \le i \le 32. \end{cases} $$ (That is, there are 11 values of 1, 3 values of 2, 8 values of 4, and 10 values of 6.)

  1. Find the median of the population.

  2. Find the smallest interval containing all non-outlier values according to the 1.5 IQR criterion.

Exercise 3.12¶

Prove that two sample sets have a Pearson correlation coefficient equal to 1 if they have identical z-scores. (Hint: Use the results of Exercise 3.2.)

Exercise 3.13¶

Suppose that two sample sets satisfy $y_i=-x_i$ for all $i$. Prove that the Pearson correlation coefficient between the sets equals $-1$.

Exercise 3.14¶

Download and solve the sample test available under Assessments.