Exercise 6.1¶

Write a function median_tree(X, y, max_depth=5, level=0) that takes as inputs an $n \times d$ NumPy array $X$ corresponding to $n$ data points in $d$ dimensions, and a Python list $y$ of labels. The parameter max_depth is optional and takes the value $5$ by default. Using only plain Python without any modules other than NumPy, implement the decision tree fitting process we have learned about in the lectures. The fitted tree is returned by the function median_tree in form of a nested sequence of Python dictionaries, each of which is of the form

tree = { 
    'level': int,
    'subsetX': np.array,
    'subsety': list,
    'gini': float, 
    'split_coord': int,
    'split_value': float,
    'left': tree,
    'right': tree
}

Starting with the full sets subsetX = X and subsety = y at the root node of the tree (the "parent node" at level 0), the splitting of data into tree branches is performed by selecting the coordinate split_coord (an integer in $0,1,\ldots,d-1$) that gives the best total impurity for the two child nodes (left and right) when the data is split at the median value of that coordinate (stored in split_value). The children of a node at depth level have depth level+1 and the splitting stops when either (i) a node has only data points with the same label or (ii) level==max_depth. In this case, assign None to the appropriate values in the dictionary.

Exercise 6.2¶

Write a function median_tree_predict(tree, X) that takes as input a tree structure (nested dictionaries) as returned by median_tree, and an $m\times d$ NumPy array $X$. The function follows the split_coord and split_value information provided by the tree to return a majority-vote prediction for each data point (each row) in X. The return value is a Python list with $m$ elements corresponding to the predicted labels.

Exercise 6.3¶

Using the penguins dataset available from seaborn, perform an experiment like in Example 5.15 using both sklearn's DecisionTreeClassifier and your own median_tree function. Using a 80-20 shuffled train-test split with a fixed random state, produce a table comparing the accuracy of both classifiers for maximal tree depths $1,2,\ldots,5$.

Exercise 6.4¶

We have seen with $k$-nearest neighbour classification that the scale of the features matters, and that standardization can significantly improve the performance of such classifiers. Explain why feature standardization is in general not needed with decision trees.

Exercise 6.5¶

One (dumb) algorithm for binary classification is to always predict the positive outcome.

(a) Considered over all possible training sets, does this method have a high training variance or low training variance? Explain.

(b) Considered over all possible testing sets, does this method have a high testing bias or low testing bias? Explain.

Exercise 6.6¶

Suppose you are trying to fit this ground-truth one-dimensional classifier defined on positive integers:

$$ y = f(m) = \begin{cases} 0, & \text{ if $m$ is odd}, \\ 1, & \text{ if $m$ is even}. \end{cases} $$

Here are three individual classifiers:

$$ \begin{aligned} \hat{f}_a(m) &= 1 \text{ for all $m$}, \\ \hat{f}_b(m) &= \begin{cases} 0, & \text{ if $m = 1,2,5,6,9,10,\dots$}, \\ 1, & \text{ if $m = 3,4,7,8,11,12,\dots$ }, \end{cases} \\ \hat{f}_c(m) &= \begin{cases} 0, & \text{ if $m = 1,4,5,8,9,12,\dots$}, \\ 1, & \text{ if $m = 2,3,6,7,10,11,\dots$}. \end{cases} \end{aligned} $$

Let the testing set be $1,2,3,\ldots,12$.

(a) Compute the accuracy of each classifier on the testing set.

(b) Compute the accuracy of the majority-vote ensemble of the three classifiers on the testing set. That is, the classifier with $\hat{f}(m) = 1$ if at least two of the member classifiers are 1, and $\hat{f}(m) = 0$ otherwise.

Exercise 6.7¶

For data with very high-dimensional feature space, it can be beneficial to apply a technique called dimension reduction before applying any machine learning algorithm. Assuming our data matrix $X$ is of size $n\times d$ ($n$ data points in $d$ dimensions), in linear dimension reduction we aim to find a $d\times \hat d$ matrix $R$ with $\hat d < d$ so that $\hat X := X R$ preserves some essential features of the data (while reducing its dimension).

Research a technique called principal component analysis (PCA). Using only NumPy and no other modules, write a function mypca(X, dhat) that takes as input a data matrix $X$ and a positive integer dhat, and then returns the matrix $R$.

Rerun the experiment from Example 5.25 with the spambase dataset, but reduce the dimension of the feature space from $57$ to $1,2,3,4,5$, respectively. Do this by computing the dimension-reducing matrix $R$ for the training set, train a decision tree on the dimension-reduced training set, and then apply the same transformation to the test data before making predictions. Try if z-normalizing the data improves the performance. If so, is this in contradiction to Exercise 6.4? Write down a summary of your observations.

Exercise 6.8¶

Pulsars are detected by scanning for radio frequencies from deep space. However, the vast majority of collected signals are actually terrestrial noise. The following dataset collects 8 statistical features from radio signals and their classifications as noise (class 0) or pulsar (class 1).

  1. Load the dataset pulsars.csv. Let y be the class column of the dataset and let X be all the remaining columns. Find the proportion of the samples that are actually pulsars.

  2. Using a random state 19716, split the data 80% / 20% into testing and training sets.

  3. Fit a pipeline with standardization scaling and a kNN classifier with $k=8$ to the training data. Since we want to minimize false positives, find its precision score on the test set.

  4. Perform a grid search on the pipeline defined in step 3, including a search over $k$ from 3 through 20 and over 'uniform' and 'distance' for weights, using 6 folds in cross-validation and precision as the scoring. Find the best parameters and find the precision score on the test set.

  5. Using the best model from step 4, make a bagging classifier using 200 estimators, 50% max features and max samples, and random state 302. Find its precision score.

In [1]:
import numpy as np
import pandas as pd
signals = pd.read_csv("_datasets/pulsars.csv")

# TODO: Provide your solution code here.
In [3]:
# Testing
assert X.shape == (17898, 8)
assert y.shape == (17898,)
assert np.isclose(pulsar_fraction, 0.091574478)

assert X_train.shape == (14318, 8)
assert y_test.sum() == 326

assert np.isclose(knn_score, 0.92982, rtol=1e-5)

assert type(best_params) == dict, "Get the best parameters from the fitted model"
assert grid_score > knn_score, "Score should have improved"
assert np.isclose(grid_score, 0.93684, rtol=1e-5)

assert ensemble_score > grid_score, "Score should have improved"
assert np.isclose(ensemble_score, 0.952206, rtol=1e-5)