Exercise 6.1¶

Write a function median_tree(X, y, max_depth=5, level=0) that takes as inputs an $n \times d$ NumPy array $X$ corresponding to $n$ data points in $d$ dimensions, and a Python list $y$ of labels. The parameter max_depth is optional and takes the value $5$ by default. Using only plain Python without any modules other than NumPy, implement the decision tree fitting process we have learned about in the lectures. The fitted tree is returned by the function median_tree in form of a nested sequence of Python dictionaries, each of which is of the form

tree = { 
    'level': int,
    'subsetX': np.array,
    'subsety': list,
    'gini': float, 
    'split_coord': int,
    'split_value': float,
    'left': tree,
    'right': tree
}

Starting with the full sets subsetX = X and subsety = y at the root node of the tree (the "parent node" at level 0), the splitting of data into tree branches is performed by selecting the coordinate split_coord (an integer in $0,1,\ldots,d-1$) that gives the best total impurity for the two child nodes (left and right) when the data is split at the median value of that coordinate (stored in split_value). The children of a node at depth level have depth level+1 and the splitting stops when either (i) a node has only data points with the same label or (ii) level==max_depth. In this case, assign None to the appropriate values in the dictionary.

In [ ]:
# Solution
# There are various ways to construct a tree, but the most straightforward is to use recursion.
# The function `median+tree` will repeatedly call itself to return the left-right branches on the current tree,
# until no more splitting is needed.

import numpy as np

def gini(S):
    """"Gini impurity of a list. See Chapter 5 exercises."""
    g = 0
    unique_el = set(S)
    for el in unique_el:
        p = S.count(el)/len(S)
        g += p*(1-p)
    return g

def median_tree(X, y, max_depth=5, level=0):

    # if y is pure or max_depth reached => no more splitting
    if len(set(y)) == 1 or level == max_depth:
        return {
            'level': level,
            'subsetX': X,
            'subsety': y,
            'gini': gini(y),
            'split_coord': None,
            'split_value': None,
            'left': None,
            'right': None
        }
    
    # otherwise try splitting
    n, d = X.shape
    best_total = float('inf') # total impurity of split
    best_split = None
    
    # test each coordinate and find the best one
    for coord in range(d):
        median = np.median(X[:, coord])
        left_mask = X[:, coord] <= median
        right_mask = X[:, coord] > median
        
        left_y = [y[i] for i in range(n) if left_mask[i]]
        right_y = [y[i] for i in range(n) if right_mask[i]]

        total = len(left_y) * gini(left_y) + len(right_y) * gini(right_y)
        
        if total < best_total:
            best_total = total
            best_split = (coord, median, left_mask, right_mask)
    
    coord, median, left_mask, right_mask = best_split
    left_tree = median_tree(X[left_mask,:], [y[i] for i in range(n) if left_mask[i]], max_depth, level + 1)
    right_tree = median_tree(X[right_mask,:], [y[i] for i in range(n) if right_mask[i]], max_depth, level + 1)
    
    return {
        'level': level,
        'subsetX': X,
        'subsety': y,
        'gini': gini(y),
        'split_coord': coord,
        'split_value': median,
        'left': left_tree,
        'right': right_tree
    }

# Example:
X = np.array([[2, 3], [10, 15], [3, 4], [8, 9], [6, 7]])
y = [0, 1, 0, 1, 0]
y = ['a', 'b', 'a', 'b', 'a']
tree = median_tree(X, y, max_depth=2)
print(tree)
{'level': 0, 'subsetX': array([[ 2,  3],
       [10, 15],
       [ 3,  4],
       [ 8,  9],
       [ 6,  7]]), 'subsety': ['a', 'b', 'a', 'b', 'a'], 'gini': 0.48, 'split_coord': 0, 'split_value': 6.0, 'left': {'level': 1, 'subsetX': array([[2, 3],
       [3, 4],
       [6, 7]]), 'subsety': ['a', 'a', 'a'], 'gini': 0.0, 'split_coord': None, 'split_value': None, 'left': None, 'right': None}, 'right': {'level': 1, 'subsetX': array([[10, 15],
       [ 8,  9]]), 'subsety': ['b', 'b'], 'gini': 0.0, 'split_coord': None, 'split_value': None, 'left': None, 'right': None}}

Exercise 6.2¶

Write a function median_tree_predict(tree, X) that takes as input a tree structure (nested dictionaries) as returned by median_tree, and an $m\times d$ NumPy array $X$. The function follows the split_coord and split_value information provided by the tree to return a majority-vote prediction for each data point (each row) in X. The return value is a Python list with $m$ elements corresponding to the predicted labels.

In [2]:
# Solution
def median_tree_predict(tree, X):
    def predict_single(tree, x):
        if tree['split_coord'] is None:
            return max(set(tree['subsety']), key=tree['subsety'].count)
        if x[tree['split_coord']] <= tree['split_value']:
            return predict_single(tree['left'], x)
        else:
            return predict_single(tree['right'], x)
    
    return [predict_single(tree, x) for x in X]

# Example usage:
yhat = median_tree_predict(tree, X)
print(yhat)
['a', 'b', 'a', 'b', 'a']

Exercise 6.3¶

Using the penguins dataset available from seaborn, perform an experiment like in Example 5.15 using both sklearn's DecisionTreeClassifier and your own median_tree function. Using a 80-20 shuffled train-test split with a fixed random state, produce a table comparing the accuracy of both classifiers for maximal tree depths $1,2,\ldots,5$.

In [6]:
# Solution
import pandas as pd
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

penguins = sns.load_dataset("penguins")
penguins = penguins.dropna()
features = [
  "bill_length_mm",
  "bill_depth_mm",
  "flipper_length_mm",
  "body_mass_g"
]
X = penguins[features]
y = penguins["species"]

X_train, X_test, y_train, y_test = train_test_split(
  X, y,
  test_size=0.2,
  shuffle=True,
  random_state=19716
)

for max_depth in range(1,6):

  print("max_depth =", max_depth)

  dtree = DecisionTreeClassifier(max_depth=max_depth)

  dtree.fit(X_train, y_train)
  yhat = dtree.predict(X_test)
  print(f"   accuracy of DecisionTreeClassifier is {accuracy_score(y_test, yhat):.1%}")

  # same as np.mean(yhat == y_test.values)

  tree = median_tree(X_train.values, list(y_train.values), max_depth=max_depth)
  yhat = median_tree_predict(tree, X_test.values)
  print(f"   accuracy of median_tree classifier is {accuracy_score(y_test, yhat):.1%}")
max_depth = 1
   accuracy of DecisionTreeClassifier is 82.1%
   accuracy of median_tree classifier is 76.1%
max_depth = 2
   accuracy of DecisionTreeClassifier is 92.5%
   accuracy of median_tree classifier is 85.1%
max_depth = 3
   accuracy of DecisionTreeClassifier is 92.5%
   accuracy of median_tree classifier is 82.1%
max_depth = 4
   accuracy of DecisionTreeClassifier is 94.0%
   accuracy of median_tree classifier is 94.0%
max_depth = 5
   accuracy of DecisionTreeClassifier is 95.5%
   accuracy of median_tree classifier is 92.5%

Exercise 6.4¶

We have seen with $k$-nearest neighbour classification that the scale of the features matters, and that standardization can significantly improve the performance of such classifiers. Explain why feature standardization is in general not needed with decision trees.

Solution: In decision trees as we have introduced them here, splitting is based on Gini impurity (which only depends on labels $y$, not the $X$ features) and the coordinates of the data matrix $X$, which are used in isolation. Hence, any scaling of the columns of $X$ will not affect the resulting tree structure.

Exercise 6.5¶

One (dumb) algorithm for binary classification is to always predict the positive outcome.

(a) Considered over all possible training sets, does this method have a high training variance or low training variance? Explain.

(b) Considered over all possible testing sets, does this method have a high testing bias or low testing bias? Explain.

Solution:

(a) The method of always predicting the positive outcome has low training variance. This is because the model does not change regardless of the training data it is given. It always predicts the same outcome (positive), so there is no variability in the model's predictions across different training sets.

(b) The method of always predicting the positive outcome has high testing bias. This is because the model does not take into account the actual features of the data and always predicts the same outcome. As a result, the model will consistently make incorrect predictions for the negative class, leading to a high bias in its predictions. The only exception is when the true labels are heavily skewed towards the positive class. In this case, the classifier has a low bias ``by luck''.

Exercise 6.6¶

Suppose you are trying to fit this ground-truth one-dimensional classifier defined on positive integers:

$$ y = f(m) = \begin{cases} 0, & \text{ if $m$ is odd}, \\ 1, & \text{ if $m$ is even}. \end{cases} $$

Here are three individual classifiers:

$$ \begin{aligned} \hat{f}_a(m) &= 1 \text{ for all $m$}, \\ \hat{f}_b(m) &= \begin{cases} 0, & \text{ if $m = 1,2,5,6,9,10,\dots$}, \\ 1, & \text{ if $m = 3,4,7,8,11,12,\dots$ }, \end{cases} \\ \hat{f}_c(m) &= \begin{cases} 0, & \text{ if $m = 1,4,5,8,9,12,\dots$}, \\ 1, & \text{ if $m = 2,3,6,7,10,11,\dots$}. \end{cases} \end{aligned} $$

Let the testing set be $1,2,3,\ldots,12$.

(a) Compute the accuracy of each classifier on the testing set.

(b) Compute the accuracy of the majority-vote ensemble of the three classifiers on the testing set. That is, the classifier with $\hat{f}(m) = 1$ if at least two of the member classifiers are 1, and $\hat{f}(m) = 0$ otherwise.

Solution:

(a) The testing accuracy of all three individual classifiers $\hat f_a, \hat f_b, \hat f_c$ is $6/12$ (simply count the number of cases they make the right prediction and divide by 12).

(b) It's easiest to make a table for that:

image.png

The ensemble classifier $\hat f$ is correct in $9/12$ of the cases, so the accuracy has increased.

Exercise 6.7¶

For data with very high-dimensional feature space, it can be beneficial to apply a technique called dimension reduction before applying any machine learning algorithm. Assuming our data matrix $X$ is of size $n\times d$ ($n$ data points in $d$ dimensions), in linear dimension reduction we aim to find a $d\times \hat d$ matrix $R$ with $\hat d < d$ so that $\hat X := X R$ preserves some essential features of the data (while reducing its dimension).

Research a technique called principal component analysis (PCA). Using only NumPy and no other modules, write a function mypca(X, dhat) that takes as input a data matrix $X$ and a positive integer dhat, and then returns the matrix $R$.

Rerun the experiment from Example 5.25 with the spambase dataset, but reduce the dimension of the feature space from $57$ to $1,2,3,4,5$, respectively. Do this by computing the dimension-reducing matrix $R$ for the training set, train a decision tree on the dimension-reduced training set, and then apply the same transformation to the test data before making predictions. Try if z-normalizing the data improves the performance. If so, is this in contradiction to Exercise 6.4? Write down a summary of your observations.

In [11]:
# Solution
import numpy as np

def mypca(X, dhat):
    X = X - X.mean(axis=0) # mean center
    U, S, Vh = np.linalg.svd(X, full_matrices=False)
    R = Vh[0:dhat,:].T
    return R

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

spam = pd.read_csv("_datasets/spambase.csv")
X = spam.drop("class", axis=1)
y = spam["class"] == 1

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2, 
    shuffle=True, random_state=302
    )

dtree = DecisionTreeClassifier(max_depth=7)

for variant in {0, 1}:

    if variant == 0:
        print("without z-normalization:")
    else:
        print("\nwith z-normalization:")
        # z-normalize
        mu = X_train.mean(axis=0)
        sigma = X_train.std(axis=0)
        X_train = (X_train - mu)/sigma
        X_test = (X_test - mu)/sigma   # important: normalise test set with training set mean and std

    for dhat in range(1,6):
        R = mypca(X_train, dhat)
        dtree.fit(X_train@R, y_train)
        yhat = dtree.predict(X_test@R)
        print("   dhat =", dhat, "- accuracy =", (yhat == y_test).mean())
without z-normalization:
   dhat = 1 - accuracy = 0.6699239956568946
   dhat = 2 - accuracy = 0.7339847991313789
   dhat = 3 - accuracy = 0.7328990228013029
   dhat = 4 - accuracy = 0.8306188925081434
   dhat = 5 - accuracy = 0.8610206297502715

with z-normalization:
   dhat = 1 - accuracy = 0.8393051031487514
   dhat = 2 - accuracy = 0.8686210640608035
   dhat = 3 - accuracy = 0.8642779587404995
   dhat = 4 - accuracy = 0.8762214983713354
   dhat = 5 - accuracy = 0.8946796959826275

Solution: Overall, the data appears to be very amenable to dimension reduction, with the decision tree achieving high accuracy on the test set even for relatively small $\hat d$. We also find that z-normalisation improves the accuracy of predictions on the dimension-reduced data significantly. For example, for $\hat d=1$ we obtain an accuracy of about 84% with z-normalisation, and without z-normalization only about 67%. This is not in contradiction to Exercise 6.4, as we perform standarization before dimension reduction. Indeed, PCA is very sensitive to standardization and here it seems to be beneficial to do it. There should be no advantage in doing z-normalization after the PCA and before fitting the tree.

Exercise 6.8¶

Pulsars are detected by scanning for radio frequencies from deep space. However, the vast majority of collected signals are actually terrestrial noise. The following dataset collects 8 statistical features from radio signals and their classifications as noise (class 0) or pulsar (class 1).

  1. Load the dataset pulsars.csv. Let y be the class column of the dataset and let X be all the remaining columns. Find the proportion of the samples that are actually pulsars.

  2. Using a random state 19716, split the data 80% / 20% into testing and training sets.

  3. Fit a pipeline with standardization scaling and a kNN classifier with $k=8$ to the training data. Since we want to minimize false positives, find its precision score on the test set.

  4. Perform a grid search on the pipeline defined in step 3, including a search over $k$ from 3 through 20 and over 'uniform' and 'distance' for weights, using 6 folds in cross-validation and precision as the scoring. Find the best parameters and find the precision score on the test set.

  5. Using the best model from step 4, make a bagging classifier using 200 estimators, 50% max features and max samples, and random state 302. Find its precision score.

In [1]:
import numpy as np
import pandas as pd
signals = pd.read_csv("_datasets/pulsars.csv")

# TODO: Provide your solution code here.
In [2]:
# Solution

# part 1
X = signals.drop(columns=["class"])
y = signals["class"]
pulsar_fraction = y.mean()
print(f"{pulsar_fraction:.1%} of the samples are pulsars.\n")

# part 2
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=19716)

assert X_train.shape == (14318, 8)
assert y_test.sum() == 326

# part 3
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score
from sklearn.neighbors import KNeighborsClassifier
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=8))
])

pipeline.fit(X_train, y_train)
knn_score = precision_score(y_test, pipeline.predict(X_test))
print(f"Test score of kNN (k=8): {knn_score:.2%}\n")

# part 4
from sklearn.model_selection import GridSearchCV

param_grid = {
    'knn__n_neighbors': np.arange(3, 21),
    'knn__weights': ['uniform', 'distance'],
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=6, scoring='precision', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
grid_score = precision_score(y_test, grid_search.predict(X_test))

print(f"Best parameters: {best_params}")
print(f"Test score with those parameters: {grid_score:.2%}\n")

# part 5
from sklearn.ensemble import BaggingClassifier
ensemble = BaggingClassifier( 
    grid_search.best_estimator_,
    max_samples=0.5,
    max_features=0.5,
    n_estimators=200,
    random_state=302
    )

ensemble.fit(X_train, y_train)
ensemble_score = precision_score(y_test, ensemble.predict(X_test))
print(f"Test score with bagging: {ensemble_score:.2%}")
9.2% of the samples are pulsars.

Test score of kNN (k=8): 92.98%

Best parameters: {'knn__n_neighbors': 14, 'knn__weights': 'uniform'}
Test score with those parameters: 93.68%

Test score with bagging: 95.22%
In [3]:
# Testing
assert X.shape == (17898, 8)
assert y.shape == (17898,)
assert np.isclose(pulsar_fraction, 0.091574478)

assert X_train.shape == (14318, 8)
assert y_test.sum() == 326

assert np.isclose(knn_score, 0.92982, rtol=1e-5)

assert type(best_params) == dict, "Get the best parameters from the fitted model"
assert grid_score > knn_score, "Score should have improved"
assert np.isclose(grid_score, 0.93684, rtol=1e-5)

assert ensemble_score > grid_score, "Score should have improved"
assert np.isclose(ensemble_score, 0.952206, rtol=1e-5)