In [18]:
# Carefully modify the below two string variables. Ensure there are no typos.

student_id = "12345678" # set this to your student ID

student_mail = "firstname.lastname@student.manchester.ac.uk" # your email address

Coursework 2¶

This coursework test contains several Jupyter Notebook cells with the comment # TODO. This is where you type the code for your solutions. Do not alter any of the other cells.

It is good practice to include markdown cells explaining your work, but in this test they won't be marked.

Here are some tips:

  • Do not alter the names of the predefined variables and functions, such as h_best, astro_scores, etc. The (return) values of these variables and functions will inform the marking. Renaming them and failure to follow the problem description will result in loss of marks.

  • Ensure that functions return values, not merely print them. Each function should have at least one occurance of the return keyword, followed by a variable of the type required by the question.

  • Do not hard-code any solution variables. All problems must be solved by computer code using the data in the provided CSV file. For example, do not simply define a variable astro_scores = 1234 with a fixed value. Your Jupyter Notebook should produce results with a modified data file that has the same format but different numerical (or NaN) values.

  • Avoid inefficient computations. Ensure that each cell can be run in about 20 seconds on a modern laptop. This allowance is generous and not a target; normally, each cell should run in just a few seconds or even milliseconds. Long-running cells will be timed out which will result in loss of marks.

  • Submit this test as a single .ipynb file using Canvas. You can simply keep the name test2-2026.ipynb. There is a basic testing code at the end that verifies some parts of the coursework.

    Strict deadline: Monday, 30th of March 2026, at 1pm. There are no automatic extensions.

Note on independent work¶

You need to complete all coursework tests independently on your own, but you are allowed to use online resources and all course notes and exercise solutions. The course notes from chapters 1 to 5 contain all that is required to solve the below problems. You are not allowed to ask other humans for help. In particular, you are not allowed to send, give, or receive code or markdown content to/from classmates and others.

The University Guidelines for Academic Malpractice apply: http://documents.manchester.ac.uk/display.aspx?DocID=2870

Important: Even if you are the originator of the work (and not the one who copied), the University Guidelines require that you will be equally responsible for this case of academic malpractice and may lose all coursework marks (or even be assigned 0 marks for the course).

Start of test¶

In [19]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

import numpy as np
import pandas as pd

Problem 1a¶

Consider a supervised learning problem on a $d$-dimensional feature space $\mathcal{X}=\mathbb{Z}^d$ with integer coordinates, and a $J$-dimensional label space $\mathcal{Y} = \mathbb{R}^J$ (notation as in the semester 1 lecture notes). Assume that the loss function is image-3.png Further assume that we are given data pairs $(x_1,y_1),(x_2,y_2),\ldots,(x_N,y_N) \in \mathcal{X}\times \mathcal{Y}$.

Now consider the best hypothesis given the data, i.e., the optimal hypothesis $h$ that minimizes the empirical error image.png Implement a function h_best(x, X, Y) that evaluates this best hypothesis for a given feature point x.

The inputs of h_best(x, X, Y) are

  • a $d$-dimensional NumPy vector x which can always be assumed (without checking) to be among the feature vectors $\{x_1,\ldots,x_N\}$

  • an $N\times d$ NumPy matrix X of the features

  • an $N\times J$ NumPy matrix Y of the labels

The function returns a $J$-dimensional NumPy vector.

All data types are standard floats, even if we only ever use integer values in x and X.

In [20]:
h_best = None

# TODO: Provide your solution code here that defines the function `h_best`

Problem 1b¶

Write a function best_err(X, Y) that, given the loss function and data as in Problem 1a, returns the empirical error of the best hypothesis as a floating point number.

(The function best_err may of course make use of h_best by calling it.)

In [22]:
best_err = None

# TODO: Provide your solution code here that defines the function `best_err`

Problem 1c¶

Device data matrices X_1c and Y_1c with $N=5$ data points such that best_err(X_1c, Y_1c) returns the value $0.2$.

In [24]:
X_1c, Y_1c = None, None

# TODO: Provide your solution code here that defines the arrays X_1c and Y_1c

Problem 1d¶

Device data matrices X_1d and Y_1d with $N=5$ data points such that best_err(X_1d, Y_1d) returns the value $0.0$.

In [26]:
X_1d, Y_1d = None, None

# TODO: Provide your solution code here that defines the arrays X_1d and Y_1d

Problem 2a¶

Using only plain Python with no modules except NumPy, write a function my_knn(x, X, k) that takes as inputs a $d$-dimensional NumPy vector x and an $N\times d$ NumPy array X (each row corresponding to a data point). The parameter k is a positive integer. The function returns a Python list with the indices of $k$ nearest neighbours to x in X, where distance between two $d$-dimensional vectors $\mathbf u = [u_0,u_1,\ldots,u_{d-1}]$ and $\mathbf v = [v_0,v_1,\ldots,v_{d-1}]$ is measured as

image-2.png

with $p(i)$ taking the value $2$ for odd $i$, and $1$ for even $i$.

The indices in the returned list should be ordered by nondecreasing distance of the data points to x, i.e., X[my_knn(x, X, 1)[0],:] is a data point closest to x. If there are multiple points in X with the exact same distance to x, the returned indices should be increasing.

Example: Assume that $k=4$ and the nearest neighbours to x are X[7,:], X[2,:], X[9,:], X[0,:] with distances $1.2,5.3,3.1,1.2$, respectively. Then the returned list should be [0, 7, 9, 2].

In [28]:
my_knn = None

# TODO: Provide your solution code here that defines the function `my_knn`

Problem 2b¶

Using only plain Python with no modules except NumPy, write a function my_knn_predict(x, X, k, y) that takes the same inputs as the function in Problem 2a, as well as a Python list y with N elements (the labels of each data point). The function then returns a label (an element of y) that appeared most frequently among the $k$ nearest neighbors. If there are multiple labels with the same number of votes, a label with an associated feature closest to x is preferred.

The function my_knn_predict may of course make use of my_knn by calling it.

Example: Assume that $k=5$ and the labels of the nearest neighbours sorted by nondecreasing distance from x are ['c', 'b', 'a', 'a', 'b']. In this case b should be returned, as b is one of the most frequent labels, and there is a data point labelled b which is potentially closer to x than the next data point labelled a.

In [30]:
my_knn_predict = None

# TODO: Provide your solution code here that defines the function `my_knn_predict`

Problem 3a¶

We will now work with some astronomical observation data used to classify celestial objects. Some missing values are given as $-9999$, and those are removed first.

In [ ]:
# do not change code in this cell
astro = pd.read_csv("_datasets/star_classification.csv")
astro.replace(-9999, np.nan, inplace=True)
astro.dropna(inplace=True)
astro.head()

We only retain the columns named u, g, r, i, z, and redshift, and let y be the column class. We then split into training and testing data as usual.

In [33]:
# do not change code in this cell
X = astro[["u", "g", "r", "i", "z", "redshift"]]
y = astro["class"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( 
    X, y, 
    test_size=0.2,
    shuffle=True,
    random_state=3383 
)

For each value $k=2,3,\ldots,6$, train a kNN classifier with $k$ neighbors, in a pipeline with z-score standardization, on the training set. Find the accuracy score of the trained model on the test set. Produce a series astro_scores indexed by values of $k$ whose values are the accuracy scores.

In [34]:
astro_scores = None

# TODO: Provide your solution code here that defines `astro_scores`

Problem 3b¶

Build your own sklearn Pipeline called astro_pipe that classifies the data from Problem 3a. You can use any scaling functions or models as part of this pipeline. Tune the parameters to achieve highest possible classification accuracy on test sets made up of 20% of the overall data.

There are two requirements:

  • training and prediction times should be below 20 seconds, respectively, to avoid timeouts

  • your final pipeline should be called astro_pipe and it should provide the usual astro_pipe.fit() and astro_pipe.predict() methods

In [36]:
astro_pipe = None

# TODO: Provide your solution code here that defines `astro_pipe`

End of test¶

You can use the below tests to get an indication if part of your work returns the right data types.

In [ ]:
try: 
    import re
    assert re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', student_mail) and not 'firstname' in student_mail
    print("OKAY - student_mail appears to be valid")
except:
    print("WARN - student_mail could not be verified")

try: 
    assert callable(h_best)
    print("OKAY - h_best should be a function")
except:
    print("FAIL - h_best should be a function")

import numpy as np
X = np.array([[1,2,3],[2,-3,4],[1,2,3],[2,2,2]])
Y = np.array([[1,2],[1,2],[3,4],[1,2]])
x = np.array([1,2,3])

try:
    val = h_best(x, X, Y)
    assert val is not None
    print("OKAY - h_best returns a value")
except:
    print("FAIL - h_best does not return a value")

try: 
    assert callable(best_err)
    print("OKAY - best_err should be a function")
except:
    print("FAIL - best_err should be a function")

try:
    val = best_err(X, Y)
    assert val is not None
    print("OKAY - best_err returns a value")
except:
    print("FAIL - best_err does not return a value")

try: 
    assert type(X_1c) == np.ndarray
    assert type(X_1d) == np.ndarray
    print("OKAY - X_1c, X_1d should be NumPy arrays")
except:
    print("FAIL - X_1c, X_1d should be NumPy arrays")

try: 
    assert type(Y_1c) == np.ndarray
    assert type(Y_1d) == np.ndarray
    print("OKAY - Y_1d, Y_1d should be NumPy arrays")
except:
    print("FAIL - Y_1c, Y_1d should be NumPy arrays")

try: 
    assert callable(my_knn)
    print("OKAY - my_knn should be a function")
except:
    print("FAIL - my_knn should be a function")

try:
    val = my_knn(x, X, k=2)
    assert val is not None
    print("OKAY - my_knn returns a value")
except:
    print("FAIL - my_knn does not return a value")

try:
    val = my_knn_predict(x, X, 2, x)
    assert val is not None
    print("OKAY - my_knn_predict returns a value")
except:
    print("FAIL - my_knn_predict does not return a value")

try: 
    assert callable(my_knn_predict)
    print("OKAY - my_knn_predict should be a function")
except:
    print("FAIL - my_knn_predict should be a function")

try: 
    assert type(astro_scores) == pd.Series
    print("OKAY - astro_scores should be a pandas series")
except:
    print("FAIL - astro_scores should be a pandas series")

try: 
    assert callable(astro_pipe.fit) and callable(astro_pipe.predict)
    print("OKAY - astro_pipe should provide fit and predict methods")
except:
    print("FAIL - astro_pipe should provide fit and predict methods")