In [1]:

# Carefully modify the below two string variables. Ensure there are no typos.

student_id = "12345678" # set this to your student ID

student_mail = "firstname.lastname@student.manchester.ac.uk" # your email address

Coursework 3¶

This coursework test contains several Jupyter Notebook cells with the comment # TODO. This is where you type the code for your solutions. Do not alter any of the other cells.

It is good practice to include markdown cells explaining your work, but in this test they won't be marked.

Here are some tips:

Do not alter the names of the predefined variables and functions, such as itls, CoD, etc. The (return) values of these variables and functions will inform the marking. Renaming them and failure to follow the problem description will result in loss of marks.
Ensure that functions return values, not merely print them. Each function should have at least one occurance of the return keyword, followed by a variable of the type required by the question.
Do not hard-code any solution variables. All problems must be solved by computer code using the data in the provided CSV file. For example, do not simply define a variable cod = 1234 with a fixed value. Your Jupyter Notebook should produce results with modified data that has the same format but different numerical (or NaN) values.
Avoid inefficient computations. Ensure that each cell can be run in about 20 seconds on a modern laptop, except some other timeout is stated. Long-running cells will be timed out which will result in loss of marks.
Submit this test as a single .ipynb file using Canvas. You can simply keep the name test3-2026.ipynb. There is a basic testing code at the end that verifies some parts of the coursework.

Strict deadline: Monday, 11th of May 2026, at 1pm. There are no automatic extensions.

Note on independent work¶

You need to complete all coursework tests independently on your own, but you are allowed to use online resources and all course notes and exercise solutions. The course notes from chapters 1 to 7 contain all that is required to solve the below problems. You are not allowed to ask other humans for help. In particular, you are not allowed to send, give, or receive code or markdown content to/from classmates and others.

The University Guidelines for Academic Malpractice apply: http://documents.manchester.ac.uk/display.aspx?DocID=2870

Important: Even if you are the originator of the work (and not the one who copied), the University Guidelines require that you will be equally responsible for this case of academic malpractice and may lose all coursework marks (or even be assigned 0 marks for the course).

Start of test¶

In [2]:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

import numpy as np
import pandas as pd

Problem 1a¶

We have seen that LASSO tends to produce sparse weight vectors that approximately minimize $\| \mathbf X \mathbf w - \mathbf y\|_2^2$.

An alternative approach to solving the sparse regression problem is iterative thresholding. In this approach, we first compute a minimum-norm least squares solution $\mathbf w^{(0)}$ that minimizes $\| \mathbf X \mathbf w - \mathbf y\|_2^2$ without regularisation. Then, we identify the components of $\mathbf w^{(0)}$ that are below a certain threshold $\tau\geq 0$ in modulus, and remove the corresponding features from $\mathbf{X}$ to obtain $\mathbf{X}^{(1)}$. We then minimize $\| \mathbf X^{(1)} \mathbf w - \mathbf y\|_2^2$ to get $\mathbf{w}^{(1)}$, and so on, until all components in the weight vector $\mathbf{w}^{(k)}$ are $\geq \tau$ in modulus.

Implement this iterative thresholding algorithm in a function itls(X, y, tau) using only plain Python and NumPy. Here, X is an $n\times d$ NumPy array, y is a NumPy vector of length $n$, and tau is the threshold (a floating point number $\geq 0$). The function should return the (sparse) $d$-dimensional weight vector $\mathbf w$.

In [3]:

itls = None

# TODO: Provide your solution code here that defines the function `itls`

Problem 1b¶

Write a function itls1(X, y, tau) like in the previous problem but with the following modification: at each iteration, instead of zeroing all components of the weight vector below the threshold, only the single component which is below the threshold and smallest in modulus is set to zero. (If there are several components of the same minimal modulus, only the one with the smallest index should be zeroed.)

As above, X is an $n\times d$ NumPy array, y is a NumPy vector of length $n$, and tau is the threshold (a floating point number $\geq 0$). The function should return the (sparse) $d$-dimensional weight vector $\mathbf w$.

In [5]:

itls1 = None

# TODO: Provide your solution code here that defines the function `itls1`

Problem 1c¶

Produce a pandas dataframe CoD with two columns itls and itls1 and an index with values $5,10,15,\ldots,50$ corresponding to the $\tau$ threshold. The entries of CoD correspond to the coefficient of determination (CoD) of the predictions produced with the weights from itls and itls1, respectively, on the dataset loaded below.

In [7]:

from sklearn import datasets
diabetes = datasets.load_diabetes(as_frame=True)["frame"]
X = diabetes.drop("target", axis=1).values
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0) # z-normalize
y = diabetes["target"].values

In [8]:

CoD = None

# TODO: Provide code to compute the dataframe CoD using the loaded data X, y.

Problem 2¶

The code below loads a corpus of 11,314 newsgroups posts on 20 topics and vectorizes them by counting the 500 most frequently occuring words. This results in a NumPy feature array X_train of size $11,314 \times 500$. Associated with this is an array y_train with entries in $\{ 0,1,\ldots, 19 \}$ labelling the topic of each of the posts.

Train an sklearn pipeline newspipe that provides a newspipe.predict(X) method that, when given a $\ell \times 500$ NumPy array of features X, returns an $\ell \times 20$ NumPy array of predicitions. Each row of the returned array should be a probability vector, indicating the probability of a newsgroup post to belong to each of the 20 categories.

Notes:

The first call of fetch_20newsgroups might take several minutes to run as the corpus needs to be downloaded from the web. This happens only once.
If you're curious about the individual posts, they are stored in the list corpus. The 500 most frequent words (of at least 3 characters length) are stored in the array words.
Training and prediction times should be below 40 seconds on a standard laptop, respectively, to avoid timeouts.

In [10]:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

corpus, y_train = fetch_20newsgroups(return_X_y=True)
vectorizer = CountVectorizer(token_pattern=r"\b[a-zA-Z]{3,}\b", max_features=500)
X_train = vectorizer.fit_transform(corpus)
words = vectorizer.get_feature_names_out()
X_train = np.array(X_train.todense())

In [11]:

newspipe = None

# TODO: Provide code to define newspipe.

Problem 3¶

Consider the below loaded dataframes eur and gbp which store historical GBP-EUR and GBP-USD exchange rates. For each day, the dataframes list the exchange rate at trade closing time, the daily highest and lowest rates, and the rate at trade opening time. The dataframe rates combines the data from eur and gbp into a single dataframe. The task is to predict a day's exchange rates using only information from previous days.

Write two functions train_models and predict_rates as follows.

The function train_models(rates_t) takes as input a training dataframe rates_t of historical exchange rates. The format rates_t is the same as rates, but the date range might be different. For example, your train_models function should work if only historical data from 2017 until 2023 is provided. But you can always assume that a continuous subset of at least 500 rows is provided. The function returns a dictionary models of trained models. This might be a single model or multiple models, completely up to you. The only requirement is that all model training happens in that function and should complete in at most 40 seconds on a standard laptop (for the full dataset of all historical values).

The function predict_rates(models, rates_h) takes as inputs the dictionary of trained models and a dataframe rates_h of historical data which you can assume to be a continuous subset of at least 50 rows from rates. The function returns a Python list of eight floating point values, corresponding to next trading day predictions of [EUR_Close, EUR_High, EUR_Low, EUR_Open, USD_Close, USD_High, USD_Low, USD_Open]. It should not take longer than 10 seconds to return a prediction.

Notes:

Training time on the full dataset should be below 40 seconds on a standard laptop, and prediction time should be at most 10 seconds.
Even though train_models will be provided with at least 500 trading days of data, you do not need to use all of them for the training. Similarly for the prediction.

In [ ]:

# do not change code in this cell
eur = pd.read_csv('_datasets/gbp-eur2.csv', index_col='Date')
usd = pd.read_csv('_datasets/gbp-usd2.csv', index_col='Date')
rates = pd.concat([eur, usd], axis=1, keys=['EUR', 'USD'])
rates.columns = ['_'.join(col).strip() for col in rates.columns.values] 
rates.dropna(inplace=True)
rates.head()

In [14]:

train_models = None
predict_rates = None

# TODO: Define the functions `train_models` and `predict_rates`.

Example: This is how your functions should be usable.

{python}
# training data
rates_t = rates.loc['2017-01-01':'2023-12-31']
models = train_models(rates_t)

# get some historic data
rates_h = rates.loc['2024-01-01':'2024-08-09']

# predict next row
lst = predict_rates(models, rates_h)

# lst should contain 8 predictions for 2024-08-12, 
# the next trading day following 2024-08-09

End of test¶

You can use the below tests to get an indication if part of your work returns the right data types.

In [ ]:

try: 
    import re
    assert re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', student_mail) and not 'firstname' in student_mail
    print("OKAY - student_mail appears to be valid")
except:
    print("WARN - student_mail could not be verified")

import numpy as np
X = np.array([[1,2,3],[2,-3,4],[1,2,3],[2,2,2]])
Y = np.array([[1,2],[1,2],[3,4],[1,2]])
x = np.array([1,2,3])

try: 
    assert callable(itls)
    print("OKAY - itls should be a function")
except:
    print("FAIL - itls should be a function")

X = np.array([[1,2,3],[2,-3,4],[1,2,3],[2,2,2]])
y = np.array([1,2,3,4])

try:
    w = itls(X, y, 0.2)
    assert type(w) == np.ndarray
    print("OKAY - itls returns a NumPy array")
except:
    print("FAIL - itls does not return a NumPy array")

try: 
    assert callable(itls1)
    print("OKAY - itls1 should be a function")
except:
    print("FAIL - itls1 should be a function")

try: 
    assert type(CoD) == pd.DataFrame
    print("OKAY - CoD should be a pandas dataframe")
except:
    print("FAIL - CoD should be a pandas dataframe")

try: 
    assert callable(newspipe.predict)
    print("OKAY - newspipe provides predict method")
except:
    print("FAIL - newspipe does not provide predict method")

try: 
    assert callable(train_models)
    print("OKAY - train_models should be a function")
except:
    print("FAIL - train_models should be a function")

try: 
    assert callable(predict_rates)
    print("OKAY - predict_rates should be a function")
except:
    print("FAIL - predict_rates should be a function")