{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "64b51d52",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Carefully modify the below two string variables. Ensure there are no typos.\n",
    "\n",
    "student_id = \"12345678\" # set this to your student ID\n",
    "\n",
    "student_mail = \"firstname.lastname@student.manchester.ac.uk\" # your email address"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eef390d5",
   "metadata": {},
   "source": [
    "# Coursework 3\n",
    "\n",
    "This coursework test contains several Jupyter Notebook cells with the comment `# TODO`. This is where you type the code for your solutions. Do not alter any of the other cells. \n",
    "\n",
    "It is good practice to include markdown cells explaining your work, but in this test they won't be marked. \n",
    "\n",
    "Here are some tips:\n",
    "\n",
    "* **Do not alter the names of the predefined variables and functions,** such as `itls`, `CoD`, etc. The (return) values of these variables and functions will inform the marking. Renaming them and failure to follow the problem description will result in loss of marks.\n",
    "\n",
    "* **Ensure that functions *return* values, not merely print them.** Each function should have at least one occurance of the `return` keyword, followed by a variable of the type required by the question.  \n",
    "\n",
    "* **Do not hard-code any solution variables.** All problems must be solved by computer code using the data in the provided CSV file. For example, do *not* simply define a variable `cod = 1234` with a fixed value. Your Jupyter Notebook should produce results with modified data that has the same format but different numerical (or NaN) values.\n",
    "\n",
    "* **Avoid inefficient computations.** Ensure that each cell can be run in about 20 seconds on a modern laptop, except some other timeout is stated. Long-running cells will be timed out which will result in loss of marks.\n",
    "\n",
    "* **Submit this test as a single .ipynb file using Canvas.** You can simply keep the name `test3-2026.ipynb`. There is a basic testing code at the end that verifies some parts of the coursework.\n",
    "\n",
    "   <span style=\"color:blue; font-weight:bold\">Strict deadline: Monday, 11th of May 2026, at 1pm. There are no automatic extensions.</span>\n",
    "\n",
    "### Note on independent work\n",
    "\n",
    "You need to complete all coursework tests independently on your own, but you are allowed to use online resources and all course notes and exercise solutions. The course notes from chapters 1 to 7 contain all that is required to solve the below problems. You are not allowed to ask other humans for help. In particular, you are not allowed to send, give, or receive code or markdown content to/from classmates and others.\n",
    "\n",
    "The University Guidelines for Academic Malpractice apply: http://documents.manchester.ac.uk/display.aspx?DocID=2870\n",
    "\n",
    "**Important: Even if you are the originator of the work** (and not the one who copied), the University Guidelines require that you will be equally responsible for this case of academic malpractice and may lose all coursework marks (or even be assigned 0 marks for the course)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2295b152",
   "metadata": {},
   "source": [
    "# Start of test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fdc2a253-d922-4124-931c-d4a90308d222",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-34389c5497ccc8a9",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
    "warnings.simplefilter(action='ignore', category=DeprecationWarning)\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "108b2d6e",
   "metadata": {},
   "source": [
    "## Problem 1a\n",
    "\n",
    "We have seen that LASSO tends to produce sparse weight vectors that approximately minimize $\\| \\mathbf X \\mathbf w - \\mathbf y\\|_2^2$. \n",
    "\n",
    "An alternative approach to solving the sparse regression problem is *iterative thresholding*. In this approach, we first compute a minimum-norm least squares solution $\\mathbf w^{(0)}$ that minimizes $\\| \\mathbf X \\mathbf w - \\mathbf y\\|_2^2$ without regularisation. Then, we identify the components of $\\mathbf w^{(0)}$ that are below a certain threshold $\\tau\\geq 0$ in modulus, and remove the corresponding features from $\\mathbf{X}$ to obtain  $\\mathbf{X}^{(1)}$. We then minimize  $\\| \\mathbf X^{(1)} \\mathbf w - \\mathbf y\\|_2^2$ to get $\\mathbf{w}^{(1)}$, and so on, until all components in the weight vector $\\mathbf{w}^{(k)}$ are $\\geq \\tau$ in modulus. \n",
    "\n",
    "Implement this iterative thresholding algorithm in a function `itls(X, y, tau)` using only plain Python and NumPy. Here, `X` is an $n\\times d$ NumPy array, `y` is a NumPy vector of length $n$, and `tau` is the threshold (a floating point number $\\geq 0$). The function should return the (sparse) $d$-dimensional weight vector $\\mathbf w$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6b3dc6c4-7547-4614-b0a7-7bab6a2d05e6",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "ducks-mean",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "itls = None\n",
    "\n",
    "# TODO: Provide your solution code here that defines the function `itls`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd05ba1c",
   "metadata": {},
   "source": [
    "## Problem 1b\n",
    "\n",
    "Write a function `itls1(X, y, tau)` like in the previous problem but with the following modification: at each iteration, instead of zeroing *all* components of the weight vector below the threshold, only the single component which is below the threshold *and smallest in modulus* is set to zero. (If there are several components of the same minimal modulus, only the one with the smallest index should be zeroed.) \n",
    "\n",
    "As above, `X` is an $n\\times d$ NumPy array, `y` is a NumPy vector of length $n$, and `tau` is the threshold (a floating point number $\\geq 0$). The function should return the (sparse) $d$-dimensional weight vector $\\mathbf w$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c5208686",
   "metadata": {},
   "outputs": [],
   "source": [
    "itls1 = None\n",
    "\n",
    "# TODO: Provide your solution code here that defines the function `itls1`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7882535d",
   "metadata": {},
   "source": [
    "## Problem 1c\n",
    "\n",
    "Produce a pandas dataframe `CoD` with two columns `itls` and `itls1` and an index with values $5,10,15,\\ldots,50$ corresponding to the $\\tau$ threshold. The entries of `CoD` correspond to the coefficient of determination (CoD) of the predictions produced with the weights from `itls` and `itls1`, respectively, on the dataset loaded below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9bea7c26",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "diabetes = datasets.load_diabetes(as_frame=True)[\"frame\"]\n",
    "X = diabetes.drop(\"target\", axis=1).values\n",
    "X = (X - np.mean(X, axis=0)) / np.std(X, axis=0) # z-normalize\n",
    "y = diabetes[\"target\"].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ce505852",
   "metadata": {},
   "outputs": [],
   "source": [
    "CoD = None\n",
    "\n",
    "# TODO: Provide code to compute the dataframe CoD using the loaded data X, y."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4befd57",
   "metadata": {},
   "source": [
    "## Problem 2\n",
    "\n",
    "The code below loads a corpus of 11,314 newsgroups posts on 20 topics and *vectorizes* them by counting the 500 most frequently occuring words. This results in a NumPy feature array `X_train` of size $11,314 \\times 500$. Associated with this is an array `y_train` with entries in $\\{ 0,1,\\ldots, 19 \\}$ labelling the topic of each of the posts. \n",
    "\n",
    "Train an sklearn pipeline `newspipe` that provides a `newspipe.predict(X)` method that, when given a $\\ell \\times 500$ NumPy array of features `X`, returns an $\\ell \\times 20$ NumPy array of predicitions. Each row of the returned array should be a probability vector, indicating the probability of a newsgroup post to belong to each of the 20 categories.\n",
    "\n",
    "**Notes:** \n",
    "* The first call of `fetch_20newsgroups` might take several minutes to run as the corpus needs to be downloaded from the web. This happens only once. \n",
    "\n",
    "* If you're curious about the individual posts, they are stored in the list `corpus`. The 500 most frequent words (of at least 3 characters length) are stored in the array `words`.\n",
    "\n",
    "* Training and prediction times should be below 40 seconds on a standard laptop, respectively, to avoid timeouts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e3346a95",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "corpus, y_train = fetch_20newsgroups(return_X_y=True)\n",
    "vectorizer = CountVectorizer(token_pattern=r\"\\b[a-zA-Z]{3,}\\b\", max_features=500)\n",
    "X_train = vectorizer.fit_transform(corpus)\n",
    "words = vectorizer.get_feature_names_out()\n",
    "X_train = np.array(X_train.todense())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7c926f19",
   "metadata": {},
   "outputs": [],
   "source": [
    "newspipe = None\n",
    "\n",
    "# TODO: Provide code to define newspipe. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4eedb1cc",
   "metadata": {},
   "source": [
    "## Problem 3\n",
    "\n",
    "Consider the below loaded dataframes `eur` and `gbp` which store historical GBP-EUR and GBP-USD exchange rates. For each day, the dataframes list the exchange rate at trade closing time, the daily highest and lowest rates, and the rate at trade opening time. The dataframe `rates` combines the data from `eur` and `gbp` into a single dataframe. The task is to predict a day's exchange rates *using only information from previous days*.\n",
    "\n",
    "Write two functions `train_models` and `predict_rates` as follows.\n",
    "\n",
    "The function `train_models(rates_t)` takes as input a training dataframe `rates_t` of historical exchange rates. The format `rates_t` is the same as `rates`, but the date range might be different. For example, your `train_models` function should work if only historical data from 2017 until 2023 is provided. But you can always assume that a continuous subset of at least 500 rows is provided. The function returns a dictionary `models` of trained models. This might be a single model or multiple models, completely up to you. The only requirement is that all model training happens in that function and should complete in at most 40 seconds on a standard laptop (for the full dataset of all historical values).\n",
    "\n",
    "The function `predict_rates(models, rates_h)` takes as inputs the dictionary of trained models and a dataframe `rates_h` of historical data which you can assume to be a continuous subset of at least 50 rows from `rates`. The function returns a Python list of eight floating point values, corresponding to next trading day predictions of `[EUR_Close, EUR_High, EUR_Low, EUR_Open, USD_Close, USD_High, USD_Low, USD_Open]`. It should not take longer than 10 seconds to return a prediction.\n",
    "\n",
    "**Notes:** \n",
    "* Training time on the full dataset should be below 40 seconds on a standard laptop, and prediction time should be at most 10 seconds. \n",
    "\n",
    "* Even though `train_models` will be provided with at least 500 trading days of data, you do not need to use all of them for the training. Similarly for the prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c17e46c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# do not change code in this cell\n",
    "eur = pd.read_csv('_datasets/gbp-eur2.csv', index_col='Date')\n",
    "usd = pd.read_csv('_datasets/gbp-usd2.csv', index_col='Date')\n",
    "rates = pd.concat([eur, usd], axis=1, keys=['EUR', 'USD'])\n",
    "rates.columns = ['_'.join(col).strip() for col in rates.columns.values] \n",
    "rates.dropna(inplace=True)\n",
    "rates.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "efc3fd38",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_models = None\n",
    "predict_rates = None\n",
    "\n",
    "# TODO: Define the functions `train_models` and `predict_rates`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5c0ebd8",
   "metadata": {},
   "source": [
    "**Example:** This is how your functions should be usable. \n",
    "```{python}\n",
    "# training data\n",
    "rates_t = rates.loc['2017-01-01':'2023-12-31']\n",
    "models = train_models(rates_t)\n",
    "\n",
    "# get some historic data\n",
    "rates_h = rates.loc['2024-01-01':'2024-08-09']\n",
    "\n",
    "# predict next row\n",
    "lst = predict_rates(models, rates_h)\n",
    "\n",
    "# lst should contain 8 predictions for 2024-08-12, \n",
    "# the next trading day following 2024-08-09\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "415a227a",
   "metadata": {},
   "source": [
    "# End of test"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe30d8cb",
   "metadata": {},
   "source": [
    "You can use the below tests to get an indication if part of your work returns the right data types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "281d2d7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "try: \n",
    "    import re\n",
    "    assert re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$', student_mail) and not 'firstname' in student_mail\n",
    "    print(\"OKAY - student_mail appears to be valid\")\n",
    "except:\n",
    "    print(\"WARN - student_mail could not be verified\")\n",
    "\n",
    "import numpy as np\n",
    "X = np.array([[1,2,3],[2,-3,4],[1,2,3],[2,2,2]])\n",
    "Y = np.array([[1,2],[1,2],[3,4],[1,2]])\n",
    "x = np.array([1,2,3])\n",
    "\n",
    "try: \n",
    "    assert callable(itls)\n",
    "    print(\"OKAY - itls should be a function\")\n",
    "except:\n",
    "    print(\"FAIL - itls should be a function\")\n",
    "\n",
    "X = np.array([[1,2,3],[2,-3,4],[1,2,3],[2,2,2]])\n",
    "y = np.array([1,2,3,4])\n",
    "\n",
    "try:\n",
    "    w = itls(X, y, 0.2)\n",
    "    assert type(w) == np.ndarray\n",
    "    print(\"OKAY - itls returns a NumPy array\")\n",
    "except:\n",
    "    print(\"FAIL - itls does not return a NumPy array\")\n",
    "\n",
    "try: \n",
    "    assert callable(itls1)\n",
    "    print(\"OKAY - itls1 should be a function\")\n",
    "except:\n",
    "    print(\"FAIL - itls1 should be a function\")\n",
    "\n",
    "try: \n",
    "    assert type(CoD) == pd.DataFrame\n",
    "    print(\"OKAY - CoD should be a pandas dataframe\")\n",
    "except:\n",
    "    print(\"FAIL - CoD should be a pandas dataframe\")\n",
    "\n",
    "try: \n",
    "    assert callable(newspipe.predict)\n",
    "    print(\"OKAY - newspipe provides predict method\")\n",
    "except:\n",
    "    print(\"FAIL - newspipe does not provide predict method\")\n",
    "\n",
    "try: \n",
    "    assert callable(train_models)\n",
    "    print(\"OKAY - train_models should be a function\")\n",
    "except:\n",
    "    print(\"FAIL - train_models should be a function\")\n",
    "\n",
    "try: \n",
    "    assert callable(predict_rates)\n",
    "    print(\"OKAY - predict_rates should be a function\")\n",
    "except:\n",
    "    print(\"FAIL - predict_rates should be a function\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  },
  "toc-autonumbering": false,
  "toc-showcode": false,
  "toc-showmarkdowntxt": true
 },
 "nbformat": 4,
 "nbformat_minor": 5
}