{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.1\n",
    "\n",
    "Write a function `median_tree(X, y, max_depth=5, level=0)` that takes as inputs an $n \\times d$ NumPy array $X$ corresponding to $n$ data points in $d$ dimensions, and a Python list $y$ of labels. The parameter `max_depth` is optional and takes the value $5$ by default. Using only plain Python without any modules other than NumPy, implement the decision tree fitting process we have learned about in the lectures. The fitted tree is returned by the function `median_tree` in form of a nested sequence of Python dictionaries, each of which is of the form\n",
    "\n",
    "```\n",
    "tree = { \n",
    "    'level': int,\n",
    "    'subsetX': np.array,\n",
    "    'subsety': list,\n",
    "    'gini': float, \n",
    "    'split_coord': int,\n",
    "    'split_value': float,\n",
    "    'left': tree,\n",
    "    'right': tree\n",
    "}\n",
    "```\n",
    "\n",
    "Starting with the full sets `subsetX = X` and `subsety = y` at the root node of the tree (the \"parent node\" at level 0), the splitting of data into tree branches is performed by selecting the coordinate `split_coord` (an integer in $0,1,\\ldots,d-1$) that gives the best total impurity for the two child nodes (`left` and `right`) when the data is split at the median value of that coordinate (stored in `split_value`). The children of a node at depth `level` have depth `level+1` and the splitting stops when either (i) a node has only data points with the same label or (ii) `level==max_depth`. In this case, assign `None` to the appropriate values in the dictionary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.2\n",
    "\n",
    "Write a function `median_tree_predict(tree, X)` that takes as input a tree structure (nested dictionaries) as returned by `median_tree`, and an $m\\times d$ NumPy array $X$. The function follows the `split_coord` and `split_value` information provided by the tree to return a majority-vote prediction for each data point (each row) in `X`. The return value is a Python list with $m$ elements corresponding to the predicted labels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.3\n",
    "\n",
    "Using the `penguins` dataset available from seaborn, perform an experiment like in Example 5.15 using both sklearn's `DecisionTreeClassifier` and your own `median_tree` function. Using a 80-20 shuffled train-test split with a fixed random state, produce a table comparing the *accuracy* of both classifiers for maximal tree depths $1,2,\\ldots,5$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.4\n",
    "\n",
    "We have seen with $k$-nearest neighbour classification that the scale of the features matters, and that standardization can significantly improve the performance of such classifiers. Explain why feature standardization is in general not needed with decision trees."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.5\n",
    "\n",
    "One (dumb) algorithm for binary classification is to always predict the positive outcome. \n",
    "\n",
    "**(a)** Considered over all possible training sets, does this method have a high training variance or low training variance? Explain.\n",
    "\n",
    "**(b)** Considered over all possible testing sets, does this method have a high testing bias or low testing bias? Explain."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Exercise 6.6\n",
    "\n",
    "Suppose you are trying to fit this ground-truth one-dimensional classifier defined on positive integers:\n",
    "\n",
    "$$\n",
    "y = f(m) = \n",
    "    \\begin{cases}\n",
    "    0, & \\text{ if $m$ is odd}, \\\\\n",
    "    1, & \\text{ if $m$ is even}.\n",
    "    \\end{cases}\n",
    "$$\n",
    "\n",
    "Here are three individual classifiers:\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\hat{f}_a(m) &= 1 \\text{ for all $m$}, \\\\ \n",
    "\\hat{f}_b(m) &= \n",
    "    \\begin{cases}\n",
    "    0, & \\text{ if $m = 1,2,5,6,9,10,\\dots$}, \\\\\n",
    "    1, & \\text{ if $m = 3,4,7,8,11,12,\\dots$ },\n",
    "    \\end{cases} \\\\\n",
    "\\hat{f}_c(m) &=\n",
    "    \\begin{cases}\n",
    "    0, & \\text{ if $m = 1,4,5,8,9,12,\\dots$}, \\\\\n",
    "    1, & \\text{ if $m = 2,3,6,7,10,11,\\dots$}.\n",
    "    \\end{cases}\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "Let the testing set be $1,2,3,\\ldots,12$.\n",
    "\n",
    "**(a)** Compute the accuracy of each classifier on the testing set.\n",
    "\n",
    "**(b)** Compute the accuracy of the majority-vote ensemble of the three classifiers on the testing set. That is, the classifier with $\\hat{f}(m) = 1$ if at least two of the member classifiers are 1, and $\\hat{f}(m) = 0$ otherwise.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.7\n",
    "\n",
    "For data with very high-dimensional feature space, it can be beneficial to apply a technique called *dimension reduction* before applying any machine learning algorithm. Assuming our data matrix $X$ is of size $n\\times d$ ($n$ data points in $d$ dimensions), in linear dimension reduction we aim to find a $d\\times \\hat d$ matrix $R$ with $\\hat d < d$ so that $\\hat X := X R$ preserves some essential features of the data (while reducing its dimension).\n",
    "\n",
    "Research a technique called principal component analysis (PCA). Using only NumPy and no other modules, write a function `mypca(X, dhat)` that takes as input a data matrix $X$ and a positive integer `dhat`, and then returns the matrix $R$. \n",
    "\n",
    "Rerun the experiment from Example 5.25 with the *spambase* dataset, but reduce the dimension of the feature space from $57$ to $1,2,3,4,5$, respectively. Do this by computing the dimension-reducing matrix $R$ for the training set, train a decision tree on the dimension-reduced training set, and then apply the same transformation to the test data before making predictions. Try if z-normalizing the data improves the performance. If so, is this in contradiction to Exercise 6.4? Write down a summary of your observations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.8\n",
    "\n",
    "Pulsars are detected by scanning for radio frequencies from deep space. However, the vast majority of collected signals are actually terrestrial noise. The following dataset collects 8 statistical features from radio signals and their classifications as noise (class 0) or pulsar (class 1).\n",
    "\n",
    "1. Load  the dataset `pulsars.csv`. Let `y` be the `class` column of the dataset and let `X` be all the remaining columns. Find the proportion of the samples that are actually pulsars.\n",
    "\n",
    "2. Using a random state `19716`, split the data 80% / 20% into testing and training sets.\n",
    "\n",
    "3. Fit a pipeline with standardization scaling and a kNN classifier with $k=8$ to the training data. Since we want to minimize false positives, find its precision score on the test set.\n",
    "\n",
    "4. Perform a grid search on the pipeline defined in step 3, including a search over $k$ from 3 through 20 and over `'uniform'` and `'distance'` for `weights`, using 6 folds in cross-validation and precision as the scoring. Find the best parameters and find the precision score on the test set.\n",
    "\n",
    "5. Using the best model from step 4, make a bagging classifier using 200 estimators, 50% max features and max samples, and random state `302`. Find its precision score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "signals = pd.read_csv(\"_datasets/pulsars.csv\")\n",
    "\n",
    "# TODO: Provide your solution code here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Testing\n",
    "assert X.shape == (17898, 8)\n",
    "assert y.shape == (17898,)\n",
    "assert np.isclose(pulsar_fraction, 0.091574478)\n",
    "\n",
    "assert X_train.shape == (14318, 8)\n",
    "assert y_test.sum() == 326\n",
    "\n",
    "assert np.isclose(knn_score, 0.92982, rtol=1e-5)\n",
    "\n",
    "assert type(best_params) == dict, \"Get the best parameters from the fitted model\"\n",
    "assert grid_score > knn_score, \"Score should have improved\"\n",
    "assert np.isclose(grid_score, 0.93684, rtol=1e-5)\n",
    "\n",
    "assert ensemble_score > grid_score, \"Score should have improved\"\n",
    "assert np.isclose(ensemble_score, 0.952206, rtol=1e-5)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
