{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "64b51d52",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Carefully modify the below two string variables. Ensure there are no typos.\n",
    "\n",
    "student_id = \"12345678\" # set this to your student ID\n",
    "\n",
    "student_mail = \"firstname.lastname@student.manchester.ac.uk\" # your email address"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eef390d5",
   "metadata": {},
   "source": [
    "# Coursework 1\n",
    "\n",
    "This coursework test contains several Jupyter Notebook cells with the comment `# TODO`. This is where you type the code for your solutions. Do not alter any of the other cells. \n",
    "\n",
    "It is good practice to include markdown cells explaining your work, but in this test they won't be marked. \n",
    "\n",
    "Here are some tips:\n",
    "\n",
    "* **Do not alter the names of the predefined variables and functions,** such as `sales`, `avgPrice`, etc. The values of these variables will inform the marking. Renaming them and failure to follow the problem description will result in loss of marks.\n",
    "\n",
    "* **Do not hard-code any solution variables.** All problems must be solved by computer code using the data in the provided CSV file. For example, do *not* simply define a variable `sales = 1234` with a fixed value. Your Jupyter Notebook should produce results with a modified data file that has the same format but different numerical (or NaN) values.\n",
    "\n",
    "* **Submit this test as a single .ipynb file using Canvas.** You can simply keep the name `test1-2026.ipynb`. There is a basic testing code at the end that verifies some parts of the coursework.\n",
    "\n",
    "   <span style=\"color:blue; font-weight:bold\">Strict deadline: Monday, 9th of March 2026, at 1pm</span>\n",
    "\n",
    "## Note on independent work\n",
    "\n",
    "You need to complete all coursework tests independently on your own, but you are allowed to use online resources and all course notes and exercise solutions. The course notes from chapters 1 to 3 contain all that is required to solve the below problems. You are not allowed to ask other humans for help. In particular, you are not allowed to send, give, or receive code or markdown content to/from classmates and others.\n",
    "\n",
    "The University Guidelines for Academic Malpractice apply: http://documents.manchester.ac.uk/display.aspx?DocID=2870\n",
    "\n",
    "**Important: Even if you are the originator of the work** (and not the one who copied), the University Guidelines require that you will be equally responsible for this case of academic malpractice and may lose all coursework marks (or even be assigned 0 marks for the course)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2295b152",
   "metadata": {},
   "source": [
    "# Start of test\n",
    "\n",
    "We will analyse data relating to UK houseprices. The dataset has been taken from the [Government's house price index website](https://www.gov.uk/government/publications/about-the-uk-house-price-index/about-the-uk-house-price-index). This website also contains a [*data table*](https://www.gov.uk/government/publications/about-the-uk-house-price-index/about-the-uk-house-price-index#data-tables) describing the column names of the dataset.\n",
    "\n",
    "We first load all the required modules and the dataset `UK_houseprices2026.csv` into a data frame `df`.\n",
    "\n",
    "**Important:** \n",
    "\n",
    "* The dataset `UK_houseprices2026.csv` has recently been added to the [_datasets.zip](../_datasets.zip) file. You may have to download it again.\n",
    "\n",
    "* Ensure that `df` does not change throughout the whole notebook. Most exercises use `df` as a starting point to construct other data frames or series.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fdc2a253-d922-4124-931c-d4a90308d222",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-34389c5497ccc8a9",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
    "warnings.simplefilter(action='ignore', category=DeprecationWarning)\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "\n",
    "df = pd.read_csv(\"_datasets/UK_houseprices2026.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af4a10eb-93d7-44fa-bbf3-9c3de7025d2e",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-404456cc35ae4faa",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## Problem 1\n",
    "\n",
    "Make a data frame called `sales` from `df` that contains only the first 8 columns. \n",
    "\n",
    "Create a series `avgPrice` that contains the mean prices, but without any NaN elements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6b3dc6c4-7547-4614-b0a7-7bab6a2d05e6",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "ducks-mean",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "sales = None\n",
    "avgPrice = None\n",
    "\n",
    "# TODO: Provide your solution code here that defines `sales` and `avgPrice`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63d064c0",
   "metadata": {},
   "source": [
    "## Problem 2\n",
    "\n",
    "In the data frame `df`, the column *New Sales Volume* corresponds to the number of newly build properties sold in a given month and region, and there is a corresponding column referring to sold existing (old) properties. We would expect these to add up to the number of total sales in the column *Sales Volume*.\n",
    "\n",
    "Compute an integer variable `mismatchCount` that corresponds to the number of rows in `df` for which *Sales Volume* deviates from the sum of new and old sales by more than 0.5.\n",
    "\n",
    "Compute an array `mismatchRegions` that lists the unique regions for which a mismatch of counts occurred."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4d999032",
   "metadata": {},
   "outputs": [],
   "source": [
    "mismatchCount = None\n",
    "mismatchRegions = None\n",
    "\n",
    "# TODO: Provide your solution code here that defines `mismatchCount` and `mismatchRegions`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab13b0c0",
   "metadata": {},
   "source": [
    "## Problem 3\n",
    "\n",
    "Define a data frame `sales_clean` from `df` by removing all rows whose region name corresponds to either *United Kingdom*, *England*, *Scotland*, *Wales*, *Northern Ireland*. Also remove all rows that contain any NaN's."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "548e510b",
   "metadata": {},
   "outputs": [],
   "source": [
    "sales_clean = None\n",
    "\n",
    "# TODO: Provide your solution code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de436669",
   "metadata": {},
   "source": [
    "## Problem 4\n",
    "\n",
    "Create a series called `soldPerRegion` from `df` listing, for each region, the total number of sales (sum of *Sales Volume*). Note that the region should be the index of that series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "93697d89",
   "metadata": {},
   "outputs": [],
   "source": [
    "soldPerRegion = None\n",
    "\n",
    "# TODO: Provide your solution code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85909270",
   "metadata": {},
   "source": [
    "## Problem 5\n",
    "\n",
    "We now consider only rows of `df` for the region *Greater Manchester*. Produce a series called `annualAvg` that, for each year, contains the average of mean prices. Note that the years should form the index of that series and they should be of integer data type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "8448ce81",
   "metadata": {},
   "outputs": [],
   "source": [
    "annualAvg = None\n",
    "\n",
    "# TODO: Provide your solution code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e8f6e11",
   "metadata": {},
   "source": [
    "## Problem 6\n",
    "\n",
    "Consider only data which falls into the year 2024 in `df`. For that year, compute a series `detached2024` that, for each region, contains the median *Detached Price*, i.e., the median price of a detached house in 2024. This series may contain some NaNs. \n",
    "\n",
    "Now find all expensive regions in `detached2024` with a *Detached Price* above the $p=0.95$ quantile (i.e., 95th percentile), and list them in a series `detached2024_expensive`.\n",
    "\n",
    "Note that the region name should form the index of both series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "111f14be",
   "metadata": {},
   "outputs": [],
   "source": [
    "detached2024 = None\n",
    "detached2024_expensive = None\n",
    "\n",
    "# TODO: Provide your solution code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ee52202",
   "metadata": {},
   "source": [
    "## Problem 7\n",
    "\n",
    "Following on from problem 6, write code that determines the largest value of $p\\in [0,1]$ so that *Greater Manchester* is classed as an expensive region (using the 2024 median prices of detached houses, as before). Assign that value to the floating point variable `p_gm`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "70868ff5",
   "metadata": {},
   "outputs": [],
   "source": [
    "p_gm = None\n",
    "\n",
    "# TODO: Provide your solution code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "167f3432",
   "metadata": {},
   "source": [
    "## Problem 8\n",
    "\n",
    "Create a new data frame `df_hpi` from `df` that has an additional column with name *HPI*. This should list the House Price Index (HPI) for each month and region. The HPI is a rescaled version of the mean price in a way that *HPI* takes the value 100 on the 1st of January 2015 in each region."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "8baae2d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_hpi = None\n",
    "\n",
    "# TODO: Provide your solution code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab18db5b",
   "metadata": {},
   "source": [
    "## Problem 9\n",
    "\n",
    "Find out which month is the best to buy a flat. Do so by looking at flat prices and finding, for each year and region, which month lists the cheapest flat price. If two months appear with the same price, take the earlier one for that year and region. Find the most frequent cheapest month overall and assign it to the variable `cheapest_month` as an integer in $1,2,\\ldots,12$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "f5336478",
   "metadata": {},
   "outputs": [],
   "source": [
    "cheapest_month = None\n",
    "\n",
    "# TODO: Provide your solution code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "415a227a",
   "metadata": {},
   "source": [
    "# End of test"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe30d8cb",
   "metadata": {},
   "source": [
    "You can use the below tests to get an indication if part of your work returns the right data types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "281d2d7e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARN - student_mail could not be verified\n",
      "OKAY - dataframe df has not been modified\n",
      "OKAY - sales should be a pandas dataframe\n",
      "OKAY - avgPrice should be a pandas series\n",
      "OKAY - mismatchCount should be an int\n",
      "OKAY - mismatchRegions should be an array\n",
      "OKAY - required variables exist in memory\n"
     ]
    }
   ],
   "source": [
    "try: \n",
    "    import re\n",
    "    assert re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$', student_mail) and not 'firstname' in student_mail\n",
    "    print(\"OKAY - student_mail appears to be valid\")\n",
    "except:\n",
    "    print(\"WARN - student_mail could not be verified\")\n",
    "\n",
    "try: \n",
    "    df_reload = pd.read_csv(\"_datasets/UK_houseprices2026.csv\")\n",
    "    assert df_reload.compare(df).empty\n",
    "    print(\"OKAY - dataframe df has not been modified\")\n",
    "except:\n",
    "    print(\"WARN - dataframe df has been modified\")\n",
    "\n",
    "try: \n",
    "    assert type(sales) == pd.DataFrame\n",
    "    print(\"OKAY - sales should be a pandas dataframe\")\n",
    "except:\n",
    "    print(\"FAIL - sales should be a pandas dataframe\")\n",
    "\n",
    "try: \n",
    "    assert type(avgPrice) == pd.Series\n",
    "    print(\"OKAY - avgPrice should be a pandas series\")\n",
    "except:\n",
    "    print(\"FAIL - avgPrice should be a pandas series\")\n",
    "\n",
    "try: \n",
    "    assert isinstance(mismatchCount, (int, np.integer))\n",
    "    print(\"OKAY - mismatchCount should be an int\")\n",
    "except:\n",
    "    print(\"FAIL - mismatchCount should be an int\")\n",
    "\n",
    "try: \n",
    "    assert isinstance(mismatchRegions, (np.ndarray, list, pd.arrays.StringArray))\n",
    "    print(\"OKAY - mismatchRegions should be an array\")\n",
    "except:\n",
    "    print(\"FAIL - mismatchRegions should be an array\")\n",
    "\n",
    "try:\n",
    "    student_mail; student_id; df; sales; avgPrice; mismatchCount; mismatchRegions; sales_clean; \n",
    "    soldPerRegion; annualAvg; detached2024; detached2024_expensive; p_gm; df_hpi; cheapest_month\n",
    "    print(\"OKAY - required variables exist in memory\")\n",
    "except:\n",
    "    print(\"WARN - at least one of the required variables is not defined. check spelling?\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.2"
  },
  "toc-autonumbering": false,
  "toc-showcode": false,
  "toc-showmarkdowntxt": true
 },
 "nbformat": 4,
 "nbformat_minor": 5
}