In [1]:
# Carefully modify the below two string variables. Ensure there are no typos.

student_id = "12345678" # set this to your student ID

student_mail = "firstname.lastname@student.manchester.ac.uk" # your email address

Coursework 1¶

This coursework test contains several Jupyter Notebook cells with the comment # TODO. This is where you type the code for your solutions. Do not alter any of the other cells.

It is good practice to include markdown cells explaining your work, but in this test they won't be marked.

Here are some tips:

  • Do not alter the names of the predefined variables and functions, such as sales, avgPrice, etc. The values of these variables will inform the marking. Renaming them and failure to follow the problem description will result in loss of marks.

  • Do not hard-code any solution variables. All problems must be solved by computer code using the data in the provided CSV file. For example, do not simply define a variable sales = 1234 with a fixed value. Your Jupyter Notebook should produce results with a modified data file that has the same format but different numerical (or NaN) values.

  • Submit this test as a single .ipynb file using Canvas. You can simply keep the name test1-2026.ipynb. There is a basic testing code at the end that verifies some parts of the coursework.

    Strict deadline: Monday, 9th of March 2026, at 1pm

Note on independent work¶

You need to complete all coursework tests independently on your own, but you are allowed to use online resources and all course notes and exercise solutions. The course notes from chapters 1 to 3 contain all that is required to solve the below problems. You are not allowed to ask other humans for help. In particular, you are not allowed to send, give, or receive code or markdown content to/from classmates and others.

The University Guidelines for Academic Malpractice apply: http://documents.manchester.ac.uk/display.aspx?DocID=2870

Important: Even if you are the originator of the work (and not the one who copied), the University Guidelines require that you will be equally responsible for this case of academic malpractice and may lose all coursework marks (or even be assigned 0 marks for the course).

Start of test¶

We will analyse data relating to UK houseprices. The dataset has been taken from the Government's house price index website. This website also contains a data table describing the column names of the dataset.

We first load all the required modules and the dataset UK_houseprices2026.csv into a data frame df.

Important:

  • The dataset UK_houseprices2026.csv has recently been added to the _datasets.zip file. You may have to download it again.

  • Ensure that df does not change throughout the whole notebook. Most exercises use df as a starting point to construct other data frames or series.

In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

import numpy as np
import pandas as pd
import seaborn as sns

df = pd.read_csv("_datasets/UK_houseprices2026.csv")

Problem 1¶

Make a data frame called sales from df that contains only the first 8 columns.

Create a series avgPrice that contains the mean prices, but without any NaN elements.

In [3]:
sales = None
avgPrice = None

# TODO: Provide your solution code here that defines `sales` and `avgPrice`
In [4]:
# Solution
sales = df[["Date", "Region", "Mean Price", "Sales Volume", "New Price", "New Sales Volume", "Old Price", "Old Sales Volume"]]
avgPrice = sales["Mean Price"].dropna()

Problem 2¶

In the data frame df, the column New Sales Volume corresponds to the number of newly build properties sold in a given month and region, and there is a corresponding column referring to sold existing (old) properties. We would expect these to add up to the number of total sales in the column Sales Volume.

Compute an integer variable mismatchCount that corresponds to the number of rows in df for which Sales Volume deviates from the sum of new and old sales by more than 0.5.

Compute an array mismatchRegions that lists the unique regions for which a mismatch of counts occurred.

In [5]:
mismatchCount = None
mismatchRegions = None

# TODO: Provide your solution code here that defines `mismatchCount` and `mismatchRegions`
In [6]:
# Solution
mismatch = (df["Sales Volume"] - df["New Sales Volume"] - df["Old Sales Volume"]).abs() >= 0.5
mismatchCount = mismatch.sum()
mismatchRegions = df[mismatch]['Region'].unique() 

print(mismatchCount, mismatchRegions)
75 <StringArray>
[          'Inner London', 'North Northamptonshire',       'Northern Ireland',
         'United Kingdom',  'West Northamptonshire']
Length: 5, dtype: str

Problem 3¶

Define a data frame sales_clean from df by removing all rows whose region name corresponds to either United Kingdom, England, Scotland, Wales, Northern Ireland. Also remove all rows that contain any NaN's.

In [7]:
sales_clean = None

# TODO: Provide your solution code here
In [8]:
# Solution
sales_clean = df.dropna()
ind = sales_clean['Region'].isin({'United Kingdom', 'England', 'Scotland', 'Wales', 'Northern Ireland'})
sales_clean = sales_clean[~ind]

sales_clean
Out[8]:
Date Region Mean Price Sales Volume New Price New Sales Volume Old Price Old Sales Volume Detached Price Semi-Detached Price Terraced Price Flat Price Cash Price Cash Sales Volume Mortgage Price Mortgage Sales Volume FTB Price FOO Price
0 2014-01-01 Aberdeenshire 192239 411.0 260142.0 62.0 184548.0 349.0 276808.0 168052.0 141761.0 109450.0 179333.0 111.0 199310.0 300.0 150307.0 226869.0
1 2014-02-01 Aberdeenshire 191558 347.0 259718.0 84.0 183857.0 263.0 275641.0 167303.0 141339.0 109561.0 178732.0 79.0 198590.0 268.0 150054.0 225868.0
2 2014-03-01 Aberdeenshire 187155 383.0 253922.0 86.0 179592.0 297.0 270184.0 163266.0 137690.0 106211.0 174452.0 111.0 194095.0 272.0 146308.0 220887.0
3 2014-04-01 Aberdeenshire 189000 513.0 256803.0 100.0 181342.0 413.0 271913.0 165402.0 139679.0 107314.0 176207.0 121.0 195993.0 392.0 148189.0 222767.0
4 2014-05-01 Aberdeenshire 192885 585.0 263044.0 103.0 185020.0 482.0 277360.0 169069.0 142767.0 109078.0 179742.0 140.0 200057.0 445.0 150990.0 227513.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58313 2025-06-01 Yorkshire and The Humber 201490 5015.0 292181.0 261.0 198291.0 4754.0 337742.0 207454.0 162373.0 125473.0 198786.0 1130.0 202679.0 2987.0 173934.0 240595.0
58314 2025-07-01 Yorkshire and The Humber 204361 5473.0 305374.0 34.0 200625.0 5439.0 342020.0 210223.0 164958.0 127510.0 201066.0 1193.0 205812.0 3183.0 176508.0 243854.0
58315 2025-08-01 Yorkshire and The Humber 206107 5471.0 302627.0 19.0 202629.0 5452.0 343290.0 212745.0 166495.0 128073.0 202747.0 1116.0 207587.0 2913.0 178185.0 245644.0
58316 2025-09-01 Yorkshire and The Humber 206581 4745.0 306535.0 18.0 202921.0 4727.0 346628.0 212622.0 166663.0 127992.0 203692.0 748.0 207852.0 1663.0 178353.0 246632.0
58317 2025-10-01 Yorkshire and The Humber 206401 4961.0 313072.0 8.0 202381.0 4953.0 344502.0 213475.0 166443.0 127278.0 203114.0 440.0 207848.0 1061.0 178864.0 245256.0

52678 rows × 18 columns

Problem 4¶

Create a series called soldPerRegion from df listing, for each region, the total number of sales (sum of Sales Volume). Note that the region should be the index of that series.

In [9]:
soldPerRegion = None

# TODO: Provide your solution code here
In [10]:
# Solution
soldPerRegion = df[['Region','Sales Volume']].groupby('Region').sum()['Sales Volume']
soldPerRegion
Out[10]:
Region
Aberdeenshire                52333.0
Adur                         11978.0
Amber Valley                 26016.0
Angus                        24155.0
Antrim and Newtownabbey      25863.0
                              ...   
Wychavon                     25462.0
Wyre                         24007.0
Wyre Forest                  17783.0
York                         37919.0
Yorkshire and The Humber    954733.0
Name: Sales Volume, Length: 405, dtype: float64

Problem 5¶

We now consider only rows of df for the region Greater Manchester. Produce a series called annualAvg that, for each year, contains the average of mean prices. Note that the years should form the index of that series and they should be of integer data type.

In [11]:
annualAvg = None

# TODO: Provide your solution code here
In [12]:
# Solution
df_Date = df.copy()

df_Date = df_Date.loc[df_Date['Region']=='Greater Manchester'][['Date','Mean Price']]
df_Date['Date'] = pd.to_datetime(df_Date['Date'])
df_Date = df_Date.set_index(df_Date['Date'])
annualAvg = df_Date.groupby(df_Date['Date'].dt.year).mean()['Mean Price']

annualAvg
Out[12]:
Date
2014    122045.666667
2015    128763.333333
2016    137953.250000
2017    147450.500000
2018    154871.083333
2019    160932.583333
2020    167812.250000
2021    191004.083333
2022    212646.416667
2023    216807.083333
2024    222817.083333
2025    232475.916667
Name: Mean Price, dtype: float64

Problem 6¶

Consider only data which falls into the year 2024 in df. For that year, compute a series detached2024 that, for each region, contains the median Detached Price, i.e., the median price of a detached house in 2024. This series may contain some NaNs.

Now find all expensive regions in detached2024 with a Detached Price above the $p=0.95$ quantile (i.e., 95th percentile), and list them in a series detached2024_expensive.

Note that the region name should form the index of both series.

In [13]:
detached2024 = None
detached2024_expensive = None

# TODO: Provide your solution code here
In [14]:
# Solution
ind = pd.to_datetime(df['Date']).dt.year==2024
detached2024 = df[ind]
detached2024 = detached2024.groupby('Region')['Detached Price'].median()
ind = detached2024 > detached2024.quantile(0.95)
detached2024_expensive = detached2024[ind]

detached2024_expensive
Out[14]:
Region
Barnet                    1627087.5
Brent                     1325065.5
Camden                    3462403.0
City of Westminster       4368453.5
Ealing                    1350946.5
Elmbridge                 1504700.0
Enfield                   1204151.0
Hackney                   1324363.0
Hammersmith and Fulham    1829783.0
Haringey                  2220648.0
Harrow                    1180033.0
Inner London              1857506.5
Islington                 1670521.0
Kensington and Chelsea    4760293.0
Kingston upon Thames      1235464.5
Merton                    2198556.5
Richmond upon Thames      1703265.5
Southwark                 1710689.5
St Albans                 1181955.0
Wandsworth                2496095.5
Name: Detached Price, dtype: float64

Problem 7¶

Following on from problem 6, write code that determines the largest value of $p\in [0,1]$ so that Greater Manchester is classed as an expensive region (using the 2024 median prices of detached houses, as before). Assign that value to the floating point variable p_gm.

In [15]:
p_gm = None

# TODO: Provide your solution code here
In [16]:
# Solution
gm_value = detached2024['Greater Manchester']
p_gm = (detached2024.dropna() < gm_value).mean() 

p_gm
Out[16]:
np.float64(0.4122137404580153)

Problem 8¶

Create a new data frame df_hpi from df that has an additional column with name HPI. This should list the House Price Index (HPI) for each month and region. The HPI is a rescaled version of the mean price in a way that HPI takes the value 100 on the 1st of January 2015 in each region.

In [17]:
df_hpi = None

# TODO: Provide your solution code here
In [18]:
# Solution
df_hpi = df.copy()

# Filter the dataframe to get the prices on 1st January 2015
base_prices = df_hpi[df_hpi['Date'] == '2015-01-01'][['Region', 'Mean Price']].set_index('Region')

# Calculate the HPI for each row
df_hpi = df_hpi.set_index('Region')
df_hpi['HPI'] = df_hpi.apply(lambda row: (row['Mean Price'] / base_prices.loc[row.name,'Mean Price'])*100, axis=1).values

# Reset the index to bring RegionName back as a column
df_hpi = df_hpi.reset_index()

Problem 9¶

Find out which month is the best to buy a flat. Do so by looking at flat prices and finding, for each year and region, which month lists the cheapest flat price. If two months appear with the same price, take the earlier one for that year and region. Find the most frequent cheapest month overall and assign it to the variable cheapest_month as an integer in $1,2,\ldots,12$.

In [19]:
cheapest_month = None

# TODO: Provide your solution code here
In [20]:
# Solution

df2 = df[['Date','Region','Flat Price']]
df2 = df2[~df2['Flat Price'].isna()]

# Extract year and month from the Date column
df2['Year'] = pd.to_datetime(df2['Date']).dt.year
df2['Month'] = pd.to_datetime(df2['Date']).dt.month

# Find the month with the cheapest FlatPrice for each year and region
cheapest_months = df2.loc[df2.groupby(['Year', 'Region'])['Flat Price'].idxmin().dropna()]

# Find the most frequent month overall
cheapest_month = cheapest_months['Month'].mode()[0]

cheapest_month
Out[20]:
np.int32(1)

End of test¶

You can use the below tests to get an indication if part of your work returns the right data types.

In [21]:
try: 
    import re
    assert re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', student_mail) and not 'firstname' in student_mail
    print("OKAY - student_mail appears to be valid")
except:
    print("WARN - student_mail could not be verified")

try: 
    df_reload = pd.read_csv("_datasets/UK_houseprices2026.csv")
    assert df_reload.compare(df).empty
    print("OKAY - dataframe df has not been modified")
except:
    print("WARN - dataframe df has been modified")

try: 
    assert type(sales) == pd.DataFrame
    print("OKAY - sales should be a pandas dataframe")
except:
    print("FAIL - sales should be a pandas dataframe")

try: 
    assert type(avgPrice) == pd.Series
    print("OKAY - avgPrice should be a pandas series")
except:
    print("FAIL - avgPrice should be a pandas series")

try: 
    assert isinstance(mismatchCount, (int, np.integer))
    print("OKAY - mismatchCount should be an int")
except:
    print("FAIL - mismatchCount should be an int")

try: 
    assert isinstance(mismatchRegions, (np.ndarray, list, pd.arrays.StringArray))
    print("OKAY - mismatchRegions should be an array")
except:
    print("FAIL - mismatchRegions should be an array")

try:
    student_mail; student_id; df; sales; avgPrice; mismatchCount; mismatchRegions; sales_clean; 
    soldPerRegion; annualAvg; detached2024; detached2024_expensive; p_gm; df_hpi; cheapest_month
    print("OKAY - required variables exist in memory")
except:
    print("WARN - at least one of the required variables is not defined. check spelling?")
WARN - student_mail could not be verified
OKAY - dataframe df has not been modified
OKAY - sales should be a pandas dataframe
OKAY - avgPrice should be a pandas series
OKAY - mismatchCount should be an int
OKAY - mismatchRegions should be an array
OKAY - required variables exist in memory