Exercise 2.1¶
For each type of data, classify it as discrete quantitative, continuous quantitative, categorical, or other.
- How many students are enrolled at a university
- Your favourite day of the week
- How many millimeters of rain fall at an airport during one day
- Weight of a motor vehicle
- Manufacturer of a motor vehicle
- Text of all Google Maps reviews for a restaurant
- Star ratings from all Google Maps reviews for a restaurant
- Size of the living area of an apartment
- DNA nucleotide sequence of a cell
Exercise 2.2¶
Give the length of each vector or series.
- Morning waking times every day for a week
- Number of siblings (max 12) for each student in a class of 30
- Position and momentum of a roller coaster car
Exercise 2.3¶
Create a NumPy vector called
primesthat consists of the first 8 prime numbers. (You can simply type in the values, nothing fancy needed to compute them.)Create a NumPy vector called
squaresthat consists of the values [1, 4, 9, 16, ..., 1521, 1600].
Note: You can use the code in the "Testing" cell to verify your solution. If primes and squares have the correct values, the testing code should run without errors.
# Testing
assert type(primes) == np.ndarray, "primes should be a numpy array"
assert primes.shape == (8,), "primes should be a 1D array with 8 elements"
assert np.sum(primes) == 77, "Sum of primes should be 77"
assert list(primes) == [2, 3, 5, 7, 11, 13, 17, 19]
assert type(squares) == np.ndarray, "squares should be a numpy array"
assert squares.shape == (40,), "squares should be a 1D array with 40 elements"
assert np.sum(squares) == 22140, "Sum of squares should be 22140"
Exercise 2.4¶
Create a matrix
Cwith 27 rows and 44 columns having all entries equal to 5.2.Create a matrix
Das the 4-by-6 upper-left submatrix ofC, making sure thatDis a "proper" copy (in the sense that changing an element and one of the matrices does not alter the other matrix).
# Testing
assert type(C) == np.ndarray, "C should be a numpy array"
assert C.shape == (27, 44), "C should be a 2D array with 27 rows and 44 columns"
assert np.all(np.isclose(C, 5.2)), "C should have all entries equal to 5.2"
assert type(D) == np.ndarray, "D should be a numpy array"
assert D.shape == (4, 6), "D should be a 2D array with 4 rows and 6 columns"
assert np.all(np.isclose(D, 5.2)), "D should have all entries equal to 5.2"
assert D.flags['OWNDATA'], "D should be a deep copy, not a view, of C"
Exercise 2.5¶
Create the matrix R = np.reshape(range(0, 30), (5, 6)). Use a single NumPy slice command to extract the submatrix
$$
S =
\begin{bmatrix}
2 & 3 & 4 \\ 14 & 15 & 16 \\ 26 & 27 & 28
\end{bmatrix}.
$$
# Testing
assert np.all(S == np.array([[2, 3, 4], [14, 15, 16], [26, 27, 28]])), "S is incorrect"
Exercise 2.6¶
Write a function row_col_mean that accepts a matrix as input and returns two vectors: the mean value in every row and the mean value in every column. So, for example, the input
$$
\begin{bmatrix}
2 & -1 & 2 \\ -3 & 1 & 2
\end{bmatrix}
$$
would produce the outputs $[1, \: 0]$ and $[-0.5,\: 0,\: 2]$.
Important: When asked to write a function that returns values, make sure you do not just use to print but instead the return keyword. In this exercise, your function should return a tuple with two elements.
# Testing
B = np.array([[1, 3, 4],[5, 0, 2]])
assert np.all( np.isclose(row_col_mean(B)[0], np.array([2.6666667, 2.3333333])) ), "Row means are incorrect"
assert np.all( np.isclose(row_col_mean(B)[1], np.array([3, 1.5, 3])) ), "Column means are incorrect"
testA = np.array([[1,3,-4,-9], [5,0,2,2], [3,5,1,0]])
t0, t1 = row_col_mean(testA)
assert np.isclose(t0, np.array([-2.25, 2.25, 2.25])).all()
assert np.isclose(t1, np.array([ 3., 2.66666667, -0.33333333, -2.33333333])).all()
Exercise 2.7¶
Write Python code that generates a list lst = [-1, 1, -1, 1, ..., -1, 1] with $10^7$ integer elements. Write code that measures the execution time of the command sum(lst), which computes the sum of that list. (There are many ways to measure runtime. You could use the time() method in the time module. Or check out the timeit module.)
Now generate a NumPy array from lst with the same elements. Then measure and compare the runtime of performing the same summation in NumPy. Can you think of a reason for the difference in performance?
Repeat the same experiment but now with a list of floats [-1.0, 1.0, -1.0, 1.0, ..., -1.0, 1.0].
Exercise 2.8¶
Describe a scheme for creating dummy variables for the days of the week. Use your scheme to encode the vector:
[Tuesday, Sunday, Friday, Tuesday, Monday]
Exercise 2.9¶
Use the following code to load weather data measured at Manchester airport and use calculations on the data frame to assign the correct numerical values to the given variables. You will need to import the appropriate module(s) first.
weather = pd.read_csv("_datasets/mcr_airport_weather.csv")
Display the first 7 rows of data frame.
Verify that the columns
snowandtsunonly contain the valueNaN. Have a look at the CSV file to explain why.Use pandas methods to assign values to the following variables:
prcp_june = None # total precipitation in June (float) range_sep = None # difference between maximal and minimal September temp (float) hottest = None # hottest day(s) in terms of maximal temperature (dataframe)
# Testing
assert np.isclose(prcp_june, 48.2), "prcp_june incorrect"
assert np.isclose(range_sept, 23.0), "range_sept incorrect"
assert type(hottest) == pd.DataFrame, "hottest should be a dataframe"
assert hottest.shape[0] == 2, "hottest should have two rows"
Exercise 2.10¶
Create a frame called ratings by loading the file corporate_rating.csv.
Display the first 5 rows.
The ratings are ordered AAA, AA, A, BBB, BB, B, CCC, CC, C, D. Create a new column called Rating_number in which each string value in Rating column is replaced with the ordinal equivalents 1, 2, 3, ..., 10.
How many unique
Namevalues (company names) are there?
# Testing
assert len(ratings.columns) == 32, "There should be 32 columns"
assert pd.api.types.is_integer_dtype(ratings["Rating_number"]), "Rating_number should be an integer"
assert ratings["Rating_number"].min() == 1, "The minimum rating number should be 1"
assert ratings["Rating_number"].max() == 10, "The maximum rating number should be 10"
assert ratings["Rating_number"].sum() == 8841, "The sum of rating numbers should be 8841"
Exercise 2.11¶
There are a number of interesting open data sources in the UK, including
- the government's data service at https://www.data.gov.uk/
- datasets published by the NHS at https://digital.nhs.uk/data-and-information/data-collections-and-data-sets/data-sets
In this exercise we will look at some Greater Manchester public transport information from https://www.data.gov.uk/, namely rail station and tram stop Park & Ride spaces. Load the two CSV files _datasets/rail_park_and_ride_spaces.csv and _datasets/Metrolink_Park_and_Ride_Facilities.csv into pandas data frames rail and metro, respectively.
You will find that each of the data sets has issues. You can use the Jupyter Variable Explorer to explore these. For example, loading the Rail P&R dataset will result in a data frame with Railway Station names that are NaN. The Metrolink dataset, on the other hand, has several of the stop names entered with a question mark ?, several repetitions of the header line within the data part, and many missing values as well.
Unfortunately, such "messy" data is not an exception but rather the rule. Hence it is absolutely crucial to spend time investigating the problems and cleaning up the data, before we can draw any conclusions from it. Here are some of the things you could do to improve the situation.
To clean up rail:
From
rail, drop all rows where theRailway Stationis listed asNaNSet
Railway Stationas the indexCreate a data frame
missinglisting all railway stations that have a missing value in the columnP&R SpacesCreate a data frame
rail_cleanlisting all railway stations that have a valid value in the columnP&R SpacesCompute the total number of P&R Spaces from
rail_clean(you will notice that theP&R Spacescolumn needs to be explicitly loaded or converted to numeric)
To clean up metro:
Remove all question marks
?in the columnStop nameSet
Stop nameas the indexConvert the values in
Total parkingto numerical. If this is not possible for a value, then useNaN.
Some possible joining operations:
Perform an inner join of
rail_cleanandmetro_cleanon their indices to get a single data frame listing the stations/stops that are listed in both data frames.Perform an outer join of
rail_cleanandmetro_cleanon their indices to get a single data frame listing the all stations/stops with P&R provision.