ch2-exercises

Exercise 2.1¶

For each type of data, classify it as discrete quantitative, continuous quantitative, categorical, or other.

How many students are enrolled at a university
Your favourite day of the week
How many millimeters of rain fall at an airport during one day
Weight of a motor vehicle
Manufacturer of a motor vehicle
Text of all Google Maps reviews for a restaurant
Star ratings from all Google Maps reviews for a restaurant
Size of the living area of an apartment
DNA nucleotide sequence of a cell

Exercise 2.2¶

Give the length of each vector or series.

Morning waking times every day for a week
Number of siblings (max 12) for each student in a class of 30
Position and momentum of a roller coaster car

Exercise 2.3¶

Create a NumPy vector called primes that consists of the first 8 prime numbers. (You can simply type in the values, nothing fancy needed to compute them.)
Create a NumPy vector called squares that consists of the values [1, 4, 9, 16, ..., 1521, 1600].

Note: You can use the code in the "Testing" cell to verify your solution. If primes and squares have the correct values, the testing code should run without errors.

In [11]:

# Testing 
assert type(primes) == np.ndarray, "primes should be a numpy array"
assert primes.shape == (8,), "primes should be a 1D array with 8 elements"
assert np.sum(primes) == 77, "Sum of primes should be 77"
assert list(primes) == [2, 3, 5, 7, 11, 13, 17, 19]

assert type(squares) == np.ndarray, "squares should be a numpy array"
assert squares.shape == (40,), "squares should be a 1D array with 40 elements"
assert np.sum(squares) == 22140, "Sum of squares should be 22140"

Exercise 2.4¶

Create a matrix C with 27 rows and 44 columns having all entries equal to 5.2.
Create a matrix D as the 4-by-6 upper-left submatrix of C, making sure that D is a "proper" copy (in the sense that changing an element and one of the matrices does not alter the other matrix).

In [13]:

# Testing
assert type(C) == np.ndarray, "C should be a numpy array"
assert C.shape == (27, 44), "C should be a 2D array with 27 rows and 44 columns"
assert np.all(np.isclose(C, 5.2)), "C should have all entries equal to 5.2"

assert type(D) == np.ndarray, "D should be a numpy array"
assert D.shape == (4, 6), "D should be a 2D array with 4 rows and 6 columns"
assert np.all(np.isclose(D, 5.2)), "D should have all entries equal to 5.2"
assert D.flags['OWNDATA'], "D should be a deep copy, not a view, of C"

Exercise 2.5¶

Create the matrix R = np.reshape(range(0, 30), (5, 6)). Use a single NumPy slice command to extract the submatrix $$ S = \begin{bmatrix} 2 & 3 & 4 \\ 14 & 15 & 16 \\ 26 & 27 & 28 \end{bmatrix}. $$

In [15]:

# Testing
assert np.all(S == np.array([[2, 3, 4], [14, 15, 16], [26, 27, 28]])), "S is incorrect"

Exercise 2.6¶

Write a function row_col_mean that accepts a matrix as input and returns two vectors: the mean value in every row and the mean value in every column. So, for example, the input $$ \begin{bmatrix} 2 & -1 & 2 \\ -3 & 1 & 2 \end{bmatrix} $$ would produce the outputs $[1, \: 0]$ and $[-0.5,\: 0,\: 2]$.

Important: When asked to write a function that returns values, make sure you do not just use to print but instead the return keyword. In this exercise, your function should return a tuple with two elements.

In [17]:

# Testing
B = np.array([[1, 3, 4],[5, 0, 2]])
assert np.all( np.isclose(row_col_mean(B)[0], np.array([2.6666667, 2.3333333])) ), "Row means are incorrect"
assert np.all( np.isclose(row_col_mean(B)[1], np.array([3, 1.5, 3])) ), "Column means are incorrect"
testA = np.array([[1,3,-4,-9], [5,0,2,2], [3,5,1,0]])
t0, t1 = row_col_mean(testA)
assert np.isclose(t0, np.array([-2.25, 2.25, 2.25])).all()
assert np.isclose(t1, np.array([ 3., 2.66666667, -0.33333333, -2.33333333])).all()

Exercise 2.7¶

Write Python code that generates a list lst = [-1, 1, -1, 1, ..., -1, 1] with $10^7$ integer elements. Write code that measures the execution time of the command sum(lst), which computes the sum of that list. (There are many ways to measure runtime. You could use the time() method in the time module. Or check out the timeit module.)

Now generate a NumPy array from lst with the same elements. Then measure and compare the runtime of performing the same summation in NumPy. Can you think of a reason for the difference in performance?

Repeat the same experiment but now with a list of floats [-1.0, 1.0, -1.0, 1.0, ..., -1.0, 1.0].

Exercise 2.8¶

Describe a scheme for creating dummy variables for the days of the week. Use your scheme to encode the vector:

[Tuesday, Sunday, Friday, Tuesday, Monday]

Exercise 2.9¶

Use the following code to load weather data measured at Manchester airport and use calculations on the data frame to assign the correct numerical values to the given variables. You will need to import the appropriate module(s) first.

weather = pd.read_csv("_datasets/mcr_airport_weather.csv")

Display the first 7 rows of data frame.
Verify that the columns snow and tsun only contain the value NaN. Have a look at the CSV file to explain why.

Use pandas methods to assign values to the following variables:

prcp_june = None   # total precipitation in June (float)
range_sep = None   # difference between maximal and minimal September temp (float)
hottest = None     # hottest day(s) in terms of maximal temperature (dataframe)

In [20]:

# Testing
assert np.isclose(prcp_june, 48.2), "prcp_june incorrect"
assert np.isclose(range_sept, 23.0), "range_sept incorrect"
assert type(hottest) == pd.DataFrame, "hottest should be a dataframe"
assert hottest.shape[0] == 2, "hottest should have two rows"

Exercise 2.10¶

Create a frame called ratings by loading the file corporate_rating.csv.

Display the first 5 rows.
The ratings are ordered AAA, AA, A, BBB, BB, B, CCC, CC, C, D. Create a new column called Rating_number in which each string value in Rating column is replaced with the ordinal equivalents 1, 2, 3, ..., 10.
How many unique Name values (company names) are there?

In [22]:

# Testing
assert len(ratings.columns) == 32, "There should be 32 columns"
assert pd.api.types.is_integer_dtype(ratings["Rating_number"]), "Rating_number should be an integer"
assert ratings["Rating_number"].min() == 1, "The minimum rating number should be 1"
assert ratings["Rating_number"].max() == 10, "The maximum rating number should be 10"
assert ratings["Rating_number"].sum() == 8841, "The sum of rating numbers should be 8841"

Exercise 2.11¶

There are a number of interesting open data sources in the UK, including

the government's data service at https://www.data.gov.uk/
datasets published by the NHS at https://digital.nhs.uk/data-and-information/data-collections-and-data-sets/data-sets

In this exercise we will look at some Greater Manchester public transport information from https://www.data.gov.uk/, namely rail station and tram stop Park & Ride spaces. Load the two CSV files _datasets/rail_park_and_ride_spaces.csv and _datasets/Metrolink_Park_and_Ride_Facilities.csv into pandas data frames rail and metro, respectively.

You will find that each of the data sets has issues. You can use the Jupyter Variable Explorer to explore these. For example, loading the Rail P&R dataset will result in a data frame with Railway Station names that are NaN. The Metrolink dataset, on the other hand, has several of the stop names entered with a question mark ?, several repetitions of the header line within the data part, and many missing values as well.

Unfortunately, such "messy" data is not an exception but rather the rule. Hence it is absolutely crucial to spend time investigating the problems and cleaning up the data, before we can draw any conclusions from it. Here are some of the things you could do to improve the situation.

To clean up rail:

From rail, drop all rows where the Railway Station is listed as NaN
Set Railway Station as the index
Create a data frame missing listing all railway stations that have a missing value in the column P&R Spaces
Create a data frame rail_clean listing all railway stations that have a valid value in the column P&R Spaces
Compute the total number of P&R Spaces from rail_clean (you will notice that the P&R Spaces column needs to be explicitly loaded or converted to numeric)

To clean up metro:

Remove all question marks ? in the column Stop name
Set Stop name as the index
Convert the values in Total parking to numerical. If this is not possible for a value, then use NaN.

Some possible joining operations:

Perform an inner join of rail_clean and metro_clean on their indices to get a single data frame listing the stations/stops that are listed in both data frames.
Perform an outer join of rail_clean and metro_clean on their indices to get a single data frame listing the all stations/stops with P&R provision.