Review questions

Chapter 1

What is Git and what does GitHub provide on top of Git?
Name the essential user interface components of VS Code.
What are the three types of cell in a Jupyter Notebook?
What is the most common pitfall when working with Jupyter Notebooks?

Chapter 2

Define quantitative values and qualitative. What is the difference between continuous and discrete quantitative data?
Name three types of qualitative data.
Define pandas series and give some examples of series.
Define pandas dataframes and discuss some use cases.
Name two ways dataframes can be combined into a single one and provide some examples of the syntax.
Name three commonly encountered tasks in “data wrangling”.

Chapter 3

Section 3.1: Summary statistics

Define mean, variance, and standard deviation. How do variance and standard deviation differ in terms of units?
What is a z-score, and how is it calculated?
Why are z-scores considered to be dimensionless, and what advantage does this provide when comparing datasets?
Explain the difference between a population and a sample in statistics.
What is an unbiased estimator?
Describe the key conceptual difference between \(s_n\) and \(s_{n-1}\) for estimating population variance.
Define percentile (i. e., quantile), quartiles, and median.
Explain the difference between the sample median calculation for odd and even numbers of observations.
What does the interquartile range represent, and how is it calculated?
Discuss the potential advantage of using the median and IQR over the mean and standard deviation for summarizing a dataset.

Section 3.2: Distributions

Define the cumulative distribution function (CDF) for a population. Explain why it is a nondecreasing function ranging between 0 and 1.
What is the CDF of a uniform distribution? How does it differ for discrete and continuous cases?
What is an empirical cumulative distribution function (ECDF)?
How does the shape of the ECDF change as the sample size increases?
Define the probability density function (PDF) and explain its relationship to the CDF.
Discuss the concept of a histogram and its normalization.
Describe formulas for computing the mean and variance of a continuous distribution from its PDF.
What are the mean and variance for a continuous uniform distribution over \([0, 1]\)?
What are the characteristics of a standard normal distribution?
How do the mean and variance of a normal distribution affect its PDF?
Discuss the 68-95-99.7 (empirical) rule in the context of the normal distribution.
What does kernel density estimation (KDE) do?

Section 3.4: Grouping

Why might it be useful to analyze data in groups defined by categorical values or other criteria?
What is a facet plot?
Describe the information conveyed by a box plot and a violin plot.
Describe how aggregation works with grouping in data analysis. What are some common aggregation functions?
Describe how transformation works with grouping in data analysis. What are some common aggregation functions?
What is the purpose of filtering in grouped data analysis?
Discuss the impact of standardizing data within groups versus across the entire dataset.

Section 3.5: Outliers

What is an informal definition of an outlier?
Why might outliers be of real interest in certain applications?
List some common reasons why outliers might appear in a dataset.
Provide an example where changing a single value in a dataset significantly affects the mean but not the median.
What is the Interquartile Range (IQR), and how is it calculated?
Describe how outliers are identified using the IQR method.
How are outliers represented in a box plot?
What are the criteria for considering values as outliers in a normal distribution based on standard deviation?
Discuss the potential impact of removing outliers on the analysis of a dataset. When might it be appropriate to remove outliers, and when might it be important to investigate them further?

Section 3.6: Correlation

Define correlation in the context of statistical analysis.
Why is it important to visually inspect data before calculating correlation coefficients?
Define covariance and explain its significance in measuring the relationship between two variables.
Why is covariance not always easy to interpret?
Explain the Pearson correlation coefficient and how it is calculated for both populations and samples.
What does a Pearson coefficient of -1, 0, and 1 signify?
How does the Pearson coefficient address the limitations of covariance?
Discuss how outliers can affect the Pearson correlation coefficient.
Provide an example where a single outlier significantly impacts the Pearson coefficient.
What is the Spearman correlation coefficient, and how does it differ from the Pearson coefficient?
How does the Spearman coefficient mitigate the impact of outliers?
Provide an example demonstrating the robustness of the Spearman coefficient against outliers.
Explain how correlation can be assessed when dealing with categorical variables.

Section 3.7: Cautionary tales

What is the main lesson from the Datasaurus Dozen example?
How can relying solely on summary statistics be misleading?
Explain the difference between correlation and dependence.
Describe an example where two variables are dependent but not correlated.
What is Simpson’s paradox, and how does it manifest in the penguin dataset example?
Can you create a hypothetical example where Simpson’s paradox might lead to incorrect conclusions if not properly understood?
Why is it important to consider both linear and nonlinear relationships when analyzing data?
How can the misuse of statistical methods lead to misconceptions or mistakes?

Chapter 4

Define the terms hypothesis and concept.
What is a nearest neighbor hypothesis?
Explain why not all data is necessarily generated by a concept and how this issue is resolved by considering data-generating distributions.
Define the term generalisation error.
Define the term Bayes hypothesis.
Explain why one-hot encoding is preferable for encoding more than two classes when an \(L_2\) loss function is used.
Define empirical Rademacher complexity and explain why it is generally hard to compute.
Explain the approximation-estimation trade-off and state the approximation-estimation decomposition of the generalisation error.

Chapter 5

Section 5.1

Define a feature vector and label in the context of classification problems.
What distinguishes a binary classification problem from a multiclass problem?
Describe the structure of a feature matrix and a label vector.
Discuss the limitations of assigning arbitrary numerical values to qualitative data. Why is one-hot or dummy encoding considered a better strategy?
Outline the basic steps involved in training and applying a machine learning classifier.
What is a query vector? How is it used in the context of a classifier? Evaluating Classifier Performance:
How can the accuracy of a classifier be determined?
Besides accuracy, what other metrics or considerations might be important in evaluating the effectiveness of a classifier?

Section 5.2

Define the concept of generalization in the context of machine learning classifiers.
Explain the purpose of splitting a dataset into training and testing sets. How does this practice help in evaluating the generalization of a classifier?
Describe the significance of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) in assessing classifier performance.
Define accuracy, recall (sensitivity), specificity, precision, and negative predictive value (NPV).
What information does a confusion matrix convey about a classifier’s performance?
Explain the F₁ score and balanced accuracy. How do they provide a more nuanced view of classifier performance compared to using accuracy alone?
How do the concepts of binary classification metrics extend to multiclass classification problems?
Discuss the difference between macro averaging and other methods of averaging precision scores across multiple classes. Why might one averaging method be preferred over another?
Describe a hypothetical situation where a high recall is more important than high precision.
Describe a hypothetical situation where a high precision is more important than high recall.

Section 5.3

Describe the process of building a decision tree for classification.
Define Gini impurity and explain its significance in the context of decision trees.
Describe the criteria for partitioning samples in a decision tree.
What is an indicator function, and how is it utilized in expressing Gini impurity?
Explain the concept of a decision boundary in relation to decision trees.
In general terms, how does the depth of a decision tree affect its decision boundary?
What is a greedy algorithm, and how does it apply to the context of finding a decision tree?
What does it mean to say a classifier is interpretable?
Besides Gini impurity, are there other metrics or considerations that might be important in evaluating the partitioning in a decision tree?

Section 5.4

Describe the basic principle behind the k-nearest neighbors algorithm. How does it determine the class of a query point?
What role do distance metrics play in the kNN algorithm? Explain the properties that a distance metric must satisfy.
Define the 2-norm, 1-norm, and infinity-norm. How are these norms used to calculate distances between feature vectors?
How does the choice of \(k\) in the kNN algorithm influence the algorithm’s decision boundary and predictions?
Why is standardization or scaling of features important in kNN?
How is kNN applied to multiclass classification problems?

Section 5.5

Explain how a probability vector arises from vote-based classification methods. How does it provide more information than a winner-takes-all approach?
Define the ROC curve and explain its significance in evaluating binary classifiers.
What is a decision threshold in the context of probabilistic classifiers? How does changing the threshold affect classification outcomes?
What is the AUC metric?
How can ROC curves and AUC scores be used to compare different classifiers or different settings of the same classifier?
Explain how the concepts of ROC curves and AUC scores extend to multiclass classification problems.

Chapter 6

Section 6.1

How is the bias of a classifier defined in the context of expected prediction error?
Explain the concept of variance in the context of machine learning models. How does variance affect a model’s performance on unseen data?
Explain how bias and variance contribute to the total prediction error of a model. Why do we refer to a “tradeoff” between bias and variance?
How does the size of the training set impact the bias and variance of a machine learning model? Answer using sketches of learning curves.
Discuss how a learner’s capacity to capture complex behavior influences both its bias and variance.
What might a large gap between training and test errors suggest about a model’s ability to generalize? What steps could be taken to reduce the gap?

Section 6.2

Define overfitting in the context of machine learning and explain why it is problematic for model generalization.
Why does a model that perfectly fits the training data not necessarily perform well on new, unseen data?
How does the choice of \(k\) in kNN affect the likelihood of overfitting, particularly with noisy data?
Explain how the depth of a decision tree can contribute to overfitting.
Compare and contrast overfitting and underfitting. How can one identify these issues based on the model’s performance?
How is overfitting related to the concept of variance in a model’s predictions? Discuss the relationship between overfitting and the gap between training and testing performance.
What strategies can be employed to reduce the risk of overfitting in machine learning models?
How does overfitting fit into the broader discussion of the bias-variance tradeoff?
Describe how learning curves can be used to diagnose overfitting.

Section 6.3

Define ensemble methods in machine learning and explain why they are used to improve model performance.
What is bootstrap aggregation (bagging), and how does it help in reducing the variance of a machine learning model?
Precisely how does bagging affect the bias and variance of a collection of \(n\) learners?
Is bagging better with constituent classifiers that have small bias and large variance, or that have large bias and small variance?
What is a random forest?
Which is more likely to improve bagging results: smaller individual training sets, or larger ones? Why?
How does selecting random subsets of features for each tree in n ensemble improve the ensemble performance?
What are the two chief disadvantages of using ensemble methods?

Section 6.4

Why is validation important in the process of selecting optimal hyperparameters and models?
Describe the steps involved in \(k\)-fold cross-validation.
Explain stratified \(k\)-fold cross-validation and when it might be preferred over standard \(k\)-fold cross-validation.
How can cross-validation be used to tune hyperparameters? Describe the process of creating a validation curve and interpreting its results.
Discuss how the variance of cross-validation scores across folds can inform us about a model’s reliability.
What is a grid search for hyperparameter optimization? When does it become impractical?