Review questions

Chapter 1

  • What is Git and what does GitHub provide on top of Git?
  • Name the essential user interface components of VS Code.
  • What are the three types of cell in a Jupyter Notebook?
  • What is the most common pitfall when working with Jupyter Notebooks?

Chapter 2

  • Define quantitative values and qualitative. What is the difference between continuous and discrete quantitative data?
  • Name three types of qualitative data.
  • Define pandas series and give some examples of series.
  • Define pandas dataframes and discuss some use cases.
  • Name two ways dataframes can be combined into a single one and provide some examples of the syntax.
  • Name three commonly encountered tasks in “data wrangling”.

Chapter 3

Section 3.1: Summary statistics

  • Define mean, variance, and standard deviation. How do variance and standard deviation differ in terms of units?
  • What is a z-score, and how is it calculated?
  • Why are z-scores considered to be dimensionless, and what advantage does this provide when comparing datasets?
  • Explain the difference between a population and a sample in statistics.
  • What is an unbiased estimator?
  • Describe the key conceptual difference between \(s_n\) and \(s_{n-1}\) for estimating population variance.
  • Define percentile (i. e., quantile), quartiles, and median.
  • Explain the difference between the sample median calculation for odd and even numbers of observations.
  • What does the interquartile range represent, and how is it calculated?
  • Discuss the potential advantage of using the median and IQR over the mean and standard deviation for summarizing a dataset.

Section 3.2: Distributions

  • Define the cumulative distribution function (CDF) for a population. Explain why it is a nondecreasing function ranging between 0 and 1.
  • What is the CDF of a uniform distribution? How does it differ for discrete and continuous cases?
  • What is an empirical cumulative distribution function (ECDF)?
  • How does the shape of the ECDF change as the sample size increases?
  • Define the probability density function (PDF) and explain its relationship to the CDF.
  • Discuss the concept of a histogram and its normalization.
  • Describe formulas for computing the mean and variance of a continuous distribution from its PDF.
  • What are the mean and variance for a continuous uniform distribution over \([0, 1]\)?
  • What are the characteristics of a standard normal distribution?
  • How do the mean and variance of a normal distribution affect its PDF?
  • Discuss the 68-95-99.7 (empirical) rule in the context of the normal distribution.
  • What does kernel density estimation (KDE) do?

Section 3.4: Grouping

  • Why might it be useful to analyze data in groups defined by categorical values or other criteria?
  • What is a facet plot?
  • Describe the information conveyed by a box plot and a violin plot.
  • Describe how aggregation works with grouping in data analysis. What are some common aggregation functions?
  • Describe how transformation works with grouping in data analysis. What are some common aggregation functions?
  • What is the purpose of filtering in grouped data analysis?
  • Discuss the impact of standardizing data within groups versus across the entire dataset.

Section 3.5: Outliers

  • What is an informal definition of an outlier?
  • Why might outliers be of real interest in certain applications?
  • List some common reasons why outliers might appear in a dataset.
  • Provide an example where changing a single value in a dataset significantly affects the mean but not the median.
  • What is the Interquartile Range (IQR), and how is it calculated?
  • Describe how outliers are identified using the IQR method.
  • How are outliers represented in a box plot?
  • What are the criteria for considering values as outliers in a normal distribution based on standard deviation?
  • Discuss the potential impact of removing outliers on the analysis of a dataset. When might it be appropriate to remove outliers, and when might it be important to investigate them further?

Section 3.6: Correlation

  • Define correlation in the context of statistical analysis.
  • Why is it important to visually inspect data before calculating correlation coefficients?
  • Define covariance and explain its significance in measuring the relationship between two variables.
  • Why is covariance not always easy to interpret?
  • Explain the Pearson correlation coefficient and how it is calculated for both populations and samples.
  • What does a Pearson coefficient of -1, 0, and 1 signify?
  • How does the Pearson coefficient address the limitations of covariance?
  • Discuss how outliers can affect the Pearson correlation coefficient.
  • Provide an example where a single outlier significantly impacts the Pearson coefficient.
  • What is the Spearman correlation coefficient, and how does it differ from the Pearson coefficient?
  • How does the Spearman coefficient mitigate the impact of outliers?
  • Provide an example demonstrating the robustness of the Spearman coefficient against outliers.
  • Explain how correlation can be assessed when dealing with categorical variables.

Section 3.7: Cautionary tales

  • What is the main lesson from the Datasaurus Dozen example?
  • How can relying solely on summary statistics be misleading?
  • Explain the difference between correlation and dependence.
  • Describe an example where two variables are dependent but not correlated.
  • What is Simpson’s paradox, and how does it manifest in the penguin dataset example?
  • Can you create a hypothetical example where Simpson’s paradox might lead to incorrect conclusions if not properly understood?
  • Why is it important to consider both linear and nonlinear relationships when analyzing data?
  • How can the misuse of statistical methods lead to misconceptions or mistakes?

Chapter 4

  • Define the terms hypothesis and concept.
  • What is a nearest neighbor hypothesis?
  • Explain why not all data is necessarily generated by a concept and how this issue is resolved by considering data-generating distributions.
  • Define the term generalisation error.
  • Define the term Bayes hypothesis.
  • Explain why one-hot encoding is preferable for encoding more than two classes when an \(L_2\) loss function is used.
  • Define empirical Rademacher complexity and explain why it is generally hard to compute.
  • Explain the approximation-estimation trade-off and state the approximation-estimation decomposition of the generalisation error.

Chapter 5

Section 5.1

  • Define a feature vector and label in the context of classification problems.
  • What distinguishes a binary classification problem from a multiclass problem?
  • Describe the structure of a feature matrix and a label vector.
  • Discuss the limitations of assigning arbitrary numerical values to qualitative data. Why is one-hot or dummy encoding considered a better strategy?
  • Outline the basic steps involved in training and applying a machine learning classifier.
  • What is a query vector? How is it used in the context of a classifier? Evaluating Classifier Performance:
  • How can the accuracy of a classifier be determined?
  • Besides accuracy, what other metrics or considerations might be important in evaluating the effectiveness of a classifier?

Section 5.2

  • Define the concept of generalization in the context of machine learning classifiers.
  • Explain the purpose of splitting a dataset into training and testing sets. How does this practice help in evaluating the generalization of a classifier?
  • Describe the significance of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) in assessing classifier performance.
  • Define accuracy, recall (sensitivity), specificity, precision, and negative predictive value (NPV).
  • What information does a confusion matrix convey about a classifier’s performance?
  • Explain the F₁ score and balanced accuracy. How do they provide a more nuanced view of classifier performance compared to using accuracy alone?
  • How do the concepts of binary classification metrics extend to multiclass classification problems?
  • Discuss the difference between macro averaging and other methods of averaging precision scores across multiple classes. Why might one averaging method be preferred over another?
  • Describe a hypothetical situation where a high recall is more important than high precision.
  • Describe a hypothetical situation where a high precision is more important than high recall.

Section 5.3

  • Describe the process of building a decision tree for classification.
  • Define Gini impurity and explain its significance in the context of decision trees.
  • Describe the criteria for partitioning samples in a decision tree.
  • What is an indicator function, and how is it utilized in expressing Gini impurity?
  • Explain the concept of a decision boundary in relation to decision trees.
  • In general terms, how does the depth of a decision tree affect its decision boundary?
  • What is a greedy algorithm, and how does it apply to the context of finding a decision tree?
  • What does it mean to say a classifier is interpretable?
  • Besides Gini impurity, are there other metrics or considerations that might be important in evaluating the partitioning in a decision tree?

Section 5.4

  • Describe the basic principle behind the k-nearest neighbors algorithm. How does it determine the class of a query point?
  • What role do distance metrics play in the kNN algorithm? Explain the properties that a distance metric must satisfy.
  • Define the 2-norm, 1-norm, and infinity-norm. How are these norms used to calculate distances between feature vectors?
  • How does the choice of \(k\) in the kNN algorithm influence the algorithm’s decision boundary and predictions?
  • Why is standardization or scaling of features important in kNN?
  • How is kNN applied to multiclass classification problems?

Section 5.5

  • Explain how a probability vector arises from vote-based classification methods. How does it provide more information than a winner-takes-all approach?
  • Define the ROC curve and explain its significance in evaluating binary classifiers.
  • What is a decision threshold in the context of probabilistic classifiers? How does changing the threshold affect classification outcomes?
  • What is the AUC metric?
  • How can ROC curves and AUC scores be used to compare different classifiers or different settings of the same classifier?
  • Explain how the concepts of ROC curves and AUC scores extend to multiclass classification problems.

Chapter 6

Section 6.1

  • How is the bias of a classifier defined in the context of expected prediction error?
  • Explain the concept of variance in the context of machine learning models. How does variance affect a model’s performance on unseen data?
  • Explain how bias and variance contribute to the total prediction error of a model. Why do we refer to a “tradeoff” between bias and variance?
  • How does the size of the training set impact the bias and variance of a machine learning model? Answer using sketches of learning curves.
  • Discuss how a learner’s capacity to capture complex behavior influences both its bias and variance.
  • What might a large gap between training and test errors suggest about a model’s ability to generalize? What steps could be taken to reduce the gap?

Section 6.2

  • Define overfitting in the context of machine learning and explain why it is problematic for model generalization.
  • Why does a model that perfectly fits the training data not necessarily perform well on new, unseen data?
  • How does the choice of \(k\) in kNN affect the likelihood of overfitting, particularly with noisy data?
  • Explain how the depth of a decision tree can contribute to overfitting.
  • Compare and contrast overfitting and underfitting. How can one identify these issues based on the model’s performance?
  • How is overfitting related to the concept of variance in a model’s predictions? Discuss the relationship between overfitting and the gap between training and testing performance.
  • What strategies can be employed to reduce the risk of overfitting in machine learning models?
  • How does overfitting fit into the broader discussion of the bias-variance tradeoff?
  • Describe how learning curves can be used to diagnose overfitting.

Section 6.3

  • Define ensemble methods in machine learning and explain why they are used to improve model performance.
  • What is bootstrap aggregation (bagging), and how does it help in reducing the variance of a machine learning model?
  • Precisely how does bagging affect the bias and variance of a collection of \(n\) learners?
  • Is bagging better with constituent classifiers that have small bias and large variance, or that have large bias and small variance?
  • What is a random forest?
  • Which is more likely to improve bagging results: smaller individual training sets, or larger ones? Why?
  • How does selecting random subsets of features for each tree in n ensemble improve the ensemble performance?
  • What are the two chief disadvantages of using ensemble methods?

Section 6.4

  • Why is validation important in the process of selecting optimal hyperparameters and models?
  • Describe the steps involved in \(k\)-fold cross-validation.
  • Explain stratified \(k\)-fold cross-validation and when it might be preferred over standard \(k\)-fold cross-validation.
  • How can cross-validation be used to tune hyperparameters? Describe the process of creating a validation curve and interpreting its results.
  • Discuss how the variance of cross-validation scores across folds can inform us about a model’s reliability.
  • What is a grid search for hyperparameter optimization? When does it become impractical?