Applied Machine Learning

5 minute read

Published:

COMS W4995 Applied Machine Learning by Andreas C. Müller at Columbia University.

01/23/19 Introduction


Conda is a packaging tool and installer that aims to do more than what pip does; handle library dependencies outside of the Python packages as well as the Python packages themselves. Conda also creates a virtual environment, like virtualenv does.


01/28/19 git, GitHub and testing


UsageGit command
Changes between working directory and what was last stagedgit diff
Changes between staging area and last commitgit diff –staged
Current Version (most recent commit)HEAD
Version before currentHEAD~1
Changes made in the last commitgit diff HEAD~1
Changes made in the last 2 commitsgit diff HEAD~2
 git log –online –decorate –all –graph
moves HEAD to (takes the current branch with it)git reset –soft
moves HEAD to , changes index to be at (but not working directory)git reset –mixed
moved HEAD to , changes index and working tree to git reset –hard

Unit tests – function does the right thing.

Integration tests – system / process does the right thing.

Non-regression tests – bug got removed (and will not be reintroduced).


01/30/19 matplotlib and visualization


import matplotlib.pyplot as plt

ax = plt.gca() # get current axes

fig = plt.gcf() # get current figure


02/04/19 Introduction to supervised learning


Naive Nearest NeighborKd-tree Nearest Neighbor
fit: no timefit: O(p * n log n)
memory: O(n * p)memory: O(n * p)
predict: O(n * p)predict: O(k * log(n)) FOR FIXED p!
n=n_samplesp=n_features

Parametric model: Number of “parameters” (degrees of freedom) independent of data.

Non-parametric model: Degrees of freedom increase with more data.


02/06/19 Preprocessing


est.fit_transform(X) == est.fit(X).transform(X) # mostly

est.fit_predict(X) == est.fit(X).predict(X) # mostly

For high cardinality categorical features, we can use target-based encoding.

Power Transformation:

\begin{equation} bc_{\lambda}(x) = \begin{cases} \frac{x^{\lambda}-1}{\lambda}, & \mbox{if }\lambda \neq 0 \newline \log(x), & \mbox{if }\lambda = 0 \end{cases} \end{equation}

Only applicable for positive x!


02/11/19 Linear models for Regression


Imputation Methods:

  1. Mean/Median
  2. kNN
  3. Regression models
  4. Matrix factorization

Coefficient of determination $R^2$:

$R^2 (y, y’) = 1 - \sum_{i=0}^{n-1}(y_i-y’i)^2 / \sum{i=0}^{n-1}(y_i-\bar{y})^2$

$\bar{y} = \frac{1}{n}\sum_{i=0}^{n-1}y_i$

Can be negative for biased estimators - or the test set!

Ridge Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_2|w|_2^2$

Lasso Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_1|w|_1$

Elastic Net:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_1|w|_1 + \alpha_2|w|_2^2$


02/13/19 Linear models for Classification, SVMs


Logistic Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1)$

Penalized Logistic Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -C\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1) + |w|_2^2$

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -C\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1) + |w|_1$

(Soft Margin) Linear SVM

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} C\sum_{i=1}^n \max(0, 1-y_i(w^Tx_i+b)) + |w|_2^2$

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} C\sum_{i=1}^n \max(0, 1-y_i(w^Tx_i+b)) + |w|_1$

One vs RestOne vs One
nclasses classifiersnclasses*(nclasses-1)/2 classifiers
trained on imbalanced datasets of original sizetrained on balanced subsets
retains some uncertaintyNo uncertainty propagated

Multinomial Logistic Regression

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -\sum_{i=1}^n \log(p(y=y_i \mid x_i, w, b))$

$p(y=i \mid x) = \frac{e^{w^T_ix+b_i}}{\sum_{j=1}^ne^{w^T_jx+b_j}}$

Kernel SVM: Read the book!


02/18/19 Trees, Forests & Ensembles


Trees are similar to Nearest Neighbors.

Trees and Nearest Neighbors can not extrapolate.

Trees are not stable.

Bagging (Bootstrap AGGregation): Generic way to build “slightly different” models.

Generalization in Ensembles depend on strength of the individual classifiers and (inversely) on their correlation; Uncorrelating them might help, even at the expense of strength.

Randomize in two ways in Random Forest:

  1. For each tree: pick bootstrap sample of data

  2. For each split: pick random sample of features


02/20/19 Gradient Boosting, Calibration


Gradient Boosting Algorithm:

$f_1(x) \approx y$

$f_2(x) \approx y - \gamma f_1(x)$

$f_3(x) \approx y - \gamma f_1(x) - \gamma f_2(x)$

Learning rate: $\gamma$

Gradient Boosting Advantages:

  1. slower to train than RF, but much faster to predict

  2. very fast using XGBoost, LightGBM, pygbm ……

  3. small model size

  4. usually more accurate than Random Forests

When to use tree-based models:

  1. Model non-linear relationships

  2. Doesn’t care about scaling, no need for feature engineering

  3. Single tree: very interpretable (if small)

  4. Random forests very robust, good benchmark

  5. Gradeint boosting often best performance with careful tuning

Calibration curve

Brier Score (for binary classification): “mean squared error of probability estimate”

$BS = \sum_{i=1}^n (p(y_i)-y_i)^2 / n$ (measure both calibration and accuracy)

Calibrating a classifier:

  1. Platt Scaling: $f_{platt} = \frac{1}{1+\exp(-ws(x)-b)}$

  2. Isotonic Regression: Learn monotonically increasing step function in 1d.


02/25/19 Model Evaluation


confusion matrix:

   
negative classTNFP
positive classFNTP
 predicted negativepredicted positive

$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$

$Precision = \frac{TP}{TP+FP}$

$Recall = \frac{TP}{TP+FN}$

$F = 2\frac{precision \times recall}{precision+recall}$

Precision-Recall curve: (x-axis Recall; y-axis Precision)

Average Precision:

$AveP = \sum_{k=1}^n P(k)\triangle r(k)$

$FPR = \frac{FP}{FP+TN}$

$TPR = \frac{TP}{TP+FN}$

ROC curve: (x-axis FPR; y-axis TPR)

ROC AUC: Area under ROC Curve (Always 0.5 for random/constant prediction)


03/04/19 Learning with Imbalanced Data



03/06/19 Model Interpretration and Feature Selection