Applied Machine Learning

5 minute read

Published: February 20, 2019

COMS W4995 Applied Machine Learning by Andreas C. Müller at Columbia University.

01/23/19 Introduction

Conda is a packaging tool and installer that aims to do more than what pip does; handle library dependencies outside of the Python packages as well as the Python packages themselves. Conda also creates a virtual environment, like virtualenv does.

01/28/19 git, GitHub and testing

Usage	Git command
Changes between working directory and what was last staged	git diff
Changes between staging area and last commit	git diff –staged
Current Version (most recent commit)	HEAD
Version before current	HEAD~1
Changes made in the last commit	git diff HEAD~1
Changes made in the last 2 commits	git diff HEAD~2
	git log –online –decorate –all –graph
moves HEAD to (takes the current branch with it)	git reset –soft
moves HEAD to , changes index to be at (but not working directory)	git reset –mixed
moved HEAD to , changes index and working tree to	git reset –hard

Unit tests – function does the right thing.

Integration tests – system / process does the right thing.

Non-regression tests – bug got removed (and will not be reintroduced).

01/30/19 matplotlib and visualization

import matplotlib.pyplot as plt

ax = plt.gca() # get current axes

fig = plt.gcf() # get current figure

02/04/19 Introduction to supervised learning

Naive Nearest Neighbor	Kd-tree Nearest Neighbor
fit: no time	fit: O(p * n log n)
memory: O(n * p)	memory: O(n * p)
predict: O(n * p)	predict: O(k * log(n)) FOR FIXED p!
n=n_samples	p=n_features

Parametric model: Number of “parameters” (degrees of freedom) independent of data.

Non-parametric model: Degrees of freedom increase with more data.

02/06/19 Preprocessing

est.fit_transform(X) == est.fit(X).transform(X) # mostly

est.fit_predict(X) == est.fit(X).predict(X) # mostly

For high cardinality categorical features, we can use target-based encoding.

Power Transformation:

\begin{equation} bc_{\lambda}(x) = \begin{cases} \frac{x^{\lambda}-1}{\lambda}, & \mbox{if }\lambda \neq 0 \newline \log(x), & \mbox{if }\lambda = 0 \end{cases} \end{equation}

Only applicable for positive x!

02/11/19 Linear models for Regression

Imputation Methods:

Mean/Median
kNN
Regression models
Matrix factorization

Coefficient of determination $R^2$:

$R^2 (y, y’) = 1 - \sum_{i=0}^{n-1}(y_i-y’i)^2 / \sum{i=0}^{n-1}(y_i-\bar{y})^2$

$\bar{y} = \frac{1}{n}\sum_{i=0}^{n-1}y_i$

Can be negative for biased estimators - or the test set!

Ridge Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_2|w|_2^2$

Lasso Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_1|w|_1$

Elastic Net:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_1|w|_1 + \alpha_2|w|_2^2$

02/13/19 Linear models for Classification, SVMs

Logistic Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1)$

Penalized Logistic Regression:

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -C\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1) + |w|_2^2$

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -C\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1) + |w|_1$

(Soft Margin) Linear SVM

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} C\sum_{i=1}^n \max(0, 1-y_i(w^Tx_i+b)) + |w|_2^2$

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} C\sum_{i=1}^n \max(0, 1-y_i(w^Tx_i+b)) + |w|_1$

One vs Rest	One vs One
nclasses classifiers	nclasses*(nclasses-1)/2 classifiers
trained on imbalanced datasets of original size	trained on balanced subsets
retains some uncertainty	No uncertainty propagated

Multinomial Logistic Regression

$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -\sum_{i=1}^n \log(p(y=y_i \mid x_i, w, b))$

$p(y=i \mid x) = \frac{e^{w^T_ix+b_i}}{\sum_{j=1}^ne^{w^T_jx+b_j}}$

Kernel SVM: Read the book!

02/18/19 Trees, Forests & Ensembles

Trees are similar to Nearest Neighbors.

Trees and Nearest Neighbors can not extrapolate.

Trees are not stable.

Bagging (Bootstrap AGGregation): Generic way to build “slightly different” models.

Generalization in Ensembles depend on strength of the individual classifiers and (inversely) on their correlation; Uncorrelating them might help, even at the expense of strength.

Randomize in two ways in Random Forest:

For each tree: pick bootstrap sample of data
For each split: pick random sample of features

02/20/19 Gradient Boosting, Calibration

Gradient Boosting Algorithm:

$f_1(x) \approx y$

$f_2(x) \approx y - \gamma f_1(x)$

$f_3(x) \approx y - \gamma f_1(x) - \gamma f_2(x)$

Learning rate: $\gamma$

Gradient Boosting Advantages:

slower to train than RF, but much faster to predict
very fast using XGBoost, LightGBM, pygbm ……
small model size
usually more accurate than Random Forests

When to use tree-based models:

Model non-linear relationships
Doesn’t care about scaling, no need for feature engineering
Single tree: very interpretable (if small)
Random forests very robust, good benchmark
Gradeint boosting often best performance with careful tuning

Calibration curve

Brier Score (for binary classification): “mean squared error of probability estimate”

$BS = \sum_{i=1}^n (p(y_i)-y_i)^2 / n$ (measure both calibration and accuracy)

Calibrating a classifier:

Platt Scaling: $f_{platt} = \frac{1}{1+\exp(-ws(x)-b)}$
Isotonic Regression: Learn monotonically increasing step function in 1d.

02/25/19 Model Evaluation

confusion matrix:


negative class	TN	FP
positive class	FN	TP
	predicted negative	predicted positive

$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$

$Precision = \frac{TP}{TP+FP}$

$Recall = \frac{TP}{TP+FN}$

$F = 2\frac{precision \times recall}{precision+recall}$

Precision-Recall curve: (x-axis Recall; y-axis Precision)

Average Precision:

$AveP = \sum_{k=1}^n P(k)\triangle r(k)$

$FPR = \frac{FP}{FP+TN}$

$TPR = \frac{TP}{TP+FN}$

ROC curve: (x-axis FPR; y-axis TPR)

ROC AUC: Area under ROC Curve (Always 0.5 for random/constant prediction)

03/04/19 Learning with Imbalanced Data

03/06/19 Model Interpretration and Feature Selection

Share on

Twitter Facebook LinkedIn

Ren Wang

Applied Machine Learning

Share on

You May Also Enjoy

Systems for ML

Deep Learning Specialization

AI For Everyone

Deep Unsupervised Learning