Applied Machine Learning
Published:
COMS W4995 Applied Machine Learning by Andreas C. Müller at Columbia University.
01/23/19 Introduction
Conda is a packaging tool and installer that aims to do more than what pip does; handle library dependencies outside of the Python packages as well as the Python packages themselves. Conda also creates a virtual environment, like virtualenv does.
01/28/19 git, GitHub and testing
Usage | Git command |
---|---|
Changes between working directory and what was last staged | git diff |
Changes between staging area and last commit | git diff –staged |
Current Version (most recent commit) | HEAD |
Version before current | HEAD~1 |
Changes made in the last commit | git diff HEAD~1 |
Changes made in the last 2 commits | git diff HEAD~2 |
git log –online –decorate –all –graph | |
moves HEAD to | git reset –soft |
moves HEAD to | git reset –mixed |
moved HEAD to | git reset –hard |
Unit tests – function does the right thing.
Integration tests – system / process does the right thing.
Non-regression tests – bug got removed (and will not be reintroduced).
01/30/19 matplotlib and visualization
import matplotlib.pyplot as plt
ax = plt.gca() # get current axes
fig = plt.gcf() # get current figure
02/04/19 Introduction to supervised learning
Naive Nearest Neighbor | Kd-tree Nearest Neighbor |
---|---|
fit: no time | fit: O(p * n log n) |
memory: O(n * p) | memory: O(n * p) |
predict: O(n * p) | predict: O(k * log(n)) FOR FIXED p! |
n=n_samples | p=n_features |
Parametric model: Number of “parameters” (degrees of freedom) independent of data.
Non-parametric model: Degrees of freedom increase with more data.
02/06/19 Preprocessing
est.fit_transform(X) == est.fit(X).transform(X) # mostly
est.fit_predict(X) == est.fit(X).predict(X) # mostly
For high cardinality categorical features, we can use target-based encoding.
Power Transformation:
\begin{equation} bc_{\lambda}(x) = \begin{cases} \frac{x^{\lambda}-1}{\lambda}, & \mbox{if }\lambda \neq 0 \newline \log(x), & \mbox{if }\lambda = 0 \end{cases} \end{equation}
Only applicable for positive x!
02/11/19 Linear models for Regression
Imputation Methods:
- Mean/Median
- kNN
- Regression models
- Matrix factorization
Coefficient of determination $R^2$:
$R^2 (y, y’) = 1 - \sum_{i=0}^{n-1}(y_i-y’i)^2 / \sum{i=0}^{n-1}(y_i-\bar{y})^2$
$\bar{y} = \frac{1}{n}\sum_{i=0}^{n-1}y_i$
Can be negative for biased estimators - or the test set!
Ridge Regression:
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_2|w|_2^2$
Lasso Regression:
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_1|w|_1$
Elastic Net:
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} \sum_{i=1}^n |w^Tx_{i} + b - y_i|^2 + \alpha_1|w|_1 + \alpha_2|w|_2^2$
02/13/19 Linear models for Classification, SVMs
Logistic Regression:
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1)$
Penalized Logistic Regression:
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -C\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1) + |w|_2^2$
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -C\sum_{i=1}^n \log(\exp(-y_i(w^Tx_i+b))+1) + |w|_1$
(Soft Margin) Linear SVM
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} C\sum_{i=1}^n \max(0, 1-y_i(w^Tx_i+b)) + |w|_2^2$
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} C\sum_{i=1}^n \max(0, 1-y_i(w^Tx_i+b)) + |w|_1$
One vs Rest | One vs One |
---|---|
nclasses classifiers | nclasses*(nclasses-1)/2 classifiers |
trained on imbalanced datasets of original size | trained on balanced subsets |
retains some uncertainty | No uncertainty propagated |
Multinomial Logistic Regression
$\min_{w \in \mathbb{R}^p, b \in \mathbb{R}} -\sum_{i=1}^n \log(p(y=y_i \mid x_i, w, b))$
$p(y=i \mid x) = \frac{e^{w^T_ix+b_i}}{\sum_{j=1}^ne^{w^T_jx+b_j}}$
Kernel SVM: Read the book!
02/18/19 Trees, Forests & Ensembles
Trees are similar to Nearest Neighbors.
Trees and Nearest Neighbors can not extrapolate.
Trees are not stable.
Bagging (Bootstrap AGGregation): Generic way to build “slightly different” models.
Generalization in Ensembles depend on strength of the individual classifiers and (inversely) on their correlation; Uncorrelating them might help, even at the expense of strength.
Randomize in two ways in Random Forest:
For each tree: pick bootstrap sample of data
For each split: pick random sample of features
02/20/19 Gradient Boosting, Calibration
Gradient Boosting Algorithm:
$f_1(x) \approx y$
$f_2(x) \approx y - \gamma f_1(x)$
$f_3(x) \approx y - \gamma f_1(x) - \gamma f_2(x)$
Learning rate: $\gamma$
Gradient Boosting Advantages:
slower to train than RF, but much faster to predict
very fast using XGBoost, LightGBM, pygbm ……
small model size
usually more accurate than Random Forests
When to use tree-based models:
Model non-linear relationships
Doesn’t care about scaling, no need for feature engineering
Single tree: very interpretable (if small)
Random forests very robust, good benchmark
Gradeint boosting often best performance with careful tuning
Calibration curve
Brier Score (for binary classification): “mean squared error of probability estimate”
$BS = \sum_{i=1}^n (p(y_i)-y_i)^2 / n$ (measure both calibration and accuracy)
Calibrating a classifier:
Platt Scaling: $f_{platt} = \frac{1}{1+\exp(-ws(x)-b)}$
Isotonic Regression: Learn monotonically increasing step function in 1d.
02/25/19 Model Evaluation
confusion matrix:
negative class | TN | FP |
positive class | FN | TP |
predicted negative | predicted positive |
$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$
$Precision = \frac{TP}{TP+FP}$
$Recall = \frac{TP}{TP+FN}$
$F = 2\frac{precision \times recall}{precision+recall}$
Precision-Recall curve: (x-axis Recall; y-axis Precision)
Average Precision:
$AveP = \sum_{k=1}^n P(k)\triangle r(k)$
$FPR = \frac{FP}{FP+TN}$
$TPR = \frac{TP}{TP+FN}$
ROC curve: (x-axis FPR; y-axis TPR)
ROC AUC: Area under ROC Curve (Always 0.5 for random/constant prediction)
03/04/19 Learning with Imbalanced Data
03/06/19 Model Interpretration and Feature Selection