class: logo-slide --- class: title-slide ## Topics in Classification ### Applications of Data Science - Class 13 ### Giora Simchoni #### `gsimchoni@gmail.com and add #dsapps in subject` ### Stat. and OR Department, TAU ### 2023-02-23 --- layout: true <div class="my-footer"> <span> <a href="https://dsapps-2023.github.io/Class_Slides/" target="_blank">Applications of Data Science </a> </span> </div> --- class: section-slide # Topics in Classification --- # Life isn't perfect Let's tackle just a few issues: - Not enough labelled data and data labeling is expensive - Imbalanced Classes --- class: section-slide # Active Learning --- ### Got Data? ```r n <- 20 x1 <- rnorm(n, 0, 1); x2 <- rnorm(n, 0, 1) t <- 2 - 4 * x1 + 3 * x2 y <- rbinom(n, 1, 1 / (1 + exp(-t))) glm_mod <- glm(y ~ x1 + x2, family = "binomial") ``` <img src="images/AL-LR-Example-1.png" width="50%" /> --- ### Want more? > The key idea behind *active learning* is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. An active learner may pose *queries*, usually in the form of unlabeled data instances to be labeled by an *oracle* (e.g., a human annotator). Active learning is well-motivated in many modern machine learning problems, where unlabeled data may be abundant or easily obtained, but labels are difficult, time-consuming, or expensive to obtain. ([Settles, 2010](http://burrsettles.com/pub/settles.activelearning.pdf)) > You want data? Well data costs! (No one, ever) --- ### Where this is going <img src="images/active_learning_plan.png" style="width: 90%" /> --- ### Active Learning Scenarios 1. **Membership Query Synthesis**: You get to choose which (maybe theoretical) points you'd want `\(y\)` labelled for. 2. **Stream-Based Selective Sampling**: You get 1 point at a time and decide which ones you'd like to query and which to discard. 3. **Pool-Based Sampling**: You have a large collecetion of unlabelled points at your disposal, you need to send the "best ones" for labelling <img src="images/active_learning_scenarios.png" style="width: 70%" /> --- ### Uncertainty Sampling .insight[ 💡 For a 2-class dataset, the observations your model is most uncertain of are... ] <img src="images/AL-LR-Example2-1.png" width="50%" /> --- ### Uncertainty Sampling Measures Let `\(\hat{y}_i\)` be the predicted classes with `\(i\)`th highest score (probability), for observations `\(x\)` under some model `\(\theta\)`. So `\(\hat{y}_1 = \arg\max{P_{\theta}(y|x)}\)` are the actual predicted classes, `\(\hat{y}_2\)` are the second choices, etc. * Least Confidence: Choose those observations for which `\(P_{\theta}(\hat{y}_1|x)\)` is smallest: `\(x^*_{LC} = \arg\min{P_{\theta}(\hat{y}_1|x)}\)` .insight[ 💡 For a 2-class balanced dataset, this means... ] --- * Margin Sampling: Choose those observations for which the margin between the two highest scores is smallest: `\(x^*_M = \arg\min{P_{\theta}(\hat{y}_1|x) - P_{\theta}(\hat{y}_2|x)}\)` .insight[ 💡 For a 2-class balanced dataset, this means... ] * Entropy: Choose the observations for which entropy is highest: `\(x^*_H = \arg\max-{\sum_i P_{\theta}(\hat{y}_i|x) \log[P_{\theta}(\hat{y}_i|x)]}\)` We will talk more about entropy in Neural Networks, let's minimize negative entropy. .insight[ 💡 For a 2-class balanced dataset, this means... ] --- ### Example: The `spotify_songs` data from HW3 ```r spotify_songs <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv') spotify_songs %>% count(playlist_genre) ``` ``` ## # A tibble: 6 × 2 ## playlist_genre n ## <chr> <int> ## 1 edm 6043 ## 2 latin 5155 ## 3 pop 5507 ## 4 r&b 5431 ## 5 rap 5746 ## 6 rock 4951 ``` Let's try to classify the genre of a song! --- We'll take only the 12 audio features as predictors, and choose each `track_id` once (remember each song appears a few times?): ```r library(tidymodels) predictors <- 12:23 spotify_songs <- spotify_songs %>% group_by(track_id) %>% slice_sample(n = 1) %>% ungroup() %>% distinct(track_name, .keep_all = TRUE) %>% select(track_id, track_name, track_artist, playlist_genre, predictors) %>% mutate(playlist_genre = recode(playlist_genre, "r&b" = "rnb")) set.seed(76) sptfy_split_obj <- spotify_songs %>% initial_split(prop = 0.8) sptfy_tr <- training(sptfy_split_obj) sptfy_te <- testing(sptfy_split_obj) ``` --- Plot twist! We only have 20 songs from each genre! ```r set.seed(1) sptfy_tr_small <- sptfy_tr %>% group_by(playlist_genre) %>% slice_sample(n = 20) %>% ungroup() sptfy_tr_small %>% count(playlist_genre) ``` ``` ## # A tibble: 6 × 2 ## playlist_genre n ## <chr> <int> ## 1 edm 20 ## 2 latin 20 ## 3 pop 20 ## 4 rap 20 ## 5 rnb 20 ## 6 rock 20 ``` Muhaha! --- We'll also have a pool of songs to query, `sptfy_tr_large`: ```r sptfy_tr_large <- sptfy_tr %>% anti_join(sptfy_tr_small, by = "track_id") ``` We `bake()` the 3 datasets with the small sample params recipe: ```r sptfy_rec <- recipe(playlist_genre ~ ., data = sptfy_tr_small) %>% update_role(track_id, track_name, track_artist, new_role = "id") %>% step_normalize(all_numeric(), -has_role("id")) %>% step_string2factor(playlist_genre) %>% prep(sptfy_tr_small, strings_as_factors = FALSE) sptfy_tr_small <- juice(sptfy_rec) sptfy_tr_large <- bake(sptfy_rec, new_data = sptfy_tr_large) sptfy_te <- bake(sptfy_rec, new_data = sptfy_te) ``` --- Let's build a simple GBT model: ```r mod_spec <- boost_tree(mode = "classification", trees = 100) %>% set_engine("xgboost", eval_metric = "mlogloss") mod_fit <- mod_spec %>% fit(playlist_genre ~ ., data = sptfy_tr_small %>% select(-track_id, -track_name, -track_artist)) mod_pred <- mod_fit %>% predict(new_data = sptfy_tr_large, type = "prob") mod_pred ``` ``` ## # A tibble: 18,640 × 6 ## .pred_edm .pred_latin .pred_pop .pred_rap .pred_rnb .pred_rock ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.926 0.00230 0.0625 0.00202 0.000854 0.00590 ## 2 0.00472 0.00170 0.0112 0.000684 0.00106 0.981 ## 3 0.0212 0.613 0.00471 0.300 0.0504 0.00994 ## 4 0.00220 0.000616 0.00261 0.000749 0.993 0.000780 ## 5 0.913 0.0296 0.0172 0.0347 0.00192 0.00370 ## 6 0.171 0.237 0.0459 0.514 0.00756 0.0247 ## 7 0.00155 0.00378 0.0154 0.00124 0.970 0.00794 ## 8 0.255 0.351 0.00678 0.270 0.114 0.00356 ## 9 0.00444 0.00184 0.0212 0.000883 0.0409 0.931 ## 10 0.109 0.589 0.125 0.0267 0.118 0.0325 ## # … with 18,630 more rows ``` --- Test accuracy? ```r mod_te_pred_class <- mod_fit %>% predict(new_data = sptfy_te) %>% bind_cols(sptfy_te) mod_te_pred_class %>% accuracy(truth = playlist_genre, estimate = .pred_class) ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy multiclass 0.363 ``` Remember this model was built on 120 of almost 19K available unique songs! --- Test Recall and Precision: ```r mod_te_pred_class %>% group_by(playlist_genre) %>% accuracy(truth = playlist_genre, estimate = .pred_class) %>% select(playlist_genre, recall = .estimate) %>% bind_cols( mod_te_pred_class %>% group_by(.pred_class) %>% accuracy(truth = playlist_genre, estimate = .pred_class) %>% select(precision = .estimate) ) ``` ``` ## # A tibble: 6 × 3 ## playlist_genre recall precision ## <fct> <dbl> <dbl> ## 1 edm 0.454 0.517 ## 2 latin 0.376 0.255 ## 3 pop 0.25 0.249 ## 4 rap 0.318 0.406 ## 5 rnb 0.305 0.312 ## 6 rock 0.472 0.487 ``` --- Build a function which will take each row of predicted probs and return a list of 3 uncertainty metrics: ```r uncertainty_lc <- function(probs) { max(probs) } uncertainty_m <- function(probs) { o <- order(probs, decreasing = TRUE) probs[o[1]] - probs[o[2]] } uncertainty_h <- function(probs) { sum(probs * log(probs + 0.000001)) } uncertainty <- function(...) { probs <- c(...) list( lc = uncertainty_lc(probs), margin = uncertainty_m(probs), entropy = uncertainty_h(probs) ) } ``` --- ```r mod_unc <- mod_pred %>% pmap_dfr(uncertainty) mod_unc ``` ``` ## # A tibble: 18,640 × 3 ## lc margin entropy ## <dbl> <dbl> <dbl> ## 1 0.926 0.864 -0.307 ## 2 0.981 0.969 -0.118 ## 3 0.613 0.313 -0.964 ## 4 0.993 0.990 -0.0514 ## 5 0.913 0.878 -0.406 ## 6 0.514 0.278 -1.25 ## 7 0.970 0.955 -0.172 ## 8 0.351 0.0816 -1.37 ## 9 0.931 0.890 -0.321 ## 10 0.589 0.464 -1.27 ## # … with 18,630 more rows ``` Obviously these are correlated: --- .pull-left[ ```r mod_unc %>% slice_sample(n = 1000) %>% ggplot(aes(lc, margin)) + geom_point() + theme_light() + theme(text = element_text(size = 14)) ``` <img src="images/AL-MR-Lc-Margin-1.png" width="100%" /> ] .pull-right[ ```r mod_unc %>% slice_sample(n = 1000) %>% ggplot(aes(lc, entropy)) + geom_point() + theme_light() + theme(text = element_text(size = 14)) ``` <img src="images/AL-MR-Lc-Entropy-1.png" width="100%" /> ] --- Which are the top 10 songs in terms of each metric the model is most curious about? ```r sptfy_tr_large_with_unc <- sptfy_tr_large %>% bind_cols(mod_unc) %>% select(track_name, track_artist, playlist_genre, lc, margin, entropy) sptfy_tr_large_with_unc %>% slice_min(lc, n = 10) %>% arrange(lc, track_name) ``` ``` ## # A tibble: 10 × 6 ## track_name track_artist playlist_genre lc margin entropy ## <chr> <chr> <fct> <dbl> <dbl> <dbl> ## 1 Deja Ali Aka Mind rap 0.205 5.76e-4 -1.72 ## 2 Bag In Da Box (feat. Orrem… DJ Tune edm 0.211 2.60e-5 -1.74 ## 3 Return of the Mack - C&J E… Mark Morris… rnb 0.216 3.52e-2 -1.78 ## 4 Mad Money Dave Nazza edm 0.227 1.15e-2 -1.64 ## 5 Banda De Camión (Remix) [f… El Alfa latin 0.229 3.24e-2 -1.74 ## 6 Told You So T.I. rap 0.232 1.88e-3 -1.67 ## 7 American Boyfriend Kevin Abstr… rnb 0.234 7.52e-4 -1.73 ## 8 Neptune EDX edm 0.235 1.62e-2 -1.67 ## 9 DJ No Pare (feat. Zion, Da… Justin Quil… latin 0.238 1.09e-2 -1.72 ## 10 tonite LCD Soundsy… pop 0.239 6.48e-3 -1.68 ``` --- ```r sptfy_tr_large_with_unc %>% slice_min(margin, n = 10) %>% arrange(margin, track_name) ``` ``` ## # A tibble: 10 × 6 ## track_name track_artist playlist_genre lc margin entropy ## <chr> <chr> <fct> <dbl> <dbl> <dbl> ## 1 Bag In Da Box (feat. Orrem… DJ Tune edm 0.211 2.60e-5 -1.74 ## 2 Right Now (feat. Future, F… Uncle Murda rap 0.343 9.22e-5 -1.17 ## 3 Traidora Gente De Zo… latin 0.315 9.25e-5 -1.40 ## 4 Dancing In the Dark Bruce Sprin… rock 0.290 1.04e-4 -1.52 ## 5 The Kids Are Coming Tones and I pop 0.330 1.21e-4 -1.40 ## 6 Legacy - Radio Edit Nicky Romero edm 0.495 1.73e-4 -0.760 ## 7 Cause We Can SMO rap 0.288 2.09e-4 -1.61 ## 8 My Feelings DiRTY RADiO pop 0.389 2.80e-4 -1.24 ## 9 what you doin' Masked Man rap 0.315 2.84e-4 -1.36 ## 10 Был хулиганом StaFFord63 rap 0.424 2.95e-4 -1.16 ``` --- ```r sptfy_tr_large_with_unc %>% slice_min(entropy, n = 10) %>% arrange(entropy, track_name) ``` ``` ## # A tibble: 10 × 6 ## track_name track_artist playlist_genre lc margin entropy ## <chr> <chr> <fct> <dbl> <dbl> <dbl> ## 1 Return of the Mack - C&J E… Mark Morris… rnb 0.216 3.52e-2 -1.78 ## 2 Want Her Mustard rap 0.263 7.40e-2 -1.75 ## 3 When We Face Reality - Khe… Sahar Z edm 0.258 6.26e-2 -1.74 ## 4 Bag In Da Box (feat. Orrem… DJ Tune edm 0.211 2.60e-5 -1.74 ## 5 Banda De Camión (Remix) [f… El Alfa latin 0.229 3.24e-2 -1.74 ## 6 En Su Nota Don Omar latin 0.275 8.87e-2 -1.74 ## 7 Isn't It Time The Babys rock 0.283 9.48e-2 -1.74 ## 8 original me (feat. dan rey… YUNGBLUD rock 0.248 3.36e-2 -1.73 ## 9 Fall in Love GoldLink rnb 0.259 3.26e-2 -1.73 ## 10 American Boyfriend Kevin Abstr… rnb 0.234 7.52e-4 -1.73 ``` --- So far it's only interesting. Will sending the observations our model is most curious about to the "oracle" prove to increase test accuracy better than random observations? See full code in slides Rmd files. <img src="images/Simul-Unc-1.png" width="100%" /> --- ### Query by Commity (QBC) Similar to ensemble models, we have a committee of models: `\(C = \{\theta_1, ..., \theta_C\}\)` Which observations the commitee is most uncertain of? E.g. `\(x^*_{VE} = \arg\max-{\sum_i \frac{V(\hat{y}_i|x)}{|C|}\log{\frac{V(\hat{y}_i|x)}{|C|}}}\)` Where `\(V(\hat{y}_i|x)\)` is the number of votes for `\(\hat{y}_i\)`. How do you get a committee? - Different models - Bagging - Same model, different subsets of features - Same model, different params --- Let's do 6 GBT models, each receiving 2 different consecutive features: ```r fit_sub_model <- function(i, tr, te) { mod_fit <- mod_spec %>% fit(playlist_genre ~ ., data = tr %>% select(playlist_genre, (2 + i * 2):(3 + i * 2))) mod_fit %>% predict(new_data = te) } mod_pred <- map_dfc(1:6, fit_sub_model, tr = sptfy_tr_small, te = sptfy_tr_large) mod_pred ``` ``` ## # A tibble: 18,640 × 6 ## .pred_class...1 .pred_class...2 .pred_class...3 .pred_class...4 ## <fct> <fct> <fct> <fct> ## 1 edm edm latin pop ## 2 edm pop rnb edm ## 3 latin rnb edm latin ## 4 rnb rnb rock rap ## 5 rnb rock rap latin ## 6 rnb rock rap edm ## 7 rnb rap rock rap ## 8 rap rnb rap rnb ## 9 edm latin latin rock ## 10 pop edm edm latin ## # … with 18,630 more rows, and 2 more variables: .pred_class...5 <fct>, ## # .pred_class...6 <fct> ``` --- ```r mod_qbc <- mod_pred %>% mutate(probs = pmap( select(., starts_with(".pred")), function(...) table(c(...)) / 6), vote_entropy = map_dbl(probs, uncertainty_h), vote_margin = map_dbl(probs, uncertainty_m)) sptfy_tr_large_with_qbc <- sptfy_tr_large %>% bind_cols(mod_qbc) %>% select(track_name, track_artist, playlist_genre, starts_with(".pred"), vote_entropy) sptfy_tr_large_with_qbc %>% slice_min(vote_entropy, n = 10) %>% arrange(vote_entropy) %>% select(starts_with(".pred")) ``` ``` ## # A tibble: 214 × 6 ## .pred_class...17 .pred_class...18 .pred_class...19 .pred_class...20 ## <fct> <fct> <fct> <fct> ## 1 edm rap rnb rock ## 2 rap rnb latin pop ## 3 rock latin rnb edm ## 4 rnb latin rap edm ## 5 rock rnb edm rap ## 6 edm rap rock latin ## 7 rnb edm rap rock ## 8 rock pop edm rnb ## 9 edm rnb pop rap ## 10 pop rock rnb edm ## # … with 204 more rows, and 2 more variables: .pred_class...21 <fct>, ## # .pred_class...22 <fct> ``` --- Will sending the observations our committee is in most disagreement about to the "oracle" prove to increase test accuracy better than random observations? See full code in slides Rmd files. <img src="images/Simul-QBC-1.png" width="100%" /> --- ### Other Active Learning Metrics - Expected Model Change - Expected Error Reduction - Variance Reduction - And more... --- class: section-slide # Imbalanced Classes --- ### Typical examples of Imbalanced Classes scenarios - Rare diseases: [this](https://www.kaggle.com/c/hivprogression) dataset contains genetic data for 1,000 HIV patients, 206 out of 1,000 patients improved after 16 weeks of therapy - Conversion/Sell/CTR rates: [this](https://www.kaggle.com/c/avazu-ctr-prediction) dataset contains 10 days of Click-Through-Rate data for Avazu mobile ads, ~6.8M clicked out of ~40.4M - Fraud detection: [this](https://www.kaggle.com/mlg-ulb/creditcardfraud) dataset contains credit card transactions for a major European CC, 492 frauds out of 284,807 transactions --- ### What's so difficult about imbalanced classes? ```r okcupid_pets <- as_tibble(read_rds("../data/okcupid3_imp_mice.rds")) idx <- read_rds("../data/okcupid3_idx.rda") train_idx <- idx$train_idx valid_idx <- idx$valid_idx test_idx <- idx$test_idx ok_train <- okcupid_pets[train_idx, ] ok_valid <- okcupid_pets[valid_idx, ] ok_train %>% count(pets) %>% mutate(pct = round(n / sum(n), 2)) ``` ``` ## # A tibble: 2 × 3 ## pets n pct ## <fct> <int> <dbl> ## 1 cats 1624 0.16 ## 2 dogs 8376 0.84 ``` .insight[ 💡 What's a sure way to get 84% accuracy? ] --- ```r mod_glm <- glm(pets ~ ., data = ok_train, family = "binomial") pred_glm <- 1 - predict(mod_glm, ok_valid, type = "response") pred_glm_class <- ifelse(pred_glm > 0.5, "cats", "dogs") true_class <- ok_valid$pets table(true_class, pred_glm_class) ``` ``` ## pred_glm_class ## true_class cats dogs ## cats 66 407 ## dogs 62 2194 ``` ```r report_accuracy_and_auc(true_class, pred_glm) ``` ``` ## Setting direction: controls < cases ``` ``` ## AUC: 0.736 ## ACC: 0.828 ## Cats: Recall: 0.14 ## Precision: 0.516 ## Dogs: Recall: 0.973 ## Precision: 0.844 ``` --- ### Remedies for Imbalanced Classes - Model level - Tuning parameters and Cutoff choice - Cost-aware training: Case weights and Prior probabilities - Data level - Down sampling - Up sampling - Get more data and features from minority class, similar to Active Learning - Change of Framework - Anomaly Detection - One final word of wisdom --- ### Tuning parameters and cutoff choice A general good approach would be: 1. Choose a model to maximize AUC on one part of the training dataset (using resampling) 2. Choose a cutoff score on another part of the training dataset 3. Fitting the entire thing on all training set and checking on test set But. You could incorporate your initial goal even into (1), making the cutoff another tuning parameter that would maximize: - Recall(cats): If never missing a cat person (the minority class) is your job .font80percent[(while maintaining acceptable level of Precision(cats))] - Precision(cats): If you don't have room for error when you say a person is a cat person .font80percent[(while maintaining acceptable level of Precision(cats))] - Some other metric like F1-score --- Let us choose a model by maximizing AUC then present our client with a few potential cutoffs. We'll begin by splitting our training set into two: ```r ok_split <- initial_split(ok_train, prop = 0.7, strata = pets) ok_train1 <- training(ok_split) ok_train2 <- testing(ok_split) dim(ok_train1) ``` ``` ## [1] 6999 38 ``` ```r dim(ok_train2) ``` ``` ## [1] 3001 38 ``` --- Use the first training set to choose a GBT model to maximize AUC with 5-fold CV: ```r mod_gbt_spec <- boost_tree(mode = "classification", mtry = tune(), min_n = tune(), learn_rate = tune(), trees = 1000) %>% set_engine("xgboost", eval_metric = "logloss") gbt_grid <- grid_regular(mtry(range(10, 50)), min_n(range(10, 100)), learn_rate(range(-3, -1)), levels = 3) rec_gbt <- recipe(pets ~ ., data = ok_train1) %>% step_dummy(all_nominal(), -all_outcomes()) %>% prep(ok_train1) cv_splits <- vfold_cv(juice(rec_gbt), v = 5, strata = pets) ``` --- ```r tune_res <- tune_grid(object = mod_gbt_spec, preprocessor = recipe(pets~., data = juice(rec_gbt)), resamples = cv_splits, grid = gbt_grid, control = control_grid(verbose = TRUE), metrics = metric_set(roc_auc)) ``` <img src="images/Imbalanced-AUC-Res-1.png" width="100%" /> --- Fit the best model on all of `ok_train1` and get scores on `ok_train2`: ```r mod_gbt_spec <- mod_gbt_spec %>% update(mtry = 30, trees = 1000, min_n = 10, learn_rate = 0.01) mod_gbt <- mod_gbt_spec %>% fit(pets ~ ., data = juice(rec_gbt)) pred_gbt <- mod_gbt %>% predict(new_data = bake(rec_gbt, ok_train2), type = "prob") %>% pull(.pred_cats) ``` --- You can use the ROC curve to understad the behavior of the cutoff: <img src="images/Imbalanced-AUC-ROC-1.png" width="60%" /> --- Maybe better, draw a histogram of `cats` score and mark the cutoffs there: <img src="images/Imbalanced-Score-Host-1.png" width="100%" /> --- Lastly, train on entire training set and evaluate on test set: ```r mod_gbt <- mod_gbt_spec %>% fit(pets ~ ., data = bake(rec_gbt, ok_train)) pred_gbt <- mod_gbt %>% predict(new_data = bake(rec_gbt, ok_valid), type = "prob") %>% pull(.pred_cats) pred_gbt_class <- ifelse(pred_gbt > 0.17, "cats", "dogs") true_class <- ok_valid$pets table(true_class, pred_gbt_class) ``` ``` ## pred_gbt_class ## true_class cats dogs ## cats 307 166 ## dogs 649 1607 ``` ```r report_accuracy_and_auc(true_class, pred_gbt, cutoff = 0.17) ``` ``` ## AUC: 0.746 ## ACC: 0.701 ## Cats: Recall: 0.649 ## Precision: 0.321 ## Dogs: Recall: 0.712 ## Precision: 0.906 ``` --- ### Cost aware training: Case weights For example, in `glm()` you can simply specify a `weights` param: > when the elements of weights are positive integers *w_i*, each response *y_i* is the mean of *w_i* unit-weight observations ```r pets_weights <- rep(1, nrow(ok_train)) pets_weights[which(ok_train$pets == "cats")] <- 5 mod_glm <- glm(pets ~ ., data = ok_train, * family = "binomial", weights = pets_weights) pred_glm <- 1 - predict(mod_glm, ok_valid, type = "response") pred_glm_class <- ifelse(pred_glm > 0.5, "cats", "dogs") true_class <- ok_valid$pets ``` --- ```r table(true_class, pred_glm_class) ``` ``` ## pred_glm_class ## true_class cats dogs ## cats 303 170 ## dogs 655 1601 ``` ```r report_accuracy_and_auc(true_class, pred_glm) ``` ``` ## AUC: 0.736 ## ACC: 0.698 ## Cats: Recall: 0.641 ## Precision: 0.316 ## Dogs: Recall: 0.71 ## Precision: 0.904 ``` But this is almost equivalent to up sampling. A more intelligent use of class weights would be something like using the `class.weights` parameter in `e1071::svm()` --- ### Cost aware training: Prior probabilities **Small Detour: Naive Bayes** You know Bayes' Theorem, right? `\(P(A|B) = \frac{P(B|A)P(A)}{P(B)}\)` or `\(posterior = \frac{likelihood \cdot prior}{evidence}\)` So what would be the posterior probability of class `\(C_k\)` given that we've seen observation `\(x_i\)`? `\(P(C_k|x_i) = \frac{P(x_i|C_k)P(C_k)}{P(x_i)}\)` --- `\(P(C_k|x_i) = \frac{P(x_i|C_k)P(C_k)}{P(x_i)}\)` In words: the likelihood of seeing an observation like `\(x_i\)` in all class `\(C_k\)` observations, times the prior of class `\(C_k\)` observations, divided by the evidence seeing an observation like `\(x_i\)` in general. .insight[ 💡 What increases `\(P(C_k|x_i)\)`? What decreases it? ] But if we have, say 100 predictors, each categorical with 2 levels - we'd have to pre-compute `\(2^{100}\)` possibilities for each `\(C_k\)`! --- Enter *Naive* Bayes: Assume that all predictors `\(X\)` are mutually independent, conditional on the class `\(C_k\)`, and so: `\(P(x_i|C_k) = \prod_{j = 1}^pP(x_{ij}|C_k)\)` And so: `\(P(C_k|x_i) = \frac{\prod P(x_{ij}|C_k)P(C_k)}{P(x_i)}\)` And we can further expand: `\(P(x_i) = \sum_k P(x_{i}|C_k)P(C_k)\)` .insight[ 💡 How would you compute `\(P(x_{ij}|C_k)\)` when `\(x_{ij}\)` is continuous? ] --- ```r library(naivebayes) mod_nb <- naive_bayes(pets ~ ., data = ok_train) pred_nb <- predict(mod_nb, ok_valid, type = "prob")[, "cats"] pred_nb_class <- ifelse(pred_nb > 0.5, "cats", "dogs") table(true_class, pred_nb_class) ``` ``` ## pred_nb_class ## true_class cats dogs ## cats 180 293 ## dogs 294 1962 ``` ```r report_accuracy_and_auc(true_class, pred_nb) ``` ``` ## AUC: 0.718 ## ACC: 0.785 ## Cats: Recall: 0.381 ## Precision: 0.38 ## Dogs: Recall: 0.87 ## Precision: 0.87 ``` --- BTW, *are* our features mutually independent? ```r ok_train %>% filter(pets == "cats") %>% select_if(is.numeric) %>% cor() %>% corrplot::corrplot() ``` <img src="images/Imbalanced-Cor-Matrix-1.png" width="50%" /> --- In the context of imbalanced classes you could just give a 5 times more weight to the score of cats by specifying different prior probabilities `\(P(C_k)\)`: ```r mod_nb <- naive_bayes(pets ~ ., data = ok_train, prior = c(5, 1)) pred_nb <- predict(mod_nb, ok_valid, type = "prob")[, "cats"] pred_nb_class <- ifelse(pred_nb > 0.5, "cats", "dogs") table(true_class, pred_nb_class) ``` ``` ## pred_nb_class ## true_class cats dogs ## cats 398 75 ## dogs 1251 1005 ``` ```r report_accuracy_and_auc(true_class, pred_nb) ``` ``` ## AUC: 0.718 ## ACC: 0.514 ## Cats: Recall: 0.841 ## Precision: 0.241 ## Dogs: Recall: 0.445 ## Precision: 0.931 ``` --- ### Down Sampling Yes, down sampling the majority class, usually to make it the same amount as the minority class (but you can tune this parameter as any other). You'd be surprised. ```r rec_gbt <- recipe(pets ~ ., data = ok_train) %>% step_dummy(all_nominal(), -all_outcomes()) %>% * themis::step_downsample(pets, under_ratio = 1) %>% prep(ok_train) ``` If you want to stay in the `tidymodels` framework you can download the `themis` package for extra recipes for dealing with unbalanced data. --- .warning[ ⚠️ Never down-sample the testing set! Look at the `skip` parameter. ] ```r juice(rec_gbt) %>% count(pets) ``` ``` ## # A tibble: 2 × 2 ## pets n ## <fct> <int> ## 1 cats 1624 ## 2 dogs 1624 ``` ```r bake(rec_gbt, ok_valid) %>% count(pets) ``` ``` ## # A tibble: 2 × 2 ## pets n ## <fct> <int> ## 1 cats 473 ## 2 dogs 2256 ``` --- ```r mod_gbt <- mod_gbt_spec %>% fit(pets ~ ., data = juice(rec_gbt)) pred_gbt <- mod_gbt %>% predict(new_data = bake(rec_gbt, ok_valid), type = "prob") %>% pull(.pred_cats) pred_gbt_class <- ifelse(pred_gbt > 0.5, "cats", "dogs") table(true_class, pred_gbt_class) ``` ``` ## pred_gbt_class ## true_class cats dogs ## cats 323 150 ## dogs 733 1523 ``` ```r report_accuracy_and_auc(true_class, pred_gbt) ``` ``` ## AUC: 0.734 ## ACC: 0.676 ## Cats: Recall: 0.683 ## Precision: 0.306 ## Dogs: Recall: 0.675 ## Precision: 0.91 ``` --- ### Up Sampling The main disadvantage of down sampling is of course the loss of data. Will replicating (minority class) data do any better? ```r rec_gbt <- recipe(pets ~ ., data = ok_train) %>% step_dummy(all_nominal(), -all_outcomes()) %>% * themis::step_upsample(pets, over_ratio = 1) %>% prep(ok_train) mod_gbt <- mod_gbt_spec %>% fit(pets ~ ., data = juice(rec_gbt)) pred_gbt <- mod_gbt %>% predict(new_data = bake(rec_gbt, ok_valid), type = "prob") %>% pull(.pred_cats) pred_gbt_class <- ifelse(pred_gbt > 0.5, "cats", "dogs") ``` --- ```r table(true_class, pred_gbt_class) ``` ``` ## pred_gbt_class ## true_class cats dogs ## cats 262 211 ## dogs 497 1759 ``` ```r report_accuracy_and_auc(true_class, pred_gbt) ``` ``` ## AUC: 0.739 ## ACC: 0.741 ## Cats: Recall: 0.554 ## Precision: 0.345 ## Dogs: Recall: 0.78 ## Precision: 0.893 ``` --- ### SMOTE [Chawla et. al. (2002)](https://arxiv.org/pdf/1106.1813.pdf) developed SMOTE (Synthetic Minority Over-sampling Technique) which is a up sampling technique. The authors claim that a hybrid combination of SMOTE and regular down sampling works best, that's why SMOTE is sometimes referred to as a "hybrid" sampling algo itself. But the up sampling does not simply replicates the minority class... --- It synthesizes them! <img src="images/smote_algo.png" style="width: 70%" /> --- ```r n <- 100 x1 <- rnorm(n, 0, 1); x2 <- rnorm(n, 0, 1) t <- 2 - 4 * x1 + 3 * x2 y <- rbinom(n, 1, 1 / (1 + exp(-t))) df <- tibble(x1 = x1, x2 = x2, y = factor(y)) df %>% count(y) ``` ``` ## # A tibble: 2 × 2 ## y n ## <fct> <int> ## 1 0 29 ## 2 1 71 ``` ```r df_smoted <- recipe(y ~ ., data = df) %>% themis::step_smote(y, over_ratio = 1) %>% prep(df) %>% juice() ``` --- ```r df_smoted %>% count(y) ``` ``` ## # A tibble: 2 × 2 ## y n ## <fct> <int> ## 1 0 71 ## 2 1 71 ``` <img src="images/SMOTE-simulation-1.png" width="100%" /> --- Let's do a hybrid of down sampling and SMOTE on our data: ```r rec_gbt <- recipe(pets ~ ., data = ok_train) %>% step_dummy(all_nominal(), -all_outcomes()) %>% themis::step_downsample(pets, under_ratio = 1.5) %>% themis::step_smote(pets, over_ratio = 1) %>% prep(ok_train) juice(rec_gbt) %>% count(pets) ``` ``` ## # A tibble: 2 × 2 ## pets n ## <fct> <int> ## 1 cats 2436 ## 2 dogs 2436 ``` ```r mod_gbt <- mod_gbt_spec %>% fit(pets ~ ., data = juice(rec_gbt)) pred_gbt <- mod_gbt %>% predict(new_data = bake(rec_gbt, ok_valid), type = "prob") %>% pull(.pred_cats) pred_gbt_class <- ifelse(pred_gbt > 0.5, "cats", "dogs") ``` --- ```r table(true_class, pred_gbt_class) ``` ``` ## pred_gbt_class ## true_class cats dogs ## cats 282 191 ## dogs 500 1756 ``` ```r report_accuracy_and_auc(true_class, pred_gbt) ``` ``` ## AUC: 0.744 ## ACC: 0.747 ## Cats: Recall: 0.596 ## Precision: 0.361 ## Dogs: Recall: 0.778 ## Precision: 0.902 ``` --- ### Other Up Sampling Methods - ADASYN - Borderline SMOTE - ROSE - Depends on data (e.g. with images it is common to perform image augmentation: flip, crop, rotate, blur it) ### Change of Framework - Anomaly Detection --- ### One final word of wisdom - It's Ok for a model to not know! - The optimal choice would be for a classification model to output a *score*, rather than a class, and have the client's system interpret that score for different applications - However, if a class output is required, consider outputting a `\(k+1\)` class: "I don't know" - In the case of classifying cats vs. dogs people - it makes sense! - For a two-class problem, you would have not one cutoff on the score, but two: Below cutoff 1 classify as "Dogs", above "Cats" and in the middle: "I don't know" - As long as you make a decision regarding at least X% of the data --- Let's see this on the `ok_valid` test set using our last SMOTEd model (but notice to tune these cutoffs you would need an extra set of untouched data!): ```r upper <- 0.65 lower <- 0.35 pred_gbt_class <- ifelse(pred_gbt > upper, "cats", ifelse(pred_gbt < lower, "dogs", NA)) table(true_class, pred_gbt_class) ``` ``` ## pred_gbt_class ## true_class cats dogs ## cats 156 103 ## dogs 239 1324 ``` --- ```r report_accuracy_and_auc2 <- function(obs, pred, lower = 0.35, upper = 0.65) { pred_class <- ifelse(pred > upper, "cats", ifelse(pred < lower, "dogs", NA)) cm <- table(true_class, pred_class) recall_cats <- cm[1, 1] / sum(cm[1,]) recall_dogs <- cm[2, 2] / sum(cm[2,]) prec_cats <- cm[1, 1] / sum(cm[,1]) prec_dogs <- cm[2, 2] / sum(cm[,2]) acc <- sum(diag(cm)) / sum(cm) pred_pct <- sum(cm) / length(obs) glue::glue("Predicted: {format(pred_pct, digits = 3)} ACC: {format(acc, digits = 3)} Cats: Recall: {format(recall_cats, digits = 3)} Precision: {format(prec_cats, digits = 3)} Dogs: Recall: {format(recall_dogs, digits = 3)} Precision: {format(prec_dogs, digits = 3)}") } report_accuracy_and_auc2(true_class, pred_gbt) ``` ``` ## Predicted: 0.668 ## ACC: 0.812 ## Cats: Recall: 0.602 ## Precision: 0.395 ## Dogs: Recall: 0.847 ## Precision: 0.928 ```