class: logo-slide --- class: title-slide ## The Trees ### Applications of Data Science - Class 12 ### Giora Simchoni #### `gsimchoni@gmail.com and add #dsapps in subject` ### Stat. and OR Department, TAU ### 2023-02-23 --- layout: true <div class="my-footer"> <span> <a href="https://dsapps-2023.github.io/Class_Slides/" target="_blank">Applications of Data Science </a> </span> </div> --- class: section-slide # The Pros and Cons of Trees --- ## Pros - It's just a set of if-else statements my ~~7~~ ~~8~~ ~~9~~ 10 y/o can get - Highly interpretable (when tree is not large) - Easy to implement - Fast? (to predict) - Little pre-processing of predictors needed - Handle all types of predictors (continuous, categorical) - Handle missing data, *predict* missing data - Assumption free - Feature selection built-in (but when predictors are correlated...) - Low bias, in general --- ## Cons - High variance, in general - Rectangular predictor regions - not always a good thing - Complexity of prediction limited in no. of leaves! (For a simple CART) - Not so fast? (to train) - Greedy - Selection Bias of predictors with more distinct values? - Variable Importance when predictors are correlated --- class: section-slide # Detour: A Regression Problem --- ### OKCupid: Predicting Annual Income It won't be easy: ```r okcupid <- read_csv("~/okcupid.csv.zip") okcupid %>% count(income, sort = TRUE) %>% head(20) ``` ``` ## # A tibble: 13 × 2 ## income n ## <dbl> <int> ## 1 -1 48442 ## 2 20000 2952 ## 3 100000 1621 ## 4 80000 1111 ## 5 30000 1048 ## 6 40000 1005 ## 7 50000 975 ## 8 60000 736 ## 9 70000 707 ## 10 150000 631 ## 11 1000000 521 ## 12 250000 149 ## 13 500000 48 ``` --- We will stick to non-NA (income) observations, and predict `\(\log_{10}(income/100000)\)`: ```r okcupid2 <- okcupid %>% mutate(income = ifelse(income == -1, NA, log10(income/100000))) %>% drop_na(income) ``` In the vector `predictors` .font80percent[(see slides Rmd source)] we have 42 continuous and categorical variables which may or may not be predictive to income: ```r okcupid2 <- okcupid2 %>% select(income, all_of(predictors)) %>% mutate(id = 1:n()) dim(okcupid2) ``` ``` ## [1] 11504 44 ``` --- ```r glimpse(okcupid2) ``` ``` ## Rows: 11,504 ## Columns: 44 ## $ income <dbl> -0.09691001, -0.69897000, -0.39794001, -0.522878… ## $ age <dbl> 35, 23, 28, 30, 29, 40, 31, 22, 35, 31, 21, 34, … ## $ height_cm <dbl> 177.80, 180.34, 182.88, 167.64, 157.48, 180.34, … ## $ sex <fct> m, m, m, f, f, m, f, m, m, f, m, m, f, m, m, m, … ## $ body_type <fct> average, thin, average, skinny, thin, fit, thin,… ## $ body_type_not_perfect <fct> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F… ## $ diet2 <fct> other, vegetarian, anything, anything, anything,… ## $ drinks <fct> often, socially, socially, socially, socially, s… ## $ drugs <fct> sometimes, NA, never, never, never, NA, sometime… ## $ religion2 <fct> atheist, NA, christian, christian, christian, at… ## $ education2 <fct> other, student1, degree1, high_school, student1,… ## $ education_kind <fct> working, working, graduated, graduated, working,… ## $ education_academic <fct> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, F… ## $ ethnicity2 <fct> white, white, white, white, other, white, other,… ## $ part_black <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, NA, FA… ## $ part_white <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, FALSE, T… ## $ part_asian <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, NA, TR… ## $ part_hispanic <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, NA, FAL… ## $ job3 <fct> travel, student, financial, marketing, other, cr… ## $ orientation <fct> straight, straight, straight, straight, straight… ## $ pets_has_dogs <fct> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, NA, NA,… ## $ pets_has_cats <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, NA, NA,… ## $ pets_likes_cats <fct> TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, NA, NA, TR… ## $ pets_likes_dogs <fct> TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, NA, NA, T… ## $ sign_fun <fct> FALSE, FALSE, FALSE, NA, FALSE, TRUE, NA, FALSE,… ## $ sign_not_matter <fct> FALSE, FALSE, TRUE, NA, FALSE, FALSE, NA, TRUE, … ## $ sign_matters <fct> FALSE, FALSE, FALSE, NA, FALSE, FALSE, NA, FALSE… ## $ sign2 <fct> cancer, pisces, leo, NA, taurus, gemini, NA, vir… ## $ speaks_spanish <fct> TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, F… ## $ speaks_french <fct> TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, F… ## $ speaks_german <fct> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, … ## $ speaks_chinese <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,… ## $ status <fct> single, single, seeing someone, single, single, … ## $ essay0_len <dbl> 6.538163, 3.713962, 7.336296, NA, NA, 6.975429, … ## $ essay1_len <dbl> 3.951551, 3.713962, 6.095861, NA, 5.843591, 5.99… ## $ essay2_len <dbl> 4.564515, 4.605330, 6.109283, NA, 6.428131, 5.12… ## $ essay3_len <dbl> NA, 3.496992, 5.733393, NA, 4.204931, 4.897959, … ## $ essay4_len <dbl> 5.517517, 5.327954, 6.278551, NA, 6.971684, 6.85… ## $ essay5_len <dbl> 5.723637, NA, 6.424895, NA, 4.812314, 5.257579, … ## $ essay6_len <dbl> NA, 3.258712, 5.459654, NA, 4.709674, 5.433792, … ## $ essay7_len <dbl> NA, NA, 4.709674, NA, 4.875319, 5.459654, 4.7275… ## $ essay8_len <dbl> 3.9123430, NA, 4.5645148, NA, 4.5434650, 4.96992… ## $ essay9_len <dbl> NA, 3.045284, 5.669936, NA, 3.784553, 8.133886, … ## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1… ``` --- We split the data into training, validation and test sets: ```r # test_idx <- sample(1:nrow(okcupid2), 2000, replace = FALSE) # train_idx <- okcupid2 %>% filter(!id %in% test_idx) %>% slice_sample(n = 8000) %>% pull(id) # valid_idx <- okcupid2 %>% filter(!id %in% test_idx, !id %in% train_idx) %>% pull(id) okcupid2 <- okcupid2 %>% select(-id) idx <- read_rds("../data/okcupid2_idx.rda") train_idx <- idx$train_idx valid_idx <- idx$valid_idx test_idx <- idx$test_idx okcupid2_train <- okcupid2[train_idx, ] okcupid2_valid <- okcupid2[valid_idx, ] okcupid2_test <- okcupid2[test_idx, ] library(glue) glue("train no. of rows: {nrow(okcupid2_train)} validation no. of rows: {nrow(okcupid2_valid)} test no. of rows: {nrow(okcupid2_test)}") ``` ``` ## train no. of rows: 8000 ## validation no. of rows: 1504 ## test no. of rows: 2000 ``` --- Our transformed income dependent variable behaves "ok": ```r ggplot(okcupid2_train, aes(income)) + geom_histogram(fill = "red", alpha = 0.5) + theme_bw() ``` <img src="images/Income-Hist-1.png" width="100%" /> --- We can quickly see percentage of missing values with [`naniar`](http://naniar.njtierney.com/): ```r library(naniar) vis_miss(okcupid2_train %>% sample_frac(0.2) %>% select(-starts_with("essay"))) ``` <img src="images/Missingness-1.png" width="100%" /> --- Also worth exploring some basic relations between predictors and income. You can use the work of others, e.g. [`ggpairs`](https://ggobi.github.io/ggally/): ```r library(GGally) ggpairs(okcupid2_train %>% select(income, age, sex, height_cm, body_type_not_perfect)) ``` <img src="images/GGPairs-1.png" width="100%" /> --- But don't be ashamed of simply exploring on your own: ```r var_vs_income_boxplot <- function(var) { ggplot(okcupid2_train, aes({{var}}, income)) + geom_boxplot() + facet_wrap(~ sex) + theme_bw() + labs(x = "") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) } ``` --- ```r var_vs_income_boxplot(body_type) ``` <img src="images/Body_Type-Income-1.png" width="100%" /> --- ```r var_vs_income_boxplot(sign2) ``` <img src="images/Sign-Income-1.png" width="100%" /> --- ```r var_vs_income_boxplot(job3) ``` <img src="images/Job-Income-1.png" width="100%" /> --- ```r var_vs_income_boxplot(education2) ``` <img src="images/Edu-Income-1.png" width="100%" /> --- ```r var_vs_income_boxplot(religion2) ``` <img src="images/Religion-Income-1.png" width="100%" /> --- ```r var_vs_income_boxplot(diet2) ``` <img src="images/Diet-Income-1.png" width="100%" /> --- ```r ggpairs(okcupid2_train %>% select(essay0_len:essay2_len, income)) ``` <img src="images/Essay-Income-1.png" width="100%" /> --- ### Baseline: Linear Regression R's `lm` function does not take `NA` values. One strategy is to impute these values using a "common" value such as the median for continuous variables and mode for categorical variables. This can easily be achieved with `naniar`: ```r okcupid2_imp <- naniar::impute_median_all(okcupid2) okcupid2_imp_train <- okcupid2_imp[train_idx, ] okcupid2_imp_valid <- okcupid2_imp[valid_idx, ] mod_lm <- lm(income ~ ., data = okcupid2_imp_train) pred_lm <- predict(mod_lm, okcupid2_imp_valid) rmse <- function(obs, pred) sqrt(mean((obs - pred)^2)) report_rmse_and_cor <- function(obs, pred) { RMSE <- rmse(obs, pred) CORR <- cor(obs, pred) glue("RMSE: {format(RMSE, digits = 3)} CORR: {format(CORR, digits = 3)}") } ``` --- ```r report_rmse_and_cor(okcupid2_valid$income, pred_lm) ``` ``` ## RMSE: 0.352 ## CORR: 0.501 ``` ```r tibble(income = okcupid2_valid$income, pred = pred_lm) %>% ggplot(aes(income, pred)) + geom_point() + ylim(range(okcupid2_valid$income)) ``` <img src="images/LM-Fit1-1.png" width="50%" /> --- A more intelligent strategy for imputing missing values would be to *predict* them using whatever data is not missing. This can be done quite seamlessly with the [`mice`](https://stefvanbuuren.name/mice/) package: ```r # library(mice) # mice_obj <- mice(okcupid2, m = 1, maxit = 10, seed = 42) # okcupid2_imp_mice <- complete(mice_obj) okcupid2_imp_mice <- read_rds("../data/okcupid2_imp_mice.rds") okcupid2_imp_mice_train <- okcupid2_imp_mice[train_idx, ] okcupid2_imp_mice_valid <- okcupid2_imp_mice[valid_idx, ] mod_lm_mice <- lm(income ~ ., data = okcupid2_imp_mice_train) pred_lm_mice <- predict(mod_lm_mice, okcupid2_imp_mice_valid) report_rmse_and_cor(okcupid2_valid$income, pred_lm_mice) ``` ``` ## RMSE: 0.351 ## CORR: 0.504 ``` .insight[ 💡 Can you think of other imputation strategies? ] --- ```r tibble(income = okcupid2_valid$income, pred = pred_lm_mice) %>% ggplot(aes(income, pred)) + geom_point() + ylim(range(okcupid2_valid$income)) ``` <img src="images/LM-Fit2-1.png" width="60%" /> --- ### Baseline: Ridge Regression ```r library(glmnet) okcupid2_imp_mat_train <- model.matrix( ~ ., okcupid2_imp_mice_train[, predictors]) okcupid2_imp_mat_valid <- model.matrix( ~ ., okcupid2_imp_mice_valid[, predictors]) ridge_cv <- cv.glmnet(x = okcupid2_imp_mat_train, y = okcupid2_train$income, alpha = 0) best_lambda <- ridge_cv$lambda.min mod_lm_ridge <- glmnet(x = okcupid2_imp_mat_train, y = okcupid2_train$income, alpha = 0, lambda = best_lambda) pred_lm_ridge <- predict(mod_lm_ridge, okcupid2_imp_mat_valid) report_rmse_and_cor(okcupid2_valid$income, pred_lm_ridge) ``` ``` ## RMSE: 0.351 ## CORR: 0.505 ``` --- ```r tibble(income = okcupid2_valid$income, pred = pred_lm_ridge) %>% ggplot(aes(income, pred)) + geom_point() + ylim(range(okcupid2_valid$income)) ``` <img src="images/LM-Ridge-Fit-1.png" width="60%" /> --- ### Baseline: Lasso Regression ```r lasso_cv <- cv.glmnet(x = okcupid2_imp_mat_train, y = okcupid2_imp_train$income, alpha = 1) best_lambda <- lasso_cv$lambda.min mod_lm_lasso <- glmnet(x = okcupid2_imp_mat_train, y = okcupid2_train$income, alpha = 1, lambda = best_lambda) pred_lm_lasso <- predict(mod_lm_lasso, okcupid2_imp_mat_valid) report_rmse_and_cor(okcupid2_valid$income, pred_lm_lasso) ``` ``` ## RMSE: 0.351 ## CORR: 0.506 ``` --- ```r tibble(income = okcupid2_valid$income, pred = pred_lm_lasso) %>% ggplot(aes(income, pred)) + geom_point() + ylim(range(okcupid2_valid$income)) ``` <img src="images/LM-Lasso-Fit-1.png" width="60%" /> --- class: section-slide # End of Detour --- class: section-slide # The CART (Regression) --- ### The OG CART 1. Find the the predictor to (binary) split on and the value of the split, by `\(SSE\)` criterion 2. For each resulting node if `\(n_{node} > 20\)` go to 1 3. Once full tree has been grown, perform pruning using the *cost-complexity parameter* `\(c_p\)` and the `\(SSE_{c_p}\)` criterion 4. Predict the average value at each terminal node - For each split *surrogate splits* are saved for future `NA`s - An alternative criterion for complexity could be tree maximum depth (sklearn) - One can also reduce all tree's unique paths to a set of rules - Variables can be ranked by "importance" .insight[ 💡 What if the best split isn't binary? ] --- ### Why *Rectangular* Regions? <img src="images/recursive_partitioning.png" style="width: 75%" /> .insight[ 💡 Does this remind you of anything? ] --- ### The `\(SSE\)` criterion - continuous predictor - `\(y\)` is the continuous dependent variable - a continuous predictor `\(v\)` is nominated for splitting the current node - with splitting value `\(l\)` - such that `\(S_1\)` is the set of observations for which `\(v_i \le l\)` - and `\(S_2\)` is the set of observations for which `\(v_i > l\)` - `\(\overline y_1\)` is the average of `\(y\)` in set `\(S_1\)` `\(SSE = \sum_{i \in S_1}(y_i - \overline y_1)^2 + \sum_{i \in S_2}(y_i - \overline y_2)^2\)` For example if `age` is candidate in splitting `income`: --- ```r library(patchwork) sse <- function(l, df, v) { income_above <- df %>% filter({{v}} >= l) %>% pull(income) income_below <- df %>% filter({{v}} < l) %>% pull(income) sse_above <- sum((income_above - mean(income_above))^2) sse_below <- sum((income_below - mean(income_below))^2) return(sse_above + sse_below) } age <- seq(18,69, 1) sse_age <- map_dbl(age, sse, df = okcupid2_train, v = age) p1 <- okcupid2_train %>% count(age, income) %>% ggplot(aes(age, income)) + geom_point(aes(size = n)) + theme_bw() + labs(x = "") + theme(axis.text.x = element_blank()) p2 <- tibble(age = age, sse = sse_age) %>% ggplot(aes(age, sse)) + geom_line() + geom_point() + theme_bw() p1 / p2 ``` --- <img src="images/Age-SSE-1.png" width="80%" /> --- ### The `\(SSE\)` criterion - categorical predictor .insight[ 💡 What could be an issue with a categorical variable with many levels? ] - a categorical predictor `\(v\)` is nominated for splitting the current node - option 1: use dummy variables, turning each level into a 2-level 0/1 category variable - option 2: order levels by some criterion like `\(\overline{y_j}\)` and treat `\(l\)` as continuous from here on For example if `job3` is candidate in splitting `income`: --- ```r mean_income_vs_job <- okcupid2_train %>% group_by(job3) %>% summarise(mean_income = mean(income)) %>% arrange(mean_income) jobs_levels_sorted <- as.character(mean_income_vs_job$job3) okcupid2_train_job_sorted <- okcupid2_train %>% mutate(job_ordered = fct_relevel(job3, jobs_levels_sorted), job_ordered_n = as.numeric(job_ordered)) job_rank <- seq(1, length(jobs_levels_sorted), 1) sse_job <- map_dbl(job_rank, sse, df = okcupid2_train_job_sorted, v = job_ordered_n) p1 <- okcupid2_train_job_sorted %>% ggplot(aes(job_ordered, income)) + geom_boxplot() + theme_bw() + labs(x = "") + theme(axis.text.x = element_blank()) p2 <- tibble(job = factor(jobs_levels_sorted, levels = jobs_levels_sorted), sse = sse_job) %>% ggplot(aes(job, sse, group = 1)) + geom_line() + geom_point() + theme_bw() + labs(x = "") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) p1 / p2 ``` --- <img src="images/Job-SSE-1.png" width="80%" /> --- ### Pruning: The `\(SSE_{c_p}\)` criterion The "right-sized" tree should not be too deep to avoid overfitting. Once the full tree is grown, we check for each split: `\(SSE_{c_p} = SSE_{tot} + c_p \cdot {\text{#terminal nodes}}\)` Where `\(c_p\)` is a penalty or regularization parameter usually chosen by cross validation. And we choose the smallest pruned tree with the minimum `\(SSE_{c_p}\)`. --- ### CART with `rpart` Growing the full tree .font80percent[(for a numeric `\(y\)` `rpart` will guess this is Regression)]: ```r library(rpart) mod_tree <- rpart(income ~ ., data = okcupid2_train) ``` Pruning with some `\(c_p\)`: ```r mod_tree <- prune(mod_tree, cp = 0.05) ``` .warning[ ⚠️ There is a `\(c_p\)` parameter you can pass to `rpart` while training like so: `rpart(income ~ ., data = okcupid2_train, control = rpart.control(cp = 0.05))` But this will only make `rpart` consider this `\(c_p\)` at each split as a minimum criterion *while growing the unpruned tree*. In fact the default of this parameter is 0.01! ] --- So, in order to train a true unpruned tree you would need to pass `\(c_p = 0\)`: ```r mod_tree <- rpart(income ~ ., data = okcupid2_train, control = rpart.control(cp = 0)) ``` Now. `rpart` by default will perform 10-fold Cross Validation on each split, resulting in this `cptable`: ```r head(mod_tree$cptable) ``` ``` ## CP nsplit rel error xerror xstd ## 1 0.113597498 0 1.0000000 1.0002130 0.02238494 ## 2 0.048179174 1 0.8864025 0.8910381 0.02282691 ## 3 0.013140594 2 0.8382233 0.8457276 0.02367193 ## 4 0.010043056 3 0.8250827 0.8326671 0.02372118 ## 5 0.009147817 4 0.8150397 0.8270287 0.02395268 ## 6 0.005634308 5 0.8058919 0.8219399 0.02402273 ``` --- The nature of the `xerror` is unclear from the [docs](https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf) (SSE/n ?) except that it is relative to the error in the root node. You can either use it like so: ```r best_cp <- mod_tree$cptable[which.min(mod_tree$cptable[,"xerror"]),"CP"] mod_tree <- prune(mod_tree, cp = best_cp) ``` Or you can perform CV on your own, passing at each stage a parameter for `rpart` to not perform CV: ```r mod_tree <- rpart(income ~ ., data = okcupid2_train, control = rpart.control(cp = 0), xval = 1) ``` --- Let's tune `\(c_p\)` for our data using a 5-fold (manual) Cross Validation. The criterion to maximize would be RMSE. ```r n_cv <- 5; cp_seq <- seq(0, 0.02, 0.001) okcupid2_train_val <- okcupid2_train %>% mutate(val = sample(1:n_cv, n(), replace = TRUE)) get_validation_set_rmse <- function(i, .cp) { ok_tr <- okcupid2_train_val %>% filter(val != i) %>% select(-val) ok_val <- okcupid2_train_val %>% filter(val == i) %>% select(-val) mod <- rpart(income ~ ., data = ok_tr, control = rpart.control(cp = 0, xval = 1)) mod <- prune(mod, cp = .cp) pred <- predict(mod, ok_val) rmse(ok_val$income, pred) } get_cv_rmse <- function(.cp) { tibble(cp = rep(.cp, n_cv), rmse = map_dbl(1:n_cv, get_validation_set_rmse, .cp = .cp)) } ``` --- ```r cv_table <- map_dfr(cp_seq, get_cv_rmse) cv_table %>% mutate(cp = factor(cp)) %>% ggplot(aes(cp, rmse)) + geom_boxplot() + theme_bw() + labs(x = "") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) ``` <img src="images/Cp-CV2-1.png" width="100%" /> --- Training on the entire training set: ```r mod_tree_na <- rpart(income ~ ., data = okcupid2_train, control = rpart.control(cp = 0, xval = 1)) mod_tree_na <- prune(mod_tree_na, cp = 0.003) ``` You can plot the tree using `rpart`, but... ```r plot(mod_tree_na) text(mod_tree_na, pretty = 1, use.n = TRUE) ``` <img src="images/CART-NA1-1.png" width="50%" /> --- The plotting function in the `rpart.plot` package is slightly nicer: ```r library(rpart.plot) prp(mod_tree_na, type = 5, extra = 1) ``` <img src="images/CART-NA2-1.png" width="100%" /> --- You can go fancy and use `partykit` and/or `ggparty`: ```r library(ggparty) party_obj <- as.party(mod_tree_na) ggparty(party_obj) + geom_edge() + # geom_edge_label() + geom_node_label(aes(label = splitvar), ids = "inner") + geom_node_label(aes(label = str_c("n = ", nodesize)), ids = "terminal", nudge_y = 0.02) + geom_node_plot(gglist = list(geom_boxplot(aes(y = income)), theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())), shared_axis_labels=TRUE) ``` --- <img src = "images/CART-ggparty.png" style="width: 100%"> --- Or you can just print the tree and see what is going on: ```r print(party_obj) ``` <img src = "images/CART_print.png" style="width: 100%"> --- ### Variables Importance Summing the reduction in `\(SSE\)` for each split variable, we can get a measure of importance. .font80percent[Unfortunately in `rpart` the "potential" reduction in `\(SSE\)` for surrogate splits is also summed. You can either ignore or retrain with only the variables chosen by the model.] ```r enframe(mod_tree_na$variable.importance) %>% arrange(value) %>% mutate(variable = as_factor(name)) %>% ggplot(aes(variable, value)) + geom_segment(aes(x = variable, xend = variable, y = 0, yend = value)) + geom_point(size = 3) + theme_bw() + coord_flip() + labs(x = "", y = "") ``` --- <img src="images/CART-VarImp-1.png" width="60%" /> .insight[ 💡 How would this profile look for a couple of very correlated predictors? ] --- ### Prediction ```r pred_tree_na <- predict(mod_tree_na, okcupid2_valid) report_rmse_and_cor(okcupid2_valid$income, pred_tree_na) ``` ``` ## RMSE: 0.363 ## CORR: 0.449 ``` As expected, far from impressive. Let's try using the imputed `NA` data: ```r mod_tree_imp <- rpart(income ~ ., data = okcupid2_imp_mice_train, control = rpart.control(cp = 0, xval = 1)) mod_tree_imp <- prune(mod_tree_imp, cp = 0.003) pred_tree_imp <- predict(mod_tree_imp, okcupid2_imp_mice_valid) report_rmse_and_cor(okcupid2_valid$income, pred_tree_imp) ``` ``` ## RMSE: 0.369 ## CORR: 0.424 ``` --- ```r tibble(income = okcupid2_valid$income, pred = pred_tree_na) %>% count(income, pred) %>% ggplot(aes(income, pred)) + geom_point(aes(size = n)) + ylim(range(okcupid2_valid$income)) ``` <img src="images/Tree-Fit-1.png" width="60%" /> --- class: section-slide # The Others --- <img src = "images/CART-ggparty.png" style="width: 80%"> Looking at this one might wonder: - Since when do we just split, how about a statistical test? - Predicting a single value for each terminal node? Is that the best we can do? - Many variables repeat in the same path, can we compact paths into simple rules? --- ### Conditional Inference Trees [Hothorn et. al. (2006)](https://eeecon.uibk.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf) perform a statistical hypothesis test for each variable to decide whether it is related to `\(y\)` (including multiple comparisons adjustment). If no variable passes test - stop. No pruning necessary. If a few variables pass - choose the most related, e.g. using a p-value. ```r library(partykit) mod_ctree_na <- ctree(income ~ ., data = okcupid2_train, alpha = 0.0001) pred_ctree_na <- predict(mod_ctree_na, okcupid2_valid) report_rmse_and_cor(okcupid2_valid$income, pred_ctree_na) ``` ``` ## RMSE: 0.362 ## CORR: 0.455 ``` --- ```r plot(mod_ctree_na) ``` <img src = "images/ctree.png" style="width: 100%"> --- ### Model Trees [Quinlan (1992)](https://sci2s.ugr.es/keel/pdf/algorithm/congreso/1992-Quinlan-AI.pdf) and later [Wang and Witten (1997)](https://pdfs.semanticscholar.org/3324/4b59fe506331926b3a33e348209ac456c532.pdf) made a few changes to the beloved CART: - instead of single values at terminal nodes, M5 trees have *linear models* - sort of a piecewise linear regression only the "pieces" can come from many variables - can extrapolate to never before seen values (not necessarily good) You can find an implementation in R in the `RWeka` package, function `M5P()`. [Zeileis et. al. (2008)](https://eeecon.uibk.ac.at/~zeileis/papers/Zeileis+Hothorn+Hornik-2008.pdf) approach is slightly simpler to grasp. It allows you to specify variables for splitting and variables for regression. See `lmtree()` and `glmtree()` functions in the `partykit` package. --- ```r mod_lmtree_na <- lmtree(income ~ age | sex + job3 + education2, data = okcupid2_train) pred_lmtree_na <- predict(mod_lmtree_na, okcupid2_valid) report_rmse_and_cor(okcupid2_valid$income, pred_lmtree_na) plot(mod_lmtree_na) ``` ``` ## RMSE: 0.366 ## CORR: 0.437 ``` <img src = "images/lmtree.png" style="width: 100%"> --- ### Rule-Based Trees [Holmes et. a. (1993)](https://perun.pmf.uns.ac.rs/radovanovic/dmsem/cd/install/Weka/doc/pubs/1999/ajc.pdf) general approach: 1. Build a model tree (e.g. M5) 2. For each path compute the coverage (= % of relevant observations to that path) 3. Compact the path with largest coverage into a rule 4. Remove previous relevant observations and go back to 1 until all observations are accounted are. ```r library(RWeka) # limiting predictors just for a nice print! mod_m5rules_na <- M5Rules(income ~ ., data = okcupid2_train[, c("income", "age", "sex", "education2")]) pred_m5rules_na <- predict(mod_m5rules_na, okcupid2_valid) report_rmse_and_cor(okcupid2_valid$income, pred_m5rules_na) ``` ``` ## RMSE: 0.374 ## CORR: 0.392 ``` --- ```r print(mod_m5rules_na) ``` <img src = "images/m5rules.png" style="width: 100%"> --- class: section-slide # Bagged Trees --- ### Bagging: The Gist For `\(j= 1\)` to `\(m\)` do - Generate a bootstrap sample of the original data - Train an unpruned tree model on this sample End. - Predict average through all `\(m\)` trees (or %vote for each class in classification) - What you get: better prediction, *OOB* error estimates - What you lose: interpretability, speed (but bagging can be paralleled) --- Why does Bagging work? <img src="images/Bagging-Trees-1.png" width="80%" /> --- Which models are likely to benefit from bagging? ```r train_lm <- function(i, .n) { samp <- sample(1:.n, .n, replace = TRUE) lm(income ~ ., data = okcupid2_imp_mice_train[samp, ]) } train_tree <- function(i, .n) { samp <- sample(1:.n, .n, replace = TRUE) rpart(income ~ ., data = okcupid2_imp_mice_train[samp, ]) } predict_mod <- function(mod) { tibble(pred = predict(mod, okcupid2_imp_mice_valid)) } rmse_m_models <- function(m, .preds) { rmse(okcupid2_imp_mice_valid$income, rowMeans(.preds[, 1:m])) } n <- nrow(okcupid2_imp_mice_train) m_max <- 50 mod_lms <- map(1:m_max, train_lm, .n = n) mod_trees <- map(1:m_max, train_tree, .n = n) preds_lm <- map_dfc(mod_lms, predict_mod) preds_tree <- map_dfc(mod_trees, predict_mod) ``` --- ```r rmse_lm <- map_dbl(1:m_max, rmse_m_models, .preds = preds_lm) rmse_tree <- map_dbl(1:m_max, rmse_m_models, .preds = preds_tree) tibble(m = rep(1:m_max, 2), model = c(rep("lm", m_max), rep("CART", m_max)), RMSE = c(rmse_lm, rmse_tree)) %>% ggplot(aes(m, RMSE, color = model)) + ylim(c(0.34, 0.38)) + geom_line() + geom_point() + theme_bw() ``` <img src="images/Bagging-LM-CART-1.png" width="100%" /> --- ### Bagging with `ipred` ```r library(ipred) mod_bag_na <- bagging(income ~ ., data = okcupid2_train, nbagg = 50) pred_bag_na <- predict(mod_bag_na, okcupid2_valid) report_rmse_and_cor(okcupid2_valid$income, pred_bag_na) ``` ``` ## RMSE: 0.366 ## CORR: 0.436 ``` ```r mod_bag_mice <- bagging(income ~ ., data = okcupid2_imp_mice_train, nbagg = 50) pred_bag_mice <- predict(mod_bag_na, okcupid2_imp_mice_valid) report_rmse_and_cor(okcupid2_valid$income, pred_bag_mice) ``` ``` ## RMSE: 0.365 ## CORR: 0.438 ``` --- class: section-slide # Random Forests --- ### Random Forests: The Gist For `\(j= 1\)` to `\(m\)` do - Generate a bootstrap sample of the original data - Train an unpruned tree model on this sample - For each split: - Randomly select `\(m_{try}\)` predictors - Select best predictor for split in this subset only End. --- What does RF add on top of Bagging? <img src="images/RF-Trees-1.png" width="80%" /> --- ### RF with `randomForest` For some reason this implementation of RF won't accept missing values: ```r library(randomForest) mod_rf_mice <- randomForest(income ~ ., data = okcupid2_imp_mice_train, mtry = 10, ntree = 50, importance = TRUE) pred_rf_mice <- predict(mod_rf_mice, okcupid2_imp_mice_valid) report_rmse_and_cor(okcupid2_valid$income, pred_rf_mice) ``` ``` ## RMSE: 0.354 ## CORR: 0.491 ``` For variable importance check out the `importance` field of the RF object: --- <img src="images/RF-VarImp-1.png" width="70%" /> .insight[ 💡 What is troubling about variable importance? ] --- ### Partial Dependency Plots 1. Select a subset `\(X_s\)` of variables (usually 1-2) you wish to know the relation to the dependent variable `\(y\)` 2. For each possible permutation of `\(X_s\)` run the model on the original data imputing this specific permutation instead of the values of `\(X_s\)` 3. Average predictions of `\(y\)` 4. Plot `\(y\)` vs. `\(X_s\)` --- ```r partial(mod_rf_mice, pred.var = "age", plot = TRUE, plot.engine = "ggplot2") + labs(y = "income") + theme_classic() ``` <img src = "images/RF-PDP-Age.png" style="width: 50%"> --- Let's see this for `age` and `height_cm` together (takes some time!): ```r partial(mod_rf_mice, pred.var = c("age", "height_cm"), plot = TRUE) ``` <img src = "images/RF-PDP-Age-Height.png" style="width: 50%"> --- class: section-slide # Gradient Boosted Trees --- ### Boosting: The Gist Compute the average response `\(\overline y\)`, and use this as the initial predicted value for each sample For `\(j = 1\)` to `\(m\)` do - Compute the residual for each observation - Sample a fraction of the original data - Fit a regression tree of a certain *"interaction depth"* using the residual as response - Predict each sample using this tree - Update the predicted value of each sample by adding the previous iteration's value to current predicted value X *learning rate* or *shrinkage* parameter End. --- What you end up with: `\(\hat{y_i} = \overline{y} + \sum_{j=1}^m \lambda \cdot tree_j(\mbox{x}_i)\)` That is why GBT (or GBM) in general is called an additive model. .insight[ 💡 Where does "Gradient" come from? ] - The added random sampling of the data was added later, this version is called *Stochastic* Gradient Boosting, the default for the parameter often called "bagging fraction" is 0.5 - The shrinkage `\(\lambda\)` can be turned into `\(\lambda_j\)`, a unique weight for each added tree - Note the **many** tuning parameters here. Do you know how to tune multiple params? - Not so parallel now, eh? --- ### GBT with `gbm` ```r library(gbm) mod_gbm_na <- gbm(income ~ ., data = okcupid2_train, distribution = "gaussian", n.trees = 500, shrinkage = 0.1, bag.fraction = 0.5) pred_gbm_na <- predict(mod_gbm_na, okcupid2_valid, n.trees = 500) report_rmse_and_cor(okcupid2_valid$income, pred_gbm_na) ``` ``` ## RMSE: 0.351 ## CORR: 0.505 ``` ```r mod_gbm_mice <- gbm(income ~ ., data = okcupid2_imp_mice_train, distribution = "gaussian", n.trees = 500, shrinkage = 0.1, bag.fraction = 0.5) pred_gbm_mice <- predict(mod_gbm_na, okcupid2_imp_mice_valid, n.trees = 500) report_rmse_and_cor(okcupid2_valid$income, pred_gbm_mice) ``` ``` ## RMSE: 0.352 ## CORR: 0.498 ``` For variables importance see the `summary` function: --- <img src="images/GBM-VarImp-1.png" width="70%" /> .insight[ 💡 Do you see the difference in the profile of importance between RF and GBT? ] --- ### GBT with `xgboost` `xgboost` is faster and has some additional features, e.g. the ability to train with a validation set until it shows no improvement, similar to neural networks. ```r library(xgboost) dtrain <- xgb.DMatrix(data = data.matrix(okcupid2_train[, predictors]), label = okcupid2_train$income) dval <- xgb.DMatrix(data = data.matrix(okcupid2_valid[, predictors]), label=okcupid2_valid$income) watchlist <- list(train=dtrain, test=dval) mod_xgboost_na <- xgb.train(data = dtrain, watchlist=watchlist, objective = "reg:squarederror", nrounds = 100, eta = 0.1, subsample = 0.5, early_stopping_rounds = 10, verbose = 0) pred_xgboost_na <- predict(mod_xgboost_na, dval) report_rmse_and_cor(okcupid2_valid$income, pred_xgboost_na) ``` ``` ## RMSE: 0.357 ## CORR: 0.479 ``` .warning[ ⚠️ Note that `xgboost` only accepts numeric input, that is why we use `data.matrix()`. Perhaps the order of the factors could be improved before they are turned into integers. ] --- For variables importance see the `xgb.importance` function: <img src="images/XGBST-VarImp-1.png" width="70%" /> --- class: section-slide # The CART (Classification) --- class: section-slide # Detour: A Classification Problem --- ### OKCupid: Predicting Dogs or Cats People It won't be easy: ```r okcupid %>% count(pets) ``` ``` ## # A tibble: 16 × 2 ## pets n ## <chr> <int> ## 1 dislikes cats 122 ## 2 dislikes dogs 44 ## 3 dislikes dogs and dislikes cats 196 ## 4 dislikes dogs and has cats 81 ## 5 dislikes dogs and likes cats 240 ## 6 has cats 1406 ## 7 has dogs 4134 ## 8 has dogs and dislikes cats 552 ## 9 has dogs and has cats 1474 ## 10 has dogs and likes cats 2333 ## 11 likes cats 1063 ## 12 likes dogs 7224 ## 13 likes dogs and dislikes cats 2029 ## 14 likes dogs and has cats 4313 ## 15 likes dogs and likes cats 14814 ## 16 <NA> 19921 ``` --- We will define cats/dogs people in the following way and stick to non-NA observations: ```r cats_categories <- c("has cats", "likes cats", "dislikes dogs and has cats", "dislikes dogs and likes cats") dogs_categories <- c("has dogs", "likes dogs", "has dogs and dislikes cats", "likes dogs and dislikes cats") okcupid3 <- okcupid %>% mutate(pets = case_when( pets %in% cats_categories ~ "cats", pets %in% dogs_categories ~ "dogs", TRUE ~ NA_character_)) %>% drop_na(pets) okcupid3 %>% count(pets) ``` ``` ## # A tibble: 2 × 2 ## pets n ## <chr> <int> ## 1 cats 2790 ## 2 dogs 13939 ``` Notice the classes are very unbalanced, with roughly 5 dogs people to every 1 cats person. --- As before in the vector `predictors` we have 37 continuous and categorical variables which may or may not be predictive to being a cats/dogs person: ```r okcupid3 <- okcupid3 %>% select(pets, all_of(predictors)) %>% mutate(id = 1:n()) dim(okcupid3) ``` ``` ## [1] 16729 39 ``` And as before we split the data into training, test and validation sets, not shown here: ``` ## train no. of rows: 10000 ## validation no. of rows: 2729 ## test no. of rows: 4000 ``` --- Worth exploring relations between predictors and being a cats/dogs person: ```r var_vs_pets_stackedbar <- function(var) { ggplot(okcupid3_train, aes({{var}}, fill = pets)) + geom_bar(position = "fill") + theme_bw() + labs(x = "", y = "") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) } var_vs_pets_stackedbar(sex) ``` <img src="images/Sex-Pets-1.png" width="80%" /> --- ```r var_vs_pets_stackedbar(body_type) ``` <img src="images/Body-Type-Pets-1.png" width="100%" /> --- ```r var_vs_pets_stackedbar(sign2) ``` <img src="images/Sign-Pets-1.png" width="100%" /> --- ```r var_vs_pets_stackedbar(job3) ``` <img src="images/Job-Pets-1.png" width="100%" /> --- ```r var_vs_pets_stackedbar(education2) ``` <img src="images/Edu-Pets-1.png" width="100%" /> --- ```r var_vs_pets_stackedbar(religion2) ``` <img src="images/Religion-Pets-1.png" width="100%" /> --- ```r var_vs_pets_stackedbar(diet2) ``` <img src="images/Diet-Pets-1.png" width="100%" /> --- ```r ggpairs(okcupid3_train %>% select(essay0_len:essay2_len, pets)) ``` <img src="images/Essay-Pets-1.png" width="100%" /> --- ### Baseline: Logistic Regression As in `lm`, it is better to impute missing values than to lose these observations: ```r # library(mice) # mice_obj <- mice(okcupid3, m = 1, maxit = 10, seed = 42) # okcupid3_imp_mice <- complete(mice_obj) okcupid3_imp_mice <- read_rds("../data/okcupid3_imp_mice.rds") okcupid3_imp_mice_train <- okcupid3_imp_mice[train_idx, ] okcupid3_imp_mice_valid <- okcupid3_imp_mice[valid_idx, ] mod_glm_mice <- glm(pets ~ ., data = okcupid3_imp_mice_train, family = "binomial") pred_glm_mice <- predict(mod_glm_mice, okcupid3_imp_mice_valid, type = "response") ``` --- Let's plot the ROC curve: ```r library(pROC) roc_obj <- roc(okcupid3_valid$pets, pred_glm_mice) ggroc(roc_obj) + theme_bw() + geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="darkgrey", linetype="dashed") ``` <img src="images/GLM-ROC-1.png" width="50%" /> --- Choosing a specific cutoff, the proportion of dogs people in the training sample (the "positive" class), 0.84, let's look at the confusion matrix: ```r cutoff <- mean(okcupid3_train$pets == "dogs") pred_class <- ifelse(pred_glm_mice > cutoff, "dogs", "cats") true_class <- okcupid3_valid$pets table(true_class, pred_class) ``` ``` ## pred_class ## true_class cats dogs ## cats 313 160 ## dogs 679 1577 ``` .insight[ 💡 What would be the accuracy? Recall (sensitivity, specificity)? Precision (PPV, NPV)? ] --- We would like to know the model's AUC, accuracy for a given score cutoff, recall and precision: ```r report_accuracy_and_auc <- function(obs, pred, cutoff = mean(okcupid3_train$pets == "dogs")) { roc_obj <- roc(obs, pred) AUC <- as.numeric(auc(roc_obj)) res <- coords(roc_obj, x = cutoff, ret = c("accuracy", "recall", "precision", "specificity", "npv"), transpose = TRUE) glue("AUC: {format(AUC, digits = 3)} ACC: {format(res['accuracy'], digits = 3)} Dogs: Recall: {format(res['recall'], digits = 3)} Precision: {format(res['precision'], digits = 3)} Cats: Recall: {format(res['specificity'], digits = 3)} Precision: {format(res['npv'], digits = 3)}") } report_accuracy_and_auc(okcupid3_valid$pets, pred_glm_mice) ``` ``` ## AUC: 0.736 ## ACC: 0.693 ## Dogs: Recall: 0.699 ## Precision: 0.908 ## Cats: Recall: 0.662 ## Precision: 0.316 ``` --- #### Baseline: GLMNET: Penalized Logistic Regression ```r okcupid3_imp_mat_train <- model.matrix( ~ ., okcupid3_imp_mice_train[, predictors]) okcupid3_imp_mat_valid <- model.matrix( ~ ., okcupid3_imp_mice_valid[, predictors]) glmnet_cv <- cv.glmnet(x = okcupid3_imp_mat_train, y = okcupid3_train$pets, family = "binomial") best_lambda <- glmnet_cv$lambda.min mod_glm_glmnet <- glmnet(x = okcupid3_imp_mat_train, y = okcupid3_train$pets, family = "binomial", lambda = best_lambda) pred_glm_glmnet <- predict(mod_glm_glmnet, okcupid3_imp_mat_valid, type = "response") report_accuracy_and_auc(okcupid3_valid$pets, pred_glm_glmnet[, 1]) ``` ``` ## AUC: 0.738 ## ACC: 0.685 ## Dogs: Recall: 0.688 ## Precision: 0.909 ## Cats: Recall: 0.67 ## Precision: 0.31 ``` --- class: section-slide # End of Detour --- ### The OG CART 1. Find the the predictor to (binary) split on and the value of the split, by `\(Gini\)` (a.k.a impurity) criterion 2. For each resulting node if `\(n_{node} > 20\)` go to 1 3. Once full tree has been grown, perform pruning using the *cost-complexity parameter* `\(c_p\)` and the `\(Gini_{c_p}\)` criterion 4. Predict the proportion of each class at each terminal node, or most common class --- ### The `\(Gini\)` criterion - `\(y\)` is the discrete dependent variable with `\(J\)` classes - The `\(Gini_{parent}\)` at a parent node is: `\(\sum_{j=1}^JP(y=j)[1-P(y=j)]=\sum_j p_j(1-p_j)\)` - a continuous predictor `\(v\)` is nominated for splitting the current node with splitting value `\(l\)` - such that `\(S_1\)` is the set of observations for which `\(v_i \le l\)` - and `\(S_2\)` is the set of observations for which `\(v_i > l\)` - `\(p_{1_j}\)` is the (observed) probability of `\(y\)` equals class `\(j\)` in set `\(S_1\)` `\(Gini_{split} = \frac{\text{#obs node1}}{\text{#obs parent}}Gini_{node1} + \frac{\text{#obs node2}}{\text{#obs parent}}Gini_{node2}=\)` `\(\frac{\text{#obs node1}}{\text{#obs parent}}\sum_j p_{1_j}(1-p_{1_j}) + \frac{\text{#obs node2}}{\text{#obs parent}}\sum_j p_{2_j}(1-p_{2_j})\)` And we find `\(l\)` for which `\(\Delta G = Gini_{parent} - Gini_{split}\)` is greatest. --- Why Gini? For `\(J=2\)` classes this means: `\(p_1(1-p_1)+p_2(1-p_2)=2p_1(1-p_1)\)` The more "impure" the node, the higher this metric: ```r p <- seq(0, 1, 0.01) plot(p, 2 * p * (1 - p), type = "l") ``` <img src="images/Gini-2-Classes-1.png" width="45%" /> --- For example if `age` is candidate in splitting `pets`: ```r gini <- function(l, df, v) { pets_above <- df %>% filter({{v}} >= l) %>% pull(pets) pets_below <- df %>% filter({{v}} < l) %>% pull(pets) p_dogs_above <- mean(pets_above == "dogs") gini_above <- 2 * p_dogs_above * (1 - p_dogs_above) p_dogs_below <- mean(pets_below == "dogs") gini_below <- 2 * p_dogs_below * (1 - p_dogs_below) gini_weighted <- (length(pets_above) / nrow(df)) * gini_above + (length(pets_below) / nrow(df)) * gini_below return(gini_weighted) } age <- seq(18,69, 1) gini_age <- map_dbl(age, gini, df = okcupid3_train, v = age) p1 <- okcupid3_train %>% group_by(age) %>% summarise(p_dogs = mean(pets == "dogs"), n = n()) %>% ggplot(aes(age, p_dogs)) + geom_point(aes(size = n)) + theme_bw() + labs(x = "") + theme(axis.text.x = element_blank()) p2 <- tibble(age = age, gini = gini_age) %>% ggplot(aes(age, gini)) + geom_line() + geom_point() + theme_bw() p1 / p2 ``` --- <img src="images/Religion-Gini-1.png" width="80%" /> --- ### CART with `rpart` Let's tune `\(c_p\)` for our data using a 5-fold (manual) Cross Validation. The criterion to maximize would be AUC. ```r n_cv <- 5; cp_seq <- seq(0, 0.01, 0.0005) okcupid3_train_val <- okcupid3_train %>% mutate(val = sample(1:n_cv, n(), replace = TRUE)) get_validation_set_auc <- function(i, .cp) { ok_tr <- okcupid3_train_val %>% filter(val != i) %>% select(-val) ok_val <- okcupid3_train_val %>% filter(val == i) %>% select(-val) mod <- rpart(pets ~ ., data = ok_tr, control = rpart.control(cp = 0, xval = 1)) mod <- prune(mod, cp = .cp) pred <- predict(mod, ok_val)[, 2] roc_obj <- roc(ok_val$pets, pred) as.numeric(auc(roc_obj)) } get_cv_auc <- function(.cp) { tibble(cp = rep(.cp, n_cv), auc = map_dbl(1:n_cv, get_validation_set_auc, .cp = .cp)) } ``` --- ```r cv_table <- map_dfr(cp_seq, get_cv_auc) cv_table %>% mutate(cp = factor(cp)) %>% ggplot(aes(cp, auc)) + geom_boxplot() + theme_bw() + labs(x = "") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) ``` <img src="images/Cp-CV-B2-1.png" width="100%" /> --- Training on the entire training set: ```r mod_tree_na_class <- rpart(pets ~ ., data = okcupid3_train, control = rpart.control(cp = 0, xval = 1)) mod_tree_na_class <- prune(mod_tree_na_class, cp = 0.0015) ``` There's no way to plot this tree nicely... ```r plot(mod_tree_na_class) ``` <img src="images/CART-NA2-Class-1.png" width="50%" /> --- Printing the tree it seems like after splitting observations by religion, those who are Christian, Jewish or Hindu have a 88% of being dogs people, and the rest are then split by Status, Body Type, the length of their 9th Essay, etc... <img src = "images/CART_class_print.png" style="width: 100%"> --- ### Variables Importance <img src="images/CART-Class-VarImp-1.png" width="60%" /> --- ### Prediction ```r pred_tree_na_class <- predict(mod_tree_na_class, okcupid3_valid) report_accuracy_and_auc(okcupid3_valid$pets, pred_tree_na_class[, 2]) ``` ``` ## AUC: 0.598 ## ACC: 0.751 ## Dogs: Recall: 0.842 ## Precision: 0.855 ## Cats: Recall: 0.317 ## Precision: 0.296 ``` As expected, far from impressive. --- Let's try using the imputed `NA` data: ```r mod_tree_imp_class <- rpart(pets ~ ., data = okcupid3_imp_mice_train, control = rpart.control(cp = 0, xval = 1)) mod_tree_imp_class <- prune(mod_tree_imp_class, cp = 0.0015) pred_tree_imp_class <- predict(mod_tree_imp_class, okcupid3_imp_mice_valid) report_accuracy_and_auc(okcupid3_valid$pets, pred_tree_imp_class[, 2]) ``` ``` ## AUC: 0.635 ## ACC: 0.75 ## Dogs: Recall: 0.828 ## Precision: 0.864 ## Cats: Recall: 0.381 ## Precision: 0.317 ``` --- # Are you impressed? Me neither. To be continued.