Feature Engineering

---

## Feature Engineering

### Applications of Data Science - Class 15

### Giora Simchoni

#### `gsimchoni@gmail.com and add #dsapps in subject`

### Stat. and OR Department, TAU
### 2023-05-28

---

<div class="my-footer">
  <span>
    <a href="https://dsapps-2023.github.io/Class_Slides/" target="_blank">Applications of Data Science
    </a>
  </span>
</div>

---

# What can you possibly teach us about Feature Engineering?

---

Our two previous modeling attempts, with [Modeling in the Tidyverse](https://dsapps-2023.github.io/Class_Slides/u1_d05-modeling/u1_d05-modeling.html) and in [The Trees](https://dsapps-2023.github.io/Class_Slides/u3_d01-trees/u3_d01-trees.html) were problematic.

- Using a "sinked" split, not using `set.seed()` - bit wasteful
- "Lazy" feature engineering, not using the entire data, not creative
- No interactions in linear models
- Not thinking about Multicollinearity in linear models
- Test once, on the same validation set, for everything
- Data leakage from training set to validation set
- Sometimes tuning params, sometimes not
- No pre-processing of features, if at all by a hunch
- Almost no treatment of class imbalance in classification
- Chaotic feature selection

---

We had one job:

Predict if an OKCupid user is a Cats or Dogs person, where:

```r
cats_categories <- c("has cats", "likes cats", "dislikes dogs and has cats",
                     "dislikes dogs and likes cats")
dogs_categories <- c("has dogs", "likes dogs", "has dogs and dislikes cats",
                     "likes dogs and dislikes cats")
```

And 80% of users are Dogs people, so we have class imbalance.

Model     | AUC | Accuracy | Recall Dogs | Precision Dogs | Recall Cats | Precision Cats
----------|-----|----------|-------------|----------------|-------------|----------
GLM       | 0.74|  0.68    |  0.68       |  0.92          |  0.66       |  0.26
GLM-EN    | 0.74|  0.67    |  0.67       |  0.92          |  0.67       |  0.26
CART      | 0.67|  0.75    |  0.79       |  0.9           |  0.49       |  0.28
GBT+Cutoff| 0.74|  0.70    |  0.71       |  0.9           |  0.65       |  0.32

And we were not impressed.

---

# Resampling/Cross-Validation

---

### Resampling done right

The main goal of resampling: No Surprises.

(By generating multiple versions of our data, and watching how our model behaves on unseen fresh data again. And again. And again.)

- Resampling is a key principle when developing a ML solution
- All ML practitioners are aware of the menace of overfitting
- But resampling should also be made for:
  - Feature engineering and transformation .font80percent[(should I use the Box-Cox or Yeo-Johnson transformation?)]
  - Feature selection .font80percent[(Should I use 2nd-order interactions here?)]
  - Choosing imputation strategy .font80percent[(Simply use the median or regress?)]
  - Tuning parameters .font80percent[(100 or 500 trees?)]
  - Model selection .font80percent[(LM or MARS?)]
  
---

- A good resampling scheme will:
  - Have more than one train/valid/test samples!*
  - Be stratified (usually by dependent variable distribution)
  - Use a (paired/repeated/within) statistical test on resamples to reach a decision, or at least a proper plot
  - Avoid data leakage and not include the test set!
  - Encompass all steps in model building
  - Ideally (when data is Big) perform each step on a different part *of the training data*

`-` The higher the number the lower the variance of performance metric

`-` The higher the number the less data used in each resample, the higher the bias of performance metric
]

---

### Ideally (yes, I know)

---

### And don't forget

Only once we decide on the final pre-processing, features, model and parameters, will we fit the model two more times:
  1. On the entire training data, then test on the testing data and get a realistic metric of performance
  2. On the *entire* data, right before deployment to production
  
---

### Many approaches to resampling

But two most common:

1. V-fold Cross Validation (+ optionally repeating n times)
2. Bootstrap Samples

---

#### V-fold Cross Validation

.font80percent[
[Applied Machine Learning Workshop at Rstudio-conf 2019 / Max Kuhn](https://github.com/topepo/rstudio-conf-2019)
]

---

#### Bootstrap Samples

---

# Let's Start from the Very Beginning

---

#### OkCupid from scratch

```r
okcupid <- read_csv("~/okcupid.csv.zip")

cats_categories <- c("has cats", "likes cats", "dislikes dogs and has cats",
                     "dislikes dogs and likes cats")
dogs_categories <- c("has dogs", "likes dogs", "has dogs and dislikes cats",
                     "likes dogs and dislikes cats")

okcupid <- okcupid %>%
  mutate(pets = case_when(
           pets %in% cats_categories ~ "cats",
           pets %in% dogs_categories ~ "dogs",
           TRUE ~ NA_character_)) %>%
  drop_na(pets) %>%
  mutate(income = na_if(income, -1)) %>%
  select(-last_online)

okcupid[which(is.na(okcupid$height)), "height"] <- 70 # you don't wanna know
```

---

Using `rsample::initial_split(okcupid, strata = pets)` would have been better, but if we do it now we won't have the same test data to compare.

```r
idx <- read_rds("../data/okcupid3_idx.rda")
train_idx <- idx$train_idx
valid_idx <- idx$valid_idx
test_idx <- idx$test_idx

okcupid_train <- okcupid[train_idx, ]
okcupid_valid <- okcupid[valid_idx, ]
okcupid_test <- okcupid[test_idx, ]
```

And let us bind `okcupid_train` and `okcupid_valid` together because we are going to use resampling anyway:

```r
okcupid_train <- bind_rows(okcupid_train, okcupid_valid)

glue("train no. of rows: {nrow(okcupid_train)}
     test no. of rows: {nrow(okcupid_test)}")
```

```
## train no. of rows: 12729
## test no. of rows: 4000
```

---

See that the class balance is more or less the same:

```r
okcupid_train %>% count(pets) %>% mutate(pct = n / sum(n))
```

```
## # A tibble: 2 × 3
##   pets      n   pct
##   <chr> <int> <dbl>
## 1 cats   2097 0.165
## 2 dogs  10632 0.835
```

```r
okcupid_test %>% count(pets) %>% mutate(pct = n / sum(n))
```

```
## # A tibble: 2 × 3
##   pets      n   pct
##   <chr> <int> <dbl>
## 1 cats    693 0.173
## 2 dogs   3307 0.827
```

Good enough .font80percent[(we could also drop a few cat people at random if they're too much in the test group)].

We stash our test data aside, kiss it good bye, we shall not see it again before the final assessment of our model.

---

#### That manager again.

Manager mentions casually: oh, and by the way, this model thingy of yours. Yeah, we need it to run in, like 10 milliseconds per user. Marketing team needs that, don't ask me why, and we can't have the users wait.

You: penalized logistic regression it is 😏

No one mentioned cats/dogs recall/precision to be of any particular importance, so we decide to maximize AUC and take care of cutoff later.

---

We explore the training data .font80percent[(skipping here at least a couple of hours in which we actually explore the data, I'm assuming we already know quite a bit!)]:

```r
glimpse(okcupid_train)
```

```
## Rows: 12,729
## Columns: 30
## $ age         <dbl> 39, 19, 27, 24, 29, 24, 25, 25, 21, 25, 28, 25, 21, 19, 26…
## $ body_type   <chr> "fit", "curvy", NA, "fit", "fit", "fit", "athletic", "aver…
## $ diet        <chr> "mostly anything", "mostly vegetarian", "mostly anything",…
## $ drinks      <chr> "socially", "rarely", NA, "socially", "socially", "sociall…
## $ drugs       <chr> "never", "sometimes", NA, NA, "never", NA, "never", "never…
## $ education   <chr> "graduated from college/university", NA, NA, "graduated fr…
## $ essay0      <chr> "i am a very positive person who loves to laugh and enjoy …
## $ essay1      <chr> "i work in the health and wellness industry and i try to a…
## $ essay2      <chr> "making people smile or laugh.", "ummmm everything? :d<br …
## $ essay3      <chr> "my eyes", "my boobs. definitely.<br />\nand the fact that…
## $ essay4      <chr> "i enjoy malcolm gladwell, blink, tipping point, outliers;…
## $ essay5      <chr> "water,steak,chocolate,comfy jeans,transportation,friends"…
## $ essay6      <chr> "food, naps, women, and health", "caffeine, my baby chihua…
## $ essay7      <chr> "doing whatever has come up or has been planned. dinner, c…
## $ essay8      <chr> "i am totally deaf in my right ear", "i love animals more …
## $ essay9      <chr> "you enjoy having fun, love to laugh, have a positive atti…
## $ ethnicity   <chr> "white", "white", NA, "white", "asian, pacific islander", …
## $ height      <dbl> 70, 60, 72, 65, 69, 72, 70, 67, 66, 67, 63, 63, 62, 67, 66…
## $ income      <dbl> 8e+04, NA, NA, NA, 3e+04, 5e+04, 1e+05, NA, 2e+04, NA, NA,…
## $ job         <chr> "medicine / health", "sales / marketing / biz dev", "medic…
## $ location    <chr> "mountain view, california", "pleasant hill, california", …
## $ offspring   <chr> "doesn&rsquo;t have kids, but might want them", NA, "doesn…
## $ orientation <chr> "straight", "straight", "straight", "straight", "straight"…
## $ pets        <chr> "dogs", "dogs", "dogs", "dogs", "dogs", "dogs", "dogs", "d…
## $ religion    <chr> "atheism", "agnosticism and laughing about it", "agnostici…
## $ sex         <chr> "m", "f", "m", "f", "m", "m", "m", "m", "f", "m", "f", "f"…
## $ sign        <chr> "capricorn", "libra and it matters a lot", NA, NA, "libra …
## $ smokes      <chr> "no", "yes", NA, "no", "no", NA, "no", "sometimes", "no", …
## $ speaks      <chr> "english", "english (fluently)", "english", "english (flue…
## $ status      <chr> "single", "seeing someone", "single", "single", "single", …
```

---

```r
library(naniar)

vis_miss(okcupid_train %>%
           sample_frac(0.2) %>%
           select(-starts_with("essay")))
```

---

#### Q & A

Q: What type of variables do we have?

A: Numeric: age, height income; Text: essays; Categorical: all others

Q: Can our model(s) handle missing values?

A: Probably gonna use `glmnet` so: NO.

Q: Are missing values a serious issue here?

A: Oh yes. `income` 80+% missing, `offspring` 50+% missing.

---

#### Q & A

Q: Can our model(s) handle categorical values?

A: Probably gonna use `glmnet` so: NO.

Q: Categorical variables with many levels?

A: Oh yes. `speaks` has over 2K. And there are those `essay`s.

Q: Numeric variables skewed?

A: Income, but all you need is log.

Q: Classes imbalance scenario?

A: Pretty imbalanced, 5 dogs people for every 1 cat person. Gonna have to take care of it.

---

#### You have to have a strategy

Strategy: Divide training data into 6 samples for 6 main decision/steps/tunings:

1. Missing data imputation
2. Class imbalance treatment
3. Feature Engineering for `essay` data
4. Looking into interactions
4. Model Tuning
5. Cutoff choice

Does it have to be and what can we do if we suspect not? (Remember we are Statisticians!)
]

---

Looking at our strategy, we decide the 12.7K observations need not split evenly between steps.

- Cutoff choice is "peanuts" compared to the other decisions, let's decide in advance that at this stage we would fit the chosen model on ~12K previously seen observations and choose a cutoff by looking at the remaining ~700 observations
- Missing data imputation, class imbalance and interaction steps will each be given 2K observations
- Model tuning 3K observations
- Text feature engineering and selection 3K observations
- Total: 12,729 training observations

---

We wish to make sure our samples are stratified by `pets`, so we'll use `initial_split()` in stages:

```r
library(tidymodels)
tidymodels_prefer()

set.seed(42)
ok_cutoff_split <- initial_split(okcupid_train,
                                 strata = pets, prop = (729 + 1)/12729)
ok_tr_cutoff <- training(ok_cutoff_split)
ok_train <- testing(ok_cutoff_split)

set.seed(42)
ok_missing_split <- initial_split(ok_train,
                                  strata = pets, prop = (2000 + 1)/12000)
ok_tr_missing <- training(ok_missing_split)
ok_train <- testing(ok_missing_split)

set.seed(42)
ok_imbalance_split <- initial_split(ok_train,
                                  strata = pets, prop = (2000 + 1)/10000)
ok_tr_imbalance <- training(ok_imbalance_split)
ok_train <- testing(ok_imbalance_split)
```

---

```r
set.seed(42)
ok_interaction_split <- initial_split(ok_train,
                                      strata = pets, prop = (2000 + 1)/8000)
ok_tr_interaction <- training(ok_interaction_split)
ok_train <- testing(ok_interaction_split)

set.seed(42)
ok_tuning_split <- initial_split(ok_train,
                                  strata = pets, prop = 3000/6000)
ok_tr_tuning <- training(ok_tuning_split)
ok_tr_eng <- testing(ok_tuning_split)
```

```
## # A tibble: 7 × 4
##   name        df                    n_rows cats_rate
##   <chr>       <list>                 <dbl>     <dbl>
## 1 missing     <tibble [2,000 × 30]>   2000     0.164
## 2 imbalance   <tibble [2,000 × 30]>   2000     0.164
## 3 engineering <tibble [3,000 × 30]>   3000     0.165
## 4 interaction <tibble [2,000 × 30]>   2000     0.164
## 5 tuning      <tibble [3,000 × 30]>   3000     0.165
## 6 cutoff      <tibble [729 × 30]>      729     0.165
## 7 test        <tibble [4,000 × 30]>   4000     0.173
```

---

# Dealing with Missing Data

---

### Missing Data

- What causes a missing datum?
  - Survey data
  - Merging data sources
  - Failure of measurement

---

- What is the missing data mechanism?
  - See [Little & Rubin (2019)](https://www.amazon.com/Statistical-Analysis-Missing-Probability-Statistics-ebook/dp/B07Q25CNSD/ref=sr_1_1?crid=2P6ERH22YRKUL&dchild=1&keywords=statistical+analysis+with+missing+data&qid=1587103544&s=books&sprefix=statistical+ana%2Cstripbooks-intl-ship%2C286&sr=1-1) seminal work
  - MCAR: "on some days the weighing scale batteries drained and we couldn't take the mice weight"
  - MAR: "on the 2024 elections poll a massive Facebook campaign called women to refuse to cooperate with the Seker-Sheker company due to allegations of sexual harassment by the company's head Dani Marom"
  - MNAR: "some of the drug addicts under study missed the final few meetings in the treatment group and their data were discarded"
  
.insight[
💡 How would you asses the missing data mechanism in your dataset?
]

---

Exploring percent missingness

```r
gg_miss_var(ok_tr_missing %>%
              select(-starts_with("essay")),
            show_pct = TRUE)
```

---

Exploring percent missingness by a predictor

```r
gg_miss_var(ok_tr_missing %>%
              select(-starts_with("essay")),
            facet = sex, show_pct = TRUE)
```

---

Exploring missingness between a predictor and the response (categorical) variable

```r
ok_tr_missing %>%
    mutate(income_missing = is.na(income)) %>%
    count(income_missing, pets) %>%
  pivot_wider(id_cols = pets,
              names_from = income_missing,
              values_from = n) %>%
  mutate(pct_missing = `TRUE` / (`TRUE` + `FALSE`))
```

```
## # A tibble: 2 × 4
##   pets  `FALSE` `TRUE` pct_missing
##   <chr>   <int>  <int>       <dbl>
## 1 cats       73    256       0.778
## 2 dogs      299   1372       0.821
```

---

But we can do better

---

Exploring missingness between two (categorical) predictors

```r
library(ggmosaic)

ok_tr_missing %>%
  mutate(body_type_missing = is.na(body_type)) %>%
  ggplot() +
  geom_mosaic(aes(x = product(body_type_missing, status),
                  fill = body_type_missing)) +
  labs(x = "", y = "") + theme_light() + theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.text.x=element_text(angle = 45, hjust = 1))
```

---

You don't always need a plot...

```r
ok_tr_missing %>%
    mutate(body_type_missing = is.na(body_type)) %>%
    count(body_type_missing, status) %>%
  pivot_wider(id_cols = status,
              names_from = body_type_missing,
              values_from = n) %>%
  mutate(pct_missing = `TRUE` / (`TRUE` + `FALSE`))
```

```
## # A tibble: 4 × 4
##   status         `FALSE` `TRUE` pct_missing
##   <chr>            <int>  <int>       <dbl>
## 1 available           49      3      0.0577
## 2 married              8      2      0.2   
## 3 seeing someone      48     12      0.2   
## 4 single            1751    127      0.0676
```

---

The same with `body_type` missingness by `drugs`:

```r
ok_tr_missing %>%
    mutate(body_type_missing = is.na(body_type),
           drugs_missing = is.na(drugs)) %>%
    count(body_type_missing, drugs_missing) %>%
  pivot_wider(id_cols = drugs_missing,
              names_from = body_type_missing,
              values_from = n) %>%
  mutate(pct_missing = `TRUE` / (`TRUE` + `FALSE`))
```

```
## # A tibble: 2 × 4
##   drugs_missing `FALSE` `TRUE` pct_missing
##   <lgl>           <int>  <int>       <dbl>
## 1 FALSE            1423     93      0.0613
## 2 TRUE              433     51      0.105
```

---

But we can do better with a co-occurrence plot or a heatmap

```r
ok_tr_missing %>%
  count(drugs, body_type) %>%
  ggplot(aes(drugs, body_type)) +
  geom_point(aes(size = n)) +
  theme_light()
```

---

Exploring missingness between numerical and categorical predictors

```r
ok_tr_missing %>%
  mutate(bodytype_missing = is.na(body_type)) %>%
  ggplot(aes(bodytype_missing, age)) +
  geom_violin() +
  facet_wrap(. ~ sex) +
  theme_light()
```

---

Exploring missingness between two numerical predictors (borrowing `airquality` data)

```r
ggplot(airquality, aes(Ozone, Solar.R)) +
  naniar::geom_miss_point() +
  theme_light()
```

---

When the data is large you could perform PCA on the binary missing/non-missing matrix:

```r
binary_mat <- ok_tr_missing %>%
  select(-pets) %>%
  mutate(across(everything(), is.na))

prcomp(binary_mat) %>%
  pluck("rotation") %>%
  as_tibble() %>%
  select(PC1, PC2) %>%
  bind_cols(
    tibble(
      predictor = colnames(binary_mat),
      pct_missing = binary_mat %>% colSums() / nrow(ok_tr_missing) 
    )
  ) %>%
  mutate(label = ifelse(PC1 < -0.3 | PC2 > 0.3, predictor, "")) %>%
  ggplot(aes(PC1, PC2)) +
  geom_point(aes(size = pct_missing), color = "lightblue", alpha = 0.5) +
  ggrepel::geom_text_repel(aes(label = label)) +
  theme_light()
```

---

---

#### Definitely not in MCAR land

- Keep the data missing, drop it all or impute values?
  - Very much depends on the mechanism of missing data
  - "Drop it all" only if you're "data rich" and assume MCAR
  - Keep or impute: can your model handle missing values?
  - "It's not a bug, it's a feature!": a new category or an additional variable
  - Use resampling to test the best strategy!

---

- How to impute?
  - Very much depends on the mechanism of missing data
  - Mean/median value if MCAR
  - kNN imputation
  - Iterative Regression/Classification algos (EM algorithm)
  - Use resampling to test the best strategy!

- What do you do with missing values in the response variable?
  - Semi-supervised Learning
  - Active Learning
  
---

Let's choose between 2 x 3 x 3 = 18 strategies:

1. Missing categorical features
  - adding an "unknown" category
  - imputing the most common category
2. The `income` feature:
  - Dropping it (it is 80% missing!!)
  - Log transformation, imputing the mean
  - Log transformation, kNN Imputation
3. The `essay` features:
  - Dropping  them
  - Adding `essay` length where a missing `essay` has length 0
  - Adding `essay` length where a missing `essay` has length 0 and adding "is_missing" feature to each `essay`

---

Use `step_mutate_*()` or add what you need beforehand.

```r
ok_tr_missing2 <- ok_tr_missing %>%
  mutate(across(essay0:essay9,
                list(
                  "len" = ~ifelse(is.na(.x), 0, str_length(.x)),
                  "isna" = ~ifelse(is.na(.x), 1, 0)
                  ))) %>%
  rename_with(~paste("len", gsub("_len", "", .), sep = "_"),
              ends_with("_len")) %>%
  rename_with(~paste("isna", gsub("_is_na", "", .), sep = "_"),
              ends_with("_isna")) %>%
  select(-starts_with("essay")) %>%
  mutate(across(where(is.character), as.factor))
```

---

Defining recipes .font80percent[(without `prep()`!)], let's start with 2 to simplify.

```r
rec_miss_unk_none_none <- recipe(pets ~ ., data = ok_tr_missing2) %>%
  step_rm(income, contains("essay")) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), other = "other1") %>%
  step_novel(all_nominal_predictors()) %>%
* step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = FALSE)

rec_miss_mode_none_none <- recipe(pets ~ ., data = ok_tr_missing2) %>%
  step_rm(income, contains("essay")) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), other = "other1") %>%
  step_novel(all_nominal_predictors()) %>%
* step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = FALSE)
```

---

Defining resamples

```r
set.seed(42)
cv_splits_missing <- vfold_cv(ok_tr_missing2, v = 10, strata = pets)

cv_splits_missing
```

```
## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [1799/201]> Fold01
##  2 <split [1800/200]> Fold02
##  3 <split [1800/200]> Fold03
##  4 <split [1800/200]> Fold04
##  5 <split [1800/200]> Fold05
##  6 <split [1800/200]> Fold06
##  7 <split [1800/200]> Fold07
##  8 <split [1800/200]> Fold08
##  9 <split [1800/200]> Fold09
## 10 <split [1801/199]> Fold10
```

---

```r
cv_splits_missing$rec_miss_unk_none_none <- map(
  cv_splits_missing$splits, prepper, recipe = rec_miss_unk_none_none)
cv_splits_missing$rec_miss_mode_none_none <- map(
  cv_splits_missing$splits, prepper, recipe = rec_miss_mode_none_none)

map_dbl(cv_splits_missing$rec_miss_unk_none_none, ~nrow(.$term_info))
```

```
##  [1] 92 91 93 93 92 91 90 91 92 91
```

```r
cv_splits_missing$rec_miss_unk_none_none[[1]]$term_info %>%
  filter(str_detect(variable, "speaks"))
```

```
## # A tibble: 4 × 4
##   variable                  type      role      source 
##   <chr>                     <list>    <chr>     <chr>  
## 1 speaks_english..fluently. <chr [2]> predictor derived
## 2 speaks_other1             <chr [2]> predictor derived
## 3 speaks_new                <chr [2]> predictor derived
## 4 speaks_unknown            <chr [2]> predictor derived
```

Notice the `role` and `source` columns.

---

Defining the model specification

```r
mod_lr <- logistic_reg(penalty = 0.01) %>%
  set_engine("glmnet")
```

Use `workflow_set()` to fit the two recipes on all folds:

```r
rec_list <- list(unk = rec_miss_unk_none_none,
                 mode = rec_miss_mode_none_none)
mod_list <- list(lr = mod_lr)
*wset <- workflow_set(rec_list, mod_list)
# also check out the cross arg

wset
```

```
## # A workflow set/tibble: 2 × 4
##   wflow_id info             option    result    
##   <chr>    <list>           <list>    <list>    
## 1 unk_lr   <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 mode_lr  <tibble [1 × 4]> <opts[0]> <list [0]>
```

---

```r
wset <- wset %>%
* workflow_map("fit_resamples",
               resamples = cv_splits_missing,
               metrics = metric_set(roc_auc))

wset
```

```
## # A workflow set/tibble: 2 × 4
##   wflow_id info             option    result   
##   <chr>    <list>           <list>    <list>   
## 1 unk_lr   <tibble [1 × 4]> <opts[2]> <rsmp[+]>
## 2 mode_lr  <tibble [1 × 4]> <opts[2]> <rsmp[+]>
```

---

Can use built-in `autoplot()` to see the results, I dislike it:

```r
library(ggrepel)

autoplot(wset, metric = "roc_auc") +
  geom_text_repel(aes(label = wflow_id), nudge_x = 1/8, nudge_y = 1/100) +
  theme_light() +
  theme(legend.position = "none")
```

---

A type of parallel coordinates plot showing all folds is better IMO:

```r
res_df <- wset %>%
* unnest(result) %>%
  mutate(recipe = wflow_id, auc = map_dbl(.metrics, ~.x$.estimate))

res_df %>%
  ggplot(aes(recipe, auc, group = id, color = id)) +
  geom_line() +
  geom_point() +
  guides(color = "none") + theme_light()
```

---

Comparing AUCs with a statistical test (paired T test)

```r
t.test(with(res_df, auc[recipe == "unk_lr"]),
       with(res_df, auc[recipe == "mode_lr"]), paired = TRUE)
```

```
## 
## 	Paired t-test
## 
## data:  with(res_df, auc[recipe == "unk_lr"]) and with(res_df, auc[recipe == "mode_lr"])
## t = -1.5003, df = 9, p-value = 0.1678
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.013950063  0.002824533
## sample estimates:
## mean difference 
##    -0.005562765
```

---

Let's move on to 2 x 2 recipes, adding 2 more recipes for treating the `income` variable with mean imputation, with `step_unknown()` and `step_impute_mode()`.

```r
rec_miss_unk_mean_none <- recipe(pets ~ ., data = ok_tr_missing2) %>%
  step_rm(contains("essay")) %>%
* step_log(income, offset = 1) %>%
* step_impute_mean(income) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), other = "other1") %>%
  step_novel(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = FALSE)

rec_miss_mode_mean_none <- recipe(pets ~ ., data = ok_tr_missing2) %>%
  step_rm(contains("essay")) %>%
* step_log(income, offset = 1) %>%
* step_impute_mean(income) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), other = "other1") %>%
  step_novel(all_nominal_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = FALSE)
```

---

And again, using `workflow_set()`, more concise:

```r
rec_list <- list("unknown_none" = rec_miss_unk_none_none,
                 "mode_none" = rec_miss_mode_none_none,
                 "unknown_mean" = rec_miss_unk_mean_none,
                 "mode_mean" = rec_miss_mode_mean_none)

wset <- workflow_set(rec_list, mod_list) %>%
  workflow_map("fit_resamples",
               resamples = cv_splits_missing,
               metrics = metric_set(roc_auc))

cv_res_missing <- wset %>%
  unnest(result) %>%
  mutate(recipe = wflow_id, auc = map_dbl(.metrics, ~.x$.estimate))

cv_res_missing %>% select(wflow_id, id, auc)
```

```
## # A tibble: 40 × 3
##    wflow_id        id       auc
##    <chr>           <chr>  <dbl>
##  1 unknown_none_lr Fold01 0.707
##  2 unknown_none_lr Fold02 0.694
##  3 unknown_none_lr Fold03 0.735
##  4 unknown_none_lr Fold04 0.686
##  5 unknown_none_lr Fold05 0.667
##  6 unknown_none_lr Fold06 0.599
##  7 unknown_none_lr Fold07 0.666
##  8 unknown_none_lr Fold08 0.676
##  9 unknown_none_lr Fold09 0.648
## 10 unknown_none_lr Fold10 0.729
## # … with 30 more rows
```

---

Comparing AUCs with a plot:

```r
cv_res_missing %>%
  separate_wider_delim(wflow_id, "_", names = c("categorical", "income", "mod")) %>%
  ggplot(aes(income, auc, group = id, color = id)) +
  geom_line() + geom_line(data = cv_res_missing_meds, color = "black", alpha = 0.2, linewidth = 2) +
  geom_point() + geom_point(data = cv_res_missing_meds, color = "black", alpha = 0.2, size = 5, pch = 17) +
    guides(color = "none") + facet_wrap(. ~ categorical) +
    theme_light()
```

---

Comparing AUCs with a statistical test (2x2 Repeated measures ANOVA):

```r
cv_res_missing <- cv_res_missing %>%
  separate_wider_delim(wflow_id, "_", names = c("categorical", "income", "mod"))

aov_missing <- aov(
  auc ~ factor(categorical) * factor(income) + Error(factor(id)),
  data = cv_res_missing)
print(summary(aov_missing))
```

```
## 
## Error: factor(id)
##           Df  Sum Sq  Mean Sq F value Pr(>F)
## Residuals  9 0.05781 0.006424               
## 
## Error: Within
##                                    Df    Sum Sq   Mean Sq F value Pr(>F)  
## factor(categorical)                 1 0.0002579 2.579e-04   5.333 0.0288 *
## factor(income)                      1 0.0000310 3.103e-05   0.642 0.4301  
## factor(categorical):factor(income)  1 0.0000023 2.340e-06   0.048 0.8275  
## Residuals                          27 0.0013059 4.836e-05                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

```r
with(cv_res_missing, interaction.plot(categorical, income, auc))
```

---

Ok, now let's go back to our original idea of 18 recipes... see code in slides Rmd file.

```r
rec_list <- list("unknown_none_none" = rec_miss_unk_none_none,
                 "mode_none_none" = rec_miss_mode_none_none,
                 "unknown_mean_none" = rec_miss_unk_mean_none,
                 "mode_mean_none" = rec_miss_mode_mean_none,
                 "unknown_knn_none" = rec_miss_unk_knn_none,
                 "mode_knn_none" = rec_miss_mode_knn_none,
                 "unknown_none_len" = rec_miss_unk_none_len,
                 "mode_none_len" = rec_miss_mode_none_len,
                 "unknown_mean_len" = rec_miss_unk_mean_len,
                 "mode_mean_len" = rec_miss_mode_mean_len,
                 "unknown_knn_len" = rec_miss_unk_knn_len,
                 "mode_knn_len" = rec_miss_mode_knn_len,
                 "unknown_none_isna" = rec_miss_unk_none_isna,
                 "mode_none_isna" = rec_miss_mode_none_isna,
                 "unknown_mean_isna" = rec_miss_unk_mean_isna,
                 "mode_mean_isna" = rec_miss_mode_mean_isna,
                 "unknown_knn_isna" = rec_miss_unk_knn_isna,
                 "mode_knn_isna" = rec_miss_mode_knn_isna
                 )
```

BTW, I just saw each of these is ~700KB before prepping, ridic.
---

```r
set.seed(42)
cv_splits_missing <- vfold_cv(ok_tr_missing2, v = 10, strata = pets)

wset <- workflow_set(rec_list, mod_list) %>%
  workflow_map("fit_resamples",
               resamples = cv_splits_missing,
               metrics = metric_set(roc_auc))

cv_res_missing <- wset %>%
  unnest(result) %>%
  mutate(recipe = wflow_id, auc = map_dbl(.metrics, ~.x$.estimate)) %>%
  select(wflow_id, id, auc) %>%
  separate_wider_delim(wflow_id, "_", names = c("categorical", "income", "essay", "mod"))
```

---

Again, see the plot .font80percent[(how would you plot a 2x3x3 experiment results?)].

```r
cv_res_missing %>%
  group_by(id, income, essay) %>%
  summarise(auc = mean(auc)) %>%
  ggplot(aes(income, auc, group = id, color = id)) +
  geom_line() + geom_line(data = cv_res_missing_meds, color = "black", alpha = 0.2, linewidth = 2) +
  geom_point() + geom_point(data = cv_res_missing_meds, color = "black", alpha = 0.2, size = 5, pch = 17) +
  guides(color = "none") + facet_wrap(. ~ essay) +
  theme_light()
```

---

Repeated measures ANOVA

```r
aov_missing <- aov(
  auc ~ factor(categorical) * factor(income) + factor(essay) + Error(factor(id)),
  data = cv_res_missing)
summary(aov_missing)
```

```
## 
## Error: factor(id)
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals  9 0.2215 0.02461               
## 
## Error: Within
##                                     Df   Sum Sq   Mean Sq F value   Pr(>F)    
## factor(categorical)                  1 0.001649 0.0016491  15.295 0.000135 ***
## factor(income)                       2 0.000239 0.0001193   1.107 0.333046    
## factor(essay)                        2 0.000129 0.0000644   0.597 0.551681    
## factor(categorical):factor(income)   2 0.000013 0.0000065   0.060 0.941812    
## Residuals                          163 0.017574 0.0001078                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

Or you can just look at the mean numbers...

```r
cv_res_missing %>%
  group_by(categorical, income, essay) %>%
  summarise(auc = mean(auc)) %>%
  arrange(-auc) %>%
  head(10)
```

```
## # A tibble: 10 × 4
## # Groups:   categorical, income [4]
##    categorical income essay   auc
##    <chr>       <chr>  <chr> <dbl>
##  1 mode        mean   none  0.688
##  2 mode        mean   len   0.687
##  3 mode        mean   isna  0.687
##  4 mode        none   none  0.686
##  5 mode        none   len   0.685
##  6 mode        none   isna  0.685
##  7 mode        knn    none  0.685
##  8 mode        knn    len   0.684
##  9 mode        knn    isna  0.683
## 10 unknown     mean   none  0.683
```

---

# Class Imbalance

---

### Class Imbalance

.font80percent[(We already gave a decent review in [last class](https://dsapps-2023.github.io/Class_Slides/u3_d02-classtopics/u3_d02-classtopics.html))]

We'll choose between 3 strategies:
- Upsample
- Downsample
- SMOTE and Downsample

---

Remember how we got here.

```r
handle_essays <- function(ok_df) {
  ok_df %>%
    mutate(across(essay0:essay9,
                  list("len" = ~ifelse(is.na(.x), 0, str_length(.x))))) %>%
    rename_with(~paste("len", gsub("_len", "", .), sep = "_"),
                ends_with("_len"))
}

ok_tr_imbalance2 <- handle_essays(ok_tr_imbalance) %>%
  select(-starts_with("essay")) %>%
  mutate(across(where(is.character), as.factor))
```

---

```r
library(themis)

rec_imb_base <- recipe(pets ~ ., data = ok_tr_imbalance2) %>%
  step_log(income, starts_with("len"), offset = 1) %>%
  step_impute_mean(income) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), other = "other1") %>%
  step_novel(all_nominal_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = FALSE)

rec_imb_upsample <- rec_imb_base %>%
* step_upsample(pets, over_ratio = 1, seed = 42)

rec_imb_downsample <- rec_imb_base %>%
* step_downsample(pets, under_ratio = 1, seed = 42)

rec_imb_smote <- rec_imb_base %>%
* step_downsample(pets, under_ratio = 1.5) %>%
* step_smote(pets, over_ratio = 1, seed = 42)
```

---

CV splits, `workflow_set()`.

```r
set.seed(42)
cv_splits_imbalance <- vfold_cv(ok_tr_imbalance2, v = 10, strata = pets)

rec_list <- list("upsample" = rec_imb_upsample,
                 "downsample" = rec_imb_downsample,
                 "smote" = rec_imb_smote)

wset <- workflow_set(rec_list, mod_list) %>%
  workflow_map("fit_resamples",
               resamples = cv_splits_imbalance,
               metrics = metric_set(roc_auc))

cv_res_imbalance <- wset %>%
  unnest(result) %>%
  mutate(recipe = wflow_id, auc = map_dbl(.metrics, ~.x$.estimate)) %>%
  select(recipe, id, auc)
```

---

Repeated measures ANOVA:

```r
aov_imbalance <- aov(
  auc ~ factor(recipe) + Error(factor(id)),
  data = cv_res_imbalance)
print(summary(aov_imbalance))
```

```
## 
## Error: factor(id)
##           Df  Sum Sq  Mean Sq F value Pr(>F)
## Residuals  9 0.02568 0.002854               
## 
## Error: Within
##                Df   Sum Sq   Mean Sq F value Pr(>F)
## factor(recipe)  2 0.001151 0.0005756   2.587  0.103
## Residuals      18 0.004005 0.0002225
```

---

Plotting...

```r
cv_res_imbalance %>%
  ggplot(aes(recipe, auc, group = id, color = id)) +
    geom_line() + geom_line(data = cv_res_imbalance_meds, color = "black", alpha = 0.2, linewidth = 2) +
  geom_point() + geom_point(data = cv_res_imbalance_meds, color = "black", alpha = 0.2, size = 5, pch = 17) +
    guides(color = "none") +
    theme_light()
```

---

```r
print(cv_res_imbalance %>%
        group_by(recipe) %>%
        summarise(mean_auc = mean(auc), .groups="drop"))
```

```
## # A tibble: 3 × 2
##   recipe        mean_auc
##   <chr>            <dbl>
## 1 downsample_lr    0.690
## 2 smote_lr         0.702
## 3 upsample_lr      0.704
```

---

# Feat. Engineering and Selection

---

### If anything, remember:

1. Don't be lazy.
2. Be creative.
3. Use resampling!

Let's divide to:
1. Categorical Features
2. Continuous Features
3. Interactions

---

# Categorical Variables

---

### Making dummy variables

.font80percent[(In the case your model can accept categorical levels, skip to the next section - Dealing with categorical variables with many levels)]

- One hot encoding: `$k$` dummy vars for `$k$` levels
- Non one hot encoding: `$k - 1$` dummy vars for `$k$` levels
- Different Contrasts schemes .font80percent[(e.g. for ordered categorical variables ("low", "medium", "high"))]
- Feature Hashing

---

```r
set.seed(42)
df <- tibble(
  x1 = factor(c(rep("a", 4), rep("b", 4))),
  y = runif(8)
)

one_hot <- recipe(y ~ x1, data = df) %>%
* step_dummy(all_nominal(), one_hot = TRUE) %>%
  prep(df) %>%
  bake(new_data = NULL)

non_one_hot <- recipe(y ~ x1, data = df) %>%
* step_dummy(all_nominal(), one_hot = FALSE) %>%
  prep(df) %>%
  bake(new_data = NULL)
```

---

```r
head(one_hot)
```

```
## # A tibble: 6 × 3
##       y  x1_a  x1_b
##   <dbl> <dbl> <dbl>
## 1 0.915     1     0
## 2 0.937     1     0
## 3 0.286     1     0
## 4 0.830     1     0
## 5 0.642     0     1
## 6 0.519     0     1
```

```r
head(non_one_hot)
```

```
## # A tibble: 6 × 2
##       y  x1_b
##   <dbl> <dbl>
## 1 0.915     0
## 2 0.937     0
## 3 0.286     0
## 4 0.830     0
## 5 0.642     1
## 6 0.519     1
```

---

When dealing with linear regression, you should be very careful what you input into the `lm` engine of your choice and how you interpret results!

```r
summary(lm(y ~ ., data = non_one_hot))
```

```
## 
## Call:
## lm(formula = y ~ ., data = non_one_hot)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45598 -0.08504  0.11103  0.17826  0.22856 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.7421     0.1434   5.177  0.00206 **
## x1_b         -0.2341     0.2027  -1.155  0.29212   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2867 on 6 degrees of freedom
## Multiple R-squared:  0.1818,	Adjusted R-squared:  0.04546 
## F-statistic: 1.333 on 1 and 6 DF,  p-value: 0.2921
```

```r
print(mean(df$y) + (mean(df$y[df$x1 == "a"]) - mean(df$y)))
```

```
## [1] 0.7421172
```

```r
print(mean(df$y[df$x1 == "a"]))
```

```
## [1] 0.7421172
```

```r
print(mean(df$y[df$x1 == "b"]) - mean(df$y[df$x1 == "a"]))
```

```
## [1] -0.2340931
```

---

This is like:

```r
summary(lm(y ~ ., data = df))
```

```
## 
## Call:
## lm(formula = y ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45598 -0.08504  0.11103  0.17826  0.22856 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.7421     0.1434   5.177  0.00206 **
## x1b          -0.2341     0.2027  -1.155  0.29212   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2867 on 6 degrees of freedom
## Multiple R-squared:  0.1818,	Adjusted R-squared:  0.04546 
## F-statistic: 1.333 on 1 and 6 DF,  p-value: 0.2921
```

---

But:

```r
summary(lm(y ~ ., data = one_hot))
```

```
## 
## Call:
## lm(formula = y ~ ., data = one_hot)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45598 -0.08504  0.11103  0.17826  0.22856 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.5080     0.1434   3.544   0.0122 *
## x1_a          0.2341     0.2027   1.155   0.2921  
## x1_b              NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2867 on 6 degrees of freedom
## Multiple R-squared:  0.1818,	Adjusted R-squared:  0.04546 
## F-statistic: 1.333 on 1 and 6 DF,  p-value: 0.2921
```

---

```r
summary(lm(y ~ . - 1, data = one_hot))
```

```
## 
## Call:
## lm(formula = y ~ . - 1, data = one_hot)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45598 -0.08504  0.11103  0.17826  0.22856 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)   
## x1_a   0.7421     0.1434   5.177  0.00206 **
## x1_b   0.5080     0.1434   3.544  0.01216 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2867 on 6 degrees of freedom
## Multiple R-squared:  0.8677,	Adjusted R-squared:  0.8236 
## F-statistic: 19.68 on 2 and 6 DF,  p-value: 0.002314
```

---

This is like:

```r
summary(lm(y ~ . - 1, data = df))
```

```
## 
## Call:
## lm(formula = y ~ . - 1, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45598 -0.08504  0.11103  0.17826  0.22856 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)   
## x1a   0.7421     0.1434   5.177  0.00206 **
## x1b   0.5080     0.1434   3.544  0.01216 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2867 on 6 degrees of freedom
## Multiple R-squared:  0.8677,	Adjusted R-squared:  0.8236 
## F-statistic: 19.68 on 2 and 6 DF,  p-value: 0.002314
```

---

Plot thickens (check your `options("contrasts")`!):

```r
summary(lm(y ~ ., data = df, contrasts = list(x1 = "contr.sum")))
```

```
## 
## Call:
## lm(formula = y ~ ., data = df, contrasts = list(x1 = "contr.sum"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45598 -0.08504  0.11103  0.17826  0.22856 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.6251     0.1014   6.167 0.000835 ***
## x11           0.1170     0.1014   1.155 0.292118    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2867 on 6 degrees of freedom
## Multiple R-squared:  0.1818,	Adjusted R-squared:  0.04546 
## F-statistic: 1.333 on 1 and 6 DF,  p-value: 0.2921
```

---

Two variables...

```r
set.seed(42)
df <- tibble(
  x1 = factor(c(rep("a", 6), rep("b", 6))),
  x2 = factor(rep(c(rep("c", 2), rep("d", 2), rep("e", 2)), 2)),
  y = runif(12)
)

one_hot <- recipe(y ~ x1 + x2, data = df) %>%
  step_dummy(all_nominal(), one_hot = TRUE) %>%
  prep(df) %>%
  bake(new_data = NULL)

non_one_hot <- recipe(y ~ x1 + x2, data = df) %>%
  step_dummy(all_nominal(), one_hot = FALSE) %>%
  prep(df) %>%
  bake(new_data = NULL)
```

---

```r
head(one_hot)
```

```
## # A tibble: 6 × 6
##       y  x1_a  x1_b  x2_c  x2_d  x2_e
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.915     1     0     1     0     0
## 2 0.937     1     0     1     0     0
## 3 0.286     1     0     0     1     0
## 4 0.830     1     0     0     1     0
## 5 0.642     1     0     0     0     1
## 6 0.519     1     0     0     0     1
```

```r
head(non_one_hot)
```

```
## # A tibble: 6 × 4
##       y  x1_b  x2_d  x2_e
##   <dbl> <dbl> <dbl> <dbl>
## 1 0.915     0     0     0
## 2 0.937     0     0     0
## 3 0.286     0     1     0
## 4 0.830     0     1     0
## 5 0.642     0     0     1
## 6 0.519     0     0     1
```

---

```r
summary(lm(y ~ ., data = non_one_hot))
```

```
## 
## Call:
## lm(formula = y ~ ., data = non_one_hot)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48619 -0.08138  0.10650  0.15667  0.19636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.74071    0.15610   4.745  0.00145 **
## x1_b        -0.11986    0.15610  -0.768  0.46465   
## x2_d        -0.06112    0.19118  -0.320  0.75738   
## x2_e        -0.09636    0.19118  -0.504  0.62783   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2704 on 8 degrees of freedom
## Multiple R-squared:  0.09602,	Adjusted R-squared:  -0.243 
## F-statistic: 0.2832 on 3 and 8 DF,  p-value: 0.8362
```

---

Same as:

```r
summary(lm(y ~ ., data = df))
```

```
## 
## Call:
## lm(formula = y ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48619 -0.08138  0.10650  0.15667  0.19636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.74071    0.15610   4.745  0.00145 **
## x1b         -0.11986    0.15610  -0.768  0.46465   
## x2d         -0.06112    0.19118  -0.320  0.75738   
## x2e         -0.09636    0.19118  -0.504  0.62783   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2704 on 8 degrees of freedom
## Multiple R-squared:  0.09602,	Adjusted R-squared:  -0.243 
## F-statistic: 0.2832 on 3 and 8 DF,  p-value: 0.8362
```

---

```r
print(mean(df$y) +
        (mean(df$y[df$x1 == "a"]) - mean(df$y)) +
        (mean(df$y[df$x2 == "c"]) - mean(df$y)))
```

```
## [1] 0.7407128
```

```r
print(mean(df$y[df$x1 == "a"]) + mean(df$y[df$x2 == "c"]) - mean(df$y))
```

```
## [1] 0.7407128
```

```r
print(mean(df$y[df$x1 == "b"]) - mean(df$y[df$x1 == "a"]))
```

```
## [1] -0.1198573
```

```r
print(mean(df$y[df$x2 == "d"]) - mean(df$y[df$x2 == "c"]))
```

```
## [1] -0.06112303
```

```r
print(mean(df$y[df$x2 == "e"]) - mean(df$y[df$x2 == "c"]))
```

```
## [1] -0.09636022
```

---

```r
summary(lm(y ~ ., data = one_hot))
```

```
## 
## Call:
## lm(formula = y ~ ., data = one_hot)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48619 -0.08138  0.10650  0.15667  0.19636 
## 
## Coefficients: (2 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.52450    0.15610   3.360  0.00993 **
## x1_a         0.11986    0.15610   0.768  0.46465   
## x1_b              NA         NA      NA       NA   
## x2_c         0.09636    0.19118   0.504  0.62783   
## x2_d         0.03524    0.19118   0.184  0.85836   
## x2_e              NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2704 on 8 degrees of freedom
## Multiple R-squared:  0.09602,	Adjusted R-squared:  -0.243 
## F-statistic: 0.2832 on 3 and 8 DF,  p-value: 0.8362
```

---

```r
print(mean(df$y[df$x1 == "b"]) + mean(df$y[df$x2 == "e"]) - mean(df$y))
```

```
## [1] 0.5244952
```

```r
print(mean(df$y[df$x1 == "a"]) - mean(df$y[df$x1 == "b"]))
```

```
## [1] 0.1198573
```

```r
print(mean(df$y[df$x2 == "c"]) - mean(df$y[df$x2 == "e"]))
```

```
## [1] 0.09636022
```

```r
print(mean(df$y[df$x2 == "d"]) - mean(df$y[df$x2 == "e"]))
```

```
## [1] 0.03523718
```

---

```r
summary(lm(y ~ . - 1, data = one_hot))
```

```
## 
## Call:
## lm(formula = y ~ . - 1, data = one_hot)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48619 -0.08138  0.10650  0.15667  0.19636 
## 
## Coefficients: (1 not defined because of singularities)
##      Estimate Std. Error t value Pr(>|t|)   
## x1_a  0.64435    0.15610   4.128  0.00331 **
## x1_b  0.52450    0.15610   3.360  0.00993 **
## x2_c  0.09636    0.19118   0.504  0.62783   
## x2_d  0.03524    0.19118   0.184  0.85836   
## x2_e       NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2704 on 8 degrees of freedom
## Multiple R-squared:  0.8914,	Adjusted R-squared:  0.8371 
## F-statistic: 16.41 on 4 and 8 DF,  p-value: 0.0006356
```

---

```r
print(mean(df$y[df$x1 == "a"]) + mean(df$y[df$x2 == "e"]) - mean(df$y))
```

```
## [1] 0.6443525
```

```r
print(mean(df$y[df$x1 == "b"]) + mean(df$y[df$x2 == "e"]) - mean(df$y))
```

```
## [1] 0.5244952
```

```r
print(mean(df$y[df$x2 == "c"]) - mean(df$y[df$x2 == "e"]))
```

```
## [1] 0.09636022
```

```r
print(mean(df$y[df$x2 == "d"]) - mean(df$y[df$x2 == "e"]))
```

```
## [1] 0.03523718
```

---

### Categorical variables with many levels

```r
ok_tr_eng %>% count(location)
```

```
## # A tibble: 67 × 2
##    location                          n
##    <chr>                         <int>
##  1 alameda, california              49
##  2 albany, california               10
##  3 atherton, california              3
##  4 belmont, california              12
##  5 belvedere tiburon, california     2
##  6 benicia, california              13
##  7 berkeley, california            185
##  8 bolinas, california               1
##  9 brisbane, california              4
## 10 burlingame, california           13
## # … with 57 more rows
```

---

#### Many dangers lurk
- OVERFITTING!
- Unseen levels
- Zero variance variables
- Size of `$p$`

#### We'll talk about

- The "Other" and "New" Strategy
- Clustering
- Feature Hashing
- Supervised Numerical Encoding
- Dimensionality Reduction (Embeddings)
- Text Features
- E.g. Simchoni and Rosset ([2021](https://papers.nips.cc/paper/2021/hash/d35b05a832e2bb91f110d54e34e2da79-Abstract.html), [2023](https://jmlr.org/papers/v24/22-0501.html))

---

#### The "Other" and "New" Strategy

- "Other": in many scenarios in which a categorical feature has many levels, some of these are so scarce it is doubtful if they have any predictive power - consider grouping them into one "Other" level
- "New": therefore there might be many levels not seen during training, only during testing - consider adding a (zero variance) "New" level

---

What % threshold should you use for "Other"?
- `step_other()` default is 5% (i.e. any level with less than 5% occurrence is grouped into one super-level "Other")
- or tune it!

```r
recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>%
* step_other(all_nominal_predictors(), threshold = 0.05) %>%
* step_novel(all_nominal_predictors()) %>%
  prep(ok_tr_eng) %>%
  bake(new_data = NULL) %>%
  count(location)
```

```
## # A tibble: 4 × 2
##   location                      n
##   <fct>                     <int>
## 1 berkeley, california        185
## 2 oakland, california         343
## 3 san francisco, california  1615
## 4 other                       857
```

---

You can also insert some creativity (or plain common sense) in your "Other" strategy:

```r
education2 = case_when(
           education == "graduated from high school" ~ "high_school",
           education == "graduated from two-year college" ~ "college",
           education == "graduated from college/university" ~ "degree1",
           education == "graduated from masters program" ~ "degree2",
           education == "graduated from ph.d program" ~ "degree3",
           education == "working on two-year college" ~ "student0",
           education == "working on college/university" ~ "student1",
           education == "working on masters program" ~ "student2",
           education == "working on ph.d program" ~ "student3",
           is.na(education) ~ NA_character_,
           TRUE ~ "other"
)
```

---

#### Clustering

- Manual clustering, using expert knowledge and common sense
- Unsupervised clustering, where it makes sense (e.g. Topic Modeling with [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) on text data)

The `sign` feature is also a great example:

```r
ok_tr_eng %>% count(sign)
```

```
## # A tibble: 47 × 2
##    sign                                           n
##    <chr>                                      <int>
##  1 aquarius                                      58
##  2 aquarius and it matters a lot                  6
##  3 aquarius and it&rsquo;s fun to think about    70
##  4 aquarius but it doesn&rsquo;t matter          74
##  5 aries                                         56
##  6 aries and it matters a lot                     1
##  7 aries and it&rsquo;s fun to think about       77
##  8 aries but it doesn&rsquo;t matter             78
##  9 cancer                                        56
## 10 cancer and it&rsquo;s fun to think about      66
## # … with 37 more rows
```

---

```r
signs <- c("leo", "libra", "cancer", "virgo", "scorpio", "gemini",
           "taurus", "aries", "pisces", "aquarius", "sagittarius", "capricorn")

recipe(pets ~ ., data = ok_tr_eng %>% select(pets, sign)) %>%
* step_mutate(sign = str_extract(sign, str_c(signs, collapse = "|"))) %>%
* step_unknown(sign) %>%
  prep(ok_tr_eng) %>%
  bake(new_data = NULL) %>%
  count(sign)
```

```
## # A tibble: 13 × 2
##    sign            n
##    <fct>       <int>
##  1 aries         212
##  2 virgo         232
##  3 leo           223
##  4 taurus        197
##  5 cancer        196
##  6 scorpio       196
##  7 libra         215
##  8 aquarius      208
##  9 gemini        239
## 10 pisces        221
## 11 capricorn     194
## 12 sagittarius   222
## 13 unknown       445
```

---

#### Feature Hashing

Suppose I told you "God, just group the locations by the number of times the letter 'l' appears in them!"

```r
recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>%
* step_mutate(location = factor(str_count(location, "l"))) %>%
  prep(ok_tr_eng) %>%
  bake(new_data = NULL) %>%
  count(location)
```

```
## # A tibble: 5 × 2
##   location     n
##   <fct>    <int>
## 1 1         1935
## 2 2          875
## 3 3          156
## 4 4           17
## 5 5           17
```

---

```r
recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>%
* step_mutate(locationL = factor(str_count(location, "l"))) %>%
* step_dummy(locationL) %>%
  prep(ok_tr_eng) %>%
  bake(new_data = NULL) %>%
  select(starts_with("location"))
```

```
## # A tibble: 3,000 × 5
##    location                  locationL_X2 locationL_X3 locationL_X4 locationL_X5
##    <fct>                            <dbl>        <dbl>        <dbl>        <dbl>
##  1 palo alto, california                0            1            0            0
##  2 san leandro, california              1            0            0            0
##  3 hayward, california                  0            0            0            0
##  4 vallejo, california                  0            1            0            0
##  5 san francisco, california            0            0            0            0
##  6 oakland, california                  1            0            0            0
##  7 berkeley, california                 1            0            0            0
##  8 alameda, california                  1            0            0            0
##  9 berkeley, california                 1            0            0            0
## 10 san francisco, california            0            0            0            0
## # … with 2,990 more rows
```

---

It's not a good idea because
- Differenet `location`s map to the same bin ("collision"), masking possible effect
- The distribution isn't even

But there are advantages too!
- I *did* reduce the dimensionality from 70 to 6 dummy variables
- Fast
- Generalizes well: any new `location` can be mapped this way
- Scales: I can do this for millions of `location`s, no dictionary needed to store in memory

---

- Feature Hashing (or "hashing trick") does something similar more intelligently
- It lets you choose the hashing algo and no. of features you'd like (the more - the less chance of collision)

```r
library(textrecipes)

recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>%
* step_tokenize(location) %>%
* step_texthash(location, signed = FALSE, num_terms = 2^4) %>%
  prep(ok_tr_eng) %>%
  bake(new_data = NULL)
```

```
## # A tibble: 3,000 × 17
##    texthash_location_01 texthash_location_02 texthash_location_03
##                   <int>                <int>                <int>
##  1                    0                    0                    1
##  2                    0                    0                    1
##  3                    0                    0                    1
##  4                    0                    0                    1
##  5                    0                    0                    1
##  6                    0                    1                    1
##  7                    0                    0                    1
##  8                    0                    0                    1
##  9                    0                    0                    1
## 10                    0                    0                    1
## # … with 2,990 more rows, and 14 more variables: texthash_location_04 <int>,
## #   texthash_location_05 <int>, texthash_location_06 <int>,
## #   texthash_location_07 <int>, texthash_location_08 <int>,
## #   texthash_location_09 <int>, texthash_location_10 <int>,
## #   texthash_location_11 <int>, texthash_location_12 <int>,
## #   texthash_location_13 <int>, texthash_location_14 <int>,
## #   texthash_location_15 <int>, texthash_location_16 <int>, pets <fct>
```

---

- Particularly useful for fast and furious feature engineering of text (see later)
- Beware of columns with zero variance, consider binning them to a "New" category
- Powerful methods like `xgboost` can handle 2^20+ no. of terms easily

What is still a major disadvantage? Hint: 🎁.
]

---

.font80percent[
[Bill Turque, 2012, Washington Post](https://www.washingtonpost.com/local/education/creative--motivating-and-fired/2012/02/04/gIQAwzZpvR_story.html)
]

---

#### Supervised Numerical Encoding

Here's another crazy idea for you:

- For regression, replace a category with `$y$`'s mean in this category
- For classification, replace a category with the log odds of `$y$` in this category

---

Thus, changing a categorical variable with many levels into a single numerical variable (or a few for a few classes).

```r
library(embed)

recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>%
  step_mutate(location2 = location) %>%
* step_lencode_glm(location, outcome = vars(pets)) %>%
  prep(ok_tr_eng) %>%
  bake(new_data = NULL)
```

```
## # A tibble: 3,000 × 3
##    location pets  location2                
##       <dbl> <fct> <fct>                    
##  1    -1.73 dogs  palo alto, california    
##  2    -1.58 dogs  san leandro, california  
##  3    -1.86 cats  hayward, california      
##  4    -2.48 dogs  vallejo, california      
##  5    -1.74 dogs  san francisco, california
##  6    -1.47 dogs  oakland, california      
##  7    -1.08 dogs  berkeley, california     
##  8    -1.49 dogs  alameda, california      
##  9    -1.08 cats  berkeley, california     
## 10    -1.74 cats  san francisco, california
## # … with 2,990 more rows
```

---

Make sure you get what just happened

```r
mv <- ok_tr_eng$location == "palo alto, california"
p <- mean(ok_tr_eng$pets[mv] == "cats")
log(p / (1 - p))
```

```
## [1] -1.727221
```

---

But...
- What if a specific location has just 1-2 samples in training set?
- "self-fulfilling prophecy"

The first issue can be somewhat alleviated using a Bayesian approach, *shrinking* somewhat coefficients towards their prior when n is small:

```r
recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>%
  step_mutate(location2 = location) %>%
* step_lencode_bayes(location, outcome = vars(pets)) %>%
  prep(ok_tr_eng %>% head(100)) %>%
  bake(new_data = NULL)
```

```
## # A tibble: 100 × 3
##    location pets  location2                
##       <dbl> <fct> <fct>                    
##  1    -1.41 dogs  palo alto, california    
##  2    -1.39 dogs  san leandro, california  
##  3    -1.36 cats  hayward, california      
##  4    -1.43 dogs  vallejo, california      
##  5    -1.49 dogs  san francisco, california
##  6    -1.42 dogs  oakland, california      
##  7    -1.44 dogs  berkeley, california     
##  8    -1.41 dogs  alameda, california      
##  9    -1.44 cats  berkeley, california     
## 10    -1.49 cats  san francisco, california
## # … with 90 more rows
```

---

#### Dimensionality Reduction (Embeddings)

See our `word2vec` detour when we learned about [Networks Community Detection](https://dsapps-2023.github.io/Class_Slides/u2_d04-community/u2_d04-community.html).

If you have `tensorflow` and `keras` installed, you could train your own embeddings to turn each `location` into a `num_terms` long vector of features:

```r
recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>%
  step_mutate(location2 = location) %>%
* step_embed(location, outcome = vars(pets), num_terms = 10,
*            options = embed_control(epochs = 10)) %>%
  prep(ok_tr_eng) %>%
  bake(new_data = NULL)
```

---

Alternatively you could use pre-trained embeddings for text variables such as `essay`s using `step_word_embedding()` from the `textrecipes` package (where `my_embeddings` are your pre-trained embeddings:

```r
recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>%
  step_tokenize(location) %>%
* step_word_embeddings(location, embeddings = my_embeddings) %>%
  prep(ok_tr_eng) %>%
  bake(new_data = NULL)
```

---

Or, perform PCA on the dummy variables matrix, keeping only a few of the first PCs as features:

```r
recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>%
  step_mutate(old_location = location) %>%
  step_dummy(location) %>%
  step_normalize(all_numeric()) %>%
* step_pca(starts_with("location"), num_comp = 10) %>%
  prep(ok_tr_eng) %>%
  bake(new_data = NULL)
```

```
## # A tibble: 3,000 × 12
##    pets  old_location   PC01    PC02    PC03     PC04     PC05     PC06     PC07
##    <fct> <fct>         <dbl>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 dogs  palo alto, … -1.28  -0.823   1.33   -1.85    -6.80     1.23    -0.524  
##  2 dogs  san leandro… -1.28  -0.803   1.25   -1.16     2.65     6.85    -1.17   
##  3 cats  hayward, ca… -1.27  -0.773   1.14   -0.720    0.802   -1.04     1.12   
##  4 dogs  vallejo, ca… -1.26  -0.741   1.04   -0.508    0.455   -0.460    0.358  
##  5 dogs  san francis…  1.15   0.0647 -0.0406  0.00565 -0.00343  0.00269 -0.00171
##  6 dogs  oakland, ca… -1.61   2.51   -0.588   0.0601  -0.0353   0.0272  -0.0171 
##  7 dogs  berkeley, c… -1.40  -1.92   -3.33    0.146   -0.0817   0.0615  -0.0382 
##  8 dogs  alameda, ca… -0.445 -0.0625  0.0430 -0.00633  0.00388 -0.00304  0.00194
##  9 cats  berkeley, c… -1.40  -1.92   -3.33    0.146   -0.0817   0.0615  -0.0382 
## 10 cats  san francis…  1.15   0.0647 -0.0406  0.00565 -0.00343  0.00269 -0.00171
## # … with 2,990 more rows, and 3 more variables: PC08 <dbl>, PC09 <dbl>,
## #   PC10 <dbl>
```

---

#### Text Features

Extracting textual features could be the subject of an entire class or even a course.

Just to demonstrate how worthwhile being creative can get here let us just add two textual features here, `n_cat` and `n_dog` in all `essay`s:

```r
ok_tr_eng2 <- handle_essays(ok_tr_eng)

str_count_na <- function(string, pattern = "") {
  n <- str_count(string, pattern)
  ifelse(is.na(n), 0, n)
}
```

---

```r
rec_eng_notext <- recipe(pets ~ ., data = ok_tr_eng2) %>%
* step_rm(starts_with("essay")) %>%
  step_log(income, starts_with("len"), offset = 1) %>%
  step_impute_mean(income) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), other = "other1") %>%
  step_novel(all_nominal_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = FALSE) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)
```

---

```r
rec_eng_catdog <- recipe(pets ~ ., data = ok_tr_eng2) %>%
* step_mutate(essays = pmap_chr(
*   list(essay0, essay1, essay2, essay3, essay4, essay5,
*        essay6, essay7, essay8, essay9), str_c),
*             n_cat = str_count_na(essays, "cat"),
*             n_dog = str_count_na(essays, "dog")) %>%
  step_rm(starts_with("essay")) %>%
  step_log(income, starts_with("len"), offset = 1) %>%
  step_impute_mean(income) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), other = "other1") %>%
  step_novel(all_nominal_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = FALSE) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)
```

---

---

#### Text Features: some ideas for you

- not just words: do cat people tend to use more the letter "z"? no. of commas, no. of dot-dot-dots, no. of links, no. of hashtags, no. of extra spaces, no. of digits
- to stem or not to stem?
  - "plays", "playing", ["playa"](https://www.urbandictionary.com/define.php?term=Playa)
- sentiment features
  - not just no. of "hate" vs. "love"
  - also no. of "you" and "I"
- stop words or not to stop words?
- TF-IDF or not to TF-IDF
- Feature Hashing
- n-grams: "I love my cats"
- topics

---

#### Using `textrecipes` + `textfeatures`

- The `textrecipes` package lets you calculate a few features with a single `step_textfeature()`
- The `extract_functions` parameter accepts a named list of functions you wish to define on your text data
- By default it will accept `count_functions` list from the [`textfeatures`](https://textfeatures.mikewk.com/reference/textfeatures.html) package by Mike Kearney, which gets you basic counts like `n_words`, `n_digits` etc.

---

But `textfeatures` also has via the `textfeatures()` function:
  - basic sentiment features like `sent_bing`
  - first word2vec 2-200 dimensions!

```r
print(textfeatures:::sentiment_bing("I hate cats!"))
```

```
## [1] -1
```

```r
print(textfeatures:::sentiment_bing("I love cats! Hate dogs though."))
```

```
## [1] 0
```

```r
print(textfeatures::word_dims(
  "Why would Kim Jong-un insult me by calling me \"old,\""))
```

```
## INFO  [09:33:49.248] early stopping at 20 iteration
## INFO  [09:33:49.290] early stopping at 20 iteration
##     w1   w2   w3  w4 w5   w6 w7 w8 w9 w10
## 1 0.05 0.05 0.05 0.4  0 0.35  0  0  0 0.1
```

---

```r
ok_tr_eng2 <- ok_tr_eng2 %>%
  mutate(essays = pmap_chr(
    list(essay0, essay1, essay2, essay3, essay4, essay5,
         essay6, essay7, essay8, essay9), str_c))

my_text_funs <- c(textfeatures::count_functions,
                      "sent_bing" = textfeatures:::sentiment_bing)
```

---

```r
recipe(pets ~ ., data = ok_tr_eng2 %>% select(pets, essays)) %>%
  update_role(essays, new_role = "discarded") %>%
* step_textfeature(essays, prefix = "t",
*                  extract_functions = my_text_funs) %>%
* step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>%
  prep(ok_tr_eng2 %>% head(10)) %>%
  bake(new_data = NULL) %>%
  select(t_essays_n_uq_words, t_essays_n_hashtags, t_essays_sent_bing)
```

```
## # A tibble: 10 × 3
##    t_essays_n_uq_words t_essays_n_hashtags t_essays_sent_bing
##                  <dbl>               <dbl>              <dbl>
##  1                   0                   0                  0
##  2                 283                   0                  9
##  3                 345                   0                 16
##  4                   0                   0                  0
##  5                   0                   0                  0
##  6                   0                   0                  0
##  7                   0                   0                  0
##  8                 267                   0                 22
##  9                   0                   0                  0
## 10                 193                   0                 -2
```

---

#### Text Features: but the words are most important

We'll take [Kuhn & Johnson (2019)](http://www.feat.engineering/index.html) approach, and:

1. Filter 1-gram tokens (words) with at least X occurrences
2. Calculate the odds ratio `$\frac{P(w|cats)}{1- P(w|cats)} / \frac{P(w|dogs)}{1- P(w|dogs)}$` where `$P(w|cats)$` is the % of cats people who used the word `$w$` at least once in all essays
3. Calculate a p-value with a FDR correction
4. Choose words with ~~OR of at least |2| and~~ FDR value of maximum `$10^{-4}$`

Then use resampling to see if these have any effect on AUC.

---

```r
library(tidytext)

tidy_essays <- ok_tr_eng2 %>%
  mutate(id = 1:n()) %>%
  select(id, pets, essays) %>%
  unnest_tokens(word, essays) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

tidy_essays %>% head()
```

```
## # A tibble: 6 × 3
##      id pets  word      
##   <int> <chr> <chr>     
## 1     2 dogs  bay       
## 2     2 dogs  raised    
## 3     2 dogs  university
## 4     2 dogs  oregon    
## 5     2 dogs  schooled  
## 6     2 dogs  hip
```

---

```r
min_n_words <- tidy_essays %>%
  count(id, word) %>%
  count(word) %>%
  filter(n > 50) %>%
  pull(word)

print(c("cat", "dog", "cats", "dogs") %in% min_n_words)
```

```
## [1] TRUE TRUE TRUE TRUE
```

```r
n_cats_people <- sum(ok_tr_eng2$pets == "cats")
n_dogs_people <- sum(ok_tr_eng2$pets == "dogs")

fisher_test <- function(cats, dogs) {
  cont_tab <- matrix(c(cats, n_cats_people - cats,
                       dogs, n_dogs_people - dogs),
                     ncol = 2, nrow = 2)
  ft <- fisher.test(cont_tab)
  list(OR = ft$estimate, p_val = ft$p.value)
}
```

---

```r
or_res <- tidy_essays %>%
  filter(word %in% min_n_words) %>%
  count(id, word, pets) %>%
  count(word, pets) %>%
  pivot_wider(id_cols = word, names_from = pets,
              values_from = n, values_fill = list(cats = 0, dogs = 0)) %>%
  mutate(fisher_obj = map2(cats, dogs, fisher_test))

or_res <- or_res %>%
  bind_cols(
    bind_rows(or_res$fisher_obj)
  ) %>%
* mutate(fdr = p.adjust(p_val, method = "fdr"))

or_res
```

```
## # A tibble: 818 × 7
##    word        cats  dogs fisher_obj          OR  p_val   fdr
##    <chr>      <int> <int> <list>           <dbl>  <dbl> <dbl>
##  1 ability       17    47 <named list [2]> 1.86  0.0392 0.163
##  2 absolutely    11    75 <named list [2]> 0.736 0.460  0.683
##  3 accent        12    39 <named list [2]> 1.57  0.182  0.398
##  4 act           12    44 <named list [2]> 1.39  0.361  0.603
##  5 action        19   105 <named list [2]> 0.912 0.805  0.928
##  6 active        21   126 <named list [2]> 0.837 0.568  0.784
##  7 activities    15    87 <named list [2]> 0.869 0.686  0.853
##  8 add           15    55 <named list [2]> 1.39  0.255  0.486
##  9 adele          8    47 <named list [2]> 0.859 0.855  0.958
## 10 admit         26   105 <named list [2]> 1.27  0.280  0.513
## # … with 808 more rows
```

---

```r
cd_words <- or_res %>%
  arrange(fdr) %>%
  filter(fdr < 10e-4)

cd_words
```

```
## # A tibble: 10 × 7
##    word     cats  dogs fisher_obj          OR    p_val      fdr
##    <chr>   <int> <int> <list>           <dbl>    <dbl>    <dbl>
##  1 cat        44    33 <named list [2]> 7.30  2.11e-16 1.73e-13
##  2 dog         7   248 <named list [2]> 0.131 7.65e-13 3.13e-10
##  3 cats       34    34 <named list [2]> 5.36  1.01e-10 2.74e- 8
##  4 ilink      58   100 <named list [2]> 3.19  3.12e-10 6.38e- 8
##  5 class      69   137 <named list [2]> 2.80  5.85e-10 9.58e- 8
##  6 href       61   114 <named list [2]> 2.95  8.41e-10 1.15e- 7
##  7 science    47    84 <named list [2]> 3.02  4.14e- 8 4.84e- 6
##  8 eternal    26    40 <named list [2]> 3.41  5.85e- 6 5.98e- 4
##  9 fantasy    24    35 <named list [2]> 3.59  6.99e- 6 6.35e- 4
## 10 drawing    22    30 <named list [2]> 3.83  7.95e- 6 6.50e- 4
```

---

So these will be our four options for handling text:

1. Not
2. Text features (zero for `NA`)
3. Text features + our flagged words counts (zero for `NA`)
4. Feature Hashing

---

```r
count_words <- function(essays) {
  counts <- map(cd_words$word, ~str_count_na(essays, .x))
* names(counts) = janitor::make_clean_names(str_c("w_", cd_words$word))
  counts
}

ok_tr_eng3 <- ok_tr_eng2 %>%
  bind_cols(
*   map_dfr(ok_tr_eng2$essays, count_words)
  ) %>%
  select(-starts_with("essay"), -starts_with("len_essay")) %>%
  bind_cols(ok_tr_eng2 %>% select(essays)) %>%
  mutate(across(where(is.character), as.factor))
```

---

```r
rec_eng_notext <- recipe(pets ~ ., data = ok_tr_eng3) %>%
  step_rm(starts_with("w_")) %>%
  update_role(essays, new_role = "discarded") %>%
  step_log(income, offset = 1) %>%
  step_impute_mean(income) %>%
  step_other(all_nominal_predictors(), -has_role("discarded"), other = "other1") %>%
  step_novel(all_nominal_predictors(), -has_role("discarded")) %>%
  step_impute_mode(all_nominal_predictors(), -has_role("discarded")) %>%
  step_dummy(all_nominal_predictors(), -has_role("discarded"), one_hot = FALSE) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)

rec_eng_textfeat <- recipe(pets ~ ., data = ok_tr_eng3) %>%
  step_rm(starts_with("w_")) %>%
  update_role(essays, new_role = "discarded") %>%
  step_textfeature(essays, prefix = "t",
                   extract_functions = my_text_funs) %>%
  step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>%
  step_log(income, starts_with("t_"), -t_essays_sent_bing, offset = 1) %>%
  step_impute_mean(income) %>%
  step_zv(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), -has_role("discarded"), other = "other1") %>%
  step_novel(all_nominal_predictors(), -has_role("discarded")) %>%
  step_impute_mode(all_nominal_predictors(), -has_role("discarded")) %>%
  step_dummy(all_nominal_predictors(), -has_role("discarded"), one_hot = FALSE) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)
```

---

```r
rec_eng_fulltext <- recipe(pets ~ ., data = ok_tr_eng3) %>%
  update_role(essays, new_role = "discarded") %>%
  step_textfeature(essays, prefix = "t",
                   extract_functions = my_text_funs) %>%
  step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>%
  step_log(income, starts_with("t_"), -t_essays_sent_bing, offset = 1) %>%
  step_impute_mean(income) %>%
  step_zv(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), -has_role("discarded"), other = "other1") %>%
  step_novel(all_nominal_predictors(), -has_role("discarded")) %>%
  step_impute_mode(all_nominal_predictors(), -has_role("discarded")) %>%
  step_dummy(all_nominal_predictors(), -has_role("discarded"), one_hot = FALSE) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)

rec_eng_hashtext <- recipe(pets ~ ., data = ok_tr_eng3) %>%
  step_rm(starts_with("w_")) %>%
  update_role(essays, new_role = "discarded") %>%
  step_tokenize(essays) %>%
  step_tokenfilter(essays, min_times = 10) %>%
  step_texthash(essays, num_terms = 2^11) %>%
  step_log(income, offset = 1) %>%
  step_impute_mean(income) %>%
  step_zv(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), -has_role("discarded"), other = "other1") %>%
  step_novel(all_nominal_predictors(), -has_role("discarded")) %>%
  step_impute_mode(all_nominal_predictors(), -has_role("discarded")) %>%
  step_dummy(all_nominal_predictors(), -has_role("discarded"), one_hot = FALSE) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)
```

---

Using resampling, `workflow_set()`:

```r
set.seed(42)
cv_splits_eng <- vfold_cv(ok_tr_eng3, v = 10, strata = pets)

rec_list <- list("notext" = rec_eng_notext,
                 "textfeat" = rec_eng_textfeat,
                 "fulltext" = rec_eng_fulltext,
                 "hashtext" = rec_eng_hashtext)

wset <- workflow_set(rec_list, mod_list) %>%
  workflow_map("fit_resamples",
               resamples = cv_splits_eng,
               metrics = metric_set(roc_auc))

cv_res_eng <- wset %>%
  unnest(result) %>%
  mutate(recipe = wflow_id, auc = map_dbl(.metrics, ~.x$.estimate)) %>%
  select(recipe, id, auc) %>%
  separate_wider_delim(recipe, "_", names = c("recipe", "mod"))
```

---

#### Haven't forgot about our other categorical variables

Let's use some common sense (see code!).

```r
cd_words <- cd_words %>%
  bind_rows(
    tibble(word =
             c("my cat", "my dog", "love cat", "love dog", "kitt", "pupp",
               "have cat", "have dog", "have a cat", "have a dog"),
           cats = NULL, dogs = NULL, fisher_obj = NULL, OR = 0,
           p_val = 0, fdr = 0))
```

```r
*ok_tr_eng4 <- full_features(ok_tr_eng)

colnames(ok_tr_eng4)
```

```
##  [1] "age"                  "body_type"            "diet"                
##  [4] "drinks"               "drugs"                "education"           
##  [7] "ethnicity"            "height"               "income"              
## [10] "job"                  "location"             "orientation"         
## [13] "pets"                 "religion"             "sex"                 
## [16] "smokes"               "status"               "len_essay0"          
## [19] "len_essay1"           "len_essay2"           "len_essay3"          
## [22] "len_essay4"           "len_essay5"           "len_essay6"          
## [25] "len_essay7"           "len_essay8"           "len_essay9"          
## [28] "essays"               "offspring_no_kids"    "offspring_one_kid"   
## [31] "offspring_many_kids"  "offspring_wants_kids" "religion_not2serious"
## [34] "religion_laughing"    "religion_somewhat"    "body_type_perfect"   
## [37] "sign_fun"             "sign_not_matter"      "sign_matters"        
## [40] "sign_na"              "speaks_spanish"       "speaks_french"       
## [43] "speaks_german"        "speaks_chinese"       "speaks_n_fluently"   
## [46] "speaks_n_okay"        "speaks_n_poorly"      "education_kind"      
## [49] "w_cat"                "w_dog"                "w_cats"              
## [52] "w_ilink"              "w_class"              "w_href"              
## [55] "w_science"            "w_eternal"            "w_fantasy"           
## [58] "w_drawing"            "w_my_cat"             "w_my_dog"            
## [61] "w_love_cat"           "w_love_dog"           "w_kitt"              
## [64] "w_pupp"               "w_have_cat"           "w_have_dog"          
## [67] "w_have_a_cat"         "w_have_a_dog"
```

---

```r
rec_eng_commonsense <- recipe(pets ~ ., data = ok_tr_eng4) %>%
  step_textfeature(essays, prefix = "t",
                   extract_functions = my_text_funs) %>%
  update_role(essays, new_role = "discarded") %>%
  step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>%
  step_log(income, starts_with("len_"), starts_with("t_"),
           -t_essays_sent_bing, offset = 1) %>%
  step_impute_mean(income) %>%
  step_zv(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), -has_role("discarded"), other = "all_else") %>%
  step_novel(all_nominal_predictors(), -has_role("discarded")) %>%
  step_impute_mode(all_nominal_predictors(), -has_role("discarded")) %>%
  step_dummy(all_nominal_predictors(), -has_role("discarded"), one_hot = FALSE) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)
```

---

```r
set.seed(42)
cv_splits_eng2 <- vfold_cv(ok_tr_eng4, v = 10, strata = pets)

rec_list <- list("commonsense" = rec_eng_commonsense)

wset <- workflow_set(rec_list, mod_list) %>%
  workflow_map("fit_resamples",
               resamples = cv_splits_eng2,
               metrics = metric_set(roc_auc))

cv_res_commonsense <- wset %>%
  unnest(result) %>%
  mutate(recipe = wflow_id, auc = map_dbl(.metrics, ~.x$.estimate)) %>%
  select(recipe, id, auc) %>%
  separate_wider_delim(recipe, "_", names = c("recipe", "mod"))

cv_res_cs_meds <- cv_res_commonsense %>%
  group_by(recipe) %>%
  summarise(auc = mean(auc), id = "All")
```

---

```r
cv_res_eng %>%
  bind_rows(cv_res_commonsense) %>%
  mutate(recipe = fct_relevel(recipe,
                              c("notext", "hashtext", "textfeat", "fulltext", "commonsense"))) %>%
  ggplot(aes(recipe, auc, group = id, color = id)) +
  geom_line() + geom_line(data = cv_res_eng_meds %>% bind_rows(cv_res_cs_meds), color = "black", alpha = 0.2, linewidth = 2) +
  geom_point() + geom_point(data = cv_res_eng_meds %>% bind_rows(cv_res_cs_meds), color = "black", alpha = 0.2, size = 5, pch = 17) +
  guides(color = "none") + theme_light()
```

---

# Numerical Variables

---

### What could the problem be?

The problem almost always depends on the combination between models and variables:

- Varied scale (and kNN, SVM) --> `step_center()`, `step_scale()`, `step_normalize()`, `step_range()`
- Bounded range of outcome (e.g. percentage and, well, almost any model) --> `step_logit()`
- Skewed data and outliers (and linear regression, NN, kNN, SVM, but less for Trees)
- Non-linearity (and LR)
- Co-linearity (and LR, NN)

---

### Skewed Data: Common Transformations

#### Box-Cox

In its simplified form:

`$$x_i^{(\lambda)} = \begin{cases}
   \frac{x_i^{\lambda} - 1}{\lambda} & \mbox{if }\lambda \ne 0\\
   \ln{x_i} & \mbox{if }\lambda = 0
 \end{cases}$$`

---

Originally the Box-Cox transformation was developed to transform a dependent `$y$` variable by maximizing the log-likelihood of seeing `$y^{(\lambda)}$` in a `lm` against some `$X$`.

But `$\lambda$` could be estimated in an unsupervised way as well:

```r
MASS::boxcox(ok_tr_eng$age ~ 1)
```

---

```r
tibble(age = ok_tr_eng$age,
       age_bc = -1/age) %>%
  pivot_longer(starts_with("age"), names_to = "age") %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 30, fill = "red", alpha = 0.5) +
  facet_wrap(~ age, scales = "free") + theme_light()
```

---

#### Yeo-Johnson

Revised Box-Cox to allow for zero and negative values:

`$$x_i^{(\lambda)} = \begin{cases}
   \frac{(x_i + 1)^{\lambda} - 1}{\lambda} & \mbox{if }\lambda \ne 0,y \ge 0\\
   \ln{(x_i + 1)} & \mbox{if }\lambda = 0, y \ge 0 \\
   \frac{-[(-x_i + 1)^{2-\lambda} - 1]}{2 - \lambda} & \mbox{if }\lambda \ne 0,y < 0\\
   -\ln{(-x_i + 1)} & \mbox{if }\lambda = 0, y < 0 \\
 \end{cases}$$`

Both transformations have `step`s in `recipes`:

```r
recipe(pets ~ ., data = ok_tr_eng) %>%
* step_BoxCox(income)
```

```r
recipe(pets ~ ., data = ok_tr_eng) %>%
* step_YeoJohnson(income)
```

---

### Non-Linearity: Basis Expansions and Splines

Let's borrow a continuous `$y$` variable like `log(income)` to simplify:

```r
ggplot(ok_tr_eng, aes(age, log(income))) +
  geom_point() +
  theme_light()
```

Remember MARS?

Perhaps you'd like to *expand* age to two variables: `max(45 - age, 0)` and `max(age - 45, 0)`?

---

You could do this manually in `lm()` like so

```r
mod_lm <- lm(log(income) ~ I(pmax(0, 45 - age)) + I(pmax(0, age - 45)),
             data = ok_tr_eng)

summary(mod_lm)
```

```
## 
## Call:
## lm(formula = log(income) ~ I(pmax(0, 45 - age)) + I(pmax(0, age - 
##     45)), data = ok_tr_eng)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4298 -0.6799 -0.0352  0.3498  3.1630 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          11.361687   0.083615 135.880  < 2e-16 ***
## I(pmax(0, 45 - age)) -0.028369   0.005317  -5.335 1.36e-07 ***
## I(pmax(0, age - 45)) -0.007601   0.013507  -0.563    0.574    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8699 on 594 degrees of freedom
##   (2403 observations deleted due to missingness)
## Multiple R-squared:  0.0531,	Adjusted R-squared:  0.04991 
## F-statistic: 16.66 on 2 and 594 DF,  p-value: 9.168e-08
```

---

Or use `splines::bs()` with `degree = 1` and `knots = 45`.

```r
library(splines)
mod_lm_bs <- lm(log(income) ~ bs(age, degree = 1, knots = 45),
             data = ok_tr_eng)

summary(mod_lm_bs)
```

```
## 
## Call:
## lm(formula = log(income) ~ bs(age, degree = 1, knots = 45), data = ok_tr_eng)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4298 -0.6799 -0.0352  0.3498  3.1630 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      10.59574    0.07782 136.155  < 2e-16 ***
## bs(age, degree = 1, knots = 45)1  0.76595    0.14356   5.335 1.36e-07 ***
## bs(age, degree = 1, knots = 45)2  0.58353    0.28632   2.038    0.042 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8699 on 594 degrees of freedom
##   (2403 observations deleted due to missingness)
## Multiple R-squared:  0.0531,	Adjusted R-squared:  0.04991 
## F-statistic: 16.66 on 2 and 594 DF,  p-value: 9.168e-08
```

---

How will the variables look like?

```r
X <- model.matrix(mod_lm)
age_no_na <- ok_tr_eng$age[!is.na(ok_tr_eng$income)]
head(cbind(age_no_na, X), 4)
```

```
##    age_no_na (Intercept) I(pmax(0, 45 - age)) I(pmax(0, age - 45))
## 3         26           1                   19                    0
## 5         24           1                   21                    0
## 10        21           1                   24                    0
## 15        25           1                   20                    0
```

```r
X <- model.matrix(mod_lm_bs)
colnames(X)[2:3] <- c("age1", "age2")
head(cbind(age_no_na, X), 4)
```

```
##    age_no_na (Intercept)      age1 age2
## 3         26           1 0.2962963    0
## 5         24           1 0.2222222    0
## 10        21           1 0.1111111    0
## 15        25           1 0.2592593    0
```

`splines::bs()` re-parametrizes the variables but it is the same model!

---

How will the fit look like?

```r
ggplot(ok_tr_eng[!is.na(ok_tr_eng$income), ]) +
  geom_point(aes(age, log(income))) +
  geom_line(aes(age, mod_lm_bs$fit), color = "red", linewidth = 2, alpha = 0.4) +
  theme_light()
```

---

Or with a simple cubic spline with one knot in `bs()`

---

Manually with `step_relu()`

```r
recipe(income ~ age, ok_tr_eng) %>%
  step_naomit(income) %>%
* step_relu(age, shift = 45, reverse = TRUE) %>%
* step_relu(age, shift = 45) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  head()
```

```
## # A tibble: 6 × 4
##     age income left_relu_age right_relu_age
##   <dbl>  <dbl>         <dbl>          <dbl>
## 1    26  20000            19              0
## 2    24  60000            21              0
## 3    21  20000            24              0
## 4    25  20000            20              0
## 5    35  80000            10              0
## 6    69  80000             0             24
```

---

For splines and polynomial regression use `step_bs()`, `step_ns()` and `step_poly()`

```r
recipe(income ~ age, ok_tr_eng) %>%
  step_naomit(income) %>%
* step_bs(age, degree = 3) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  head()
```

```
## # A tibble: 6 × 4
##   income age_bs_1 age_bs_2 age_bs_3
##    <dbl>    <dbl>    <dbl>    <dbl>
## 1  20000    0.335  0.0622  0.00386 
## 2  60000    0.275  0.0366  0.00163 
## 3  20000    0.156  0.00977 0.000204
## 4  20000    0.306  0.0488  0.00259 
## 5  80000    0.444  0.222   0.0370  
## 6  80000    0      0       1
```

---

### Co-linearity: Consider reducing dimensionality

> A conventional thought process is that the more predictors that are included, the better chance the model will have to find the signal. But this is not the case; including irrelevant predictors can have detrimental effects on the final model. The negative impacts include:
- increasing the computational time required for training a model
- decreasing the predictive performance
- complicating the predictor importance calculation.

> By including any potential predictor, we end up with a sub-optimal model.

[Kuhn & Johnson (2019)](http://www.feat.engineering/index.html)

---

Common *dimensionality reduction* methods (some of which we've talked about!):
- PCA, Kernel PCA, Sparse PCA --> `step_pca()`, `step_kpca()`
- NMF --> `step_nnmf()`
- PLS --> `step_pls()`

All of these will find some projection matrix `$B_{pxk}$` where `$k$` is the number of columns (e.g. PCs) to choose and would typically be much lower than `$p$`, the original dimension.

You would then project your `$X$` matrix like so: `$X^* = XB$` (in most methods `$X^*$` columns would also be linearly independent, or "close" to it), and you would fit your data on this transformed `$X^*$`

---

# Interaction Effects

---

### What is Interaction?

---

### But seriously

```r
lm_int <- lm(log(income) ~ sex + height + sex:height,
             data = ok_tr_interaction %>% filter(between(height, 60, 80)))
summary(lm_int)
```

```
## 
## Call:
## lm(formula = log(income) ~ sex + height + sex:height, data = ok_tr_interaction %>% 
##     filter(between(height, 60, 80)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18582 -0.65885 -0.04721  0.50363  2.90622 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.075285   2.011322   5.009 8.38e-07 ***
## sexm        -0.506173   2.411062  -0.210    0.834    
## height       0.013731   0.030795   0.446    0.656    
## sexm:height  0.006272   0.036081   0.174    0.862    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.907 on 381 degrees of freedom
##   (1602 observations deleted due to missingness)
## Multiple R-squared:  0.003511,	Adjusted R-squared:  -0.004336 
## F-statistic: 0.4474 on 3 and 381 DF,  p-value: 0.7193
```

---

```r
ok_tr_interaction %>% filter(between(height, 60, 80)) %>%
  ggplot(aes(height, log(income), color = sex)) +
  geom_point() +
  geom_smooth(method = "lm")
```

.insight[
💡 How would you visualize an interaction between two continuous features? Two categorical? Three?...
]

---

#### You get an interaction! You get an interaction!

With the last `rec_eng_commonsense` recipe we reached over 130 features.

So all two-way interactions would mean adding... over 8K features. All three-way interactions would mean adding... over 350K features.

Which models surface interactions naturally? What about our model?

Should you include interaction terms before or after pre-processing?
]

---

Obviously, we need some control for this exploding no. of features:
- Expert knowledge .font80percent[(e.g. when it comes to income most variables probably interact with...?)]
- Heredity principle
- Resampling, Penalize
- When `$p$` is huge: Two-Stage Modeling combined with your favorite Feature Selection approach
- Whatever you choose, visualize!

---

So these will be our four options for handling interactions:

1. Not
2. Adding only interactions between `sex` and everything
3. Two Stage: Screening all "main effects" features selected by a single `glmnet` stage then all two-way interactions of those features
4. Two Stage: Screening all two-way interactions by multiple comparisons of a `glm` main effects model vs. main effects + interaction, combined with FDR adjustment for the p-value
5. All two-way interactions 😎 .font80percent[(with a `step_nzv()` though)]

---

For stages 3 and 4 let's use previously available data.

```r
ok_tr_combined <- bind_rows(
  ok_tr_missing,
  ok_tr_imbalance,
  ok_tr_eng
  )
ok_tr_combined <- full_features(ok_tr_combined)
```

---

For stage 3 need to screen out "main effects" features which passed `glmnet`:

```r
main_data <- rec_eng_commonsense %>%
  prep(ok_tr_combined) %>%
  bake(new_data = NULL)

mod_lr_res <- mod_lr %>%
  fit(pets ~ ., data = main_data) %>%
  pluck("fit") %>%
  tidy()

mod_lr_res
```

```
## # A tibble: 5,450 × 5
##    term         step  estimate lambda dev.ratio
##    <chr>       <dbl>     <dbl>  <dbl>     <dbl>
##  1 (Intercept)     1  5.14e-14 0.0909  8.81e-14
##  2 (Intercept)     2 -4.70e- 3 0.0828  9.49e- 3
##  3 (Intercept)     3 -2.94e- 3 0.0754  2.23e- 2
##  4 (Intercept)     4 -3.97e- 4 0.0687  3.48e- 2
##  5 (Intercept)     5  1.66e- 3 0.0626  4.57e- 2
##  6 (Intercept)     6  3.34e- 3 0.0571  5.87e- 2
##  7 (Intercept)     7  4.84e- 2 0.0520  7.24e- 2
##  8 (Intercept)     8  8.18e- 2 0.0474  8.59e- 2
##  9 (Intercept)     9  1.16e- 1 0.0432  9.84e- 2
## 10 (Intercept)    10  1.45e- 1 0.0393  1.10e- 1
## # … with 5,440 more rows
```

---

We can plot no. of features selected by `glmnet` by lambda

```r
mod_lr_res %>%
  count(lambda) %>%
  ggplot(aes(lambda, n)) +
  geom_line(linewidth = 2, color = "red", alpha = 0.5) +
  labs(y = "N features selected by glmnet") +
  theme_light()
```

---

We see a plateau around 12-15 features, let's choose those.

```r
mod_lr_res %>%
  filter(step == 6) %>%
  select(term, estimate) %>%
  arrange(-abs(estimate))
```

```
## # A tibble: 12 × 2
##    term                estimate
##    <chr>                  <dbl>
##  1 w_my_cat           -0.458   
##  2 status_all_else    -0.343   
##  3 religion_christian  0.304   
##  4 w_my_dog            0.227   
##  5 w_dog               0.114   
##  6 job_marketing       0.0623  
##  7 w_cat              -0.0551  
##  8 drugs_sometimes    -0.0368  
##  9 drinks_socially     0.0269  
## 10 w_cats             -0.0147  
## 11 (Intercept)         0.00334 
## 12 age                -0.000539
```

---

These make sense, let's choose them as our main effects from which we pick all two-way interactions.

```r
main_effects <- mod_lr_res %>%
  filter(step == 6, term != "(Intercept)") %>%
  pull(term)
```

For stage 4 we need to iterate over all possible two-way interactions, each time adding one into a full main effects `glm` model, getting the p-value, and applying FDR adjustment.

Start by defining the base model:

```r
mod_glm_base <- glm(pets ~ .,
                    data = main_data %>% select(-ends_with("_new")),
                    family = "binomial")
```

---

Then go over all interactions:

```r
extract_interaction_p_val <- function(var1, var2) {
  pb$tick()$print()
  int_term <- str_c(var1, ":", var2)
  int_term2 <- str_c(var2, ":", var1)
  mod_glm <- quietly(stats::update)(
    mod_glm_base, as.formula(str_c(".~.+", int_term))
  )
  if(length(mod_glm$warnings) > 1) {
    return(NA_real_)
  }
  p_val <- tidy(mod_glm$result) %>%
    filter(term %in% c(int_term, int_term2)) %>%
    pull(p.value)
  if (length(p_val) == 0) {
    return(NA_real_)
  }
  p_val
}

all_vars <- colnames(main_data %>% select(-ends_with("_new")))
all_ints <- t(combn(all_vars[-which(all_vars == "pets")], 2))
colnames(all_ints) <- c("var1", "var2")
all_ints <- as_tibble(all_ints)
```

---

```r
*pb <- progress_estimated(nrow(all_ints))
all_ints <- all_ints %>%
  mutate(p_val = map2_dbl(var1, var2, extract_interaction_p_val),
         fdr = p.adjust(p_val, method = "fdr"))
```

Let's filter only the *most* convincing interactions.

```r
pairwise_ints <- all_ints %>%
  arrange(fdr) %>%
  filter(fdr < 10e-7)

pairwise_ints
```

```
## # A tibble: 18 × 4
##    var1                   var2                      p_val      fdr
##    <chr>                  <chr>                     <dbl>    <dbl>
##  1 offspring_no_kids      offspring_wants_kids   3.60e-19 2.35e-15
##  2 ethnicity_white        job_hardware           3.08e-12 1.01e- 8
##  3 t_essays_n_words       education_kind_working 1.07e-10 1.40e- 7
##  4 t_essays_n_charS       education_kind_working 9.15e-11 1.40e- 7
##  5 t_essays_n_lowers      education_kind_working 8.64e-11 1.40e- 7
##  6 t_essays_n_uq_words    education_kind_working 1.40e-10 1.52e- 7
##  7 t_essays_n_uq_charS    education_kind_working 1.69e-10 1.58e- 7
##  8 speaks_chinese         t_essays_sent_bing     3.20e-10 2.61e- 7
##  9 t_essays_n_extraspaces education_kind_working 3.90e-10 2.83e- 7
## 10 age                    t_essays_n_uq_charS    5.63e-10 3.41e- 7
## 11 body_type_perfect      ethnicity_white        5.73e-10 3.41e- 7
## 12 speaks_spanish         speaks_french          8.40e-10 4.58e- 7
## 13 t_essays_n_periods     education_kind_working 1.28e- 9 6.42e- 7
## 14 age                    t_essays_n_charS       1.66e- 9 7.25e- 7
## 15 age                    t_essays_n_lowers      1.65e- 9 7.25e- 7
## 16 religion_laughing      diet_other             1.96e- 9 8.00e- 7
## 17 age                    t_essays_n_uq_words    2.50e- 9 9.16e- 7
## 18 t_essays_n_puncts      education_kind_working 2.52e- 9 9.16e- 7
```

---

First make sure you understand the difference between `other`, `all_else` etc.

```r
ok_tr_combined %>% count(ethnicity, sort = TRUE)
```

```
## # A tibble: 6 × 2
##   ethnicity     n
##   <fct>     <int>
## 1 white      3802
## 2 other      1675
## 3 asian       752
## 4 hispanic    401
## 5 black       244
## 6 indian      126
```

```r
main_data %>% select(starts_with("ethnicity")) %>% head()
```

```
## # A tibble: 6 × 5
##   ethnicity_hispanic ethnicity_other ethnicity_white ethnicity_all_else
##                <dbl>           <dbl>           <dbl>              <dbl>
## 1                  0               0               1                  0
## 2                  0               0               1                  0
## 3                  0               0               1                  0
## 4                  0               1               0                  0
## 5                  0               0               1                  0
## 6                  0               0               1                  0
## # … with 1 more variable: ethnicity_new <dbl>
```

---

What are those people with all `ethnicity` columns with zero?

```r
main_data %>% select(starts_with("ethnicity")) %>%
  filter(if_all(everything(), ~.x == 0))
```

```
## # A tibble: 1,039 × 5
##    ethnicity_hispanic ethnicity_other ethnicity_white ethnicity_all_else
##                 <dbl>           <dbl>           <dbl>              <dbl>
##  1                  0               0               0                  0
##  2                  0               0               0                  0
##  3                  0               0               0                  0
##  4                  0               0               0                  0
##  5                  0               0               0                  0
##  6                  0               0               0                  0
##  7                  0               0               0                  0
##  8                  0               0               0                  0
##  9                  0               0               0                  0
## 10                  0               0               0                  0
## # … with 1,029 more rows, and 1 more variable: ethnicity_new <dbl>
```

---

Same with `location`.

```r
ok_tr_combined %>% count(location, sort = TRUE)
```

```
## # A tibble: 4 × 2
##   location     n
##   <fct>    <int>
## 1 sf        3739
## 2 other     2045
## 3 oakland    776
## 4 berkeley   440
```

```r
main_data %>% select(starts_with("location")) %>% head()
```

```
## # A tibble: 6 × 4
##   location_oakland location_other location_sf location_new
##              <dbl>          <dbl>       <dbl>        <dbl>
## 1                0              1           0            0
## 2                0              1           0            0
## 3                0              0           1            0
## 4                0              0           1            0
## 5                1              0           0            0
## 6                1              0           0            0
```

---

```r
main_data %>% group_by(offspring_no_kids, offspring_wants_kids) %>%
    summarise(pct_cats = mean(pets == "cats"), n = n()) %>% ungroup() %>%
    mutate(across(c("offspring_no_kids", "offspring_wants_kids"), as.factor)) %>%
    ggplot(aes(offspring_no_kids, pct_cats, color = offspring_wants_kids)) +
    geom_line(aes(group = offspring_wants_kids)) + geom_point(aes(size = n)) +
    scale_y_continuous(labels = percent_format()) +
    theme_light()
```

---

```r
main_data %>% group_by(ethnicity_white, job_hardware) %>%
    summarise(pct_cats = mean(pets == "cats"), n = n()) %>% ungroup() %>%
    mutate(across(c("ethnicity_white", "job_hardware"), as.factor)) %>%
    ggplot(aes(ethnicity_white, pct_cats, color = job_hardware)) +
    geom_line(aes(group = job_hardware)) + geom_point(aes(size = n)) +
    scale_y_continuous(labels = percent_format()) +
    theme_light()
```

---

---

```r
ok_tr_interaction2 <- full_features(ok_tr_interaction)

rec_int_noint <- rec_eng_commonsense

rec_int_sex <- recipe(pets ~ ., data = ok_tr_interaction2) %>%
  step_textfeature(essays, prefix = "t",
                   extract_functions = my_text_funs) %>%
  update_role(essays, new_role = "discarded") %>%
  step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>%
  step_log(income, starts_with("len_"), starts_with("t_"),
           -t_essays_sent_bing, offset = 1) %>%
  step_impute_mean(income) %>%
  step_other(all_nominal_predictors(), -has_role("discarded"), other = "all_else") %>%
  step_novel(all_nominal_predictors(), -has_role("discarded")) %>%
  step_impute_mode(all_nominal_predictors(), -has_role("discarded")) %>%
  step_dummy(all_nominal_predictors(), -has_role("discarded"), one_hot = FALSE) %>%
* step_interact(~starts_with("sex"):all_numeric_predictors()) %>%
* step_nzv(all_numeric_predictors(), freq_cut = 99/1) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)
```

---

```r
# pairwise interactions string code from Max Kuhn's workshop:
# https://github.com/topepo/FES/blob/master/07_Detecting_Interaction_Effects/7_04_The_Brute-Force_Approach_to_Identifying_Predictive_Interactions/ames_glmnet.R

topmain_ints <- t(combn(main_effects, 2))
colnames(topmain_ints) <- c("var1", "var2")
topmain_ints <- topmain_ints %>% 
  as_tibble() %>%
  mutate(
    term = str_c(var1, ":", var2)
  ) %>% 
  pull(term) %>% 
  paste(collapse = "+")

topmain_ints <- str_c("~", topmain_ints)
topmain_ints <- as.formula(topmain_ints)

topmain_ints
```

```
## ~age:w_cat + age:w_dog + age:w_cats + age:w_my_cat + age:w_my_dog + 
##     age:drinks_socially + age:drugs_sometimes + age:job_marketing + 
##     age:religion_christian + age:status_all_else + w_cat:w_dog + 
##     w_cat:w_cats + w_cat:w_my_cat + w_cat:w_my_dog + w_cat:drinks_socially + 
##     w_cat:drugs_sometimes + w_cat:job_marketing + w_cat:religion_christian + 
##     w_cat:status_all_else + w_dog:w_cats + w_dog:w_my_cat + w_dog:w_my_dog + 
##     w_dog:drinks_socially + w_dog:drugs_sometimes + w_dog:job_marketing + 
##     w_dog:religion_christian + w_dog:status_all_else + w_cats:w_my_cat + 
##     w_cats:w_my_dog + w_cats:drinks_socially + w_cats:drugs_sometimes + 
##     w_cats:job_marketing + w_cats:religion_christian + w_cats:status_all_else + 
##     w_my_cat:w_my_dog + w_my_cat:drinks_socially + w_my_cat:drugs_sometimes + 
##     w_my_cat:job_marketing + w_my_cat:religion_christian + w_my_cat:status_all_else + 
##     w_my_dog:drinks_socially + w_my_dog:drugs_sometimes + w_my_dog:job_marketing + 
##     w_my_dog:religion_christian + w_my_dog:status_all_else + 
##     drinks_socially:drugs_sometimes + drinks_socially:job_marketing + 
##     drinks_socially:religion_christian + drinks_socially:status_all_else + 
##     drugs_sometimes:job_marketing + drugs_sometimes:religion_christian + 
##     drugs_sometimes:status_all_else + job_marketing:religion_christian + 
##     job_marketing:status_all_else + religion_christian:status_all_else
```

---

```r
rec_int_topmain <- recipe(pets ~ ., data = ok_tr_interaction2) %>%
  step_textfeature(essays, prefix = "t",
                   extract_functions = my_text_funs) %>%
  update_role(essays, new_role = "discarded") %>%
  step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>%
  step_log(income, starts_with("len_"), starts_with("t_"),
           -t_essays_sent_bing, offset = 1) %>%
  step_impute_mean(income) %>%
  step_other(all_nominal_predictors(), -has_role("discarded"), other = "all_else") %>%
  step_novel(all_nominal_predictors(), -has_role("discarded")) %>%
  step_impute_mode(all_nominal_predictors(), -has_role("discarded")) %>%
  step_dummy(all_nominal_predictors(), -has_role("discarded"), one_hot = FALSE) %>%
* step_interact(topmain_ints) %>%
  step_nzv(all_numeric_predictors(), freq_cut = 99/1) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)
```

---

```r
rec_int_topints <- recipe(pets ~ ., data = ok_tr_interaction2) %>%
  step_textfeature(essays, prefix = "t",
                   extract_functions = my_text_funs) %>%
  update_role(essays, new_role = "discarded") %>%
  step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>%
  step_log(income, starts_with("len_"), starts_with("t_"),
           -t_essays_sent_bing, offset = 1) %>%
  step_impute_mean(income) %>%
  step_other(all_nominal_predictors(), -has_role("discarded"), other = "all_else") %>%
  step_novel(all_nominal_predictors(), -has_role("discarded")) %>%
  step_impute_mode(all_nominal_predictors(), -has_role("discarded")) %>%
  step_dummy(all_nominal_predictors(), -has_role("discarded"), one_hot = FALSE) %>%
* step_interact(topint_ints) %>%
  step_nzv(all_numeric_predictors(), freq_cut = 99/1) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)
```

---

```r
rec_int_all <- recipe(pets ~ ., data = ok_tr_interaction2) %>%
  step_textfeature(essays, prefix = "t",
                   extract_functions = my_text_funs) %>%
  update_role(essays, new_role = "discarded") %>%
  step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>%
  step_log(income, starts_with("len_"), starts_with("t_"),
           -t_essays_sent_bing, offset = 1) %>%
  step_impute_mean(income) %>%
  step_other(all_nominal_predictors(), -has_role("discarded"), other = "all_else") %>%
  step_novel(all_nominal_predictors(), -has_role("discarded")) %>%
  step_impute_mode(all_nominal_predictors(), -has_role("discarded")) %>%
  step_dummy(all_nominal_predictors(), -has_role("discarded"), one_hot = FALSE) %>%
* step_interact(~all_numeric_predictors():all_numeric_predictors()) %>%
  step_nzv(all_numeric_predictors(), freq_cut = 95/5) %>%
  step_upsample(pets, over_ratio = 1, seed = 42)
```

---

```r
set.seed(42)
cv_splits_int <- vfold_cv(ok_tr_interaction2, v = 10, strata = pets)

rec_list <- list("noint" = rec_int_noint,
                 "sex" = rec_int_sex,
                 "topmain" = rec_int_topmain,
                 "topints" = rec_int_topints,
                 "all" = rec_int_all)

wset <- workflow_set(rec_list, mod_list) %>%
  workflow_map("fit_resamples",
               resamples = cv_splits_int,
               metrics = metric_set(roc_auc))

cv_res_int <- wset %>%
  unnest(result) %>%
  mutate(recipe = wflow_id, auc = map_dbl(.metrics, ~.x$.estimate)) %>%
  select(recipe, id, auc) %>%
  separate_wider_delim(recipe, "_", names = c("recipe", "mod"))

cv_res_int_meds <- cv_res_int %>%
  group_by(recipe) %>%
  summarise(auc = mean(auc), id = "All")
```

---

```r
cv_res_int %>%
  mutate(recipe = fct_relevel(recipe, c("noint", "sex", "topmain", "topints", "all"))) %>%
  ggplot(aes(recipe, auc, group = id, color = id)) +
  geom_line() + geom_line(data = cv_res_int_meds, color = "black", alpha = 0.2, linewidth = 2) +
  geom_point() + geom_point(data = cv_res_int_meds, color = "black", alpha = 0.2, size = 5, pch = 17) +
  guides(color = "none") + theme_light()
```

---

# Tuning Params

---

### Everything is Tunable

But let us simply vary Elastic Net `$\alpha$` and `$\lambda$`:

`$\min_{\beta_0, \beta}\frac{1}{N}\sum_{i = 1}^N w_i l(y_i,\beta_0 + \beta^Tx_i) + \lambda[(1-\alpha)||\beta||^2_2/2 + \alpha||\beta||_1]$`

```r
mod_en_tune <- logistic_reg(penalty = tune(),
                            mixture = tune()) %>%
  set_engine("glmnet")

en_grid <- grid_regular(penalty(range(-6, 0)),
                        mixture(range(0, 1)), levels = c(7, 11))
```

---

```r
ok_tr_tuning2 <- full_features(ok_tr_tuning)

set.seed(42)
cv_splits_tune <- vfold_cv(ok_tr_tuning2, v = 10, strata = pets)
```

```r
tune_res <- tune_grid(object = mod_en_tune,
                      preprocessor = rec_eng_commonsense,
                      resamples = cv_splits_tune,
                      grid = en_grid,
                      metrics = metric_set(roc_auc),
                      control = control_grid(verbose = TRUE))

collect_metrics_tune_res <- collect_metrics(tune_res)

collect_metrics_tune_res
```

```
## # A tibble: 77 × 8
##     penalty mixture .metric .estimator  mean     n std_err .config              
##       <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.000001     0   roc_auc binary     0.785    10  0.0120 Preprocessor1_Model01
##  2 0.00001      0   roc_auc binary     0.785    10  0.0120 Preprocessor1_Model02
##  3 0.0001       0   roc_auc binary     0.785    10  0.0120 Preprocessor1_Model03
##  4 0.001        0   roc_auc binary     0.785    10  0.0120 Preprocessor1_Model04
##  5 0.01         0   roc_auc binary     0.785    10  0.0120 Preprocessor1_Model05
##  6 0.1          0   roc_auc binary     0.780    10  0.0125 Preprocessor1_Model06
##  7 1            0   roc_auc binary     0.758    10  0.0135 Preprocessor1_Model07
##  8 0.000001     0.1 roc_auc binary     0.787    10  0.0113 Preprocessor1_Model08
##  9 0.00001      0.1 roc_auc binary     0.787    10  0.0113 Preprocessor1_Model09
## 10 0.0001       0.1 roc_auc binary     0.787    10  0.0113 Preprocessor1_Model10
## # … with 67 more rows
```

---

```r
collect_metrics_tune_res %>%
  mutate(auc = mean, mixture = factor(mixture)) %>%
  ggplot(aes(penalty, auc, color = mixture)) +
  geom_line() + geom_point() +
  theme_light()
```

---

```r
show_best(tune_res, n = 10, metric = "roc_auc")
```

```
## # A tibble: 10 × 8
##    penalty mixture .metric .estimator  mean     n std_err .config              
##      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1   0.01      0.4 roc_auc binary     0.789    10  0.0108 Preprocessor1_Model33
##  2   0.01      0.5 roc_auc binary     0.789    10  0.0110 Preprocessor1_Model40
##  3   0.01      0.3 roc_auc binary     0.789    10  0.0110 Preprocessor1_Model26
##  4   0.01      0.6 roc_auc binary     0.788    10  0.0111 Preprocessor1_Model47
##  5   0.01      0.2 roc_auc binary     0.788    10  0.0112 Preprocessor1_Model19
##  6   0.001     1   roc_auc binary     0.788    10  0.0117 Preprocessor1_Model74
##  7   0.01      0.7 roc_auc binary     0.787    10  0.0111 Preprocessor1_Model54
##  8   0.001     0.8 roc_auc binary     0.787    10  0.0117 Preprocessor1_Model60
##  9   0.01      0.8 roc_auc binary     0.787    10  0.0111 Preprocessor1_Model61
## 10   0.001     0.9 roc_auc binary     0.787    10  0.0117 Preprocessor1_Model67
```

---

# Cutoff choice

---

### The model is there.

Remember it is best to predict a score, let the model's "clients" choose what to do with that score.

Also remember the "I don't know" strategy!

But if we must, let us provide:
- a score to maximize precision of `cats` .font80percent[(while maintaing acceptable levels of recall)]
- a score to maximize recall of `cats` .font80percent[(while maintaing acceptable levels of precision)]
- a middle score

---

```r
ok_tr_combined <- full_features(
  bind_rows(
    ok_tr_missing,
    ok_tr_imbalance,
    ok_tr_eng,
    ok_tr_interaction,
    ok_tr_tuning))

ok_tr_cutoff2 <- full_features(ok_tr_cutoff)

rec_cutoff <- rec_eng_commonsense %>%
  prep(ok_tr_combined)

ok_tr_for_cutoff <- rec_cutoff %>%
  bake(new_data = NULL)

ok_te_for_cutoff <- rec_cutoff %>%
  bake(ok_tr_cutoff2)

mod_lr_cutoff <- logistic_reg(penalty = 0.01, mixture = 0.5) %>%
  set_engine("glmnet")

mod_lr_cutoff <- mod_lr_cutoff %>%
  fit(pets ~ ., data = ok_tr_for_cutoff)

pred_cutoff <-  mod_lr_cutoff %>%
  predict(new_data = ok_te_for_cutoff, penalty = 0.01, type = "prob") %>%
  mutate(truth = ok_te_for_cutoff$pets)
```

---

```r
library(pROC)

cutoffs <- seq(0.1, 0.9, 0.1)

roc_obj <- roc(pred_cutoff$truth, pred_cutoff$.pred_cats,
               levels = c("dogs", "cats"))
coords(roc_obj, x = cutoffs,
                ret = c("accuracy", "recall", "precision",
                        "specificity", "npv"),
                transpose = TRUE)
```

```
##                   [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
## accuracy    0.21262003 0.2949246 0.4663923 0.6159122 0.7503429 0.8052126
## recall      1.00000000 0.9916667 0.9250000 0.8416667 0.6833333 0.4666667
## precision   0.17291066 0.1882911 0.2260692 0.2790055 0.3628319 0.4179104
## specificity 0.05747126 0.1576355 0.3760263 0.5714286 0.7635468 0.8719212
## npv         1.00000000 0.9896907 0.9621849 0.9482289 0.9244533 0.8924370
##                  [,7]      [,8]      [,9]
## accuracy    0.8381344 0.8600823 0.8545953
## recall      0.3166667 0.2333333 0.1333333
## precision   0.5135135 0.7368421 0.8888889
## specificity 0.9408867 0.9835796 0.9967159
## npv         0.8748092 0.8668596 0.8537271
```

---

# Predict on Test set

---

### Long time no see!

```r
set.seed(42)
random_cats_to_get_rid_of <- okcupid_test %>%
  mutate(id = 1:n()) %>%
  filter(pets == "cats") %>%
  sample_n(40) %>%
  pull(id)

okcupid_test2 <- okcupid_test %>%
  mutate(id = 1:n()) %>%
  filter(!id %in% random_cats_to_get_rid_of) %>%
  select(-id)
*okcupid_test2 <- full_features(okcupid_test2)

okcupid_test2 %>% count(pets) %>% mutate(pct = n / sum(n))
```

```
## # A tibble: 2 × 3
##   pets      n   pct
##   <fct> <int> <dbl>
## 1 cats    653 0.165
## 2 dogs   3307 0.835
```

---

```r
okcupid_test_baked <- rec_cutoff %>%
  bake(okcupid_test2)

mod_lr_test <- logistic_reg(penalty = 0.01, mixture = 0.5) %>%
  set_engine("glmnet")

mod_lr_test <- mod_lr_test %>%
  fit(pets ~ ., data = ok_tr_for_cutoff)

pred_test <-  mod_lr_test %>%
  predict(new_data = okcupid_test_baked, penalty = 0.01, type = "prob") %>%
  mutate(truth = okcupid_test_baked$pets)
```

```r
roc_auc(pred_test, truth, .pred_cats)
```

```
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.792
```

---

```r
roc_lr <- roc_curve(
  tibble(truth = pred_test$truth, cats = pred_test$.pred_cats),
  truth, cats)

autoplot(roc_lr) +
  labs(x = "1 - Recall(Dogs)", y = "Recall(Cats)")
```

---

```r
roc_obj <- roc(pred_test$truth, pred_test$.pred_cats,
               levels = c("dogs", "cats"))

cutoffs <- c(0.3, 0.5, 0.9)

coords(roc_obj, x = cutoffs,
                ret = c("accuracy", "recall", "precision",
                        "specificity", "npv"),
                transpose = TRUE)
```

```
##                  [,1]      [,2]      [,3]
## accuracy    0.4608586 0.7361111 0.8520202
## recall      0.9356815 0.6875957 0.1056662
## precision   0.2259615 0.3480620 0.9718310
## specificity 0.3671001 0.7456910 0.9993952
## npv         0.9665605 0.9235955 0.8498329
```

---

### What *is* the model?

```r
*tidy(mod_lr_test) %>%
  filter(near(penalty, 0.01, tol = 0.0005)) %>%
  arrange(-abs(estimate)) %>%
  select(term, estimate, penalty) %>%
  head(20)
```

```
## # A tibble: 20 × 3
##    term               estimate penalty
##    <chr>                 <dbl>   <dbl>
##  1 w_my_cat             -2.13     0.01
##  2 w_my_dog              1.57     0.01
##  3 w_kitt               -1.25     0.01
##  4 w_love_dog            0.840    0.01
##  5 w_cats               -0.772    0.01
##  6 religion_christian    0.764    0.01
##  7 (Intercept)          -0.652    0.01
##  8 w_dog                 0.648    0.01
##  9 religion_jewish       0.613    0.01
## 10 body_type_thin       -0.587    0.01
## 11 diet_all_else        -0.529    0.01
## 12 drugs_sometimes      -0.509    0.01
## 13 status_all_else      -0.508    0.01
## 14 diet_vegetarian      -0.485    0.01
## 15 job_marketing         0.480    0.01
## 16 job_hardware         -0.452    0.01
## 17 drinks_socially       0.438    0.01
## 18 ethnicity_white      -0.429    0.01
## 19 drugs_all_else       -0.388    0.01
## 20 body_type_all_else   -0.384    0.01
```

---

### What did we achieve?

- ✔️ Reproducible model
- ✔️ Dependable model
- ✔️ Explainable model
- ✔️ Small and Fast as hell model .font80percent[(put in on a nanochip!)]
- ❌ Impressive model

### Is that the end?

No.

---

### Few Excellent Books

]

]