class: logo-slide --- class: title-slide ## Modeling in the Tidyverse ### Applications of Data Science - Class 5 ### Giora Simchoni #### `gsimchoni@gmail.com and add #dsapps in subject` ### Stat. and OR Department, TAU ### 2023-03-26 --- layout: true <div class="my-footer"> <span> <a href="https://dsapps-2023.github.io/Class_Slides/" target="_blank">Applications of Data Science </a> </span> </div> --- class: section-slide # The Problem --- ### Inconsistency, Inextensibility ```r n <- 10000 x1 <- runif(n) x2 <- runif(n) t <- 1 + 2 * x1 + 3 * x2 y <- rbinom(n, 1, 1 / (1 + exp(-t))) ``` ```r glm(y ~ x1 + x2, family = "binomial") ``` ```r glmnet(as.matrix(cbind(x1, x2)), as.factor(y), family = "binomial") ``` ```r randomForest(as.factor(y) ~ x1 + x2) ``` ```r gbm(y ~ x1 + x2, data = data.frame(x1 = x1, x2 = x2, y = y)) ``` 😱 --- ### Compare this with `sklearn` ```python from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier LogisticRegression(penalty='none').fit(X, y) LogisticRegression(penalty='l2', C=0.001).fit(X, y) RandomForestClassifier(n_estimators=100).fit(X, y) GradientBoostingClassifier(n_estimators=100).fit(X, y) ``` --- class: section-slide # Detour: A Regression Problem --- ### Hungarian Blogs: Predicting Feedback - Dataset was published as part of the [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/BlogFeedback) initiative - Comes from [Buza 2014](http://www.cs.bme.hu/~buza/pdfs/gfkl2012_blogs.pdf) - 280 numeric heavily engineered features on blogs and posts published in the last 72 hours - Can we predict no. of comments in the next 24 hours? <img src="images/hungarian_blog.png" style="width: 70%" /> --- The raw data has over 50K rows: for each blog features like total comments until base time, weekday, words, etc. We will be predicting `log(fb)` based on all features, no missing values: ```r blogs_fb <- read_csv("~/BlogFeedback/blogData_train.csv", col_names = FALSE) blogs_fb <- blogs_fb %>% rename(fb = X281, blog_len = X62, sunday = X276) %>% mutate(fb = log(fb + 1)) dim(blogs_fb) ``` ``` ## [1] 52397 281 ``` --- ```r glimpse(blogs_fb) ``` ``` ## Rows: 52,397 ## Columns: 281 ## $ X1 <dbl> 40.30467, 40.30467, 40.30467, 40.30467, 40.30467, 40.30467, 4… ## $ X2 <dbl> 53.84566, 53.84566, 53.84566, 53.84566, 53.84566, 53.84566, 5… ## $ X3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X4 <dbl> 401, 401, 401, 401, 401, 401, 401, 401, 401, 401, 401, 401, 4… ## $ X5 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1… ## $ X6 <dbl> 15.52416, 15.52416, 15.52416, 15.52416, 15.52416, 15.52416, 1… ## $ X7 <dbl> 32.44188, 32.44188, 32.44188, 32.44188, 32.44188, 32.44188, 3… ## $ X8 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X9 <dbl> 377, 377, 377, 377, 377, 377, 377, 377, 377, 377, 377, 377, 3… ## $ X10 <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3… ## $ X11 <dbl> 14.04423, 14.04423, 14.04423, 14.04423, 14.04423, 14.04423, 1… ## $ X12 <dbl> 32.61542, 32.61542, 32.61542, 32.61542, 32.61542, 32.61542, 3… ## $ X13 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X14 <dbl> 377, 377, 377, 377, 377, 377, 377, 377, 377, 377, 377, 377, 3… ## $ X15 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2… ## $ X16 <dbl> 34.56757, 34.56757, 34.56757, 34.56757, 34.56757, 34.56757, 3… ## $ X17 <dbl> 48.47518, 48.47518, 48.47518, 48.47518, 48.47518, 48.47518, 4… ## $ X18 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X19 <dbl> 378, 378, 378, 378, 378, 378, 378, 378, 378, 378, 378, 378, 3… ## $ X20 <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1… ## $ X21 <dbl> 1.479934, 1.479934, 1.479934, 1.479934, 1.479934, 1.479934, 1… ## $ X22 <dbl> 46.18691, 46.18691, 46.18691, 46.18691, 46.18691, 46.18691, 4… ## $ X23 <dbl> -356, -356, -356, -356, -356, -356, -356, -356, -356, -356, -… ## $ X24 <dbl> 377, 377, 377, 377, 377, 377, 377, 377, 377, 377, 377, 377, 3… ## $ X25 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X26 <dbl> 1.076167, 1.076167, 1.076167, 1.076167, 1.076167, 1.076167, 1… ## $ X27 <dbl> 1.795416, 1.795416, 1.795416, 1.795416, 1.795416, 1.795416, 1… ## $ X28 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X29 <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1… ## $ X30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X31 <dbl> 0.4004914, 0.4004914, 0.4004914, 0.4004914, 0.4004914, 0.4004… ## $ X32 <dbl> 1.078097, 1.078097, 1.078097, 1.078097, 1.078097, 1.078097, 1… ## $ X33 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X34 <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9… ## $ X35 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X36 <dbl> 0.3775594, 0.3775594, 0.3775594, 0.3775594, 0.3775594, 0.3775… ## $ X37 <dbl> 1.07421, 1.07421, 1.07421, 1.07421, 1.07421, 1.07421, 1.07421… ## $ X38 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X39 <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9… ## $ X40 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X41 <dbl> 0.972973, 0.972973, 0.972973, 0.972973, 0.972973, 0.972973, 0… ## $ X42 <dbl> 1.704671, 1.704671, 1.704671, 1.704671, 1.704671, 1.704671, 1… ## $ X43 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X44 <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1… ## $ X45 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X46 <dbl> 0.02293202, 0.02293202, 0.02293202, 0.02293202, 0.02293202, 0… ## $ X47 <dbl> 1.521174, 1.521174, 1.521174, 1.521174, 1.521174, 1.521174, 1… ## $ X48 <dbl> -8, -8, -8, -8, -8, -8, -8, -8, -8, -8, -8, -8, -8, -8, -8, -… ## $ X49 <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9… ## $ X50 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X51 <dbl> 2, 6, 6, 2, 3, 6, 6, 3, 30, 30, 0, 0, 30, 0, 51, 10, 32, 10, … ## $ X52 <dbl> 2, 2, 2, 2, 1, 0, 0, 1, 27, 27, 0, 0, 30, 0, 51, 10, 2, 10, 5… ## $ X53 <dbl> 0, 4, 4, 0, 2, 2, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 30, 0, 0, 51,… ## $ X54 <dbl> 2, 5, 5, 2, 2, 5, 5, 2, 2, 2, 0, 0, 30, 0, 51, 10, 32, 10, 51… ## $ X55 <dbl> 2, -2, -2, 2, -1, -2, -2, -1, 26, 26, 0, 0, 30, 0, 51, 10, -2… ## $ X56 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 2, 3, 0, 2, 0, 3, 4, 0… ## $ X57 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 3, 0, 1, 0, 3, 1, 0… ## $ X58 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 3, 0… ## $ X59 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 2, 3, 0, 2, 0, 3, 4, 0… ## $ X60 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, -2, 1, 0, 3, 0, 0, 0, 3, -2,… ## $ X61 <dbl> 10, 35, 35, 10, 34, 59, 59, 34, 58, 58, 11, 35, 5, 59, 8, 14,… ## $ blog_len <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X63 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X64 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X65 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X66 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X67 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X68 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X69 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X70 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X71 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X72 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X73 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X74 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X75 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X76 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X77 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X78 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X79 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X80 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X81 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X82 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X83 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X84 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X85 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X86 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X87 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X88 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X89 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X90 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X91 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X92 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X93 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X94 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X95 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X96 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X97 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X98 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X99 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X100 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X101 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X102 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X103 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X104 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X105 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X106 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X107 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X108 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X109 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X110 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X111 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X112 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X113 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X114 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X115 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X116 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X117 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X118 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X119 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X120 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X121 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X122 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X123 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X124 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X125 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X126 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X127 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X128 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X129 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X130 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X131 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X132 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X133 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X134 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X135 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X136 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X137 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X138 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X139 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X140 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X141 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X142 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X143 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X144 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X145 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X146 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X147 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X148 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X149 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X150 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X151 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X152 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X153 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X154 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X155 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X156 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X157 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X158 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X159 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X160 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X161 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X162 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X163 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X164 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X165 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X166 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X167 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X168 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X169 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X170 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X171 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X172 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X173 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X174 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X175 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X176 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X177 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X178 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X179 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X180 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X181 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X182 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X183 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X184 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X185 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X186 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X187 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X188 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X189 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X190 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X191 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X192 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X193 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X194 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X195 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X196 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X197 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X198 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X199 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X200 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X201 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X202 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X203 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X204 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X205 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X206 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X207 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X208 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X209 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X210 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X211 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X212 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X213 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X214 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X215 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X216 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X217 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X218 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X219 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X220 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X221 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X222 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X223 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X224 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X225 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X226 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X227 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X228 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X229 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X230 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X231 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X232 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X233 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X234 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X235 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X236 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X237 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X238 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X239 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X240 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X241 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X242 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X243 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X244 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X245 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X246 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X247 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X248 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X249 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X250 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X251 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X252 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X253 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X254 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X255 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X256 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X257 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X258 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X259 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X260 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X261 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X262 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X263 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X264 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0… ## $ X265 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0… ## $ X266 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1… ## $ X267 <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X268 <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X269 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X270 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0… ## $ X271 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1… ## $ X272 <dbl> 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X273 <dbl> 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X274 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X275 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0… ## $ sunday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X277 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X278 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X279 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ X280 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ fb <dbl> 0.6931472, 0.0000000, 0.0000000, 0.6931472, 3.3322045, 0.0000… ``` --- See the dependent variable distribution: ```r ggplot(blogs_fb, aes(fb)) + geom_histogram(fill = "red", alpha = 0.5, binwidth = 0.5) + theme_bw() ``` <img src="images/Bench-Hist-1.png" width="100%" /> --- See it vs. say "length of post": ```r ggplot(blogs_fb, aes(log(blog_len), fb)) + geom_point(color = "red", alpha = 0.5) + theme_bw() ``` <img src="images/Bench-Age-Equipment-1.png" width="100%" /> --- class: section-slide # End of Detour --- # WARNING .warning[ 💡 What you're about to see is not a good modeling/prediction flow. This is just an intro to tidy modeling. Some of the issues with how things are done here will be raised, some will have to wait till later in the course. ] --- class: section-slide # The ~~Present~~ Past Solution: `caret` --- ### Split Data ```r library(caret) train_idx <- createDataPartition(blogs_fb$fb, p = 0.6, list = FALSE) blogs_tr <- blogs_fb[train_idx, ] blogs_te <- blogs_fb[-train_idx, ] library(glue) glue("train no. of rows: {nrow(blogs_tr)} test no. of rows: {nrow(blogs_te)}") ``` ``` ## train no. of rows: 31439 ## test no. of rows: 20958 ``` Here you might consider some preprocessing. `caret` has some nice documentation [here](https://topepo.github.io/caret/index.html). --- ### Tuning and Modeling Define general methodology, e.g. 5-fold Cross-Validation: ```r fit_control <- trainControl(method = "cv", number = 5) ridge_grid <- expand.grid(alpha=0, lambda = 10^seq(-3, 1, length = 50)) lasso_grid <- expand.grid(alpha=1, lambda = 10^seq(-3, 1, length = 50)) rf_grid <- expand.grid(splitrule = "variance", min.node.size = seq(10, 30, 10), mtry = seq(10, 50, 20)) mod_ridge <- train(fb ~ ., data = blogs_tr, method = "glmnet", trControl = fit_control, tuneGrid = ridge_grid, metric = "RMSE") mod_lasso <- train(fb ~ ., data = blogs_tr, method = "glmnet", trControl = fit_control, tuneGrid = lasso_grid, metric = "RMSE") mod_rf <- train(fb ~ ., data = blogs_tr, method = "ranger", trControl = fit_control, tuneGrid = rf_grid, num.trees = 50, metric = "RMSE") ``` --- ### Evaluating Models ```r mod_ridge ``` ``` ## glmnet ## ## 31439 samples ## 280 predictor ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## Summary of sample sizes: 25151, 25150, 25152, 25151, 25152 ## Resampling results across tuning parameters: ## ## lambda RMSE Rsquared MAE ## 0.001000000 0.8465427 0.4432868 0.5902339 ## 0.001206793 0.8465427 0.4432868 0.5902339 ## 0.001456348 0.8465427 0.4432868 0.5902339 ## 0.001757511 0.8465427 0.4432868 0.5902339 ## 0.002120951 0.8465427 0.4432868 0.5902339 ## 0.002559548 0.8465427 0.4432868 0.5902339 ## 0.003088844 0.8465427 0.4432868 0.5902339 ## 0.003727594 0.8465427 0.4432868 0.5902339 ## 0.004498433 0.8465427 0.4432868 0.5902339 ## 0.005428675 0.8465427 0.4432868 0.5902339 ## 0.006551286 0.8465427 0.4432868 0.5902339 ## 0.007906043 0.8465427 0.4432868 0.5902339 ## 0.009540955 0.8465427 0.4432868 0.5902339 ## 0.011513954 0.8465427 0.4432868 0.5902339 ## 0.013894955 0.8465427 0.4432868 0.5902339 ## 0.016768329 0.8465427 0.4432868 0.5902339 ## 0.020235896 0.8465427 0.4432868 0.5902339 ## 0.024420531 0.8465427 0.4432868 0.5902339 ## 0.029470517 0.8465427 0.4432868 0.5902339 ## 0.035564803 0.8465427 0.4432868 0.5902339 ## 0.042919343 0.8465427 0.4432868 0.5902339 ## 0.051794747 0.8465427 0.4432868 0.5902339 ## 0.062505519 0.8467117 0.4430782 0.5903658 ## 0.075431201 0.8479974 0.4414519 0.5913602 ## 0.091029818 0.8493544 0.4397435 0.5923965 ## 0.109854114 0.8507596 0.4379886 0.5934634 ## 0.132571137 0.8522240 0.4361799 0.5945637 ## 0.159985872 0.8537427 0.4343320 0.5956963 ## 0.193069773 0.8553299 0.4324376 0.5968782 ## 0.232995181 0.8570035 0.4304862 0.5981506 ## 0.281176870 0.8588006 0.4284420 0.5995849 ## 0.339322177 0.8607655 0.4262615 0.6012393 ## 0.409491506 0.8629367 0.4239091 0.6031958 ## 0.494171336 0.8653867 0.4212996 0.6055008 ## 0.596362332 0.8681789 0.4183521 0.6082562 ## 0.719685673 0.8713638 0.4150067 0.6115633 ## 0.868511374 0.8749957 0.4111848 0.6154977 ## 1.048113134 0.8791111 0.4068194 0.6201124 ## 1.264855217 0.8837361 0.4018633 0.6253678 ## 1.526417967 0.8888746 0.3962879 0.6311635 ## 1.842069969 0.8945103 0.3900955 0.6373592 ## 2.222996483 0.9006079 0.3833192 0.6438697 ## 2.682695795 0.9071214 0.3760195 0.6506244 ## 3.237457543 0.9139965 0.3682851 0.6575705 ## 3.906939937 0.9211772 0.3602281 0.6647188 ## 4.714866363 0.9286122 0.3519779 0.6719947 ## 5.689866029 0.9362606 0.3436749 0.6793779 ## 6.866488450 0.9440955 0.3354611 0.6868500 ## 8.286427729 0.9521037 0.3274654 0.6943627 ## 10.000000000 0.9602856 0.3197914 0.7019162 ## ## Tuning parameter 'alpha' was held constant at a value of 0 ## RMSE was used to select the optimal model using the smallest value. ## The final values used for the model were alpha = 0 and lambda = 0.05179475. ``` --- ```r mod_lasso ``` ``` ## glmnet ## ## 31439 samples ## 280 predictor ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## Summary of sample sizes: 25152, 25151, 25150, 25151, 25152 ## Resampling results across tuning parameters: ## ## lambda RMSE Rsquared MAE ## 0.001000000 0.8363520 0.4557473 0.5808941 ## 0.001206793 0.8367221 0.4552621 0.5812621 ## 0.001456348 0.8373006 0.4545144 0.5818254 ## 0.001757511 0.8378683 0.4537838 0.5823994 ## 0.002120951 0.8384855 0.4529970 0.5830077 ## 0.002559548 0.8393089 0.4519476 0.5837321 ## 0.003088844 0.8405302 0.4503812 0.5847438 ## 0.003727594 0.8422647 0.4481496 0.5861356 ## 0.004498433 0.8441916 0.4456619 0.5876737 ## 0.005428675 0.8461175 0.4431796 0.5890034 ## 0.006551286 0.8481192 0.4405938 0.5903564 ## 0.007906043 0.8502745 0.4378111 0.5918695 ## 0.009540955 0.8527639 0.4345716 0.5936910 ## 0.011513954 0.8548075 0.4319568 0.5952619 ## 0.013894955 0.8572912 0.4287521 0.5971848 ## 0.016768329 0.8595876 0.4258242 0.5989684 ## 0.020235896 0.8612785 0.4237738 0.6005154 ## 0.024420531 0.8624864 0.4224979 0.6017381 ## 0.029470517 0.8639204 0.4210409 0.6031626 ## 0.035564803 0.8657003 0.4192847 0.6051093 ## 0.042919343 0.8678262 0.4173131 0.6076622 ## 0.051794747 0.8703560 0.4151486 0.6109852 ## 0.062505519 0.8735505 0.4125224 0.6154668 ## 0.075431201 0.8773779 0.4097051 0.6210429 ## 0.091029818 0.8811636 0.4081723 0.6271661 ## 0.109854114 0.8859656 0.4069365 0.6348774 ## 0.132571137 0.8928400 0.4049931 0.6445753 ## 0.159985872 0.9027495 0.4016010 0.6567481 ## 0.193069773 0.9169912 0.3953602 0.6720417 ## 0.232995181 0.9373524 0.3830465 0.6913175 ## 0.281176870 0.9656148 0.3583346 0.7152676 ## 0.339322177 0.9925805 0.3498912 0.7358105 ## 0.409491506 1.0281981 0.3387926 0.7604651 ## 0.494171336 1.0755751 0.2972423 0.7912722 ## 0.596362332 1.1271288 0.2887711 0.8202405 ## 0.719685673 1.1336226 NaN 0.8238043 ## 0.868511374 1.1336226 NaN 0.8238043 ## 1.048113134 1.1336226 NaN 0.8238043 ## 1.264855217 1.1336226 NaN 0.8238043 ## 1.526417967 1.1336226 NaN 0.8238043 ## 1.842069969 1.1336226 NaN 0.8238043 ## 2.222996483 1.1336226 NaN 0.8238043 ## 2.682695795 1.1336226 NaN 0.8238043 ## 3.237457543 1.1336226 NaN 0.8238043 ## 3.906939937 1.1336226 NaN 0.8238043 ## 4.714866363 1.1336226 NaN 0.8238043 ## 5.689866029 1.1336226 NaN 0.8238043 ## 6.866488450 1.1336226 NaN 0.8238043 ## 8.286427729 1.1336226 NaN 0.8238043 ## 10.000000000 1.1336226 NaN 0.8238043 ## ## Tuning parameter 'alpha' was held constant at a value of 1 ## RMSE was used to select the optimal model using the smallest value. ## The final values used for the model were alpha = 1 and lambda = 0.001. ``` --- ```r mod_rf ``` ``` ## Random Forest ## ## 31439 samples ## 280 predictor ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## Summary of sample sizes: 25151, 25151, 25151, 25151, 25152 ## Resampling results across tuning parameters: ## ## min.node.size mtry RMSE Rsquared MAE ## 10 10 0.6971449 0.6287011 0.4482854 ## 10 30 0.6548617 0.6675777 0.4092333 ## 10 50 0.6483229 0.6734382 0.4020972 ## 20 10 0.7028311 0.6237031 0.4534763 ## 20 30 0.6603183 0.6624895 0.4153412 ## 20 50 0.6513949 0.6706486 0.4069590 ## 30 10 0.7077821 0.6187225 0.4578572 ## 30 30 0.6643416 0.6589521 0.4199686 ## 30 50 0.6520327 0.6703291 0.4093365 ## ## Tuning parameter 'splitrule' was held constant at a value of variance ## RMSE was used to select the optimal model using the smallest value. ## The final values used for the model were mtry = 50, splitrule = variance ## and min.node.size = 10. ``` --- ```r plot(mod_ridge) ``` <img src="images/Ridge-CV-1.png" width="80%" /> --- ```r plot(mod_lasso) ``` <img src="images/Lasso-CV-1.png" width="80%" /> --- ```r plot(mod_rf) ``` <img src="images/RF-CV-1.png" width="80%" /> --- ### Comparing Models ```r resamps <- resamples(list(Ridge = mod_ridge, Lasso = mod_lasso, RF = mod_rf)) summary(resamps) ``` ``` ## ## Call: ## summary.resamples(object = resamps) ## ## Models: Ridge, Lasso, RF ## Number of resamples: 5 ## ## MAE ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## Ridge 0.5750361 0.5893380 0.5917179 0.5902339 0.5968475 0.5982301 0 ## Lasso 0.5681938 0.5797615 0.5820657 0.5808941 0.5844702 0.5899792 0 ## RF 0.3943710 0.3977928 0.4030565 0.4020972 0.4055509 0.4097149 0 ## ## RMSE ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## Ridge 0.8332790 0.8425411 0.8447478 0.8465427 0.8465174 0.8656284 0 ## Lasso 0.8084833 0.8358103 0.8385322 0.8363520 0.8402470 0.8586870 0 ## RF 0.6364601 0.6371085 0.6489900 0.6483229 0.6562275 0.6628281 0 ## ## Rsquared ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## Ridge 0.4221219 0.4237722 0.4483357 0.4432868 0.4501113 0.4720927 0 ## Lasso 0.4396015 0.4483588 0.4517731 0.4557473 0.4693071 0.4696961 0 ## RF 0.6579006 0.6708609 0.6718423 0.6734382 0.6809223 0.6856649 0 ``` --- ```r dotplot(resamps, metric = "RMSE") ``` <img src="images/Caret-RMSE-Comp-1.png" width="100%" /> --- ### Predicting ```r pred_ridge <- predict(mod_ridge, newdata = blogs_te) pred_lasso <- predict(mod_lasso, newdata = blogs_te) pred_rf <- predict(mod_rf, newdata = blogs_te) rmse_ridge <- RMSE(pred_ridge, blogs_te$fb) rmse_lasso <- RMSE(pred_lasso, blogs_te$fb) rmse_rf <- RMSE(pred_rf, blogs_te$fb) glue("Test RMSE Ridge: {format(rmse_ridge, digits = 3)} Test RMSE Lassoe: {format(rmse_lasso, digits = 3)} Test RMSE RF: {format(rmse_rf, digits = 3)}") ``` ``` ## Test RMSE Ridge: 0.846 ## Test RMSE Lassoe: 0.834 ## Test RMSE RF: 0.648 ``` --- ```r bind_rows( tibble(method = "Ridge", pred = pred_ridge, true = blogs_te$fb), tibble(method = "Lasso", pred = pred_lasso, true = blogs_te$fb), tibble(method = "RF", pred = pred_rf, true = blogs_te$fb)) %>% ggplot(aes(true, pred)) + geom_point(color = "red", alpha = 0.5) + geom_abline(slope = 1, intercept = 0) + facet_wrap(~ factor(method, levels = c("Ridge", "Lasso", "RF"))) + theme_bw() ``` <img src="images/Caret-Pred-vs-True-1.png" width="100%" /> --- class: section-slide # The ~~Future~~ Present Solution: `tidymodels` #### Inspired by [Julia Silge](https://juliasilge.com/blog/intro-tidymodels/) --- ### Packages under tidymodels - `parsnip`: tidy `caret` - `dials` and `tune`: specifying and tuning model parameters - `rsample`: sampling, data partitioning - `recipes`, `embed`, `themis`: preprocessing and creating model matrices - `infer`: tidy statistics - `yardstick`: measuring models performance - `broom`: convert models output into tidy tibbles And [more](https://www.tidymodels.org/). .warning[ ⚠️ All `tidymodels` packages are under development! ] --- ### Split Data The `initial_split()` function is from the `rsample` package: ```r library(tidymodels) blogs_split_obj <- blogs_fb %>% initial_split(prop = 0.6) print(blogs_split_obj) ``` ``` ## <Training/Testing/Total> ## <31438/20959/52397> ``` ```r blogs_tr <- training(blogs_split_obj) blogs_te <- testing(blogs_split_obj) glue("train no. of rows: {nrow(blogs_tr)} test no. of rows: {nrow(blogs_te)}") ``` ``` ## train no. of rows: 31438 ## test no. of rows: 20959 ``` --- ### Preprocess The `recipe()` function is from the `recipes` package. It allows you to specify a preprocessing pipeline you can later apply to any dataset, including multiple steps: ```r blogs_rec <- recipe(fb ~ ., data = blogs_tr) blogs_rec ``` ``` ## ``` ``` ## ── Recipe ────────────────────────────────────────────────────────────────────── ``` ``` ## ``` ``` ## ── Inputs ``` ``` ## Number of variables by role ``` ``` ## outcome: 1 ## predictor: 280 ``` --- The `recipes` package contains more preprocessing [`step_`s](https://tidymodels.github.io/recipes/reference/index.html) than you imagine: ```r blogs_rec <- blogs_rec %>% step_zv(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) ``` --- After you have your `recipe` you need to `prep()` materials... ```r blogs_rec <- blogs_rec %>% prep(blogs_tr) blogs_rec ``` ``` ## ``` ``` ## ── Recipe ────────────────────────────────────────────────────────────────────── ``` ``` ## ``` ``` ## ── Inputs ``` ``` ## Number of variables by role ``` ``` ## outcome: 1 ## predictor: 280 ``` ``` ## ``` ``` ## ── Training information ``` ``` ## Training data contained 31438 data points and no incomplete rows. ``` ``` ## ``` ``` ## ── Operations ``` ``` ## • Zero variance filter removed: X13, X33, X38, X278 | Trained ``` ``` ## • Centering and scaling for: X1, X2, X3, X4, X5, X6, X7, X8, X9, ... | Trained ``` --- At this point our `recipe` has all necessary `sd` and `mean`s for numeric variables. ```r blogs_rec$var_info ``` ``` ## # A tibble: 281 × 4 ## variable type role source ## <chr> <list> <chr> <chr> ## 1 X1 <chr [2]> predictor original ## 2 X2 <chr [2]> predictor original ## 3 X3 <chr [2]> predictor original ## 4 X4 <chr [2]> predictor original ## 5 X5 <chr [2]> predictor original ## 6 X6 <chr [2]> predictor original ## 7 X7 <chr [2]> predictor original ## 8 X8 <chr [2]> predictor original ## 9 X9 <chr [2]> predictor original ## 10 X10 <chr [2]> predictor original ## # … with 271 more rows ``` ```r blogs_rec$var_info$type[[1]] ``` ``` ## [1] "double" "numeric" ``` --- ```r blogs_rec$steps[[2]]$means |> head() ``` ``` ## X1 X2 X3 X4 X5 X6 ## 39.6974280 47.0002493 0.3098798 341.4913162 24.9267288 15.3097213 ``` ```r blogs_rec$steps[[2]]$sds |> head() ``` ``` ## X1 X2 X3 X4 X5 X6 ## 79.095448 62.569848 3.820566 443.854810 69.354476 32.256680 ``` --- And then we `bake()` (or [`juice()`](https://tidymodels.github.io/recipes/reference/juice.html)): ```r blogs_tr2 <- blogs_rec %>% bake(blogs_tr) blogs_te2 <- blogs_rec %>% bake(blogs_te) glue("mean of comments in orig training: {format(mean(blogs_tr$X51), digits = 3)}, sd: {format(sd(blogs_tr$X51), digits = 3)} mean of comments in baked training: {format(mean(blogs_tr2$X51), digits = 1)}, sd: {format(sd(blogs_tr2$X51), digits = 3)}") ``` ``` ## mean of comments in orig training: 39.8, sd: 112 ## mean of comments in baked training: 2e-17, sd: 1 ``` ```r glue("mean of comments in orig testing: {format(mean(blogs_te$X51), digits = 3)}, sd: {format(sd(blogs_te$X51), digits = 3)} mean of comments in baked testing: {format(mean(blogs_te2$X51), digits = 1)}, sd: {format(sd(blogs_te2$X51), digits = 3)}") ``` ``` ## mean of comments in orig testing: 38.9, sd: 110 ## mean of comments in baked testing: -0.008, sd: 0.977 ``` --- Or you can do it all in a single pipe: ```r blogs_rec <- recipe(fb ~ ., data = blogs_tr) %>% step_zv(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) %>% prep(blogs_tr) blogs_tr2 <- blogs_rec %>% bake(blogs_tr) blogs_te2 <- blogs_rec %>% bake(blogs_te) glue("mean of comments in orig training: {format(mean(blogs_tr$X51), digits = 3)}, sd: {format(sd(blogs_tr$X51), digits = 3)} mean of comments in baked training: {format(mean(blogs_tr2$X51), digits = 1)}, sd: {format(sd(blogs_tr2$X51), digits = 3)}") ``` ``` ## mean of comments in orig training: 39.8, sd: 112 ## mean of comments in baked training: 2e-17, sd: 1 ``` ```r glue("mean of comments in orig testing: {format(mean(blogs_te$X51), digits = 3)}, sd: {format(sd(blogs_te$X51), digits = 3)} mean of comments in baked testing: {format(mean(blogs_te2$X51), digits = 1)}, sd: {format(sd(blogs_te2$X51), digits = 3)}") ``` ``` ## mean of comments in orig testing: 38.9, sd: 110 ## mean of comments in baked testing: -0.008, sd: 0.977 ``` --- Can also `tidy()` a recipe: ```r tidy(blogs_rec) ``` ``` ## # A tibble: 2 × 6 ## number operation type trained skip id ## <int> <chr> <chr> <lgl> <lgl> <chr> ## 1 1 step zv TRUE FALSE zv_BRxUF ## 2 2 step normalize TRUE FALSE normalize_aPFe9 ``` --- #### Fast Forward 10 weeks from now... ```r rec_int_topints <- recipe(pets ~ ., data = ok_tr_interaction2) %>% step_textfeature(essays, prefix = "t", extract_functions = my_text_funs) %>% update_role(essays, new_role = "discarded") %>% step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>% step_log(income, starts_with("len_"), starts_with("t_"), -t_essays_sent_bing, offset = 1) %>% step_impute_mean(income) %>% step_other(all_nominal_predictors(), -has_role("discarded"), other = "all_else") %>% step_novel(all_nominal_predictors(), -has_role("discarded")) %>% step_impute_mode(all_nominal_predictors(), -has_role("discarded")) %>% step_dummy(all_nominal_predictors(), -has_role("discarded"), one_hot = FALSE) %>% step_interact(topint_ints) %>% step_nzv(all_numeric_predictors(), freq_cut = 99/1) %>% step_upsample(pets, over_ratio = 1, seed = 42) ``` --- ### Modeling For now let us use the original `blogs_tr` data. Functions `linear_reg()` and `set_engine()` are from the `parsnip` package: ```r mod_ridge_spec <- linear_reg(mixture = 0, penalty = 0.001) %>% set_engine(engine = "glmnet") mod_ridge_spec ``` ``` ## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = 0.001 ## mixture = 0 ## ## Computational engine: glmnet ``` --- ```r mod_ridge <- mod_ridge_spec %>% fit(fb ~ ., data = blogs_tr) mod_ridge ``` ``` ## parsnip model object ## ## ## Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian", alpha = ~0) ## ## Df %Dev Lambda ## 1 276 0.00 614.20 ## 2 276 2.62 559.60 ## 3 276 2.86 509.90 ## 4 276 3.11 464.60 ## 5 276 3.38 423.30 ## 6 276 3.68 385.70 ## 7 276 4.00 351.40 ## 8 276 4.34 320.20 ## 9 276 4.70 291.80 ## 10 276 5.09 265.90 ## 11 276 5.51 242.20 ## 12 276 5.95 220.70 ## 13 276 6.42 201.10 ## 14 276 6.92 183.20 ## 15 276 7.44 167.00 ## 16 276 7.99 152.10 ## 17 276 8.58 138.60 ## 18 276 9.19 126.30 ## 19 276 9.82 115.10 ## 20 276 10.48 104.90 ## 21 276 11.16 95.54 ## 22 276 11.87 87.05 ## 23 276 12.59 79.32 ## 24 276 13.33 72.27 ## 25 276 14.09 65.85 ## 26 276 14.86 60.00 ## 27 276 15.63 54.67 ## 28 276 16.41 49.82 ## 29 276 17.20 45.39 ## 30 276 17.98 41.36 ## 31 276 18.76 37.68 ## 32 276 19.54 34.34 ## 33 276 20.31 31.29 ## 34 276 21.07 28.51 ## 35 276 21.82 25.97 ## 36 276 22.56 23.67 ## 37 276 23.28 21.56 ## 38 276 24.00 19.65 ## 39 276 24.69 17.90 ## 40 276 25.38 16.31 ## 41 276 26.05 14.86 ## 42 276 26.71 13.54 ## 43 276 27.36 12.34 ## 44 276 28.00 11.24 ## 45 276 28.63 10.24 ## 46 276 29.24 9.34 ## 47 276 29.85 8.51 ## 48 276 30.45 7.75 ## 49 276 31.04 7.06 ## 50 276 31.63 6.43 ## 51 276 32.20 5.86 ## 52 276 32.77 5.34 ## 53 276 33.33 4.87 ## 54 276 33.88 4.43 ## 55 276 34.42 4.04 ## 56 276 34.95 3.68 ## 57 276 35.47 3.36 ## 58 276 35.98 3.06 ## 59 276 36.48 2.79 ## 60 276 36.96 2.54 ## 61 276 37.44 2.31 ## 62 276 37.89 2.11 ## 63 276 38.34 1.92 ## 64 276 38.76 1.75 ## 65 276 39.17 1.59 ## 66 276 39.56 1.45 ## 67 276 39.94 1.32 ## 68 276 40.29 1.21 ## 69 276 40.64 1.10 ## 70 276 40.96 1.00 ## 71 276 41.26 0.91 ## 72 276 41.55 0.83 ## 73 276 41.82 0.76 ## 74 276 42.08 0.69 ## 75 276 42.32 0.63 ## 76 276 42.54 0.57 ## 77 276 42.76 0.52 ## 78 276 42.96 0.48 ## 79 276 43.14 0.43 ## 80 276 43.32 0.39 ## 81 276 43.49 0.36 ## 82 276 43.65 0.33 ## 83 276 43.80 0.30 ## 84 276 43.94 0.27 ## 85 276 44.08 0.25 ## 86 276 44.22 0.23 ## 87 276 44.34 0.21 ## 88 276 44.47 0.19 ## 89 276 44.59 0.17 ## 90 276 44.71 0.16 ## 91 276 44.82 0.14 ## 92 276 44.93 0.13 ## 93 276 45.04 0.12 ## 94 276 45.15 0.11 ## 95 276 45.26 0.10 ## 96 276 45.36 0.09 ## 97 276 45.46 0.08 ## 98 276 45.56 0.07 ## 99 276 45.65 0.07 ## 100 276 45.75 0.06 ``` --- In a single pipe: ```r mod_lasso <- linear_reg(mixture = 1, penalty = 0.001) %>% set_engine(engine = "glmnet") %>% fit(fb ~ ., data = blogs_tr) mod_lasso ``` ``` ## parsnip model object ## ## ## Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian", alpha = ~1) ## ## Df %Dev Lambda ## 1 0 0.00 0.61420 ## 2 1 4.97 0.55960 ## 3 1 9.09 0.50990 ## 4 3 13.12 0.46460 ## 5 3 16.98 0.42330 ## 6 3 20.19 0.38570 ## 7 3 22.85 0.35140 ## 8 3 25.06 0.32020 ## 9 3 26.89 0.29180 ## 10 4 29.23 0.26590 ## 11 4 31.29 0.24220 ## 12 4 33.00 0.22070 ## 13 4 34.42 0.20110 ## 14 4 35.60 0.18320 ## 15 4 36.58 0.16700 ## 16 4 37.40 0.15210 ## 17 4 38.07 0.13860 ## 18 4 38.63 0.12630 ## 19 5 39.10 0.11510 ## 20 5 39.49 0.10490 ## 21 5 39.81 0.09554 ## 22 6 40.11 0.08705 ## 23 6 40.42 0.07932 ## 24 7 40.71 0.07227 ## 25 8 40.97 0.06585 ## 26 8 41.21 0.06000 ## 27 8 41.41 0.05467 ## 28 11 41.58 0.04982 ## 29 12 41.76 0.04539 ## 30 11 41.91 0.04136 ## 31 13 42.06 0.03768 ## 32 16 42.19 0.03434 ## 33 18 42.31 0.03129 ## 34 20 42.45 0.02851 ## 35 20 42.57 0.02597 ## 36 22 42.68 0.02367 ## 37 25 42.79 0.02156 ## 38 29 42.88 0.01965 ## 39 32 42.98 0.01790 ## 40 39 43.12 0.01631 ## 41 42 43.31 0.01486 ## 42 47 43.51 0.01354 ## 43 55 43.70 0.01234 ## 44 65 43.92 0.01124 ## 45 69 44.10 0.01024 ## 46 76 44.34 0.00934 ## 47 85 44.55 0.00850 ## 48 89 44.73 0.00775 ## 49 97 44.90 0.00706 ## 50 103 45.09 0.00643 ## 51 111 45.25 0.00586 ## 52 116 45.45 0.00534 ## 53 125 45.59 0.00487 ## 54 133 45.74 0.00443 ## 55 144 45.95 0.00404 ## 56 149 46.14 0.00368 ## 57 160 46.32 0.00335 ## 58 158 46.47 0.00306 ## 59 164 46.60 0.00279 ## 60 171 46.70 0.00254 ## 61 172 46.79 0.00231 ## 62 174 46.86 0.00211 ## 63 183 46.94 0.00192 ## 64 185 47.03 0.00175 ## 65 190 47.11 0.00159 ## 66 196 47.18 0.00145 ## 67 204 47.24 0.00132 ## 68 202 47.30 0.00121 ## 69 200 47.35 0.00110 ## 70 206 47.39 0.00100 ## 71 215 47.43 0.00091 ## 72 220 47.46 0.00083 ## 73 222 47.49 0.00076 ## 74 224 47.52 0.00069 ## 75 229 47.55 0.00063 ## 76 239 47.58 0.00057 ## 77 244 47.61 0.00052 ## 78 242 47.65 0.00048 ## 79 241 47.69 0.00043 ## 80 240 47.73 0.00039 ## 81 241 47.77 0.00036 ## 82 243 47.80 0.00033 ## 83 243 47.83 0.00030 ## 84 243 47.86 0.00027 ## 85 246 47.88 0.00025 ## 86 251 47.89 0.00023 ## 87 252 47.91 0.00021 ## 88 252 47.92 0.00019 ## 89 253 47.93 0.00017 ## 90 256 47.94 0.00016 ## 91 258 47.95 0.00014 ## 92 258 47.96 0.00013 ## 93 260 47.97 0.00012 ## 94 259 47.98 0.00011 ## 95 261 47.98 0.00010 ## 96 265 47.99 0.00009 ## 97 260 47.99 0.00008 ## 98 260 47.99 0.00007 ## 99 264 47.99 0.00007 ``` --- Can also use `fit_xy()` a-la `sklearn`: ```r mod_rf <- rand_forest(mode = "regression", mtry = 50, trees = 50, min_n = 10) %>% set_engine("ranger") %>% fit_xy(x = blogs_tr[, -281], y = blogs_tr$fb) mod_rf ``` ``` ## parsnip model object ## ## Ranger result ## ## Call: ## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~50, x), num.trees = ~50, min.node.size = min_rows(~10, x), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) ## ## Type: Regression ## Number of trees: 50 ## Sample size: 31438 ## Number of independent variables: 280 ## Mtry: 50 ## Target node size: 10 ## Variable importance mode: none ## Splitrule: variance ## OOB prediction error (MSE): 0.4272363 ## R squared (OOB): 0.6687403 ``` --- Notice how easy it is to get the model's results in a tidy way using the `tidy()` function: ```r tidy(mod_ridge) ``` ``` ## # A tibble: 281 × 3 ## term estimate penalty ## <chr> <dbl> <dbl> ## 1 (Intercept) 0.686 0.001 ## 2 X1 0.00118 0.001 ## 3 X2 0.00135 0.001 ## 4 X3 0.000687 0.001 ## 5 X4 0.000333 0.001 ## 6 X5 0.0000880 0.001 ## 7 X6 0.00209 0.001 ## 8 X7 -0.00115 0.001 ## 9 X8 0.277 0.001 ## 10 X9 -0.000139 0.001 ## # … with 271 more rows ``` --- ### Predicting ```r results_test <- mod_ridge %>% predict(new_data = blogs_te, penalty = 0.001) %>% mutate( truth = blogs_te$fb, method = "Ridge" ) %>% bind_rows(mod_lasso %>% predict(new_data = blogs_te) %>% mutate( truth = blogs_te$fb, method = "Lasso" )) %>% bind_rows(mod_rf %>% predict(new_data = blogs_te) %>% mutate( truth = blogs_te$fb, method = "RF" )) dim(results_test) ``` ``` ## [1] 62877 3 ``` ```r head(results_test) ``` ``` ## # A tibble: 6 × 3 ## .pred truth method ## <dbl> <dbl> <chr> ## 1 0.186 0 Ridge ## 2 0.473 2.30 Ridge ## 3 0.663 1.39 Ridge ## 4 1.34 2.64 Ridge ## 5 0.322 0.693 Ridge ## 6 0.698 0.693 Ridge ``` --- ### Comparing Models The package `yardstick` has tons of performance [metrics](https://tidymodels.github.io/yardstick/articles/metric-types.html): ```r results_test %>% group_by(method) %>% rmse(truth = truth, estimate = .pred) ``` ``` ## # A tibble: 3 × 4 ## method .metric .estimator .estimate ## <chr> <chr> <chr> <dbl> ## 1 Lasso rmse standard 0.836 ## 2 RF rmse standard 0.647 ## 3 Ridge rmse standard 0.852 ``` --- ```r results_test %>% ggplot(aes(truth, .pred)) + geom_point(color = "red", alpha = 0.5) + geom_abline(slope = 1, intercept = 0) + facet_wrap(~ factor(method, levels = c("Ridge", "Lasso", "RF"))) + theme_bw() ``` <img src="images/Tidymodels-Pred-vs-True-1.png" width="100%" /> --- ### Tuning Define your model spec, using `tune()` from the `tune` package for a parameter you wish to tune: ```r mod_rf_spec <- rand_forest(mode = "regression", mtry = tune(), min_n = tune(), trees = 100) %>% set_engine("ranger") ``` --- Define the `grid` on which you train your params, with the `dials` package: ```r rf_grid <- grid_regular(mtry(range(10, 70)), min_n(range(10, 30)), levels = c(4, 3)) rf_grid ``` ``` ## # A tibble: 12 × 2 ## mtry min_n ## <int> <int> ## 1 10 10 ## 2 30 10 ## 3 50 10 ## 4 70 10 ## 5 10 20 ## 6 30 20 ## 7 50 20 ## 8 70 20 ## 9 10 30 ## 10 30 30 ## 11 50 30 ## 12 70 30 ``` --- Split your data into a few folds for Cross Validation with `vfold_cv()` from the `rsample` package: ```r cv_splits <- vfold_cv(blogs_tr, v = 5) cv_splits ``` ``` ## # 5-fold cross-validation ## # A tibble: 5 × 2 ## splits id ## <list> <chr> ## 1 <split [25150/6288]> Fold1 ## 2 <split [25150/6288]> Fold2 ## 3 <split [25150/6288]> Fold3 ## 4 <split [25151/6287]> Fold4 ## 5 <split [25151/6287]> Fold5 ``` --- Now perform cross validation with `tune_grid()` from the `tune` package: ```r tune_res <- tune_grid(mod_rf_spec, recipe(fb ~ ., data = blogs_tr), resamples = cv_splits, grid = rf_grid, metrics = metric_set(rmse)) tune_res ``` ``` ## # Tuning results ## # 5-fold cross-validation ## # A tibble: 5 × 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [25150/6288]> Fold1 <tibble [12 × 6]> <tibble [0 × 1]> ## 2 <split [25150/6288]> Fold2 <tibble [12 × 6]> <tibble [0 × 1]> ## 3 <split [25150/6288]> Fold3 <tibble [12 × 6]> <tibble [0 × 1]> ## 4 <split [25151/6287]> Fold4 <tibble [12 × 6]> <tibble [0 × 1]> ## 5 <split [25151/6287]> Fold5 <tibble [12 × 6]> <tibble [0 × 1]> ``` --- ```r tune_res$.metrics[[1]] ``` ``` ## # A tibble: 12 × 6 ## mtry min_n .metric .estimator .estimate .config ## <int> <int> <chr> <chr> <dbl> <chr> ## 1 10 10 rmse standard 0.687 Preprocessor1_Model01 ## 2 30 10 rmse standard 0.649 Preprocessor1_Model02 ## 3 50 10 rmse standard 0.644 Preprocessor1_Model03 ## 4 70 10 rmse standard 0.642 Preprocessor1_Model04 ## 5 10 20 rmse standard 0.697 Preprocessor1_Model05 ## 6 30 20 rmse standard 0.655 Preprocessor1_Model06 ## 7 50 20 rmse standard 0.646 Preprocessor1_Model07 ## 8 70 20 rmse standard 0.643 Preprocessor1_Model08 ## 9 10 30 rmse standard 0.701 Preprocessor1_Model09 ## 10 30 30 rmse standard 0.657 Preprocessor1_Model10 ## 11 50 30 rmse standard 0.651 Preprocessor1_Model11 ## 12 70 30 rmse standard 0.643 Preprocessor1_Model12 ``` --- Collect the mean metric across folds: ```r estimates <- collect_metrics(tune_res) estimates ``` ``` ## # A tibble: 12 × 8 ## mtry min_n .metric .estimator mean n std_err .config ## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 10 10 rmse standard 0.698 5 0.00647 Preprocessor1_Model01 ## 2 30 10 rmse standard 0.656 5 0.00673 Preprocessor1_Model02 ## 3 50 10 rmse standard 0.648 5 0.00569 Preprocessor1_Model03 ## 4 70 10 rmse standard 0.646 5 0.00572 Preprocessor1_Model04 ## 5 10 20 rmse standard 0.704 5 0.00668 Preprocessor1_Model05 ## 6 30 20 rmse standard 0.662 5 0.00638 Preprocessor1_Model06 ## 7 50 20 rmse standard 0.650 5 0.00527 Preprocessor1_Model07 ## 8 70 20 rmse standard 0.647 5 0.00576 Preprocessor1_Model08 ## 9 10 30 rmse standard 0.710 5 0.00740 Preprocessor1_Model09 ## 10 30 30 rmse standard 0.664 5 0.00590 Preprocessor1_Model10 ## 11 50 30 rmse standard 0.654 5 0.00526 Preprocessor1_Model11 ## 12 70 30 rmse standard 0.649 5 0.00567 Preprocessor1_Model12 ``` --- Choose best parameter: ```r estimates %>% mutate(min_n = factor(min_n)) %>% ggplot(aes(x = mtry, y = mean, color = min_n)) + geom_point() + geom_line() + labs(y = "Mean RMSE") + theme_bw() ``` <img src="images/Tidymodels-RMSE-Comp-1.png" width="100%" /> --- There are of course also methods for helping us choose best params and final model. ```r best_rmse <- tune_res %>% select_best(metric = "rmse") best_rmse ``` ``` ## # A tibble: 1 × 3 ## mtry min_n .config ## <int> <int> <chr> ## 1 70 10 Preprocessor1_Model04 ``` See also `?select_by_one_std_err`. ```r mod_rf_final <- finalize_model(mod_rf_spec, best_rmse) mod_rf_final ``` ``` ## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = 70 ## trees = 100 ## min_n = 10 ## ## Computational engine: ranger ``` --- ```r mod_rf_final %>% fit(fb ~ ., data = blogs_tr) %>% predict(new_data = blogs_te) %>% mutate(truth = blogs_te$fb) %>% head(10) ``` ``` ## # A tibble: 10 × 2 ## .pred truth ## <dbl> <dbl> ## 1 0.584 0.693 ## 2 0.584 0.693 ## 3 0.238 0 ## 4 1.79 2.30 ## 5 1.79 2.30 ## 6 0.145 0 ## 7 3.33 1.10 ## 8 0.719 1.39 ## 9 2.40 3.09 ## 10 1.14 0.693 ``` --- ### Workflow As we shall see, this manual approach won't scale, is prone to bugs and will not play nicely with other modeling components: ```r results_test <- mod_ridge %>% predict(new_data = blogs_te, penalty = 0.001) %>% mutate( truth = blogs_te$fb, method = "Ridge" ) %>% bind_rows(mod_lasso %>% predict(new_data = blogs_te) %>% mutate( truth = blogs_te$fb, method = "Lasso" )) %>% bind_rows(mod_rf %>% predict(new_data = blogs_te) %>% mutate( truth = blogs_te$fb, method = "RF" )) ``` --- Similar to sklearn's `Pipeline()` class, we need a `workflow()`: to bundle together your pre-processing, modeling, and post-processing requests. ```r mod_rf <- rand_forest(mode = "regression", mtry = 70, trees = 100, min_n = 10) %>% set_engine("ranger") blogs_rec <- recipe(fb ~ ., data = blogs_tr) %>% step_zv(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) wflow_rf <- workflow() %>% add_recipe(blogs_rec) %>% add_model(mod_rf) ``` --- ```r wflow_rf ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: rand_forest() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 2 Recipe Steps ## ## • step_zv() ## • step_normalize() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = 70 ## trees = 100 ## min_n = 10 ## ## Computational engine: ranger ``` --- Calling `fit()` will `prep()` the recipe **and** `fit()` the model: ```r fit_rf <- fit(wflow_rf, blogs_tr) ``` It is still a `workflow()` object: ```r fit_rf ``` ``` ## ══ Workflow [trained] ══════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: rand_forest() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 2 Recipe Steps ## ## • step_zv() ## • step_normalize() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Ranger result ## ## Call: ## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~70, x), num.trees = ~100, min.node.size = min_rows(~10, x), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) ## ## Type: Regression ## Number of trees: 100 ## Sample size: 31438 ## Number of independent variables: 276 ## Mtry: 70 ## Target node size: 10 ## Variable importance mode: none ## Splitrule: variance ## OOB prediction error (MSE): 0.4176772 ## R squared (OOB): 0.677417 ``` --- Can extract the prepped recipe: ```r fit_rf %>% extract_recipe() ``` ``` ## ``` ``` ## ── Recipe ────────────────────────────────────────────────────────────────────── ``` ``` ## ``` ``` ## ── Inputs ``` ``` ## Number of variables by role ``` ``` ## outcome: 1 ## predictor: 280 ``` ``` ## ``` ``` ## ── Training information ``` ``` ## Training data contained 31438 data points and no incomplete rows. ``` ``` ## ``` ``` ## ── Operations ``` ``` ## • Zero variance filter removed: X13, X33, X38, X278 | Trained ``` ``` ## • Centering and scaling for: X1, X2, X3, X4, X5, X6, X7, X8, X9, ... | Trained ``` --- Can extract the actual parsnip model: ```r fit_rf %>% extract_fit_parsnip() ``` ``` ## parsnip model object ## ## Ranger result ## ## Call: ## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~70, x), num.trees = ~100, min.node.size = min_rows(~10, x), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) ## ## Type: Regression ## Number of trees: 100 ## Sample size: 31438 ## Number of independent variables: 276 ## Mtry: 70 ## Target node size: 10 ## Variable importance mode: none ## Splitrule: variance ## OOB prediction error (MSE): 0.4176772 ## R squared (OOB): 0.677417 ``` --- Calling `predict()` with a new dataset will `bake()` the new dataset using the prepped recipe, and `predict()` the trained model: ```r res_rf <- predict(fit_rf, blogs_te) res_rf %>% head(10) ``` ``` ## # A tibble: 10 × 1 ## .pred ## <dbl> ## 1 0.0928 ## 2 2.16 ## 3 0.840 ## 4 2.48 ## 5 0.702 ## 6 1.09 ## 7 1.38 ## 8 1.38 ## 9 0.899 ## 10 0.466 ``` --- Calling `fit_resamples()` or `tune_grid()` works with a `vfold_cv()` splits object: ```r cv_splits <- vfold_cv(blogs_tr, v = 5) fit_rf_cv <- fit_resamples(wflow_rf, cv_splits, metrics = metric_set(rmse)) fit_rf_cv ``` ``` ## # Resampling results ## # 5-fold cross-validation ## # A tibble: 5 × 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [25150/6288]> Fold1 <tibble [1 × 4]> <tibble [0 × 3]> ## 2 <split [25150/6288]> Fold2 <tibble [1 × 4]> <tibble [0 × 3]> ## 3 <split [25150/6288]> Fold3 <tibble [1 × 4]> <tibble [0 × 3]> ## 4 <split [25151/6287]> Fold4 <tibble [1 × 4]> <tibble [0 × 3]> ## 5 <split [25151/6287]> Fold5 <tibble [1 × 4]> <tibble [0 × 3]> ``` Can use `collect_metrics()` etc. --- But the real advantage of working with `workflow()`s is when comparing different recipes `\(\times\)` different models, via `workflow_set()`: ```r mod_ridge <- linear_reg(mixture = 0, penalty = 0.001) %>% set_engine(engine = "glmnet") mod_lasso <- linear_reg(mixture = 1, penalty = 0.001) %>% set_engine(engine = "glmnet") mod_list <- list(RF = mod_rf, Ridge = mod_ridge, Lasso = mod_lasso) rec_list <- list(basic = blogs_rec) # can add more.. wset <- workflow_set(rec_list, mod_list) # checkout the cross arg wset ``` ``` ## # A workflow set/tibble: 3 × 4 ## wflow_id info option result ## <chr> <list> <list> <list> ## 1 basic_RF <tibble [1 × 4]> <opts[0]> <list [0]> ## 2 basic_Ridge <tibble [1 × 4]> <opts[0]> <list [0]> ## 3 basic_Lasso <tibble [1 × 4]> <opts[0]> <list [0]> ``` --- Now use a `purrr`-like mapping function to fit all on all resamples: ```r wset <- wset %>% workflow_map("fit_resamples", resamples = cv_splits) ``` ``` ## # A workflow set/tibble: 3 × 4 ## wflow_id info option result ## <chr> <list> <list> <list> ## 1 basic_RF <tibble [1 × 4]> <opts[1]> <rsmp[+]> ## 2 basic_Ridge <tibble [1 × 4]> <opts[1]> <rsmp[+]> ## 3 basic_Lasso <tibble [1 × 4]> <opts[1]> <rsmp[+]> ``` And use e.g. `collect_metrics()` to select the best recipe/model combination. --- ### Book (WIP?) <img src="images/tmwr.png" style="width: 45%" /> https://www.tmwr.org/ --- class: section-slide # `infer`: Tidy Statistics --- ### Statistical Q1 Is there a relation between posts published on Sundays and blogger hand dominance, where hand dominance is totally made up? 😬 ```r pub_vs_hand <- blogs_fb %>% select(sunday, hand) %>% table() pub_vs_hand ``` ``` ## hand ## sunday left right ## NS 4780 42958 ## S 493 4166 ``` ```r prop.table(pub_vs_hand, margin = 1) ``` ``` ## hand ## sunday left right ## NS 0.1001299 0.8998701 ## S 0.1058167 0.8941833 ``` --- ### Statistical Q2 Is there a difference in feedback between posts published on Sundays and posts published on another day? ```r blogs_fb %>% group_by(sunday) %>% summarise(avg = mean(fb), sd = sd(fb), n = n()) ``` ``` ## # A tibble: 2 × 4 ## sunday avg sd n ## <fct> <dbl> <dbl> <int> ## 1 NS 0.645 1.13 47738 ## 2 S 0.605 1.12 4659 ``` --- ### Same Problem! Varied interface, varied output. ```r prop.test(pub_vs_hand[,1], rowSums(pub_vs_hand)) ``` ``` ## ## 2-sample test for equality of proportions with continuity correction ## ## data: pub_vs_hand[, 1] out of rowSums(pub_vs_hand) ## X-squared = 1.4545, df = 1, p-value = 0.2278 ## alternative hypothesis: two.sided ## 95 percent confidence interval: ## -0.015038615 0.003664968 ## sample estimates: ## prop 1 prop 2 ## 0.1001299 0.1058167 ``` --- ```r t.test(fb ~ sunday, data = blogs_fb) ``` ``` ## ## Welch Two Sample t-test ## ## data: fb by sunday ## t = 2.3606, df = 5638.9, p-value = 0.01828 ## alternative hypothesis: true difference in means between group NS and group S is not equal to 0 ## 95 percent confidence interval: ## 0.006863705 0.074103874 ## sample estimates: ## mean in group NS mean in group S ## 0.6454439 0.6049601 ``` --- ### The `generics::tidy()` Approach (Also available when you load several other packages, like `broom` and `yardstick`) ```r tidy(prop.test(pub_vs_hand[,1], rowSums(pub_vs_hand))) ``` ``` ## # A tibble: 1 × 9 ## estimate1 estimate2 statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 0.100 0.106 1.45 0.228 1 -0.0150 0.00366 2-sample t… ## # … with 1 more variable: alternative <chr> ``` ```r tidy(t.test(fb ~ sunday, data = blogs_fb)) ``` ``` ## # A tibble: 1 × 10 ## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.0405 0.645 0.605 2.36 0.0183 5639. 0.00686 0.0741 ## # … with 2 more variables: method <chr>, alternative <chr> ``` --- ### The `infer` Approach > infer implements an expressive grammar to perform statistical inference that coheres with the tidyverse design framework 4 main verbs for a typical flow: * `specify()` - dependent/independent variables, formula * `hypothesize()` - declare the null hypothesis * `generate()` - generate data reflecting the null hypothesis (the permutation/bootstrap approach) * `calculate()` - calculate a distribution of statistics from the generated data, from which you can extract conclusion based on a p-value for example --- ### `infer` Diff in Proportions Test Get the observed statistic (here manually in order to not confuse you, there *is* a way via `infer`): ```r pub_vs_hand ``` ``` ## hand ## sunday left right ## NS 4780 42958 ## S 493 4166 ``` ```r p_NS <- pub_vs_hand[1, 1] / (sum(pub_vs_hand[1, ])) p_S <- pub_vs_hand[2, 1] / (sum(pub_vs_hand[2, ])) obs_diff <- p_NS - p_S obs_diff ``` ``` ## [1] -0.005686823 ``` --- Get distribution of the difference in proportions under null hypothesis ```r diff_null_perm <- blogs_fb %>% specify(hand ~ sunday, success = "left") %>% hypothesize(null = "independence") %>% generate(reps = 200, type = "permute") %>% calculate(stat = "diff in props", order = c("NS", "S")) diff_null_perm ``` ``` ## Response: hand (factor) ## Explanatory: sunday (factor) ## Null Hypothesis: independence ## # A tibble: 200 × 2 ## replicate stat ## <int> <dbl> ## 1 1 -0.00451 ## 2 2 -0.00569 ## 3 3 -0.00498 ## 4 4 0.00751 ## 5 5 0.00185 ## 6 6 -0.0000328 ## 7 7 0.00963 ## 8 8 -0.00427 ## 9 9 0.000203 ## 10 10 -0.00592 ## # … with 190 more rows ``` --- Visualize the permuted difference null distribution and the p-value ```r visualize(diff_null_perm) + shade_p_value(obs_stat = obs_diff, direction = "two_sided") ``` <img src="images/Diff-in-Props-Null-1.png" width="50%" /> --- Get the actual p-value: ```r diff_null_perm %>% get_p_value(obs_stat = obs_diff, direction = "two_sided") ``` ``` ## # A tibble: 1 × 1 ## p_value ## <dbl> ## 1 0.3 ``` --- ### `infer` t Test (independent samples) Get the observed statistic (here via `infer`): ```r obs_t <- blogs_fb %>% specify(fb ~ sunday) %>% calculate(stat = "t", order = c("NS", "S")) obs_t ``` ``` ## Response: fb (numeric) ## Explanatory: sunday (factor) ## # A tibble: 1 × 1 ## stat ## <dbl> ## 1 2.36 ``` --- Get distribution of the t statistic under null hypothesis ```r t_null_perm <- blogs_fb %>% specify(fb ~ sunday) %>% hypothesize(null = "independence") %>% generate(reps = 100, type = "permute") %>% calculate(stat = "t", order = c("NS", "S")) t_null_perm ``` ``` ## Response: fb (numeric) ## Explanatory: sunday (factor) ## Null Hypothesis: independence ## # A tibble: 100 × 2 ## replicate stat ## <int> <dbl> ## 1 1 1.25 ## 2 2 -0.838 ## 3 3 1.01 ## 4 4 -0.435 ## 5 5 0.895 ## 6 6 -0.293 ## 7 7 0.241 ## 8 8 -0.356 ## 9 9 1.50 ## 10 10 0.295 ## # … with 90 more rows ``` --- Visualize the permuted t statistic null distribution and the two-sided p-value ```r visualize(t_null_perm) + shade_p_value(obs_stat = obs_t, direction = "two_sided") ``` <img src="images/T-Null-1.png" width="50%" /> --- Get the actual p-value: ```r t_null_perm %>% get_p_value(obs_stat = obs_t, direction = "two_sided") ``` <pre style="color: red;"><code>## Warning: Please be cautious in reporting a p-value of 0. This result is an ## approximation based on the number of `reps` chosen in the `generate()` step. ## See `?get_p_value()` for more information. </code></pre> ``` ## # A tibble: 1 × 1 ## p_value ## <dbl> ## 1 0 ```