Time Series Intro

Applications of Data Science - Class 20

Giora Simchoni

`gsimchoni@gmail.com` and add `#dsapps` in subject

Stat. and OR Department, TAU

2023-04-07

What is a Time Series?

A time series is a sequence of observations taken sequentially in time.

Code

library(tidyverse)
library(tsibble)
library(tsibbledata)
library(feasts)

aus_livestock |>
  filter(Animal == "Calves") |>
  index_by() |>
  summarise(Count = sum(Count)) |>
  autoplot(log(Count)) +
  labs(x = "", y = "log no. of calves slaughtered in Australia") +
  scale_x_yearmonth(date_breaks = "5 year", date_labels = "%Y") +
  theme_light()

TSA Focuses on:

Discrete measurements
Equally-spaced
Usually no missing values
Not that short (say > 20? 50?)
Time is intrinsic, it is not just another feature
Could involve time-varying/non-time-varying explaining features

E.g. this is more of a longitudinal/growth-curve/cohort dataset:

TSA Goals

Forecasting (prediction)
Monitoring: outlier detection, alert systems
Describing: the TS dynamics, input features contribution
Planning: control schemes, intervention analyses, sensitivity analyses
Clustering: with multivariate TS

Detour: Dates in R

The `Date` class

date_obj <- as.Date("1915-6-16")

class(date_obj)

[1] "Date"

Internally, Date objects are stored as the number of days since January 1, 1970, using negative numbers for earlier dates.

typeof(date_obj)

[1] "double"

as.numeric(date_obj)

[1] -19923

So this naturally works:

date_obj + 10

[1] "1915-06-26"

range(date_obj + 0:10)

[1] "1915-06-16" "1915-06-26"

Can also accept different formats and has a few built-in functions:

date_obj <- as.Date("1/15/2001",format="%m/%d/%Y")
months(date_obj)

[1] "January"

The `POSIX` classes

POSIX is a slightly more evolved class from UNIX, holding number of seconds since January 1, 1970, and a time zone may be specified:

now <- Sys.time()

now

[1] "2023-04-07 14:04:13 IDT"

class(now)

[1] "POSIXct" "POSIXt"

as.numeric(now)

[1] 1680865453

The POSIXlt class will store time in a list with useful elements:

now <- as.POSIXlt(now, tz = "GMT")
now$hour

[1] 11

The `lubridate` package

Parsing dates:

today()

[1] "2023-04-07"

ymd("2023-01-31")

[1] "2023-01-31"

dmy("31-Jan-2023")

[1] "2023-01-31"

ymd(20170131)

[1] "2017-01-31"

Making dates:

make_date(year = 2013, month = 2, day = 12)

[1] "2013-02-12"

Converting between formats:

as_datetime(today())

[1] "2023-04-07 UTC"

Getting components:

datetime <- ymd_hms("2016-07-08 12:34:56")

year(datetime)

[1] 2016

mday(datetime)

[1] 8

yday(datetime)

[1] 190

wday(datetime)

[1] 6

Duration, time maths:

today() - ymd("2020-01-01") #base R difftime object

Time difference of 1192 days

as.duration(today() - ymd("2020-01-01"))

[1] "102988800s (~3.26 years)"

today() - dyears(1)

[1] "2022-04-06 18:00:00 UTC"

today() + ddays(10)

[1] "2023-04-17"

The Tidyverts

A suite of packages for TSA, the tidy way, led by Rob J. Hyndman:

tsibble: The ~~data.frame~~ tibble re-imagined for temporal data
tsibbledata: TS datasets
feasts: Feature extraction for TS + Some useful gg plots
fable and fabletools: The modeling and forecasting package
more and more to come.

See tidyverts.org.

`tsibble`

Need a time index:

library(tsibble)

tsibble(
  date = as.Date("2017-01-01") + 0:9,
  value = rnorm(10)
)

# A tsibble: 10 x 2 [1D]
   date        value
   <date>      <dbl>
 1 2017-01-01  1.60 
 2 2017-01-02  1.07 
 3 2017-01-03 -0.329
 4 2017-01-04 -0.753
 5 2017-01-05 -1.28 
 6 2017-01-06 -0.285
 7 2017-01-07  0.289
 8 2017-01-08 -0.342
 9 2017-01-09 -0.720
10 2017-01-10  0.696

Here, as date is the only Date column, tsibble gets this.

[1D] is the tsibble’s interval.

Grouping variable(s) are specified with key:

tsibble(
  qtr = rep(yearquarter("2010 Q1") + 0:9, 3),
  group = rep(c("x", "y", "z"), each = 10),
  value = rnorm(30),
  key = group
)

# A tsibble: 30 x 3 [1Q]
# Key:       group [3]
       qtr group   value
     <qtr> <chr>   <dbl>
 1 2010 Q1 x     -2.19  
 2 2010 Q2 x      0.266 
 3 2010 Q3 x      0.880 
 4 2010 Q4 x     -0.0491
 5 2011 Q1 x      1.55  
 6 2011 Q2 x      0.390 
 7 2011 Q3 x     -0.331 
 8 2011 Q4 x     -1.01  
 9 2012 Q1 x      0.345 
10 2012 Q2 x      0.119 
# … with 20 more rows

Here there are 54 TS combinations:

aus_livestock

# A tsibble: 29,364 x 4 [1M]
# Key:       Animal, State [54]
      Month Animal                     State                        Count
      <mth> <fct>                      <fct>                        <dbl>
 1 1976 Jul Bulls, bullocks and steers Australian Capital Territory  2300
 2 1976 Aug Bulls, bullocks and steers Australian Capital Territory  2100
 3 1976 Sep Bulls, bullocks and steers Australian Capital Territory  2100
 4 1976 Oct Bulls, bullocks and steers Australian Capital Territory  1900
 5 1976 Nov Bulls, bullocks and steers Australian Capital Territory  2100
 6 1976 Dec Bulls, bullocks and steers Australian Capital Territory  1800
 7 1977 Jan Bulls, bullocks and steers Australian Capital Territory  1800
 8 1977 Feb Bulls, bullocks and steers Australian Capital Territory  1900
 9 1977 Mar Bulls, bullocks and steers Australian Capital Territory  2700
10 1977 Apr Bulls, bullocks and steers Australian Capital Territory  2300
# … with 29,354 more rows

Wrangling similar to the Tidyverse, only group_by() + summarise() on a time index would be index_by() + summarise():

aus_calves <- aus_livestock |>
  filter(Animal == "Calves") |>
  index_by() |>
  summarise(count = sum(Count), log_count = log(count))

aus_calves

# A tsibble: 558 x 3 [1M]
      Month  count log_count
      <mth>  <dbl>     <dbl>
 1 1972 Jul 196200      12.2
 2 1972 Aug 250600      12.4
 3 1972 Sep 170100      12.0
 4 1972 Oct 128500      11.8
 5 1972 Nov 103400      11.5
 6 1972 Dec  69200      11.1
 7 1973 Jan  87800      11.4
 8 1973 Feb  71300      11.2
 9 1973 Mar  78200      11.3
10 1973 Apr  62600      11.0
# … with 548 more rows

`feasts`

Seamless integration with ggplot2 and friends, use autoplot() from feasts:

aus_calves |>
  autoplot(log_count) +
  labs(x = "", y = "log no. of calves slaughtered in Australia") +
  scale_x_yearmonth(date_breaks = "5 year", date_labels = "%Y") +
  theme_light()

Some more useful plots:

library(feasts)

aus_calves |>
  gg_season(log_count) +
  labs(x = "", y = "log no. of calves slaughtered in Australia") +
  theme_light()

aus_calves |>
  gg_subseries(log_count) +
  labs(x = "", y = "log no. of calves slaughtered in Australia") +
  scale_x_yearmonth(labels = NULL) +
  theme_light()

aus_calves |>
  gg_lag(log_count, geom = "point") +
  labs(x = "lag(log(count))", y = "log no. of calves slaughtered in Australia") +
  scale_x_yearmonth(labels = NULL) +
  theme_light()

Some useful features for multiple series:

aus_livestock |>
  features(log(Count), quantile)

# A tibble: 54 × 7
   Animal                     State            `0%`   `25%`   `50%` `75%` `100%`
   <fct>                      <fct>           <dbl>   <dbl>   <dbl> <dbl>  <dbl>
 1 Bulls, bullocks and steers Australian C… -Inf    -Inf    -Inf     7.60   8.32
 2 Bulls, bullocks and steers New South Wa…   10.8    11.1    11.3  11.4   11.9 
 3 Bulls, bullocks and steers Northern Ter… -Inf    -Inf       4.61  7.85   9.74
 4 Bulls, bullocks and steers Queensland      10.5    11.7    11.9  12.1   12.4 
 5 Bulls, bullocks and steers South Austra…    8.59    9.59    9.80  9.96  10.4 
 6 Bulls, bullocks and steers Tasmania         8.16    8.75    8.95  9.15   9.67
 7 Bulls, bullocks and steers Victoria        10.4    10.8    11.0  11.1   11.8 
 8 Bulls, bullocks and steers Western Aust…    8.97    9.62    9.83 10.1   10.7 
 9 Calves                     Australian C… -Inf    -Inf    -Inf     4.61   7.17
10 Calves                     New South Wa…    7.60    9.62    9.83 10.0   11.0 
# … with 44 more rows

aus_livestock |>
  group_by(Animal) |>
  summarise(total_count = sum(Count)) |>
  features(log(total_count), feat_stl) |>
  select(Animal, trend_strength, spikiness, linearity)

# A tibble: 7 × 4
  Animal                     trend_strength spikiness linearity
  <fct>                               <dbl>     <dbl>     <dbl>
1 Bulls, bullocks and steers          0.727  8.61e-11    0.0747
2 Calves                              0.945  6.94e-10   -8.02  
3 Cattle (excl. calves)               0.818  9.96e-11    0.656 
4 Cows and heifers                    0.881  1.56e-10    0.955 
5 Lambs                               0.834  1.34e-10    2.35  
6 Pigs                                0.930  1.42e-11    2.14  
7 Sheep                               0.931  6.26e-10   -4.38

TS Decomposition

Transformations

Why did we need log? What would occur without it?

What other transformations are worth considering? Hint: “total monthly sales”

What is the 1st conclusion about no. of calves slaughtered in AUS? What is the 2nd?

Classical Decomposition

Let \(Y_1, \dots, Y_n\) be our TS, then: \[ Y_t = S_t + T_t + e_t \]

where \(S_t\) is the seasonal component, \(T_t\) is the trend, \(e_t\) is the remainder, all at time \(t\).

Necessary but not sufficient condition: Make \(e_t\) “white noise”, i.e. \(e_t \stackrel{iid}\sim \mathcal{N}(0, \sigma^2)\)
Decomposition can also be multiplicative:

\[ Y_t = S_t \times T_t \times e_t \]

but then we’d take \(\log(Y_t)\).

To make a long story short:

Moving Averages

\(m\)-MA suitable for an odd seasonality period \(m = 2k + 1\) (e.g. a week):

\[ \hat{T}_t = \frac{1}{m}\sum_{j = -k}^{j = k} Y_{t + j} \]

aus_calves |>
  mutate(
    `7-MA` = slider::slide_dbl(log_count, mean,
                .before = 3, .after = 3, .complete = TRUE)
  )

# A tsibble: 558 x 4 [1M]
      Month  count log_count `7-MA`
      <mth>  <dbl>     <dbl>  <dbl>
 1 1972 Jul 196200      12.2   NA  
 2 1972 Aug 250600      12.4   NA  
 3 1972 Sep 170100      12.0   NA  
 4 1972 Oct 128500      11.8   11.8
 5 1972 Nov 103400      11.5   11.6
 6 1972 Dec  69200      11.1   11.5
 7 1973 Jan  87800      11.4   11.3
 8 1973 Feb  71300      11.2   11.3
 9 1973 Mar  78200      11.3   11.3
10 1973 Apr  62600      11.0   11.5
# … with 548 more rows

Moving Averages of Moving Averages

\(2 \times m\)-MA suitable for an even seasonality period \(m = 2k\) (e.g. quarter):

\[\begin{split} Q_t = \frac{1}{m}\sum_{j = -(k-1)}^{j = k} Y_{t + j} \\ \hat{T}_t = \frac{1}{2} Q_{t-1} + \frac{1}{2} Q_{t} \end{split}\]

E.g. for \(m = 4\) (quarter):

\[\begin{split} \hat{T}_t &= \frac{1}{2} \left[\frac{1}{4}(Y_{t-2} + Y_{t-1} + Y_{t} + Y_{t+1}) + \frac{1}{4}(Y_{t-1} + Y_{t} + Y_{t + 1} + Y_{t+2})\right] \\ &= \frac{1}{8}Y_{t_2} + \frac{1}{4}Y_{t_1} + \frac{1}{4}Y_{t} + \frac{1}{4}Y_{t+1} + \frac{1}{8}Y_{t+2} \end{split}\]

Which makes this…

aus_calves_with_12ma <- aus_calves |>
mutate(
    `12-MA` = slider::slide_dbl(log_count, mean,
                .before = 6, .after = 5, .complete = TRUE),
    `2x12-MA` = slider::slide_dbl(`12-MA`, mean,
                .before = 1, .after = 0, .complete = TRUE)
  )

aus_calves_with_12ma

# A tsibble: 558 x 5 [1M]
      Month  count log_count `12-MA` `2x12-MA`
      <mth>  <dbl>     <dbl>   <dbl>     <dbl>
 1 1972 Jul 196200      12.2    NA        NA  
 2 1972 Aug 250600      12.4    NA        NA  
 3 1972 Sep 170100      12.0    NA        NA  
 4 1972 Oct 128500      11.8    NA        NA  
 5 1972 Nov 103400      11.5    NA        NA  
 6 1972 Dec  69200      11.1    NA        NA  
 7 1973 Jan  87800      11.4    11.6      NA  
 8 1973 Feb  71300      11.2    11.6      11.6
 9 1973 Mar  78200      11.3    11.6      11.6
10 1973 Apr  62600      11.0    11.6      11.6
# … with 548 more rows

(mean(aus_calves$log_count[1:12]) + mean(aus_calves$log_count[2:13])) / 2

[1] 11.61403

aus_calves_with_12ma$`2x12-MA`[8]

[1] 11.61403

aus_calves_with_12ma |>
  autoplot(log_count, colour = "gray") +
  geom_line(aes(y = `2x12-MA`), colour = "#D55E00") +
  labs(x = "", y = "log no. of calves slaughtered in Australia") +
  scale_x_yearmonth(date_breaks = "5 year", date_labels = "%Y") +
  theme_light()

Back to Classical Decomposition

From Hyndman & Athanasopoulos, 2021:

If \(m\) is an even number, compute the trend component \(\hat{T}_t\) using a \(2 \times m\)-MA. If \(m\) is an odd number, compute the trend component \(\hat{T}_t\) using a \(m\)-MA.
Calculate the detrended series: \(Y_t - \hat{T}_t\).
To estimate the seasonal component for each season, simply average the detrended values for that season. These seasonal component values are then adjusted to ensure that they add to zero. The seasonal component is obtained by stringing together these monthly values, and then replicating the sequence for each year of data. This gives \(\hat{S}_t\).
The remainder component is calculated by subtracting the estimated seasonal and trend-cycle components: \(\hat{e}_t = Y_t - \hat{T}_t - \hat{S}_t\)

Criticize this process!

In fable:

aus_calves |>
  model(classical_decomposition(log_count)) |>
  components() |>
  autoplot()

STL Decomposition

STL: Seasonal and Trend decomposition using LOESS, from Clevelnad et al., 1990.

aus_calves |>
  model(STL(log_count)) |>
  components() |>
  autoplot() +
  labs(x = "", y = "log no. of calves slaughtered in Australia", title = "") +
  scale_x_yearmonth(date_breaks = "5 year", date_labels = "%Y") +
  theme_light()

ACF

(Sample) auto-correlation function at lag \(k\) is a huge deal in TSA!

\[ r_k = \frac{\sum_{t = k+1}^{n}(Y_t - \bar{Y})(Y_{t-k} - \bar{Y})}{\sum_{t = 1}^{n} (Y_t - \bar{Y})^2} \]

aus_calves |>
  ACF(log_count, lag_max = 10)

# A tsibble: 10 x 2 [1M]
        lag      acf
   <cf_lag>    <dbl>
 1       1M  0.844  
 2       2M  0.558  
 3       3M  0.257  
 4       4M  0.0172 
 5       5M -0.110  
 6       6M -0.160  
 7       7M -0.121  
 8       8M  0.00978
 9       9M  0.240  
10      10M  0.529

aus_calves |>
  ACF(log_count, lag_max = 40) |>
  autoplot() +
  theme_light()

Where does the CI come from?

Residuals

What do we like to see in residuals?

Uncorrelated
Zero mean

Good to have:

Homoscedasticity
Gaussian

Actually we already wrote this: \(e_t \stackrel{iid}\sim \mathcal{N}(0, \sigma^2)\)

All your knowledge from linear regression comes into play! E.g. what if the mean isn’t zero?

Residuals Diagnostics Viz

Feasts has gg_tsresiduals() for a quick look given a mable:

aus_calves |>
  model(STL(log_count)) |>
  gg_tsresiduals()

What do you expect to see in each of these?

“Stadardized” Residuals Diagnostics Viz

Getting residuals from a fable model (a.k.a mable) with augment():

aus_calves_aug <- aus_calves |>
  model(STL(log_count)) |>
  augment()

aus_calves_aug

# A tsibble: 558 x 6 [1M]
# Key:       .model [1]
   .model            Month log_count .fitted  .resid  .innov
   <chr>             <mth>     <dbl>   <dbl>   <dbl>   <dbl>
 1 STL(log_count) 1972 Jul      12.2    12.3 -0.149  -0.149 
 2 STL(log_count) 1972 Aug      12.4    12.5 -0.0248 -0.0248
 3 STL(log_count) 1972 Sep      12.0    12.1 -0.0772 -0.0772
 4 STL(log_count) 1972 Oct      11.8    11.8 -0.0236 -0.0236
 5 STL(log_count) 1972 Nov      11.5    11.4  0.116   0.116 
 6 STL(log_count) 1972 Dec      11.1    11.1  0.0814  0.0814
 7 STL(log_count) 1973 Jan      11.4    11.2  0.226   0.226 
 8 STL(log_count) 1973 Feb      11.2    11.1  0.0646  0.0646
 9 STL(log_count) 1973 Mar      11.3    11.2  0.0247  0.0247
10 STL(log_count) 1973 Apr      11.0    11.3 -0.301  -0.301 
# … with 548 more rows

White noise standardized residuals should behave like \(\mathcal{N}(0, 1)\), so:

resid <- aus_calves_aug$.innov
n <- dim(aus_calves)[1]
sig <- sqrt(sum((resid - mean(resid))**2)/n) # very rough!

aus_calves_aug$st_resid <- resid / sig

aus_calves_aug |>
  ggplot(aes(Month, st_resid)) +
  geom_point() +
  geom_line() +
  labs(x = "", y = "Standardized Residuals") +
  scale_x_yearmonth(date_breaks = "5 year", date_labels = "%Y") +
  theme_bw()

aus_calves_aug |>
  ggplot(aes(x = st_resid)) +
  geom_histogram() +
  labs(x = "Standardized Residuals", y = "") +
  theme_bw()

aus_calves_aug |>
  ggplot(aes(sample = st_resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(x = "Theoretical Quantiles", y = "Residuals Quantiles") +
  theme_bw()

aus_calves_aug |>
  ggplot(aes(.fitted, st_resid)) +
  geom_point() +
  labs(x = "Fitted", y = "Standardized Residuals") +
  theme_bw()

acf_obj <- acf(aus_calves_aug$st_resid, plot = FALSE)

acf_df <- with(acf_obj, tibble(lag = lag[,,1], acf = acf[,,1])) |>
  filter(lag > 0)

acf_df |>
  ggplot(aes(lag, acf)) +
  geom_hline(aes(yintercept = 0)) +
  geom_hline(aes(yintercept = -1/sqrt(n)), color = "blue", linetype = 2) +
  geom_hline(aes(yintercept = 1/sqrt(n)), color = "blue", linetype = 2) +
  geom_segment(aes(xend = lag, yend = 0)) +
  theme_bw()

Residuals Statistical Tests

Normality - Shapiro-Wilk Test:

shapiro.test(aus_calves_aug$st_resid)


    Shapiro-Wilk normality test

data:  aus_calves_aug$st_resid
W = 0.99429, p-value = 0.03469

TS independence - Runs Test:

TSA::runs(aus_calves_aug$st_resid)

$pvalue
[1] 0.223

$observed.runs
[1] 265

$expected.runs
[1] 279.871

$n1
[1] 273

$n2
[1] 285

$k
[1] 0

ACF - Ljung-Box Test:

\[ Q^* = n(n+2)\sum_{k = 1}^l (n - k)^{-1}r^2_k \sim \chi^2_l \]

Box.test(aus_calves_aug$st_resid, lag = 10, type = "Ljung")


    Box-Ljung test

data:  aus_calves_aug$st_resid
X-squared = 149.71, df = 10, p-value < 2.2e-16

A bit more tidy with fable:

aus_calves_aug |> features(.innov, ljung_box, lag = 10)

# A tibble: 1 × 3
  .model         lb_stat lb_pvalue
  <chr>            <dbl>     <dbl>
1 STL(log_count)    150.         0

Baseline Forecasts

In general, \(\hat{Y}_t(l) = E(Y_{t+l} | Y_1, \dots, Y_t)\) is the minimum MSE forecast, \(l\) time units into the future from time \(t\).

What if \(Y_t = \mu + e_t\)?

What if \(Y_t = \mu_t + e_t = \beta_0 + \beta_1\cdot t + e_t\)?

Mean: \(\hat{Y}_t(l) = \bar{Y}\)
Naive: \(\hat{Y}_t(l) = Y_t\)
Seasonal Naive: \(\hat{Y}_t(l) = Y_{t + l - m(k-1)}\), where \(m\) is the seasonality and \(k\) is the integer part of \((l-1)/m\) (i.e. the forecast for all future February values is equal to the last observed February value.)
Linear regression: \(\hat{Y}_t(l) = \hat{\beta}_0 + \hat{\beta}_1(t + l)\)

# Set training data until 2016
calves_tr <- aus_calves |>
  filter(year(Month) < 2016)
calves_te <- aus_calves |>
  filter(year(Month) >= 2016)

# Fit the models
models_fit <- calves_tr |>
  model(
    Mean = fable::MEAN(log_count),
    Naive = fable::NAIVE(log_count),
    Seasonal_naive = fable::SNAIVE(log_count),
    LM = fable::TSLM(log_count ~ trend())
  )

# Generate forecasts for 3 years
models_fc <- models_fit |> forecast(calves_te)

# Can also do forecast(h = 36)

# Plot forecasts against actual values
models_fc |>
  autoplot(calves_tr, level = NULL) +
  autolayer(
    filter_index(aus_calves, "2016-01" ~ .),
    log_count,
    colour = "black") +
  guides(colour = guide_legend(title = "Forecast")) +
  labs(x = "", y = "log no. of calves slaughtered in Australia") +
  scale_x_yearmonth(date_breaks = "5 year", date_labels = "%Y") +
  theme_bw()

Evaluating forecast point accuracy

E.g. RMSE, MAE:

models_fc |>
  accuracy(
    data = calves_te,
    measures = list("RMSE" = RMSE, "MAE" = MAE)
  )

# A tibble: 4 × 4
  .model         .type  RMSE   MAE
  <chr>          <chr> <dbl> <dbl>
1 LM             Test  0.718 0.575
2 Mean           Test  1.08  0.904
3 Naive          Test  1.21  1.05 
4 Seasonal_naive Test  0.465 0.393