Deep learning for time series

class: center, middle, inverse, title-slide

.title[
# Deep learning for time series
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-03-17
]

---

## Time series

* **Time series data** involves measurements taken at regular intervals, exhibiting characteristics like periodic cycles, trends, and sudden spikes.

* Common tasks involving time series data include forecasting, classification, event detection, and anomaly detection.

---
## Time series

* **Forecasting:** predicting future values in a time series. For example, predicting the temperature 24 hours in the future based on hourly measurements of various weather variables over the past five days.

*   **Classification:** Assigning categorical labels to a time series. For example, determining whether the activity pattern of a website visitor corresponds to a human user or a bot.

---
## Time series

*   **Event detection:** Identifying the occurrence of specific, predefined events within a continuous data stream.  
  * Examples: "hotword detection" where a model monitors audio and detects specific phrases like "OK, Google" or "Hey, Alexa".

*   **Anomaly detection:** Identifying unusual or unexpected patterns within a continuous data stream. This task often relies on unsupervised learning, as the specific nature of anomalies might be unknown in advance. 
  * Examples: detecting unusual network activity that could signal an attack or identifying unusual sensor readings on a manufacturing line that might require human intervention.

---
## Weather forecasting

* We will use a **weather dataset** recorded at the Max Planck Institute for Biogeochemistry in Jena, Germany.
* It includes 14 quantities measured every 10 minutes over several years.
.small[
"Date Time" - Timestamp of the measurement
1. "p (mbar)" - Atmospheric pressure in millibars
2. "T (degC)" - Temperature in degrees Celsius
3. "Tpot (K)" - Potential temperature in Kelvin
4. "Tdew (degC)" - Dew point temperature in degrees Celsius
5. "rh (%)" - Relative humidity percentage
6. "VPmax (mbar)" - Maximum vapor pressure in millibars
7. "VPact (mbar)" - Actual vapor pressure in millibars
8. "VPdef (mbar)" - Vapor pressure deficit in millibars
9. "sh (g/kg)" - Specific humidity in grams per kilogram
10. "H2OC (mmol/mol)" - Water vapor concentration in millimoles per mole
11. "rho (g/m**3)" - Air density in grams per cubic meter
12. "wv (m/s)" - Wind velocity in meters per second
13. "max. wv (m/s)" - Maximum wind velocity in meters per second
14. "wd (deg)" - Wind direction in degrees]
.small[ https://maps.app.goo.gl/geuAWQzfUtkmnKVW8 https://www.bgc-jena.mpg.de/wetter/ ]

---
## Weather forecasting data preparation

* **Normalizing** each measurement independently to ensure they have similar scales, as different variables may have vastly different ranges.

* **Splitting the data** into training, validation, and testing sets, with the validation and test sets containing more recent data than the training set to simulate real-world forecasting scenarios.

* **Creating a TF Dataset object** that generates batches of data.  This involves sampling data at hourly intervals, defining a sequence length of 120 hours (5 days), and setting a delay of 24 hours to predict the temperature one day ahead.

---
## Common sense baseline

* Before applying complex models, it's essential to **establish a common-sense baseline** to evaluate the effectiveness of machine learning approaches.

* For the temperature forecasting task, a simple baseline is to predict the temperature in 24 hours will be the same as the current temperature, taking advantage of the continuous and periodic nature of temperature data.

* We will use **mean absolute error (MAE)** to evaluate the baseline and our methods.

---
## First deep learning models

* **Densely connected networks**. 
  * Disadvantage - flatten the time series, lose the inherent temporal information.

* **1D convolutional models**. 
  * Advantages - can exploit patterns like daily cycles. 
  * Disadvantages - struggle with the importance of order in time series data and the fact that weather data isn't strictly translation invariant (patterns may change depending on the time of day).

- Fully connected networks and CNNs do not have memory - each input is processed independently.

.small[https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/]

---
## Recurrent neural networks (RNN)

- RNNs are specifically designed to model space-temporal structures because they consider information from multiple previous layers.

- Used in time series forecasting, natural language processing (NLP), speech recognition, and more.

.center[<img src="img/rnn_example1.png" height=250>]

.small[ https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ ]

---
## Recurrent neural networks (RNN)

- The model maintains a hidden state across time steps, allowing it to capture dependencies in sequential data.

- The recurrence is unfolded over time to show how the same network unit is applied repeatedly at each step.

.center[<img src="img/rnn_example1.png" height=250>]

.small[ https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ ]

---
## Folded RNN Representation

- Input `$x$`, Hidden state `$h$`, Output `$o$`.

- The hidden state `$h$` has a **recurrent connection**, meaning it passes information from one time step to the next.

- The weight matrices are labeled as: `$U$` for input-to-hidden connections; `$V$` for hidden-to-hidden connections; `$W$` for hidden-to-output connections.

.center[<img src="img/rnn_example1.png" height=250>]
.small[https://www.analyticsvidhya.com/blog/2020/04/comprehensive-popular-deep-learning-interview-questions-answers/]

---
## Unfolded RNN representation over time

The **unfolding process** illustrates how the recurrent unit processes sequential data across multiple time steps (`$t-1, t, t+1, \dots$`).
- At each time step:
 - The input `$x_t$` is processed to update the hidden state `$h_t$`.
 - The hidden state `$h_t$` is computed based on the previous hidden state `$h_{t-1}$` and the current input `$x_t$`.
 - The output `$o_t$` is generated from `$h_t$`.

- The recurrence is modeled using:
 - `$h_t = f(U x_t + V h_{t-1})$`, where `$f$` is an activation function (often tanh or ReLU).
 - `$o_t = g(W h_t)$`, where `$g$` is usually a softmax or linear function.

<!--
## RNN mathematical definition

- The predicted output at time `$t$`, `$\hat{y}^{(t)}$`, is a nonlinear function of `$h^{(t)}$` and bias `$b_1$`, where `$V$` is a weight matrix

`$$\hat{y}^{(t)} = g(V h^{(t)} + b_1 )$$`
- The current hidden layer `$h^{(t)}$` is a nonlinear function of the previous layer `$h^{(t - 1)}$` of the current input `$(x)$` and of bias, `$b_0$`
`$$h^{(t)} = f(W h^{(t - 1)} + Ux^{(t)} + b_0)$$`

- `$W$` and `$U$` are weight matrices to be estimated. If `$x$` represents a sequence-like dataset, `$x^{(t)}$` refers to the value of `$x$` at time `$t$` 
-->

---
## Word-level RNN language model

.center[<img src="img/0_Text.svg.png" height=450>]

.small[https://d2l.ai/chapter_recurrent-neural-networks/rnn.html]

---
## Backpropagation through time

- **Expansion of the RNN**: Backpropagation through time involves unrolling the recurrent neural network over the entire sequence length. 
- This unrolling process allows us to visualize the dependencies between model variables and parameters at each timestep.
<br><br>
- **Application of the Chain Rule**: Once the network is unrolled, we apply the chain rule to compute gradients.
- This involves calculating the gradients at each timestep and then propagating them backward through the network to update the weights.

.small[https://d2l.ai/chapter_recurrent-neural-networks/bptt.html]

<!---
## Backpropagation through time

**Handling Long Sequences**: Since sequences can be long, the dependencies can become extensive, leading to issues like vanishing or exploding gradients. To address these issues, several methods have been proposed:

- **Long Short-Term Memory (LSTM)**: Introduced by Hochreiter & Schmidhuber in 1997, LSTMs are designed to capture long-term dependencies by using gates to control the flow of information.

- **Gated Recurrent Unit (GRU)**: Proposed by Cho et al. in 2014, GRUs are a streamlined variant of LSTMs that often offer comparable performance while being faster to compute.

.small[https://d2l.ai/chapter_recurrent-neural-networks/bptt.html]
-->

---
## RNN limitations

* The **vanishing gradient problem** in RNNs occurs when gradients become very small during backpropagation through time.

* This happens because gradients are multiplied across time steps, and they can shrink exponentially if weights are small or activation function derivatives are less than 1.

* As a result, the RNN struggles to learn long-term dependencies because early time steps have little influence on weight updates.

---
## RNN improvements

- Sequence elements are not created equal - some may be more important than the other. E.g., introductory words may be highly predictive of a future story. Need _memory cell_ to store such information.

- Some elements may be not important at all, e.g., HTML formatting tags around the actual text. Need _skipping mechanism_ to forget such elements.

- Some parts of the sequence may be disjoint, e.g., book chapters. Need _reset_ mechanism to refresh internal state representations.

---
## Long Short-Term Memory (LSTM)

- **Long Short-Term Memory (LSTM)**: Introduced by Hochreiter & Schmidhuber in 1997, LSTMs are designed to capture long-term dependencies by using gates to control the flow of information.

* LSTMs can effectively retain information from earlier time steps using a "conveyor belt" mechanism that selectively carries information across time.
.center[<img src="img/LSTM3-chain.png" height=250>]
.small[https://colah.github.io/posts/2015-08-Understanding-LSTMs/]