Deep Learning with R

class: center, middle, inverse, title-slide

.title[
# Deep Learning with R
]
.subtitle[
## Terminology and Best practices
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2025-04-30
]

---

## Hyperparameter tuning

*   **Hyperparameters** are architecture-level parameters that you need to decide on when building a deep learning model (e.g., number of layers, units per layer, activation functions, dropout rate).

*   Experienced machine learning engineers develop intuition for hyperparameter choices, but initial decisions are often suboptimal.

---
## Hyperparameter tuning

*   **Hyperparameter optimization** is the process of systematically searching the hyperparameter space to find the best-performing model architecture empirically.

*   Updating hyperparameters is challenging because the hyperparameter space is often discrete and not differentiable, requiring gradient-free optimization techniques. Computing feedback (model performance) for each hyperparameter set is also expensive, as it requires training a new model.

---
## Hyperparameter tuning

*   **KerasTuner** is a tool that simplifies hyperparameter tuning in Keras. It allows you to define a search space by replacing hardcoded hyperparameter values with a range of possible choices.

*   **A model-building function** takes a hyperparameter object (`hp`) from which you can sample hyperparameter ranges. KerasTuner offers different kinds of hyperparameters like `Int`, `Float`, `Boolean`, and `Choice`.

*   **A tuner** (e.g., `RandomSearch`, `BayesianOptimization`, `Hyperband`) repeatedly picks hyperparameter values, builds and trains the model, and records metrics.

*   **Designing the right search space is an art**; it's too computationally expensive to make everything a hyperparameter. You should leverage your knowledge of model architecture best practices to define a search space with the potential to yield good results.

---
## Tuning best practices

*   **Avoid making everything a hyperparameter**; search space size grows combinatorially, making the search too expensive.

*   **Design the search space intelligently** by focusing on experiment configurations with the potential for good performance.

*   **Higher-level architecture decisions (e.g., using residual connections) tend to generalize better** across different tasks and datasets.

*   **KerasTuner offers premade search spaces** tailored to broad problem categories like image classification. Explore **tunable versions of Keras Applications models** such as `kt$applications$HyperXception` and `kt$applications$HyperResNet`.

---
## Automated machine learning

*   The future of automation extends beyond hyperparameter tuning to **automatically generating model architectures from scratch**.

*   Techniques like **reinforcement learning or genetic algorithms** may be used for this purpose.

*   The ultimate goal is **automated machine learning (AutoML)**, where entire end-to-end machine learning pipelines are automatically generated.

*   Libraries like **AutoKeras** already exist for solving basic machine learning problems with minimal user involvement.

.small[ https://github.com/keras-team/autokeras ]

---
## Model ensembling

*   **Model ensembling** involves **pooling the predictions of a set of different models** to produce better overall predictions.

*   **Winners of machine learning competitions** often use **very large ensembles** that outperform any single model.

*   Ensembling works because **different well-performing models trained independently are likely to be good for different reasons**, each capturing slightly different aspects of the data.

*   The easiest way to ensemble classifiers is to **average their predictions at inference time**.

---
## Model ensembling

*   A **smarter approach** is to use a **weighted average**, where weights are **learned on the validation data**, giving better models higher weights.

*   Ensemble models that are **as good as possible while being as different as possible**, e.g. use **very different architectures or even different brands of machine learning approaches**.

*   An effective strategy can be to ensemble **tree-based methods and deep neural networks**.

*   Even a model with a **worse individual score** can significantly improve an ensemble if it is **sufficiently different** and provides **unique information**.

---
## Scaling up

There are three main ways to train models faster.

*   **Mixed-precision training:** Speed up training (up to 3x on GPUs) by using lower-precision (16-bit) computations where possible while maintaining numerical stability with 32-bit precision in sensitive parts.

*   **Training on multiple GPUs:** Distribute the training workload across several GPUs to achieve significant speedups.

*   **Training on TPUs:** Utilize Google's Tensor Processing Units, specialized hardware for deep learning, which can offer substantial speed advantages.

---
## Mixed-Precision Training Details

*   Leverages **float16 computations for speed and memory efficiency** on modern GPUs and TPUs.

*   Maintains **float32 for weights and precision-sensitive operations** (like softmax and cross-entropy) to ensure numerical stability.

*   Can be enabled globally in Keras with `keras::keras$mixed_precision$set_global_policy("mixed_float16")`.

*   Be mindful of **data type defaults**, especially when converting R arrays to TensorFlow tensors (explicitly set `dtype = "float32"` if needed).

---
## Multi-GPU Training (Data Parallelism)

*   Involves replicating a single model across multiple devices, with each replica processing different data batches and then merging results.

*   **Single-host, multidevice synchronous training** using `tf$distribute$MirroredStrategy()` - the model is built, and each GPU gets a copy (replica).

*   A global batch of data is split into local batches, processed independently by each replica, and their weight updates are merged.

*   Speedup is generally sub-linear with the number of GPUs due to communication overhead.

---
## TPU Training

*   Offers **significant speedups** compared to GPUs.

*   Requires **connecting to the TPU cluster** using `tf$distribute$cluster_resolver$TPUClusterResolver$connect()`.

*   Utilizes `tf$distribute$TPUStrategy()` with a similar distribution template to `MirroredStrategy()`.

*   TPU training data needs to be either in memory or in a Google Cloud Storage (GCS) bucket.

*   Be aware of **I/O bottlenecks** when reading data from GCS; consider caching smaller datasets in memory or using the TFRecord format for larger datasets.

---
## FINAL

.center[<img src="img/cs_4620_intelligent_systems.jpg" height=450>]

.small[https://www.aprogrammerlife.com/top-rated/cs-4620-intelligent-systems-738]