class: center, middle, inverse, title-slide .title[ # Deep Learning with R ] .subtitle[ ## Terminology and Best practices ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2025-04-30 ] --- <!-- .center[<img src="img/.png" height=450>] --> ## Hyperparameter tuning * **Hyperparameters** are architecture-level parameters that you need to decide on when building a deep learning model (e.g., number of layers, units per layer, activation functions, dropout rate). * Experienced machine learning engineers develop intuition for hyperparameter choices, but initial decisions are often suboptimal. --- ## Hyperparameter tuning * **Hyperparameter optimization** is the process of systematically searching the hyperparameter space to find the best-performing model architecture empirically. * Updating hyperparameters is challenging because the hyperparameter space is often discrete and not differentiable, requiring gradient-free optimization techniques. Computing feedback (model performance) for each hyperparameter set is also expensive, as it requires training a new model. --- ## Hyperparameter tuning * **KerasTuner** is a tool that simplifies hyperparameter tuning in Keras. It allows you to define a search space by replacing hardcoded hyperparameter values with a range of possible choices. * **A model-building function** takes a hyperparameter object (`hp`) from which you can sample hyperparameter ranges. KerasTuner offers different kinds of hyperparameters like `Int`, `Float`, `Boolean`, and `Choice`. * **A tuner** (e.g., `RandomSearch`, `BayesianOptimization`, `Hyperband`) repeatedly picks hyperparameter values, builds and trains the model, and records metrics. * **Designing the right search space is an art**; it's too computationally expensive to make everything a hyperparameter. You should leverage your knowledge of model architecture best practices to define a search space with the potential to yield good results. --- ## Tuning best practices * **Avoid making everything a hyperparameter**; search space size grows combinatorially, making the search too expensive. * **Design the search space intelligently** by focusing on experiment configurations with the potential for good performance. * **Higher-level architecture decisions (e.g., using residual connections) tend to generalize better** across different tasks and datasets. * **KerasTuner offers premade search spaces** tailored to broad problem categories like image classification. Explore **tunable versions of Keras Applications models** such as `kt$applications$HyperXception` and `kt$applications$HyperResNet`. --- ## Automated machine learning * The future of automation extends beyond hyperparameter tuning to **automatically generating model architectures from scratch**. * Techniques like **reinforcement learning or genetic algorithms** may be used for this purpose. * The ultimate goal is **automated machine learning (AutoML)**, where entire end-to-end machine learning pipelines are automatically generated. * Libraries like **AutoKeras** already exist for solving basic machine learning problems with minimal user involvement. .small[ https://github.com/keras-team/autokeras ] --- ## Model ensembling * **Model ensembling** involves **pooling the predictions of a set of different models** to produce better overall predictions. * **Winners of machine learning competitions** often use **very large ensembles** that outperform any single model. * Ensembling works because **different well-performing models trained independently are likely to be good for different reasons**, each capturing slightly different aspects of the data. * The easiest way to ensemble classifiers is to **average their predictions at inference time**. --- ## Model ensembling * A **smarter approach** is to use a **weighted average**, where weights are **learned on the validation data**, giving better models higher weights. * Ensemble models that are **as good as possible while being as different as possible**, e.g. use **very different architectures or even different brands of machine learning approaches**. * An effective strategy can be to ensemble **tree-based methods and deep neural networks**. * Even a model with a **worse individual score** can significantly improve an ensemble if it is **sufficiently different** and provides **unique information**. --- ## Scaling up There are three main ways to train models faster. * **Mixed-precision training:** Speed up training (up to 3x on GPUs) by using lower-precision (16-bit) computations where possible while maintaining numerical stability with 32-bit precision in sensitive parts. * **Training on multiple GPUs:** Distribute the training workload across several GPUs to achieve significant speedups. * **Training on TPUs:** Utilize Google's Tensor Processing Units, specialized hardware for deep learning, which can offer substantial speed advantages. --- ## Mixed-Precision Training Details * Leverages **float16 computations for speed and memory efficiency** on modern GPUs and TPUs. * Maintains **float32 for weights and precision-sensitive operations** (like softmax and cross-entropy) to ensure numerical stability. * Can be enabled globally in Keras with `keras::keras$mixed_precision$set_global_policy("mixed_float16")`. * Be mindful of **data type defaults**, especially when converting R arrays to TensorFlow tensors (explicitly set `dtype = "float32"` if needed). --- ## Multi-GPU Training (Data Parallelism) * Involves replicating a single model across multiple devices, with each replica processing different data batches and then merging results. * **Single-host, multidevice synchronous training** using `tf$distribute$MirroredStrategy()` - the model is built, and each GPU gets a copy (replica). * A global batch of data is split into local batches, processed independently by each replica, and their weight updates are merged. * Speedup is generally sub-linear with the number of GPUs due to communication overhead. --- ## TPU Training * Offers **significant speedups** compared to GPUs. * Requires **connecting to the TPU cluster** using `tf$distribute$cluster_resolver$TPUClusterResolver$connect()`. * Utilizes `tf$distribute$TPUStrategy()` with a similar distribution template to `MirroredStrategy()`. * TPU training data needs to be either in memory or in a Google Cloud Storage (GCS) bucket. * Be aware of **I/O bottlenecks** when reading data from GCS; consider caching smaller datasets in memory or using the TFRecord format for larger datasets. --- ## FINAL .center[<img src="img/cs_4620_intelligent_systems.jpg" height=450>] .small[https://www.aprogrammerlife.com/top-rated/cs-4620-intelligent-systems-738]