class: center, middle, inverse, title-slide .title[ # Deep Learning with R ] .subtitle[ ## Convolution Neural Networks, images ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2025-02-12 ] --- ## Introduction to Convolutional Neural Networks for Computer Vision * **Computer vision is a field where deep learning has seen significant success.** Deep vision models are used in many applications, such as Google Photos, Google image search, YouTube, camera app filters, OCR software, and more. They are also being used in cutting-edge research areas such as autonomous driving, robotics, AI-assisted medical diagnosis, autonomous retail checkout systems, and autonomous farming. * **Convolutional neural networks (CNNs or ConvNets) are a type of deep learning model that's almost universally used in computer vision applications.** We will focus on applying ConvNets to image-classification problems, particularly for small datasets, a common situation when working outside large tech companies. --- ## Why ConvNets Excel in Computer Vision * **ConvNets learn local patterns and are translation-invariant:** Unlike densely connected networks, which learn global patterns involving all input features, ConvNets learn local patterns within small 2D windows of the input, like 3x3 or 5x5 patches in images. This property makes ConvNets **data-efficient** because they require fewer training samples to generalize well. * **Translation-invariance** means that a pattern learned in one part of an image can be recognized anywhere else in the image, allowing convnets to efficiently learn representations that generalize well. --- ## Why ConvNets Excel in Computer Vision * **ConvNets can learn spatial hierarchies of patterns:** This hierarchical learning allows them to effectively learn increasingly complex and abstract visual concepts, capturing the spatial hierarchy present in the visual world. * For instance, lower layers might learn simple features like edges, while higher layers learn more complex features like shapes or objects, which are composed of these simpler features. --- ## Convolutional Neural Networks (CNNs) - The convolutional neural network (CNN) is a specialized feedforward neural network designed to process multi-dimensional data, e.g., images - A CNN architecture is typically comprised of **convolutional layers**, **pooling** (subsampling) layers, and fully-connected layers (LeNet-5) .center[<img src="img/LeNet5.png" height=250>] .small[Lecun Y, Bottou L, Bengio Y, Haffner P. [Gradient-based learning applied to document recognition](https://ieeexplore.ieee.org/document/726791). Proc IEEE. 1998] --- ## Classical Image Classification Datasets - **MNIST** (Modified National Institute of Standards and Technology) - hand-written digits, 0-9. 60,000 28x28 pixels black-and-white images - **CIFAR 10** (Canadian Institute For Advanced Research) - 60,000 32x32 pixels color images, 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) - **ImageNet** - over 14 million images, 20,000 categories (animals, plants, sports, etc.) .small[https://www.analyticsvidhya.com/blog/2020/02/learn-image-classification-cnn-convolutional-neural-networks-3-datasets/ http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html] --- ## Architecture and Components of ConvNets * Convolutions operate on **feature maps**, which are rank 3 tensors with spatial dimensions (height and width) and a **depth axis (channels)**. For instance, RGB images have a depth of 3 (red, green, blue). * **The convolution operation extracts patches from the input feature map, applies a transformation to them, and produces an output feature map, which is also a rank 3 tensor.** This transformation is applied uniformly to all patches, ensuring consistency. The output depth can vary, as it is a parameter of the layer, and the different channels represent filters rather than specific colors. * **Filters encode specific aspects of the input data.** For instance, a filter might encode the concept of "presence of a face in the input". Each output channel in the feature map contains a grid of values, forming a response map indicating the filter's response at different input locations. --- ## Architecture and Components of ConvNets * **ConvNets process inputs as tensors of shape (image\_height, image\_width, image\_channels), excluding the batch dimension.** * MNIST images are processed as (28, 28, 1) tensors. * A `layer_rescaling()` layer is used to rescale pixel values from to the range. --- ## Architecture and Components of ConvNets **ConvNets typically consist of a stack of `layer_conv_2d()` and `layer_max_pooling_2d()` layers.** * The output of each `layer_conv_2d()` and `layer_max_pooling_2d()` is a 3D tensor with shape (height, width, channels). * The width and height of these tensors typically shrink as you go deeper into the network. * The number of channels is determined by the `filters` argument in the `layer_conv_2d()` layer. * After the last convolutional layer, the output is flattened using `layer_flatten()` to prepare it for input to a densely connected classifier, typically a stack of `layer_dense()` layers. --- ## Data Preprocessing for ConvNets * Images stored as JPEG files need to be preprocessed into floating-point tensors before being fed into the model. * Additional steps include image resizing, normalizing pixel intensity to a common scale, data augmentation (optional). * Keras provides the `image_dataset_from_directory()` utility to read image files, shuffle them, decode them to tensors, resize them, and pack them into batches. --- ## Understanding TF Datasets * The `tfdatasets` package creates efficient input pipelines using TensorFlow Datasets (TF Datasets). * TF Datasets are iterable objects that yield batches of data and labels. * They offer benefits like asynchronous data prefetching, improving training efficiency. * The `tfdatasets` package has a functional API for modifying datasets, including methods like `dataset_batch()`, `dataset_shuffle()`, `dataset_prefetch()`, and `dataset_map()`. .small[ https://tensorflow.rstudio.com/reference/tfdatasets/ ] --- ## Convolution .pull-left[**Convolution** is a mathematical operation that does the integral of the product of two functions (signals), with one of the signals flipped. Here, we convolve two signals `\(f(t)\)` and `\(g(t)\)`. .small[ https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/convolution ] ] .pull-right[ .center[<img src="img/convolution.png" height=500>] ] --- ## Convolution * The convolution operation is the key difference between a densely connected layer and a convolution layer. **Densely connected layers process global patterns, while convolution layers process local patterns.** This characteristic gives convnets (convolutional neural networks) several important properties. - In each convolutional layer, a convolutional operation is performed along with the input of predefined width and strides. Each of these convolutional operations is called a "**kernel**" or a "**filter**" and is somewhat equivalent to a "neuron" in an MLP. An activation function is applied after each convolution to produce the output --- ## Convolution * **A convolution slides a window over the 3D input feature map, extracting a 3D patch of features at each location.** This patch is transformed into a 1D vector using a learned weight matrix (the convolution kernel). These vectors are then reassembled into the 3D output feature map. .center[<img src="img/convolution.gif" height=300>] .small[ https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/convolutional_neural_networks ] --- ## Convolution operation - A single unit of a convolutional layer is only connected to a small receptive field of its input, where the weights of its connections define a **filter**. - The convolution operation is used to slide the filter bank across the input, producing activations at each receptive field that combine to form a **feature map**. .center[<img src="img/cnn_filter.png" height=250>] .small[https://d2l.ai/chapter_convolutional-neural-networks/conv-layer.html] --- ## Convolution operation .center[<img src="img/cnn_filter.png" height=250>] Two-dimensional cross-correlation operation. The shaded portions are the first output element and the input and kernel array elements used in its computation: `\(0\times0 + 1\times1 + 3\times2 + 4\times3=19\)` .small[https://d2l.ai/chapter_convolutional-neural-networks/channels.html] --- ## Parameters defining convolution Convolutions are defined by two key parameters: - **Size of the filters** extracted from the inputs. These are typically `\(3\times3\)` (most frequently used), or `\(5\times5\)`. - **Depth of the output feature map** - The number of filters computed by the convolution. A convolution works by sliding these filters over the 3D input feature map, stopping at every possible location, and extracting the 3D patch of surrounding features. --- ## Padding and Strides * **Strides, the distance between successive windows, can downsample the feature map.** A stride of 2 reduces the width and height by half. While rarely used in classification models, they are useful for other types of models. * **Padding can be used to maintain the output feature map's spatial dimensions.** It involves adding rows and columns to the input to allow convolution windows to be centered around every input pixel, preventing the output from shrinking. * The `padding` argument in `layer_conv_2d()` can be set to `"valid"` (the default) for no padding or `"same"` to maintain the input dimensions in the output. --- ## Strides .pull-left[ - Stride is the distance between two successive windows is a parameter of the convolution, called its stride, which defaults to 1. - It is possible to have strided convolutions: convolutions with a stride higher than 1. ] .pull-right[ .center[<img src="img/strides_1.gif" height=200>] .center[<img src="img/strides_2.gif" height=200>]] .small[ https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/convolution#stride ] --- ## Padding - By default the convolution output will always have a result smaller than the input. To avoid this behaviour we need to use padding. - **Padding** - adding an appropriate number of rows and columns on each side of the input feature map to make it possible to fit center convolution windows around every input tile. .center[<img src="img/padding.gif" height=250>] .small[ https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/convolutional_neural_networks#output-size ] --- ## Padding `$$H_{out} = 1 + \frac{H_{in} + 2*pad -K_{height}}{S}$$` - `\(pad\)` is Padding (on each side of the input) - `\(S\)` is Stride (step size of the convolution) - `\(H\)` is Input height - `\(K\)` is Kernel (filter) size .center[<img src="img/padding.gif" height=250>] .small[ https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/convolutional_neural_networks#output-size ] --- ## Max-Pooling * **The Max-Pooling Operation:** Max pooling downsamples feature maps, similar to strided convolutions, by extracting windows and outputting the maximum value of each channel. * Typically uses 2x2 windows with a stride of 2 to downsample by a factor of 2. * Max pooling is generally preferred over average pooling because it better preserves feature-presence information. * **The Importance of Downsampling:** * Downsampling reduces the number of feature-map coefficients, improving computational efficiency. * It also helps create spatial-filter hierarchies, allowing higher layers to learn patterns covering a larger portion of the input. --- ## Max-Pooling - The reason to use downsampling via max-pooling is to reduce the number of feature-map coefficients to process and to induce spatial-filter hierarchies by making successive convolution layers look at increasingly large windows. - **Max** is used because it’s more informative to look at the maximal presence of different features. .center[<img src="img/max_pooling.png" height=200>] .small[https://deeplizard.com/learn/video/ZjM_XQa5s6s, https://d2l.ai/chapter_convolutional-neural-networks/pooling.html] --- ## Convolutional Neural Networks (CNNs) - By combining multiple filters in a single convolutional layer, the layer can learn to detect multiple features in the input. The resulting feature maps become the input of the next layer. - A first convolution layer will learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the features of the first layers, and so on. .center[<img src="img/LeNet5.png" height=250>] .small[LeCun Y, Bengio Y, Hinton G. [Deep learning](https://www.nature.com/articles/nature14539). Nature. 2015 A friendly introduction to Convolutional Neural Networks and Image Recognition, 32m https://youtu.be/2-Ol7ZB0MmU] --- ## Convolutional Neural Networks (CNNs) - Pooling layers are added after one or more convolutional layers to merge semantically similar features and reduce the dimensionality. - After the convolutional and pooling layers, the multi-dimensional output is flattened and fed to fully-connected layers for classification. .center[<img src="img/LeNet5.png" height=250>] .small[LeCun Y, Bengio Y, Hinton G. [Deep learning](https://www.nature.com/articles/nature14539). Nature. 2015 A friendly introduction to Convolutional Neural Networks and Image Recognition, 32m https://youtu.be/2-Ol7ZB0MmU] --- ## Batch normalization **Batch normalization (often shortened to "batch norm") is a technique used in convolutional neural networks (CNNs) to improve training speed and stability.** It's a type of layer (`layer_batch_normalization()` in Keras). * **Normalization:** Batch normalization normalizes the output of a layer by adjusting the activations to have zero mean and unit variance. This is similar to the data normalization you might do before feeding data into a model, but batch norm does it within the network, after each layer. --- ## Batch normalization * **Adaptive Normalization:** The key feature of batch normalization is that it **adaptively normalizes data even as the mean and variance change over time during training**. * **During training**, it uses the mean and variance of the **current batch of data** to normalize samples. * **During inference** (when you're using the model to make predictions and might not have a large batch of data), it uses an **exponential moving average of the batch-wise mean and variance** calculated during training. --- ## Batch normalization * **Improved Gradient Propagation:** Batch normalization helps gradients flow more smoothly through the network, which is especially important in deep CNNs. This allows you to train deeper networks more effectively. * **Deeper Networks:** The improved gradient propagation allows for the training of deeper networks that might otherwise be difficult to train due to vanishing gradients. * **Smoother Training:** Batch normalization often leads to smoother, more stable training with faster convergence. You can use `layer_batch_normalization()` after a variety of layers, including `layer_dense()` and `layer_conv_2d()`. --- ## Batch normalization **Important Considerations:** * **Bias Vector:** Since batch normalization centers the output of a layer on zero, the bias vector in the preceding layer becomes redundant. You can often create these layers without a bias vector using `use_bias = FALSE`. * **Activation Placement:** While there's some debate, it's generally recommended to place the activation function *after* the batch normalization layer. This helps maximize the utilization of activation functions like ReLU. * **Fine-tuning:** When fine-tuning a pre-trained model that includes batch normalization layers, it's often advisable to **freeze these layers** (i.e., make them non-trainable) to prevent them from updating their mean and variance, which could interfere with the fine-tuning process. --- ## Depth-wise separable convolution layers .pull-left[ - Implemented as `layer_separable_conv_2d()`. - Performs a spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution (a `\(1 \times 1\)` convolution). - It requires significantly fewer parameters and involves fewer computations, thus resulting in smaller, speedier models. ] .pull-right[<img src="img/layer_separable_conv.png" height=350>] .small[https://towardsdatascience.com/review-xception-with-depthwise-separable-convolution-better-than-inception-v3-image-dc967dd42568] --- ## Data augmentation - Given infinite data, your model would be exposed to every possible aspect of the data distribution at hand: you would never overfit. - Data augmentation takes the approach of generating more training data from existing training samples, by augmenting the samples via a number of random transformations that yield believable-looking images. - The goal is that at training time, your model will never see the exact same picture twice. This helps expose the model to more aspects of the data and generalize better. --- ## Data augmentation - `rotation_range()` is a value in degrees (0–180), a range within which to randomly rotate pictures. - `width_shift()` and `height_shift()` are fractions of total width or height within which to randomly translate pictures vertically or horizontally. - `shear_range()` is for randomly applying shearing transformations. - `zoom_range()` is for randomly zooming inside pictures. - `horizontal_flip()` is for randomly flipping half the images horizontally. - `fill_mode()` is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift. --- ## Visualizing what CNN learns - Visualizing intermediate convnet outputs (intermediate activations) - Useful for understanding how successive convnet layers transform their input, and for getting a first idea of the meaning of individual convnet filters. - Visualizing convnets filters - Useful for understanding precisely what visual pattern or concept each filter in a convnet is receptive to. - Visualizing heatmaps of class activation in an image - Useful for understanding which parts of an image were identified as belonging to a given class, thus allowing you to localize objects in images. --- ## Evolution of CNNs - [LeNet5](http://yann.lecun.com/exdb/lenet/) - handwritten digit recognition, developed by Yann Lecun in the 90s. Two convolutional layers, two fully-connected hidden layers, and one fully-connected output layer, sigmoid activation function. .center[<img src="img/LeNet5.jpg" height=200>] - [AlexNet](https://en.wikipedia.org/wiki/AlexNet) - ImageNet classification winner, developed by [Alex Krizhevsky in 2012](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks). Five concolutional layers, two fully-connect ed hidden layer, and one fully-connected output layer, ReLU activation function. .small[http://www.image-net.org/index https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/] --- ## Evolution of CNNs - **VGG networks** developed by the **Visual Geometry Group**. - Introduced block concept - a sequence of the following layers: (i) a convolutional layer (with padding to maintain the resolution), (ii) a nonlinearity such as a ReLU, (iii) a pooling layer such as a max-pooling layer. - **VGG-11** - 8 convolutional layers (wrapped into 5 convolutional blocks) and 3 fully-connected layers. - **VGG-16**, **VGG-19** - deeper architectures. .small[http://www.robots.ox.ac.uk/~vgg/] --- ## GoogLeNet - [GoogLeNet](https://paperswithcode.com/method/googlenet) - ImageNet 2014 classification winner, developed by [Christian Szegedy](https://arxiv.org/abs/1409.4842). The concept of **inception block**. .center[<img src="img/inception.png" height=250>] .small[https://d2l.ai/chapter_convolutional-modern/googlenet.html https://datascience.stackexchange.com/questions/14984/what-is-an-inception-layer] --- ## Inception block - The inception layer covers a bigger area, but also keep a fine resolution for small information on the images. It consists of four parallel paths: - The first three paths use convolutional layers with window sizes of `\(1\times1\)`, `\(3\times3\)`, and `\(5\times5\)` to extract information from different spatial sizes. - The middle two paths perform a `\(1\times1\)` convolution on the input to reduce the number of input channels, reducing the model’s complexity. - The fourth path uses a `\(3\times3\)` maximum pooling layer, followed by a `\(1\times1\)` convolutional layer to change the number of channels. The four paths all use appropriate padding to give the input and output the same height and width. - Finally, the outputs along each path are concatenated along the channel dimension and comprise the block’s output. .small[https://d2l.ai/chapter_convolutional-modern/googlenet.html] --- ## Evolution of CNNs - ResNet - ImageNet 2016 classification winner, developed by Kaiming He. A 150 deep convolution neural network made by equal "residual" blocks. - DenseNet - another winner, developed by Gao Huang in 2017. .center[<img src="img/cnn_evolution.png" height=250>] .small[https://devopedia.org/imagenet | https://arxiv.org/abs/1512.03385 | https://arxiv.org/abs/1608.06993 https://towardsdatascience.com/review-senet-squeeze-and-excitation-network-winner-of-ilsvrc-2017-image-classification-a887b98b2883] --- ## Using pre-trained network - CNNs comprise two parts: they start with a series of pooling and convolution layers, and they end with a densely connected classifier. - The first part is called the convolutional base. - The second part is the densely connected layers. - The representations learned by the convolutional base are likely to be more generic and therefore more reusable: the feature maps of a convnet are presence maps of generic concepts over a picture. - The densely connected layers utilize those representations to learn specific properties of the new input. --- ## Using pre-trained network * Pretrained models are models previously trained on large datasets, often on large-scale image classification tasks like ImageNet. * Using a pretrained model can significantly improve performance on small image datasets by leveraging the features learned from the larger dataset. --- ## Using pre-trained network **Feature Extraction:** This approach uses the convolutional base of a pretrained model to extract features from new samples, which are then used to train a new classifier. * The convolutional base, consisting of the initial pooling and convolution layers, learns generic representations of the visual world, making its features reusable across different problems. * The densely connected classifier of the pretrained model is typically not reused because it's specific to the original problem's classes. * The level of generality and reusability of features from convolutional layers depends on their depth within the model. * Earlier layers learn more generic features like edges and textures, while higher layers learn more abstract concepts. <!-- - A pre-trained network is a saved network previously trained on a large dataset, typically on a large-scale image-classification task - If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pre-trained network can effectively act as a generic model of the visual world - Its features can prove useful for many different computer-vision problems, even though these new problems may involve completely different classes than those of the original task. --> --- ## Using pre-trained network Extend the model you have (`conv_base`) by adding dense layers on top and running the whole thing end-to-end on the input data. - Possible only on GPU, can use data augmentation, less prone to overfitting. - Need to freeze the `conv_base` (preventing their weights from being updated during training). Freezing ensures that the learned representations from the pretrained model are preserved. - Unfrozen weights (typically, last layers of `conv_base`) after the initial training with frozen weights allow for fine-tuning the performance (slight adjustment of representations in `conv_base`). --- ## Fine-tuning a Pretrained Model * This technique involves unfreezing some of the top layers of the frozen convolutional base and training both the added classifier and these unfrozen layers jointly. * Fine-tuning is done after the added classifier has been trained on the extracted features. * The unfrozen layers are typically those higher up in the convolutional base, as they encode more specialized features that benefit from fine-tuning. * Fine-tuning too many layers can lead to overfitting. * A low learning rate is used during fine-tuning to avoid damaging the pretrained representations. --- ## Using pre-trained network The list of image-classification models (all pre-trained on the ImageNet dataset) that are available as part of Keras: | **Model** | **Keras3 Function in R** | |-----------------|------------------------| | **Xception** | `application_xception()` | | **InceptionV3** | `application_inception_v3()` | | **ResNet50** | `application_resnet50()` | | **VGG16** | `application_vgg16()` | | **VGG19** | `application_vgg19()` | | **MobileNet** | `application_mobilenet()` | --- ## Convolutional Neural Network in genomics .pull-left[ A simple scheme of a one-dimension (1D) convolutional operation (a). Full representation of a 1D convolutional neural network for an SNP-matrix (b). The convolution outputs are represented in yellow. Pooling layers after convolutional operations combining the output of the previous layer at certain locations into a single neuron are represented in green. The final output is a standard MLP ] .pull-right[ <img src="img/cnn_example.png" height=400> ] .small[Pérez-Enciso and Zingaretti, “[A Guide for Using Deep Learning for Complex Trait Genomic Prediction](https://www.mdpi.com/2073-4425/10/7/553).” Convolution layer animation and math, https://www.analyticsvidhya.com/blog/2020/02/mathematics-behind-convolutional-neural-network/]