class: center, middle, inverse, title-slide .title[ # Generative Deep Learning ] .subtitle[ ## Image generation ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2025-04-23 ] --- ## Image generation * **Image generation** with deep learning involves learning **latent spaces of images** and sampling from them to create new images. **Generative Adversarial Networks** and **Variational Autoencoders (VAEs)** are key techniques for this. - The module capable of realizing this mapping, taking as input a latent point and outputting an image (a grid of pixels), is called a _generator_ (in the case of GANs) or a _decoder_ (in the case of VAEs). - Once such a latent space has been developed, you can sample points from it, either deliberately or at random, and, by mapping them to image space, generate images that have never been seen before. --- ## Generative adversarial networks (GANs) - Unsupervised learning models that aim to generate data points that are indistinguishable from the observed ones. - Aim to learn the data-generating process. - GANs were proposed as a radically different approach to generative modeling that involves two neural networks, a **discriminator** and a **generator** network. They are trained jointly, whereby the generator aims to generate realistic data points, and the discriminator classifies whether a given sample is real or generated by the generator. - You won’t have to design a loss function. It might take a while, but the GAN will figure out its own evaluation rules. .small[Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “[Generative Adversarial Networks](http://arxiv.org/abs/1406.2661)” ArXiv, 2014 ] --- ## Generative adversarial networks (GANs) .center[<img src="img/gan.png" height=450>] .small[https://www.analyticsvidhya.com/blog/2020/01/generative-models-gans-computer-vision/] <!-- ## Generative adversarial networks (GANs) .pull-left[<img src="img/gan_generator.png" height=300>] .pull-right[<img src="img/gan_discriminator.png" height=300>] <br> We train the model, calculate the loss function at the end of the discriminator network and backpropagate the loss into both discriminator and generator models. .small[https://www.analyticsvidhya.com/blog/2020/01/generative-models-gans-computer-vision/] --> --- ## Applications of GANs - **Music**: Create new melodies, generate accompaniments, or compose entire pieces in various genres. https://openai.com/index/musenet/ - **Text**: Generate coherent and contextually relevant paragraphs, poems, or even entire articles. https://chat.openai.com/ - **Speech**: Synthesize realistic human speech from text inputs, mimicking different voices, accents, and emotions. WaveNet by Google DeepMind --- ## Applications of GANs **Image**: Generate images, music, or other data forms based on textual descriptions. https://openart.ai - **Image Super-Resolution**: Enhance the resolution of low-quality images by generating missing details. - **Inpainting**: Fill in missing or damaged parts of an image seamlessly. - **Denoising**: Remove noise from images while preserving important details. - **Artistic Style Transfer**: Apply the artistic style of one image to another, creating unique and visually appealing results. --- ## Applications of GANs **Security** - **Data Augmentation**: Generate synthetic data to improve the training of machine learning models, enhancing their robustness and generalization. - **Adversarial Attacks and Defense**: Develop and defend against adversarial examples that can fool machine learning models. - **Privacy Preservation**: Generate synthetic yet realistic data to protect sensitive information while maintaining data utility. .small[https://www.analyticsvidhya.com/blog/2019/04/top-5-interesting-applications-gans-deep-learning/ https://adeshpande3.github.io/Deep-Learning-Research-Review-Week-1-Generative-Adversarial-Nets ] --- ## DeepDream **DeepDream is an artistic image-modification technique** that leverages the representations learned by convolutional neural networks (convnets). .center[ <img src="img/deepdream_ibis.png" height=300> ] .small[ https://github.com/google/deepdream https://deepdreamgenerator.com/] --- ## DeepDream * The algorithm works by essentially **running a convnet in reverse**. * It is closely related to the **convnet filter-visualization technique**, which involves performing gradient ascent on the input of a convnet to maximize the activation of a specific filter in an upper layer. --- ## DeepDream DeepDream differs from filter visualization in a few key ways: * Instead of maximizing the activation of a **specific filter**, DeepDream aims to **maximize the activation of entire layers**. This leads to a mixture of visualizations from numerous features simultaneously. * DeepDream starts with an **existing image** as the input, rather than a blank or noisy image. This causes the resulting effects to attach to the pre-existing visual patterns and distort them in an artistic manner. * The input images are processed at **different scales, known as octaves**, which enhances the quality of the visualizations. --- ## DeepDream * DeepDream starts with an **existing image** as the input, rather than a blank or noisy image. This causes the resulting effects to attach to the pre-existing visual patterns and distort them in an artistic manner. * The input images are processed at **different scales, known as octaves**, which enhances the quality of the visualizations. .center[ <img src="img/deepdream_octaves.png" height=300> ] --- ## DeepDream * The core process involves performing **gradient ascent on the input image to maximize the activation of chosen layers** in a pretrained convnet. * The **DeepDream loss** is calculated as a **weighted mean of the L2 norm of the activations of a set of high-level layers**. The specific layers chosen and their weights significantly influence the visual outcome. .center[ <img src="img/deepdream_octaves.png" height=300> ] --- ## DeepDream * The algorithm processes the image over **octaves**. For each successive octave, the image is upscaled, and gradient ascent is performed. * To prevent the loss of image detail, **detail reinjection** is used. The difference between the original image and a lower-quality upscaled version of the original image added back into the dream image. .center[ <img src="img/deepdream_octaves.png" height=300> ] --- ## DeepDream * Lower layers tend to produce **geometric patterns**, while higher layers can lead to recognizable **visual patterns related to objects the network was trained on** (e.g., dog eyes, bird feathers, if trained on ImageNet). * The results are often described as **trippy and full of pareidolia artifacts**, and they can be somewhat **similar to the visual artifacts experienced by humans due to the disruption of the visual cortex via psychedelics**. .center[ <img src="img/deepdream_example.jpeg" height=250> ] .small[ https://github.com/mftnakrsu/DeepDream ] --- ## GAN applications StyleGAN2 is a state-of-the-art network in generating realistic images. Besides, it was explicitly trained to have disentangled directions in latent space, which allows efficient image manipulation by varying latent factors .center[<img src="https://github.com/EvgenyKashin/stylegan2-distillation/raw/master/imgs/title.jpg" height=250>] .small[Viazovetskyi Y. et al., 2020, "[StyleGAN2 Distillation for Feed-forward Image Manipulation](https://arxiv.org/abs/2003.03581)", arXiv:2003.03581б https://github.com/EvgenyKashin/stylegan2-distillation Fake celebrity faces, https://medium.com/datadriveninvestor/artificial-intelligence-gans-can-create-fake-celebrity-faces-44fe80d419f7] --- ## Style transfer - Style transfer consists of creating a new image that preserves the contents of a target image while also capturing the style of a reference image. - **Content** can be captured by **the high-level activations of a convnet**. - **Style** can be captured by **the internal correlations of the activations of different layers** of a convnet. --- ## Neural style transfer * **Neural style transfer** is a deep-learning-driven image modification technique that applies the style of a reference image to a target image while conserving the content of the target image. * **Style** essentially refers to textures, colors, and visual patterns in the image at various spatial scales, while **content** is the higher-level macrostructure of the image. * The key notion is to **define a loss function** that specifies the goal of conserving content and adopting style, and then to **minimize this loss**. The loss function is generally represented as: `loss <- distance(style(reference_image) - style(combination_image)) +` ` distance(content(original_image) - content(combination_image))` --- ## Content Loss - The intuition behind content loss stems from how convolutional neural networks (convnets) learn to represent images. - **Earlier layers** in a convnet detect basic visual features like edges and corners, capturing **local information**. - As we go **deeper into the network**, the layers learn to recognize increasingly **complex and abstract features**, representing the **global content** and high-level structure of the image. --- ## Content Loss - To preserve the content of an original image in the generated (combination) image, the algorithm focuses on matching the **activations of a higher-level layer** of a pretrained convnet (like VGG19) for both images. - The idea is that these upper-layer activations represent what the network "sees" as the **high-level content** of the image. - By minimizing the difference (using a norm like L2) between the feature maps of the content image and the combination image at this chosen layer, we encourage the combination image to retain the same **objects and overall scene structure** as the content image. --- ## Content Loss - **Content loss** is typically the L2 norm between the activations of an **upper layer** in a pretrained convnet (like the `block5_conv2` layer of VGG19) computed over the target image and the generated (combination) image. - Essentially, we are telling the network: "Make sure the generated image still contains the same 'stuff' as the original content image, according to how a deep network understands 'stuff' at a high level." --- ## Style loss - Style is understood as the **textures, colors, and visual patterns** present in an image at **various spatial scales**. - To capture and transfer the style of a reference image, the style loss utilizes the activations from **multiple layers** of the convnet, spanning both low-level and high-level layers. - The key insight here is that the **style** of an image can be represented by the **correlations between the features** learned by these different layers. For example, a painter's characteristic brushstrokes might manifest as specific co-occurrences of certain low-level edge and color features across the image. --- ## Style loss - To mathematically represent these feature correlations within a given layer, the algorithm uses the **Gram matrix**. - The Gram matrix computes the inner product of the feature maps of a layer, providing a measure of how much different features tend to activate together. - These correlations capture the **statistical properties of the textures and patterns** at the spatial scale represented by that layer. --- ## Style loss - The style loss is then calculated by minimizing the difference (again, using a norm like L2) between the **Gram matrices** of the style-reference image and the combination image for the chosen set of layers. - By doing this across multiple layers, we ensure that the combination image adopts **similar feature correlations** to the style image at different levels of abstraction. This forces the generated image to have **textures and visual patterns** that are statistically similar to those of the style image. - In essence, we are telling the network: "Make sure the generated image has the same kind of 'feel' and 'look' as the style image, by matching the statistical relationships between the learned features across different levels of visual representation." --- ## Final loss function * A **total variation loss** is often added, operating on the pixels of the generated combination image to encourage spatial continuity and avoid overly pixelated results. It acts as a regularization loss. * The final **loss function** that is minimized is a **weighted average** of the content loss, the style loss (summed over multiple style layers), and the total variation loss. - The weights (`content_weight`, `style_weight`, `total_variation_weight`) can be tuned to achieve different results. --- ## Style transfer implementation * The style transfer can be implemented using a pretrained convnet like **VGG19**. - The process involves setting up a network to compute layer activations for the style, content, and combination images, defining the loss based on these activations, and then using **gradient descent** (or other optimization algorithms like L-BFGS, though SGD is used in the example) to minimize this loss by updating the pixels of the combination image. - A **learning rate schedule** can be used to gradually decrease the learning rate during optimization. --- ## Style transfer implementation * Neural style transfer is essentially a form of **image retexturing or texture transfer**. - It works best with strongly textured style images and content targets that don't require very fine details. * The original algorithm can be **slow to run**, but the transformation can be learned by a faster feed-forward convnet if appropriate training data (input-output pairs generated by the slower method) is available, leading to **fast style transfer**. --- ## CycleGAN: domain transformation CycleGAN learns transformation across domains with unpaired data .center[<img src="https://junyanz.github.io/CycleGAN/images/teaser_high_res.jpg" height=400>] .small[https://junyanz.github.io/CycleGAN/] --- ## CycleGAN: domain transformation .center[<iframe width="672" height="378" src="https://www.youtube.com/embed/9reHvktowLY" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>] .small[https://junyanz.github.io/CycleGAN/ https://interestingengineering.com/elon-musks-deepfake-video-of-singing-soviet-space-song-breaks-the-internet] --- ## Autoencoders - Autoencoder (**Auto**matically **encoding** data) is an unsupervised neural network trained to reconstruct the input. - One or more bottleneck layers have lower dimensionality than the input, which leads to compression of data and forces the autoencoder to extract useful features and omit unimportant features in the reconstruction. .center[<img src="img/keras_autoencoders_applications.png" height=200>] .small[https://www.pyimagesearch.com/2020/02/17/autoencoders-with-keras-tensorflow-and-deep-learning/] --- ## Autoencoders - Autoencoders learn a **compressed representation** of the input data by reconstructing it on the output of the network. - Goal: capture the structure of the data `\(x\)` (i.e., intrinsic relationships between the data variables) in a low-dimensional latent space `\(z\)`, and allows for more accurate downstream analyses. - Applications: Dimensionality reduction; Data denoising; Compression and data generation. .center[<img src="img/autoencoder1.png" height=200>] --- ## Basic autoencoder network .center[<img src="img/autoencoder2.png" height=200>] - This network is trained in such a way that the features ( `\(z\)` ) can be used to reconstruct the original input data ( `\(x\)` ). - If the output ( `\(\hat{X}\)` ) is different from the input ( `\(x\)` ), the loss penalizes it and helps to reconstruct the input data. --- ## How autoencoder learns - Image denoising problem - removing noise from images. .center[<img src="img/autoencoder3.png" height=400>] .small[https://www.analyticsvidhya.com/blog/2020/02/what-is-autoencoder-enhance-image-resolution/] --- ## Autoencoder calculations - The model contains an encoder function `\(f(.)\)` parameterised by `\(\theta\)` and a decoder function `\(g(.)\)` parameterised by `\(\phi\)`. - The lower dimensional embedding learned for an input `\(x\)` in the bottleneck layer is `\(h = f_{\theta}(x)\)` and the reconstructed input is `\(x' = g_{\phi}(f_{\theta}(x))\)`. - The parameters `\(\theta,\phi\)` are learned together to output a reconstructed data sample that is ideally the same as the original input `\(x' \approx g_{\phi}(f_{\theta}(x))\)` - There are various metrics used to quantify the error between the input and output such as cross-entropy (CE) or simpler metrics such as mean squared error: `\(L_{AE}(\theta,\phi) = \frac{1}{n}\sum_{i=0}^n(x_i - g_{\phi}(f_{\theta}(x_i))^2\)` --- ## Autoencoder variants - The main challenge when designing an autoencoder is its sensitivity to the input data. - While an autoencoder should learn a representation that embeds the key data traits as accurately as possible, it should also be able to encode traits which generalize beyond the original training set and capture similar characteristics in other data sets. - However, the embedding spaces learned by autoencoders are not continuous and poorly generalizable. --- ## Autoencoder variants - Several variants have been proposed since autoencoders were first introduced. - These variants mainly aim to address shortcomings such as improved generalization, disentanglement and modification to sequence input models. - Some significant examples include the **Denoising Autoencoder** (DAE), **Sparse Autoencoder** (SAE), and more recently the **Variational Autoencoder** (VAE). .small[ [Vincent et al., 2008, Extracting and Composing Robust Features with Denoising Autoencoders](https://www.cs.toronto.edu/~larocheh/publications/icml-2008-denoising-autoencoders.pdf) [Makhzani and Frey, 2014, k-Sparse Autoencoders](https://arxiv.org/pdf/1312.5663.pdf) [Kingma and Welling, 2014, Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114) ] --- ## Variational Autoencoder * VAEs learn **well-structured, continuous latent spaces** where specific directions can encode meaningful axes of variation in the data. This makes them suitable for **image editing via concept vectors**. * VAEs blend deep learning with Bayesian inference. Unlike classical autoencoders that map an image to a fixed latent vector, a VAE maps an image to the **parameters of a statistical distribution** (a mean and a variance) in the latent space. .center[<img src="img/vae.png" height=200>] .small[https://www.analyticsvidhya.com/blog/2020/01/generative-models-gans-computer-vision/] --- ## Variational Autoencoder - The assumption is that the input image has been generated by a statistical process, and that the randomness of this process should be taken into account during encoding and decoding. .center[<img src="img/vae.png" height=200>] .small[https://www.analyticsvidhya.com/blog/2020/01/generative-models-gans-computer-vision/] --- ## Variational Autoencoder * The VAE process involves three main parts: * **Encoder:** Takes an input image and maps it to the **mean (`z_mean`) and log-variance (`z_log_var`)** of a probability distribution in the latent space. * **Sampling Layer:** Takes `z_mean` and `z_log_var` and **randomly samples a point `z`** from the latent normal distribution. The formula used is `z = z_mean + exp(z_log_variance) * epsilon`, where epsilon is a random tensor. This stochasticity forces the latent space to be continuously meaningful. * **Decoder:** Takes the sampled latent point `z` and **maps it back to an image** with the same dimensions as the original input. .center[<img src="img/vae.png" height=200>] .small[https://www.analyticsvidhya.com/blog/2020/01/generative-models-gans-computer-vision/] --- ## Variational Autoencoder - The VAE then uses the mean and variance parameters to randomly sample one element of the distribution and decodes that element back to the original input. * VAEs are trained using **two loss functions**: * **Reconstruction Loss:** Forces the decoded samples to **match the initial input images**. Binary cross-entropy is used as the reconstruction loss in the MNIST example. * **Regularization Loss (KL Divergence):** Helps learn **well-rounded latent distributions** and reduces overfitting by nudging the distribution of the encoder output towards a standard normal distribution centered around 0. --- ## VAE applications * The **continuity and structure** of the learned latent space in VAEs enable interesting applications like: * **Concept vectors for image editing:** Identifying directions in the latent space that correspond to meaningful changes in the image (e.g., adding a smile). * Generating **continuous spaces of images** where morphing between different images is smooth and natural. * Creating **latent-space-based animations**. * While **GANs** can generate highly realistic images, the latent spaces they learn may lack the same level of structure and continuity as those learned by VAEs. Practical image applications often favor VAEs, although GANs are popular in academic research. --- ## Deep Belief Networks as Generative Networks - **Layered Probabilistic Model:** - DBNs consist of multiple layers of **Restricted Boltzmann Machines (RBMs)** stacked on top of each other. - Each layer learns a probabilistic representation of the input data. - **Generative Capability:** - DBNs **learn the probability distribution** of training data. - Can generate new samples by sampling from the top-level RBM and propagating downward. - **Applications as a Generative Model:** - **Image Generation:** By learning feature hierarchies. - **Speech & Audio Modeling:** Generate realistic audio waveforms. - **Feature Learning:** Pretraining for deep networks before fine-tuning. --- ## Deep Belief Networks as Generative Networks Comparison with Other Generative Models: - DBNs **explicitly model data distributions**, while GANs use an adversarial approach. - DBNs are less effective at generating high-resolution images compared to GANs. - VAEs use **latent space sampling**, while DBNs use **RBMs for hierarchical learning**. - VAEs are more common in modern generative modeling. .small[http://www.scholarpedia.org/article/Deep_belief_networks]