class: center, middle, inverse, title-slide .title[ # Generative Deep Learning ] .subtitle[ ## Text generation ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2025-04-23 ] --- ## Generative adversarial networks (GANs) > The most important [recent development], in my opinion, is adversarial training (also called GAN for Generative Adversarial Networks). This is an idea that was originally proposed by Ian Goodfellow when he was a student with Yoshua Bengio at the University of Montreal (he since moved to Google Brain and recently to OpenAI). > This, and the variations that are now being proposed, is the most interesting idea in the last 10 years in ML, in my opinion. .right[Yann LeCun] .small[https://danieltakeshi.github.io/2017/03/05/understanding-generative-adversarial-networks/] --- ## Sequence data generation * The universal method for generating sequence data in deep learning involves **training a model** (typically a **Transformer or an RNN**) to **predict the next token or the next few tokens in a sequence**. * The model uses the **previous tokens as input** for this prediction. For example, given "the cat is on the," the model is trained to predict "mat". * Tokens are usually **words or characters**, especially when dealing with text data. --- ## Sequence data generation * A network capable of modeling the probability of the next token given the preceding ones is called a **language model**. * A language model **captures the latent space of language** and its **statistical structure**. * Once a language model is trained, you can **sample from it to generate new sequences**. * This involves **feeding the model an initial string of text** (known as **conditioning data**). --- ## Sequence data generation * The model is then asked to **generate the next character or word** (or even several tokens at once). * The **generated output is added back to the input data**, and this process is **repeated multiple times**. * This iterative loop allows for the generation of sequences of **arbitrary length** that reflect the **structure of the training data**, often resembling human-written sentences. .center[<img src="img/generative_text_12_1.png" height=220>] --- ## Sequence data generation The **sampling strategy** for choosing the next token is crucial. * A naive approach is **greedy sampling**, where the most likely next character is always chosen, but this often leads to repetitive and predictable strings. * A more interesting method is **stochastic sampling**, which introduces randomness by sampling from the probability distribution for the next character. --- ## Sequence Data Generation: Controlling Randomness with Temperature A parameter called the **softmax temperature** can be used to control the amount of stochasticity (randomness) in the sampling process. Let `\(z_i\)` be the probability for token `\(i\)`. The temperature-scaled softmax is computed as: `$$P(i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$$` where `\(T\)` is the **temperature** parameter. `\(T = 1\)` gives the standard softmax, `\(T < 1\)` sharpens the distribution (less randomness), `\(T > 1\)` flattens the distribution (more randomness). --- ## Sequence Data Generation: Controlling Randomness with Temperature * **Higher temperatures** lead to more surprising and unstructured generated data due to sampling from distributions with higher entropy. * **Makes the model more likely to sample less probable words.** This leads to more surprising, diverse, and potentially creative output. * However, **very high temperatures can lead to generated text that loses coherence and appears largely random**. * **Lower temperatures** result in less randomness and more predictable generated data from distributions with lower entropy. * **Makes the model more likely to stick to the most probable next words.** This can lead to more repetitive, and sometimes boring text. * **Very low temperatures can lead to the model getting stuck in loops**. --- ## LSTMs as Generative Networks - LSTMs trained on collections of text can be run to generate text - predict the next token(s) given previous tokens. - LSTMs are better for structured, sequential tasks, e.g., text; GANs excel in image synthesis. - **Text/Code Generation:** Story writing, chatbot responses, AI-assisted programming. - **Music Generation:** Composing melodies, generating polyphonic music. - **Image Captioning:** Generating textual descriptions from images. --- ## A brief history of generative deep learning for sequence generation * The **LSTM algorithm**, which enabled successful sequence data generation with recurrent networks, was developed in **1997**. * Early on, the LSTM algorithm was used to **generate text character by character**. * In **2002**, **Douglas Eck** applied **LSTM to music generation** for the first time, showing promising results. --- ## A brief history of generative deep learning for sequence generation * In the late 2000s and early 2010s, **Alex Graves** did important pioneering work using recurrent networks for sequence data generation, notably his **2013 work on generating human-like handwriting** using recurrent mixture density networks. * Between **2015 and 2017**, recurrent neural networks were successfully used for various generative tasks including **text and dialogue generation, music generation, and speech synthesis**. --- ## A brief history of generative deep learning for sequence generation * Around **2017–2018**, the **Transformer architecture** began to replace recurrent neural networks for generative sequence models, particularly for **language modeling (word-level text generation)**. * A well-known example of a generative Transformer is **GPT-3**, a large language model trained by OpenAI, which gained attention in **2020** for its ability to generate plausible-sounding text on almost any topic. * **GPT-4 (2023, OpenAI)** – An improved version of GPT-3, demonstrating stronger reasoning, factual accuracy, and multimodal capabilities (accepting text and image inputs). --- ## Latest Advances in Large Language Models (LLMs) * **Gemini 1.5 (2024, Google DeepMind)** – A multimodal LLM with **longer context memory** (up to 1 million tokens), significantly improving code and document understanding. * **Claude 3 (2024, Anthropic)** – Focused on safety and interpretability, Claude 3 exhibits near-GPT-4 performance while being more efficient. * **Mistral & Mixtral (2023, Mistral AI)** – Open-weight LLMs with **efficient mixture-of-experts (MoE)** architectures, balancing accuracy and inference speed. * **Llama 3 (2024, Meta AI)** – The next generation of Meta’s **open-source** language models, designed for improved efficiency and multilingual support. --- ## A Transformer-based sequence-to-sequence model * We will train a model to **predict a probability distribution over the next word in a sentence**, given a number of initial words. * The model takes as **input a sequence of N words** (indexed from 1 to N). * The model aims to **predict the sequence offset by one** (from 2 to N+1). * We employ **causal masking** to ensure that when predicting the word at position `i + 1`, the model only uses words from position 1 to `i`. * This allows the model to be trained to solve **N mostly overlapping but different problems**: predicting the next word given a sequence of 1 to N prior words. * This also enables the model to **start predicting with fewer than N words** at generation time.