class: center, middle, inverse, title-slide .title[ # Mathematical Foundations ] .subtitle[ ## Data representations for neural network ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2025-01-27 ] --- ## Data representation in R Scalars (0D), Vectors (1D), Matrices (2D), Higher-Dimensional arrays ``` r pi ``` ``` ## [1] 3.141593 ``` ``` r month.abb ``` ``` ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" ``` ``` r head(mtcars) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 ``` --- ## Data Representations for Neural Networks - Neural networks process and learn from data in the form of tensors - the fundamental data structures in deep learning, analogous to multidimensional arrays in R. - Tensors have dimensions: Scalars (0D), Vectors (1D), Matrices (2D), Higher-Dimensional Tensors. - Key Attributes of Tensors - **Rank (Order)** - Number of axes or dimensions (e.g., scalar = 0D, vector = 1D, matrix = 2D). - **Data Type (dtype)** - Type of data contained, e.g., float32, int32, character. <!-- ## Vectors (1D Tensors) - **Vector** is a 1-dimensional tensor (1D tensor), representing a sequence of values. - Example - A list of temperatures over a week, e.g., [25.3, 24.8, 26.1, 23.4]. ``` r month.abb ``` ``` ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" ``` ## Matrices (2D Tensors) - **Matrix** is a 2-dimensional tensor (2D tensor), representing a table of numbers, organized in rows and columns. - Example: A grayscale image where each pixel intensity is stored in a 2D matrix, e.g., 28x28 pixel image. ``` r head(mtcars) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 ``` ## 3D Tensors and Higher-Dimensional Tensors - **Tensors** can extend to 3D and higher dimensions. These tensors are used to represent more complex data structures. - Example: 3D tensor - A color image with RGB channels, e.g., (height, width, channels) = (28, 28, 3). - Example: 4D tensor - A batch of images, e.g., (batch_size, height, width, channels) = (32, 28, 28, 3). ## Real-World Examples of Data Tensors - Vector Data: Examples include tabular data (e.g., financial data), represented as a 2D tensor of shape (samples, features). - Time Series Data (Sequence Data): Stock prices, heartbeats, or weather patterns are sequences of values over time, represented as 3D tensors (samples, timesteps, features). - Image Data: Grayscale or color images stored as 3D tensors (height, width, channels). - Video Data: Videos represented as 5D tensors (samples, frames, height, width, channels). --> --- ## Vector Data (2D tensor) - Definition: Each sample in vector data (data point) is represented as a 1D tensor (vector). - Shape: (samples, features) where samples are the rows and features are the columns. - Example: A dataset of patient records where each patient's features (age, height, weight, blood pressure, etc.) form a vector (patient x feature). - Another example: a collection of news articles represented by the counts of words from a dictionary of 20,000 words (news articles x 20,000 words). --- ## Timeseries Data or Sequence Data (3D tensor) - Definition: Timeseries or sequential data is represented as a 3D tensor. - Shape: (samples, timesteps, features) where timesteps represent the length of the sequence. The timestep axis is always the second. - Example: Financial stock prices over time, ECG readings, or language sequences (e.g., word embeddings). .center[<img src="img/timeseries_tensor.jpg" height=230>] .small[ http://dx.doi.org/10.1016/j.mlwa.2020.100013 ] --- ## Image Data (4D tensor) - Definition: Image data is typically represented as a 3D tensor. - Shape: For a batch of 32 color images, the shape would be (32, height, width, 3). - Example: A grayscale image is represented as (height, width, 1) and a color image as (height, width, 3). .center[<img src="img/image_tensor.png" height=230>] .small[ https://livebook.manning.com/concept/deep-learning/shape ] --- ## Video Data (5D tensor) - Definition: Video data adds another dimension for time (frames) and is represented as a 5D tensor. - Shape: For a batch of 10 videos, each with 60 frames, the shape is (10, 60, 720, 1280, 3). - Example: A video file with 60 frames of 720p resolution (720, 1280) in RGB color would be (samples, frames, height, width, 3). --- ## Data Batches - Large amount of training data raises a question how to process it efficiently. - Neural networks process data in batches for efficiency and optimization during training. - Instead of processing 1 image at a time, a model might process 128 images simultaneously, creating a 4D tensor of shape (128, 28, 28, 3). - Advantages - Reduces memory usage and speeds up training through mini-batch gradient descent. --- ## Key Concepts of Neural Networks * **Tensor Manipulation:** * Tensors, including those storing network state (variables). * Tensor operations like addition, `relu`, and `matmul`. * Backpropagation for gradient computation, handled by the `GradientTape` object. * **High-Level Concepts:** * Layers combined into a model. * Loss function to define the feedback signal. * Optimizer to determine the learning process. * Metrics to evaluate model performance (e.g., accuracy). * A training loop for mini-batch stochastic gradient descent. --- ## Manipulating Tensors in R You can manipulate tensors using libraries like TensorFlow for R. ``` r # Create a 3x4x2 tensor tensor_example <- array(1:24, dim = c(3, 4, 2)) # Print the tensor to visualize its structure print(tensor_example) ``` ``` ## , , 1 ## ## [,1] [,2] [,3] [,4] ## [1,] 1 4 7 10 ## [2,] 2 5 8 11 ## [3,] 3 6 9 12 ## ## , , 2 ## ## [,1] [,2] [,3] [,4] ## [1,] 13 16 19 22 ## [2,] 14 17 20 23 ## [3,] 15 18 21 24 ``` --- ## Manipulating Tensors in R Reshaping a tensor column-wise ``` r # order - The order in which elements of x should be read during the rearrangement. # "C" means elements should be read in row-major order; # "F" means elements should be read in column-major order. tensorflow::array_reshape(tensor_example, dim = c(3, 8), order = c("F")) ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 1 4 7 10 13 16 19 22 ## [2,] 2 5 8 11 14 17 20 23 ## [3,] 3 6 9 12 15 18 21 24 ``` ``` r tensorflow::array_reshape(tensor_example, dim = c(3, 8), order = c("C")) ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 1 13 4 16 7 19 10 22 ## [2,] 2 14 5 17 8 20 11 23 ## [3,] 3 15 6 18 9 21 12 24 ``` --- ## Element-wise Operations - Operations applied independently to each element of a tensor, performed element-by-element between tensors of the same shape. * **Element-wise Addition:** The `+` operator or `tf$math$add()` performs element-wise addition `\(A + B = [a_{ij} + b_{ij}]\)`. * **Element-wise Subtraction:** The `-` operator or `tf$math$subtract()` performs element-wise subtraction `\(A - B = [a_{ij} - b_{ij}]\)`. * **Element-wise Multiplication (Hadamard Product):** The `*` operator or `tf$math$multiply()` performs element-wise multiplication `\(A \circ B = [a_{ij} \cdot b_{ij}]\)`. * **Element-wise Division:** The `/` operator or `tf$math$divide()` performs element-wise division: `\(A / B = [a_{ij} \div b_{ij}]\)`. - These operations require the tensors to have the **same shape** or be **broadcastable** (following broadcasting rules in TensorFlow). --- ## Element-wise addition ``` r # Create two matrices (2x2 for simplicity) tensor1 <- matrix(c(1, 2, 3, 4), nrow=2, ncol=2) tensor2 <- matrix(c(5, 6, 7, 8), nrow=2, ncol=2) # Initialize a result matrix with the same dimensions result <- matrix(0, nrow=2, ncol=2) # Perform element-wise multiplication using a for loop for (i in 1:nrow(tensor1)) { for (j in 1:ncol(tensor1)) { result[i, j] <- tensor1[i, j] * tensor2[i, j] } } # Print the result print(result) ``` ``` ## [,1] [,2] ## [1,] 5 21 ## [2,] 12 32 ``` --- ## Element-wise Multiplication ``` r # Create two tensors (2x2 matrix for simplicity) (tensor1 <- tf$constant(matrix(c(1, 2, 3, 4), nrow=2, ncol=2), dtype="float32")) ``` ``` ## tf.Tensor( ## [[1. 3.] ## [2. 4.]], shape=(2, 2), dtype=float32) ``` ``` r (tensor2 <- tf$constant(matrix(c(5, 6, 7, 8), nrow=2, ncol=2), dtype="float32")) ``` ``` ## tf.Tensor( ## [[5. 7.] ## [6. 8.]], shape=(2, 2), dtype=float32) ``` ``` r # Perform element-wise multiplication (result <- tf$math$multiply(tensor1, tensor2)) ``` ``` ## tf.Tensor( ## [[ 5. 21.] ## [12. 32.]], shape=(2, 2), dtype=float32) ``` --- ## Dot-product operation - The `%*%` operator or `tf$matmul()` computes the dot product (matrix multiplication), which involves summing the products of corresponding elements from rows of the first matrix and columns of the second matrix (`\(x \cdot y\)` in mathematical notation). - Dimension requirement: For matrix multiplication, the number of columns in the first matrix <!--(or last dimension of the first tensor)--> must equal the number of rows in the second matrix<!-- (or second-to-last dimension of the second tensor)-->. .center[<img src="img/dotproduct.png" height=250>] --- ## Dot-product operation - Higher-dimensional tensors: `tf$matmul()` can perform batched matrix multiplication for tensors with more than two dimensions, applying the dot product across corresponding submatrices in the batch. - For tensors with more than one dimension, the dot product is not symmetric, meaning that the order of the operands matters (`A %*% B` or `tf$matmul(A, B)` is generally not equal to `B %*% A` or `tf$matmul(B, A)`). .center[<img src="img/dotproduct.png" height=250>] --- ## Dot-product operation ``` r # Create two tensors (2x2 matrices for simplicity) (tensor1 <- tf$constant(matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2), dtype = "float32")) ``` ``` ## tf.Tensor( ## [[1. 3.] ## [2. 4.]], shape=(2, 2), dtype=float32) ``` ``` r (tensor2 <- tf$constant(matrix(c(5, 6, 7, 8), nrow = 2, ncol = 2), dtype = "float32")) ``` ``` ## tf.Tensor( ## [[5. 7.] ## [6. 8.]], shape=(2, 2), dtype=float32) ``` ``` r # Compute the dot product # The dot product of two 2x2 matrices results in another 2x2 matrix (result <- tf$matmul(tensor1, tensor2)) ``` ``` ## tf.Tensor( ## [[23. 31.] ## [34. 46.]], shape=(2, 2), dtype=float32) ``` --- ## Tensor Reshaping - Changing the shape (dimensions) of tensors without modifying the data. - Needed to prepare inputs for layers in neural networks. Reshape between different operations (e.g., convnets to dense layers). ``` r tensor <- matrix(1:8, nrow=2) # Flatten a tensor into a 1D tensor reshaped <- as.vector(tensor) array_reshape(tensor, dim = c(1, 8)) ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 1 3 5 7 2 4 6 8 ``` ``` r # Reshaping into higher dimension tensor array_reshape(tensor, dim = c(2, 2, 2)) ``` ``` ## , , 1 ## ## [,1] [,2] ## [1,] 1 5 ## [2,] 2 6 ## ## , , 2 ## ## [,1] [,2] ## [1,] 3 7 ## [2,] 4 8 ``` --- ## Tensor Reshaping - Reshaping: The `as_tensor()` function can reshape an R array during the conversion process. You can specify the desired shape using the shape argument. - Row-major vs. Column-major Ordering: A key difference arises when reshaping arrays for use in TensorFlow. While R uses column-major ordering, TensorFlow utilizes row-major ordering. Therefore, it's crucial to employ the `array_reshape()` function in R with the order = "C" argument to ensure compatibility with TensorFlow's row-major semantics. - Inferring Dimensions: Both `array_reshape()` and `as_tensor()` allow you to leave the size of one axis unspecified by using -1 or NA. The missing dimension will be automatically inferred based on the total size of the array and the sizes of the specified dimensions. --- ## Tensor Reshaping ``` r # 1. Reshaping during Conversion with `as_tensor()` r_array <- array(1:6, dim = c(2, 3)) # An R array, column-major by default tensor_reshaped <- as_tensor(r_array, shape = c(3, 2)) # Reshape to (3x2) using as_tensor()) # 2. Row-Major vs. Column-Major Ordering array_col_major <- array_reshape(r_array, c(3, 2), order = "C") # Column-wise default for tensorflow (array_row_major <- array_reshape(r_array, c(3, 2), order = "F")) # Row-wise ordering ``` ``` ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6 ``` ``` r # 3. Inferring Dimensions with -1 or NA # Reshape the array to (2x-1) where one dimension is inferred tensor_inferred <- as_tensor(array_row_major, shape = c(2, -1)) # Another example using NA for inference (tensor_inferred_na <- as_tensor(array_row_major, shape = c(NA, 2))) ``` ``` ## tf.Tensor( ## [[1 4] ## [2 5] ## [3 6]], shape=(3, 2), dtype=int32) ``` --- ## Tensor Slicing - Slicing: You can subset TensorFlow tensors similarly to R arrays using slicing. However, TensorFlow slicing offers some convenient features not available in R. For instance, you can use NA within a slice range to represent "the rest of the tensor in that direction". - Negative Indices: TensorFlow uses negative indices differently than R. Instead of dropping elements, negative indices in TensorFlow indicate positions relative to the end of the axis. Importantly, this behavior differs from R and might trigger a warning message the first time it's encountered. - Capturing Remaining Dimensions: TensorFlow provides the all_dims() object for capturing all remaining dimensions without explicitly providing commas in the slicing operation. This simplifies code and improves readability, especially when working with tensors of different ranks. --- ## Tensor Slicing ``` r # Create a 3D Tensor (2x3x4) tensor <- tf$constant(array(1:24, dim = c(2, 3, 4)), dtype = tf$int32) cat("Original Tensor (Shape 2x3x4):\n") print(tensor) # 1. Slicing: Using NA to represent "the rest of the tensor in that direction" slice_2 <- tensor[, 1:2, NA] # Slice all rows, first two columns, and "the rest" in the last dimension # 2. Negative Indices: Indicate positions relative to the end of the axis slice_3 <- tensor[, -1, ] # TensorFlow treats -1 as the last element slice_4 <- tensor[, , -2:-1] # Select the last two elements of the last dimension for all rows and columns # 3. Capturing Remaining Dimensions with all_dims() slice_5 <- tensor[2, tf$all_dims()] # Slice the second element along the first axis and keep all remaining dimensions slice_6 <- tensor[1:2, tf$all_dims()] # Slice specific rows and capture remaining dimensions cat("\nSlice 6 (First two rows, capturing all remaining dimensions):\n") print(slice_6) ``` --- ## Geometric Interpretation of Tensor Operations - Tensors as Geometric Objects - Vectors, matrices, and higher-dimensional tensors have geometric meanings. - Element-wise operations - scaling/stretching. - Dot products: projection or similarity measures. - Tensor reshaping: dimensional transformations. .center[<img src="img/geometric_vector_addition.png" height=320>] .small[ https://www.cuemath.com/geometry/addition-of-vectors/ ] --- ## Geometric Interpretation of the Dot Product - The dot product of two vectors results in a scalar. - It's calculated as the sum of the products of corresponding entries. - Geometrically, it relates to the angle between the vectors: - If the vectors are orthogonal (perpendicular), their dot product is 0. - A larger dot product indicates a smaller angle between the vectors. --- ## Geometric Interpretation of Deep Learning - Neural Networks as Geometric Transformations. Each layer transforms data geometrically. - Non-linear activations distort the geometry to enhance learning capacity. - Optimization (e.g., SGD) guides the geometric shape toward a solution. .center[<img src="img/manifold.png" height=320>] .small[ Manifold-based approach for neural network robustness analysis https://doi.org/10.1038/s44172-024-00263-8 ] --- ## Tensor operations in Layers - A layer can be represented as a function that takes a tensor as an input and return another tensor, a new representation for the input tensor. - Functions transforming the data are nonlinear, e.g. rectified linear unit (ReLU) `\(\text{ReLU}(x) = \max(0, x)\)`, SoftMax `\({Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}\)`. - Adding weights and biases, we have `\(output = ReLU(dot(W, input) + b)\)`. - A dot product between the input tensor and a tensor of weights. - An addition between the resulting tensor and a tensor of biases. - A ReLU operation. --- ## Introduction to Gradient-Based Optimization - Each layer in a neural network transforms its input. For example, a dense (fully connected) layer computes `\(output = ReLU(dot(W, input) + b)\)`, where `\(W\)` (weights) and `\(b\)` (biases) are the trainable parameters of the layer. - These parameters, `\(W\)` and `\(b\)`, are initialized randomly, which is why the initial predictions often don’t reflect any useful data patterns. - Goal of Optimization: Adjust these parameters gradually to improve model predictions. - Forward Pass: Calculate the predicted output, `\(y_{\text{pred}}\)`, given the current weights. - Loss Calculation: Measure how far `\(y_{\text{pred}}\)` is from the actual target `\(y\)`. - Gradient-Based Optimization: Adjust weights to minimize this loss iteratively. --- ## Differentiability in Neural Networks - Every operation in a neural network is designed to be differentiable, enabling us to calculate how small changes in each weight affect the loss function. - Gradient - Indicates the rate and direction to adjust each parameter to reduce the loss. - Key Concept: By moving the weights in the direction opposite to the gradient, we can iteratively decrease the loss. - Derivative: For a function `\(f(x)\)`, its derivative `\(f'(x)\)` gives the rate of change with respect to `\(x\)`. --- ## Differentiability in Neural Networks - Geometric Interpretation: Derivatives represent the slope of the tangent line at any point, guiding adjustments. .center[<img src="img/derivative.svg" height=300>] - Example: For `\(f(x) = x^2\)`, `\(f'(x) = 2x\)`, meaning the slope and adjustment direction depend on `\(x\)`. --- ## Derivative of a Tensor Operation: The Gradient - Gradient: Generalization of a derivative to multi-variable functions. In a neural network, it measures sensitivity to changes in each weight. - For a function with multiple variables, the gradient vector `\(\nabla \mathcal{L}(\mathbf{w})\)` points in the steepest ascent direction. In training, we use `\(-\nabla \mathcal{L}(\mathbf{w})\)` to decrease the loss. - Example: If `\(\mathbf{w} = (w_1, w_2)\)`, then for a loss function `\(\mathcal{L}(\mathbf{w})\)`: `$$\nabla_{\mathbf{w}} \mathcal{L} = \left( \frac{\partial \mathcal{L}}{\partial w_1}, \frac{\partial \mathcal{L}}{\partial w_2} \right)$$` --- ## Stochastic Gradient Descent (SGD) - Stochastic Gradient Descent (SGD): A variant of gradient descent where we update weights based on small batches (mini-batches) instead of the entire dataset. - Advantages: - Computational Efficiency: Processes smaller data portions at a time, accelerating training. - Regularization Effect: Introduces noise, which can help the model escape local minima and explore more of the loss surface. - SGD Update Rule: `\(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \cdot \nabla_{\mathbf{w}} \mathcal{L}(\mathbf{w})\)` where `\(\eta\)` is the learning rate. --- ## Learning Rate and Its Impact - Learning Rate (`\(\eta\)`): A hyperparameter that controls the step size in each iteration of gradient descent. - Too High: May overshoot the minimum, leading to divergent training. - Too Low: Training takes much longer to converge and risks getting stuck in local minima. - Finding an optimal learning rate is key for efficient and effective training. Techniques such as learning rate schedules and adaptive optimizers (e.g., Adam) can help. --- ## Chaining Derivatives: The Backpropagation Algorithm - Backpropagation: Core algorithm for calculating gradients across all network parameters, allowing efficient updates during training. - Chain Rule: Allows us to calculate the gradient of composite functions. For functions `\(f(g(x))\)`, the derivative is: `$$\frac{d}{dx} [f(g(x))] = f'(g(x)) \cdot g'(x)$$` - Backpropagation uses this rule to propagate errors from output layers back through each layer. --- ## Backpropagation in Action - Forward Pass: - Feed the input through the network to obtain the output prediction. - Store intermediate activations, as these are required for calculating gradients in the backward pass. - Backward Pass: - Starting from the loss at the output, compute gradients with respect to each layer’s weights using stored activations. - Update weights across all layers based on these gradients. --- ## Example of Backpropagation Two-Layer Network: - Forward Pass: Input `\(x \rightarrow\)` Hidden Layer `\(h = W_1 x + b_1 \rightarrow\)` Output `\(o = W_2 h + b_2\)` - Backward Pass: - Calculate gradients `\(\nabla L\)` for weights `\(W_1\)`, `\(W_2\)` and biases `\(b_1\)`, `\(b_2\)`. - Update weights to reduce the loss, layer by layer, using chain rule. --- ## Visualizing Gradient Descent - Loss Surface: Imagine the loss function as a landscape, with peaks, valleys, and slopes. - Gradient Descent Steps: Moves in the direction that reduces the loss, analogous to descending a hill. - Challenges in Optimization: - Local Minima: Points where the gradient is zero but not the lowest possible loss. - Saddle Points: Flat regions where the gradient is near zero, slowing down progress. - Plateaus: Regions where the gradient is very small, making it difficult to move toward the minimum. --- ## GradientTape in TensorFlow * The **GradientTape()** is an API in TensorFlow for automatic differentiation. This powerful tool enables calculating the gradients of complex combinations of differentiable tensor operations. * `GradientTape()` records operations performed on Variable objects to enable automatic differentiation and gradient calculation. * `GradientTape()` creates a **computation graph**, or "tape". This graph facilitates retrieving the gradient of any output concerning any variable or set of variables. * **Automatic Differentiation** employs computation graphs to determine the gradients of differentiable tensor operations. Modern frameworks, including TensorFlow, have this capability, making manual backpropagation implementation unnecessary. --- ## GradientTape operates with tensor operations ``` r # Create a 2x2 matrix of zeros as a TensorFlow Variable x <- tf$Variable(array(0, dim = c(2, 2))) with(tf$GradientTape() %as% tape, { y <- 2 * x + 3 # Linear operation: y = 2x + 3 }) # Compute gradient dy/dx and convert to R array (grad_of_y_wrt_x <- as.array(tape$gradient(y, x))) ``` ``` ## [,1] [,2] ## [1,] 2 2 ## [2,] 2 2 ``` - Result is a 2x2 matrix of 2's. - This is because ∂y/∂x = 2 for each element (derivative of 2x + 3). --- ## GradientTape operates with tensor operations * A variable with a shape (2,2) and an initial value of zeros is instantiated. * A GradientTape scope is opened. * Tensor operations are applied to the variable within the scope. * The scope is exited. * The tape is used to retrieve the gradient of the output *y* with respect to our variable *x*. * `grad_of_y_wrt_x` represents a tensor of shape (2, 2) that describes the curvature of `y = 2 * a + 3` around `x = array(0, dim = c(2, 2))`. * `tape$gradient()` returns a TensorFlow Tensor that is converted to an R array using `as.array()`. --- ## GradientTape operates with tensor operations * GradientTape() can also handle lists of variables: ``` r # Create a 2x2 random weight matrix as a TensorFlow Variable W <- tf$Variable(random_array(c(2, 2))) # Create a 2-element zero bias vector as a TensorFlow Variable b <- tf$Variable(array(0, dim = c(2))) # Create a 2x2 random input matrix (not a Variable) x <- random_array(c(2, 2)) with(tf$GradientTape() %as% tape, { y <- tf$matmul(x, W) + b # Linear layer: y = xW + b }) grad_of_y_wrt_W_and_b <- tape$gradient(y, list(W, b)) ``` * `matmul` represents the **dot product** (matrix multiplication) in TensorFlow. * `grad_of_y_wrt_W_and_b` is a list containing two tensors, dy/dW: - Gradient of loss with respect to weights and dy/db. - Gradient of loss with respect to biases, each matching the shape of *W* and *b*, respectively. --- ## Practical Implementation in R with Keras/TensorFlow - Define Model: Use keras_model_sequential() to stack layers. Each layer has weights and biases that will be optimized. - Compile Model: - Specify optimizer (e.g., SGD, Adam). - Define loss function (e.g., mean squared error for regression). - Add metrics for monitoring during training. - Train Model: - Use fit() to start training, where Keras automatically performs forward and backward passes. - Parameters are updated according to chosen optimization strategy (e.g., SGD, Adam). - Visualize Training Process: - Plot training/validation loss over epochs to observe convergence. - Experiment with different learning rates and optimizers to understand their effects on convergence.