Encoder, Decoder

Encoder For each input word, we encode It's position (in the input text). Since Transformer is not an RNN, the sequence aspect can get lost. So instead, here we encode the position explicitly. Evaluate 'self-attention' - what other OUTPUT words might matter Evaluate attention to encoded representations of the INPUT words --> this is the extra step in decoder (vs. encoder) Put it through a...

read more

RNN

A lot of data in the world has sequence.. ie a temporal aspect. Eg. music, spoken language. By treating each data point independently (in a feed-forward only mode) we lose the ability to model the sequential aspect of it. In Recurrent Neural Network , computation of the neuron is a product of both the current input and this past memory of previous time steps. So it has a constant relation back...

read more

Transformer

A transformer is the major breakthrough of recent times - a neural network architecture that uses attention to process all tokens in parallel instead of step-by-step like an RNN. Processes sequences in parallel Built from stacked blocks of attention + feedforward layers Scales very well with data and compute (better for long texts + GPUs/TPUs)) Why are they better RNNs have sequential...

read more

Gradient Descent

Gradient descent is an iterative algorithm for minimizing a loss function by moving the model parameters in the direction that most rapidly decreases the loss. Process: Start initially with random weights for all inputs. Compute the loss. Calculate the gradient of loss with respect to that initial setting of weights. In one-dimension, gradient is basically the derivative In multi-dimension,...

read more

Backpropagation

Backpropagation is the main algorithm used for training neural networks with hidden layers. The main idea is that you can calculate an estimate for how much the error in the output node is based on the errors in the weights of the node before it. It does so by: Starting with the error in the output layer, calculate the gradient descent for the weights of the previous layer Propagate error back...

read more

Loss

Given an input and its correct (target) output, a loss function compares the model’s prediction to the target and returns a single number measuring how wrong the prediction is. That number is called the Loss Larger loss means a worse prediction; zero loss means a perfect prediction. Training a neural network means changing the parameters (weights and biases) to make this loss as small as...

read more

Model Training Techniques

Eg. Dropout Randomly and temporarily remove/shutdown some of the interstitial neurons, so that the weights only flow through a subset. This builds new-paths ways and forces the neural network to not depend on any particular neurons. Do it repeatedly, each time removing different set of neurons. Eg. Early Stopping Stop when the test loss has plateaued, just before it starts increasing again....

read more

Think about local minima in thousands of dimensions

When I first learned about Gradient Descent about two years ago, I pictured it in the most obvious 3D way - where one imagines two input variables (as x and y axis in a 2D plane) and the loss being the third (z) axis. In terms of 'local minima' I imagined it as the model getting stuck in a "false bottom" of this bowl-shaped landscape, unable to reach the true minimum, the lowest point. But this...

read more

Chain Rule of Calculus

Chain Rule = how you take derivatives when a value depends on another value, which itself depends on another value (i.e. a composition of functions) Intuition If A affects B and B affects C, then A affects C through B. The chain rule just says: Total sensitivity of these order of operations = (how sensitive C is to B) × (how sensitive B is to A) A neural network is a long chain of computations...

read more

Embedding

“Embeddings” emphasizes the notion of representing data in a meaningful and structured way, while “[[Vectors]]” refers to the numerical representation itself. ‘Vector embeddings’ is a way to represent different data types (like words, sentences, articles etc) as points in a multidimensional space. OpenAI’s vector embedding model is called ada-002 (read their Dec 2022 post announcing it) There...

read more