Recently, I took some time to understand why Transformers have become central to the progress in large language models (LLMs). I will explain how these work from a top down view which helped me make more sense of why transformers work.
What do they solve
Transformers help us capture the surrounding context of a word more effectively. Without this context, predicting the next word becomes much more unpredictable and error-prone. Older models such as LSTM's and GRU's also tried to solve this issue but failed to remember dependency and context for longer sequences of text.
How do they work
At a high level, a simplified LLM consists of two major components:
- A Transformer that generates a context-rich representation for each word. 
- A linear neural network that uses this representation to predict the next word. 
Let's take a look at the transformer structure. Each transformer is composed of multiple Heads. Each Head has the responsibility of learning a relation between the word and it's surroundings, let's say hypothetically, one of the head may keep track of the syntax/grammar, some other head would have the responsibility for keeping track of semantic relationships, another one may work on the style of writing and so on. The outputs from all these heads are included while making the final generation.
Individual heads implement a attention mechanism. For this 3 representations of a single word is used + position embeddings to find the most important words/context from surroundings. For each word we define 3 vectors. Query, Key, and Value. In very simple terms,
- Query: Query for a word represents what it wants to know from it's surroundings. Let's say the word is a pronoun, so it would most likely want to know which noun it's referring to. Or maybe the word is an adverb, so it would like to know the relevant verb. Hence the name Query, because it asks the surrounding words what value do you have with respect to me. 
- Key: Key for a word is what it provides to the rest of the sentence. The key of a word describes what it is and depending on the query it may be relevant or not. Let's say for a adjective it may represent that it's an adjective and what does it modify. A query from the related noun would see this and conclude that this is important for me to focus on. 
The attention mechanism uses the query and key vectors to compute attention scores, which can be thought of as how much one word should focus on others.
- Value: Value can be said to be the actual information that the word gives to context. We combine the Query and Key to find what the relevant portion for any sentence is and then we use the Value vectors of the relevant portion to finally generate the context aware representation of the word.
To recap:
Query: What do I need?
Key: What do I offer?
Attention score (Q • K): How relevant are you to me?
Value: The actual information being passed along.
So there you have it. A very simplified view of how transformer works and why do they work. I hope it helps people like me to see a little behind the magic on why transformers are so game changing. This is my first blog here so critiques and suggestions are extremely welcome. Feel free to ask any questions, and I will do my best to answer them as well as possible.
 
                                                 
                                                