LLM (Large Language Model)
Transformer Architecture

Encoder
Purpose: To understand the input text meaning and create a perfect mathematical "summary" or "map" of the input text.
Step 1: Tokenization (Chopping)
The computer cannot read words. It needs numbers.
Action: The text is chopped into small chunks called "Tokens" (words or parts of words).
Input: "I love AI."
Output:
[ "I", "love", "AI" ]→[ 45, 2098, 11 ]
Step 2: Embedding (Vectorization)
This is where the concepts we discussed earlier (Weights/Vectors) come in.
Action: Each number is converted into a massive list of numbers (a vector) that represents its meaning.
The Logic: Words with similar meanings have similar numbers.
King is mathematically close to Queen.
Apple is mathematically far from Car.
Step 3: Positional Encoding (The Order)
Because the Transformer looks at the whole sentence at once (Parallel Processing), it doesn't naturally know that "Man bites Dog" is different from "Dog bites Man."
Action: The architecture adds a mathematical "timestamp" to each word so the model knows: "This word is 1st, this word is 2nd."
Step 4: The Attention Blocks (The Processing)
This is the Deep Learning part (The Hidden Layers).
Multi-Head Attention: The model looks at the sentence from multiple "perspectives" (Heads) at the same time.
Head 1: Focuses on grammar (Subject-Verb).
Head 2: Focuses on relationships (Who is "it"?).
Head 3: Focuses on tone (Is this angry?).
Feed-Forward Network: The model passes this information through its weights (the logic learned during training) to refine the understanding.
Finally created context vector (The meaning of the input text)
Decoder
Purpose: To take the "map" from the Encoder and generate the output text, step-by-step.
Step 1: Input Processing (The Setup)
The Decoder takes the list of words it has generated so far: ["<Start>", "The", "Giant"].
Embedding: It turns "Giant" into a vector (a list of numbers representing the meaning of Giant).
Positional Encoding: It adds a "timestamp" to the vector so the model knows "Giant" is the 3rd word in the sentence.
Step 2: Masked Self-Attention (The Internal Check)
Action: The Decoder looks at the input
["The", "Giant"].The Mask: It deliberately blocks out any future positions (so it can't see what hasn't been written yet).
The Logic: It calculates how "The" and "Giant" relate to each other.
It determines that "Giant" is an adjective modifying "The."
It establishes the expectation: "I have an Adjective. I need a Noun next."
Result: The vector for "Giant" is updated to include this grammatical context.
Step 3: Cross-Attention (The Fact Retrieval)
Action: The Decoder takes this updated "Giant" vector and looks at the Encoder’s Context Vector (the map of the original source sentence).
The Query: "I am at the word 'Giant'. What concept in the original map corresponds to this?"
The Match: The Attention mechanism finds a high match with the concept "Apple" in the Encoder's map.
Result: It pulls the "Apple" information into the Decoder. The vector now contains the meaning of the word it wants to say.
Step 4: Feed-Forward Network (The Processing)
Action: The "Apple" vector passes through the Feed-Forward neural network.
The Logic: This is where the model applies its "Deep Learning" logic (weights).
It refines the vector.
It resolves specific details (e.g., "Should it be 'Apple' or 'Apples'? Well, the source was singular, so keep it singular.")
Result: A highly polished vector that represents the perfect next concept.
Step 5: Linear Layer (The Vocabulary Check)
Action: The Decoder takes that polished vector and compares it against its entire Dictionary (e.g., 50,000 possible words).
The Math: It performs a massive multiplication.
How much does this vector look like "Aardvark"? (Score: 0.01)
How much does this vector look like "Banana"? (Score: 3.5)
How much does this vector look like "Apple"? (Score: 15.2)
Result: A list of raw scores (called Logits) for every word in the dictionary.
Step 6: Softmax Function (The Probability)
Action: It turns those raw scores into percentages.
Result:
Apple: 94%
Pear: 4%
Car: 0.001%
Step 7: Selection (Decoding Strategy)
Action: The computer picks the word.
Greedy Search: Picks the highest number (94% -> "Apple").
Temperature Sampling: Sometimes picks a slightly less likely word to be "creative" (e.g., might pick "Pear").
Final Output: The word "Apple" is printed on the screen.
Relationship between Transformer and LLM
LLM is trained by transformer architecture of deep learning
When using transformer way, each token can be processed in parallel to understand its meaning with large data set.
Furthermore, as the input is processed at once instead of sequential call with RNN, so it can keep of the context to truly understand the meaning.
Old Way: Like one teacher grading 1,000 tests one by one. (Slow, cannot be rushed).
Transformer Way: Like hiring 1,000 teachers to grade 1,000 tests simultaneously. (Done in seconds). So more gpu are required and more expensive
Token & Tokenizer
A Token is the smallest unit of text that a model can process. Sometimes, it is a word / part of a word / character
The Tokenizer is the standalone software/tool that sits in front of the Transformer.
Its only job is to translate Human Text into Token IDs for the consumption of embedding layer.
So, specific tokenizer is only corresponding for the specific embedding model in order to understand the token id correctly.

When translating token to token id, there are some special tokens involved into it for better understanding of the sentence

The above are some list of special tokens
Last updated
Was this helpful?