LLM (Large Language Model)

Transformer Architecture

Encoder

  • Purpose: To understand the input text meaning and create a perfect mathematical "summary" or "map" of the input text.

Step 1: Tokenization (Chopping)

The computer cannot read words. It needs numbers.

  • Action: The text is chopped into small chunks called "Tokens" (words or parts of words).

  • Input: "I love AI."

  • Output: [ "I", "love", "AI" ][ 45, 2098, 11 ]

Step 2: Embedding (Vectorization)

This is where the concepts we discussed earlier (Weights/Vectors) come in.

  • Action: Each number is converted into a massive list of numbers (a vector) that represents its meaning.

  • The Logic: Words with similar meanings have similar numbers.

    • King is mathematically close to Queen.

    • Apple is mathematically far from Car.

Step 3: Positional Encoding (The Order)

Because the Transformer looks at the whole sentence at once (Parallel Processing), it doesn't naturally know that "Man bites Dog" is different from "Dog bites Man."

  • Action: The architecture adds a mathematical "timestamp" to each word so the model knows: "This word is 1st, this word is 2nd."

Step 4: The Attention Blocks (The Processing)

This is the Deep Learning part (The Hidden Layers).

  • Multi-Head Attention: The model looks at the sentence from multiple "perspectives" (Heads) at the same time.

    • Head 1: Focuses on grammar (Subject-Verb).

    • Head 2: Focuses on relationships (Who is "it"?).

    • Head 3: Focuses on tone (Is this angry?).

  • Feed-Forward Network: The model passes this information through its weights (the logic learned during training) to refine the understanding.

  • Finally created context vector (The meaning of the input text)

Decoder

Purpose: To take the "map" from the Encoder and generate the output text, step-by-step.

Step 1: Input Processing (The Setup)

The Decoder takes the list of words it has generated so far: ["<Start>", "The", "Giant"].

  1. Embedding: It turns "Giant" into a vector (a list of numbers representing the meaning of Giant).

  2. Positional Encoding: It adds a "timestamp" to the vector so the model knows "Giant" is the 3rd word in the sentence.

Step 2: Masked Self-Attention (The Internal Check)

  • Action: The Decoder looks at the input ["The", "Giant"].

  • The Mask: It deliberately blocks out any future positions (so it can't see what hasn't been written yet).

  • The Logic: It calculates how "The" and "Giant" relate to each other.

    • It determines that "Giant" is an adjective modifying "The."

    • It establishes the expectation: "I have an Adjective. I need a Noun next."

  • Result: The vector for "Giant" is updated to include this grammatical context.

Step 3: Cross-Attention (The Fact Retrieval)

  • Action: The Decoder takes this updated "Giant" vector and looks at the Encoder’s Context Vector (the map of the original source sentence).

  • The Query: "I am at the word 'Giant'. What concept in the original map corresponds to this?"

  • The Match: The Attention mechanism finds a high match with the concept "Apple" in the Encoder's map.

  • Result: It pulls the "Apple" information into the Decoder. The vector now contains the meaning of the word it wants to say.

Step 4: Feed-Forward Network (The Processing)

  • Action: The "Apple" vector passes through the Feed-Forward neural network.

  • The Logic: This is where the model applies its "Deep Learning" logic (weights).

    • It refines the vector.

    • It resolves specific details (e.g., "Should it be 'Apple' or 'Apples'? Well, the source was singular, so keep it singular.")

  • Result: A highly polished vector that represents the perfect next concept.

Step 5: Linear Layer (The Vocabulary Check)

  • Action: The Decoder takes that polished vector and compares it against its entire Dictionary (e.g., 50,000 possible words).

  • The Math: It performs a massive multiplication.

    • How much does this vector look like "Aardvark"? (Score: 0.01)

    • How much does this vector look like "Banana"? (Score: 3.5)

    • How much does this vector look like "Apple"? (Score: 15.2)

  • Result: A list of raw scores (called Logits) for every word in the dictionary.

Step 6: Softmax Function (The Probability)

  • Action: It turns those raw scores into percentages.

  • Result:

    • Apple: 94%

    • Pear: 4%

    • Car: 0.001%

Step 7: Selection (Decoding Strategy)

  • Action: The computer picks the word.

    • Greedy Search: Picks the highest number (94% -> "Apple").

    • Temperature Sampling: Sometimes picks a slightly less likely word to be "creative" (e.g., might pick "Pear").

  • Final Output: The word "Apple" is printed on the screen.

Relationship between Transformer and LLM

  • LLM is trained by transformer architecture of deep learning

  • When using transformer way, each token can be processed in parallel to understand its meaning with large data set.

  • Furthermore, as the input is processed at once instead of sequential call with RNN, so it can keep of the context to truly understand the meaning.

  • Old Way: Like one teacher grading 1,000 tests one by one. (Slow, cannot be rushed).

  • Transformer Way: Like hiring 1,000 teachers to grade 1,000 tests simultaneously. (Done in seconds). So more gpu are required and more expensive

Token & Tokenizer

  • A Token is the smallest unit of text that a model can process. Sometimes, it is a word / part of a word / character

  • The Tokenizer is the standalone software/tool that sits in front of the Transformer.

    Its only job is to translate Human Text into Token IDs for the consumption of embedding layer.

  • So, specific tokenizer is only corresponding for the specific embedding model in order to understand the token id correctly.

  • When translating token to token id, there are some special tokens involved into it for better understanding of the sentence

  • The above are some list of special tokens

Last updated

Was this helpful?