LLM (Large Language Model)
Transformer Architecture

Encoder
Purpose: To understand the input text meaning and create a perfect mathematical "summary" or "map" of the input text.
Step 1: Tokenization (Chopping)
The computer cannot read words. It needs numbers.
Action: The text is chopped into small chunks called "Tokens" (words or parts of words).
Input: "I love AI."
Output:
[ "I", "love", "AI" ]→[ 45, 2098, 11 ]
Step 2: Embedding (Vectorization)
This is where the concepts we discussed earlier (Weights/Vectors) come in.
Action: Each number is converted into a massive list of numbers (a vector) that represents its meaning.
The Logic: Words with similar meanings have similar numbers.
King is mathematically close to Queen.
Apple is mathematically far from Car.
Step 3: Positional Encoding (The Order)
Because the Transformer looks at the whole sentence at once (Parallel Processing), it doesn't naturally know that "Man bites Dog" is different from "Dog bites Man."
Action: The architecture adds a mathematical "timestamp" to each word so the model knows: "This word is 1st, this word is 2nd."
Step 4: The Attention Blocks (The Processing)
This is the Deep Learning part (The Hidden Layers).
Multi-Head Attention: The model looks at the sentence from multiple "perspectives" (Heads) at the same time.
Head 1: Focuses on grammar (Subject-Verb).
Head 2: Focuses on relationships (Who is "it"?).
Head 3: Focuses on tone (Is this angry?).
Feed-Forward Network: The model passes this information through its weights (the logic learned during training) to refine the understanding.
Finally created context vector (The meaning of the input text)
Decoder
Purpose: To take the "map" from the Encoder and generate the output text, step-by-step.
Step 1: Input Processing (The Setup)
The Decoder takes the list of words it has generated so far: ["<Start>", "The", "Giant"].
Embedding: It turns "Giant" into a vector (a list of numbers representing the meaning of Giant).
Positional Encoding: It adds a "timestamp" to the vector so the model knows "Giant" is the 3rd word in the sentence.
Step 2: Masked Self-Attention (The Internal Check)
Action: The Decoder looks at the input
["The", "Giant"].The Mask: It deliberately blocks out any future positions (so it can't see what hasn't been written yet).
The Logic: It calculates how "The" and "Giant" relate to each other.
It determines that "Giant" is an adjective modifying "The."
It establishes the expectation: "I have an Adjective. I need a Noun next."
Result: The vector for "Giant" is updated to include this grammatical context.
Step 3: Cross-Attention (The Fact Retrieval)
Action: The Decoder takes this updated "Giant" vector and looks at the Encoder’s Context Vector (the map of the original source sentence).
The Query: "I am at the word 'Giant'. What concept in the original map corresponds to this?"
The Match: The Attention mechanism finds a high match with the concept "Apple" in the Encoder's map.
Result: It pulls the "Apple" information into the Decoder. The vector now contains the meaning of the word it wants to say.
Step 4: Feed-Forward Network (The Processing)
Action: The "Apple" vector passes through the Feed-Forward neural network.
The Logic: This is where the model applies its "Deep Learning" logic (weights).
It refines the vector.
It resolves specific details (e.g., "Should it be 'Apple' or 'Apples'? Well, the source was singular, so keep it singular.")
Result: A highly polished vector that represents the perfect next concept.
Step 5: Linear Layer (The Vocabulary Check)
Action: The Decoder takes that polished vector and compares it against its entire Dictionary (e.g., 50,000 possible words).
The Math: It performs a massive multiplication.
How much does this vector look like "Aardvark"? (Score: 0.01)
How much does this vector look like "Banana"? (Score: 3.5)
How much does this vector look like "Apple"? (Score: 15.2)
Result: A list of raw scores (called Logits) for every word in the dictionary.
Step 6: Softmax Function (The Probability)
Action: It turns those raw scores into percentages.
Result:
Apple: 94%
Pear: 4%
Car: 0.001%
Step 7: Selection (Decoding Strategy)
Action: The computer picks the word.
Greedy Search: Picks the highest number (94% -> "Apple").
Temperature Sampling: Sometimes picks a slightly less likely word to be "creative" (e.g., might pick "Pear").
Final Output: The word "Apple" is printed on the screen.
Relationship between Transformer and LLM
LLM is trained by transformer architecture of deep learning
When using transformer way, each token can be processed in parallel to understand its meaning with large data set.
Furthermore, as the input is processed at once instead of sequential call with RNN, so it can keep of the context to truly understand the meaning.
Old Way: Like one teacher grading 1,000 tests one by one. (Slow, cannot be rushed).
Transformer Way: Like hiring 1,000 teachers to grade 1,000 tests simultaneously. (Done in seconds). So more gpu are required and more expensive
Token & Tokenizer
A Token is the smallest unit of text that a model can process. Sometimes, it is a word / part of a word / character
The Tokenizer is the standalone software/tool that sits in front of the Transformer.
Its only job is to translate Human Text into Token IDs for the consumption of embedding layer.
So, specific tokenizer is only corresponding for the specific embedding model in order to understand the token id correctly.

When translating token to token id, there are some special tokens involved into it for better understanding of the sentence

The above are some list of special tokens
Fine Tuning
Parameter
Top-p

Top-p, also known as Nucleus Sampling, is a setting used to control the randomness and creativity of the text generated by Large Language Models (LLMs)
Imagine the AI is trying to finish the sentence: "The car drove down the..."
The AI ranks the possible next words by probability:
Road (50%)
Street (30%)
Highway (15%)
River (1%)
Banana (0.001%)
If you set Top-p to 0.9 (90%): The AI starts adding up the probabilities from the top down until it hits 90%.
Road (50%) → Total: 50% (Keep going)
Street (30%) → Total: 80% (Keep going)
Highway (15%) → Total: 95% (Stop! We crossed 90%)
Temperature

Temperature is a setting that acts as the "randomness dial" for an LLM. It controls how "confident" or "wild" the model allows itself to be when selecting the next word.
Low Temperature: The AI exaggerates the differences between probabilities. It makes the most likely word even more likely (close to 100%) and makes the less likely words almost impossible to pick.
High Temperature: The AI "flattens" the probabilities. It makes the most likely word less dominant, and gives the unlikely words a better fighting chance.
Frequency penalty & Presence penalty
Frequency_penalty: This parameter is used to discourage the model from repeating the same words or phrases too frequently within the generated text. A higher frequency_penalty value will result in the less repeated keyword
Presence_penalty: This parameter is used to encourage the model to include a diverse range of tokens in the generated text. so that the result will be in different topic or content . A higher presence_penalty value will result in the model being more likely to generate tokens that have not yet been included in the generated text and in different topic
Max Token
The maximum number of tokens allowed for the generated answer. By default, the number of tokens the model can return will be (4096 - prompt tokens).
Tools
For function call , here is the example of object
Chat Completion
Embedding
Overview
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
Commonly used for:
Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)
Example
Function Call
Overview

You can describe the parameter required, description and the function name, and provide it to gpt
Gpt will help to decide which function will be called and return the parameter based on the user question
Then, you can make good use of return value to call your own function , and return back the answer to provide to gpt
Finally, gpt will output the answer, that involves multiple completion calls
Example
Langchain
LangChain is a framework that facilitates the creation of applications using language models.
It provides different components (e.g: llm model and embedding) that allow non-AI experts to be able to implement existing AI language models into their applications
It is easy for developer to build complex chain from components
Chains are the fundamental principle that holds various AI components in LangChain to provide context-aware responses. A chain is a series of automated actions from the user's query to the model's output. For example, developers can use a chain for:
Connecting to different data sources.
Generating unique content.
Translating multiple languages.
Answering user queries.
Chains are made of links. Each action that developers string together to form a chained sequence is called a link. With links, developers can divide complex tasks into multiple, smaller tasks. Examples of links include:
Formatting user input.
Sending a query to an LLM.
Retrieving data from cloud storage.
Translating from one language to another.
Customization Methodology

Fine-tuning
Fine-tuning entails techniques to further train a model whose weights have already been updated through prior training. Using the base model’s previous knowledge as a starting point, fine-tuning tailors the model by training it on a smaller, task-specific dataset.
Prompt engineering
Retrieval-Augmented Generation (RAG)



Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information.
Firstly, fetching the related content by function call, similarity search based on embedding (R)
After that, add the result to the prompt
Finally, the result will be returned from the model
References
Last updated