LLM (Large Language Model)

Transformer Architecture

Encoder

  • Purpose: To understand the input text meaning and create a perfect mathematical "summary" or "map" of the input text.

Step 1: Tokenization (Chopping)

The computer cannot read words. It needs numbers.

  • Action: The text is chopped into small chunks called "Tokens" (words or parts of words).

  • Input: "I love AI."

  • Output: [ "I", "love", "AI" ][ 45, 2098, 11 ]

Step 2: Embedding (Vectorization)

This is where the concepts we discussed earlier (Weights/Vectors) come in.

  • Action: Each number is converted into a massive list of numbers (a vector) that represents its meaning.

  • The Logic: Words with similar meanings have similar numbers.

    • King is mathematically close to Queen.

    • Apple is mathematically far from Car.

Step 3: Positional Encoding (The Order)

Because the Transformer looks at the whole sentence at once (Parallel Processing), it doesn't naturally know that "Man bites Dog" is different from "Dog bites Man."

  • Action: The architecture adds a mathematical "timestamp" to each word so the model knows: "This word is 1st, this word is 2nd."

Step 4: The Attention Blocks (The Processing)

This is the Deep Learning part (The Hidden Layers).

  • Multi-Head Attention: The model looks at the sentence from multiple "perspectives" (Heads) at the same time.

    • Head 1: Focuses on grammar (Subject-Verb).

    • Head 2: Focuses on relationships (Who is "it"?).

    • Head 3: Focuses on tone (Is this angry?).

  • Feed-Forward Network: The model passes this information through its weights (the logic learned during training) to refine the understanding.

  • Finally created context vector (The meaning of the input text)

Decoder

Purpose: To take the "map" from the Encoder and generate the output text, step-by-step.

Step 1: Input Processing (The Setup)

The Decoder takes the list of words it has generated so far: ["<Start>", "The", "Giant"].

  1. Embedding: It turns "Giant" into a vector (a list of numbers representing the meaning of Giant).

  2. Positional Encoding: It adds a "timestamp" to the vector so the model knows "Giant" is the 3rd word in the sentence.

Step 2: Masked Self-Attention (The Internal Check)

  • Action: The Decoder looks at the input ["The", "Giant"].

  • The Mask: It deliberately blocks out any future positions (so it can't see what hasn't been written yet).

  • The Logic: It calculates how "The" and "Giant" relate to each other.

    • It determines that "Giant" is an adjective modifying "The."

    • It establishes the expectation: "I have an Adjective. I need a Noun next."

  • Result: The vector for "Giant" is updated to include this grammatical context.

Step 3: Cross-Attention (The Fact Retrieval)

  • Action: The Decoder takes this updated "Giant" vector and looks at the Encoder’s Context Vector (the map of the original source sentence).

  • The Query: "I am at the word 'Giant'. What concept in the original map corresponds to this?"

  • The Match: The Attention mechanism finds a high match with the concept "Apple" in the Encoder's map.

  • Result: It pulls the "Apple" information into the Decoder. The vector now contains the meaning of the word it wants to say.

Step 4: Feed-Forward Network (The Processing)

  • Action: The "Apple" vector passes through the Feed-Forward neural network.

  • The Logic: This is where the model applies its "Deep Learning" logic (weights).

    • It refines the vector.

    • It resolves specific details (e.g., "Should it be 'Apple' or 'Apples'? Well, the source was singular, so keep it singular.")

  • Result: A highly polished vector that represents the perfect next concept.

Step 5: Linear Layer (The Vocabulary Check)

  • Action: The Decoder takes that polished vector and compares it against its entire Dictionary (e.g., 50,000 possible words).

  • The Math: It performs a massive multiplication.

    • How much does this vector look like "Aardvark"? (Score: 0.01)

    • How much does this vector look like "Banana"? (Score: 3.5)

    • How much does this vector look like "Apple"? (Score: 15.2)

  • Result: A list of raw scores (called Logits) for every word in the dictionary.

Step 6: Softmax Function (The Probability)

  • Action: It turns those raw scores into percentages.

  • Result:

    • Apple: 94%

    • Pear: 4%

    • Car: 0.001%

Step 7: Selection (Decoding Strategy)

  • Action: The computer picks the word.

    • Greedy Search: Picks the highest number (94% -> "Apple").

    • Temperature Sampling: Sometimes picks a slightly less likely word to be "creative" (e.g., might pick "Pear").

  • Final Output: The word "Apple" is printed on the screen.

Relationship between Transformer and LLM

  • LLM is trained by transformer architecture of deep learning

  • When using transformer way, each token can be processed in parallel to understand its meaning with large data set.

  • Furthermore, as the input is processed at once instead of sequential call with RNN, so it can keep of the context to truly understand the meaning.

  • Old Way: Like one teacher grading 1,000 tests one by one. (Slow, cannot be rushed).

  • Transformer Way: Like hiring 1,000 teachers to grade 1,000 tests simultaneously. (Done in seconds). So more gpu are required and more expensive

Token & Tokenizer

  • A Token is the smallest unit of text that a model can process. Sometimes, it is a word / part of a word / character

  • The Tokenizer is the standalone software/tool that sits in front of the Transformer.

    Its only job is to translate Human Text into Token IDs for the consumption of embedding layer.

  • So, specific tokenizer is only corresponding for the specific embedding model in order to understand the token id correctly.

  • When translating token to token id, there are some special tokens involved into it for better understanding of the sentence

  • The above are some list of special tokens

Fine Tuning

Parameter

Top-p

  • Top-p, also known as Nucleus Sampling, is a setting used to control the randomness and creativity of the text generated by Large Language Models (LLMs)

  • Imagine the AI is trying to finish the sentence: "The car drove down the..."

    The AI ranks the possible next words by probability:

    1. Road (50%)

    2. Street (30%)

    3. Highway (15%)

    4. River (1%)

    5. Banana (0.001%)

    If you set Top-p to 0.9 (90%): The AI starts adding up the probabilities from the top down until it hits 90%.

    • Road (50%) → Total: 50% (Keep going)

    • Street (30%) → Total: 80% (Keep going)

    • Highway (15%) → Total: 95% (Stop! We crossed 90%)

Temperature

  • Temperature is a setting that acts as the "randomness dial" for an LLM. It controls how "confident" or "wild" the model allows itself to be when selecting the next word.

  • Low Temperature: The AI exaggerates the differences between probabilities. It makes the most likely word even more likely (close to 100%) and makes the less likely words almost impossible to pick.

  • High Temperature: The AI "flattens" the probabilities. It makes the most likely word less dominant, and gives the unlikely words a better fighting chance.

Frequency penalty & Presence penalty

  • Frequency_penalty: This parameter is used to discourage the model from repeating the same words or phrases too frequently within the generated text. A higher frequency_penalty value will result in the less repeated keyword

  • Presence_penalty: This parameter is used to encourage the model to include a diverse range of tokens in the generated text. so that the result will be in different topic or content . A higher presence_penalty value will result in the model being more likely to generate tokens that have not yet been included in the generated text and in different topic

Max Token

  • The maximum number of tokens allowed for the generated answer. By default, the number of tokens the model can return will be (4096 - prompt tokens).

Tools

  • For function call , here is the example of object

Chat Completion

Embedding

Overview

  • An embedding is a vector (list) of floating point numbers. The distancearrow-up-right between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

  • Commonly used for:

    • Search (where results are ranked by relevance to a query string)

    • Clustering (where text strings are grouped by similarity)

    • Recommendations (where items with related text strings are recommended)

    • Anomaly detection (where outliers with little relatedness are identified)

    • Diversity measurement (where similarity distributions are analyzed)

    • Classification (where text strings are classified by their most similar label)

Example

Function Call

Overview

  • You can describe the parameter required, description and the function name, and provide it to gpt

  • Gpt will help to decide which function will be called and return the parameter based on the user question

  • Then, you can make good use of return value to call your own function , and return back the answer to provide to gpt

  • Finally, gpt will output the answer, that involves multiple completion calls

Example

Langchain

  • LangChain is a framework that facilitates the creation of applications using language models.

  • It provides different components (e.g: llm model and embedding) that allow non-AI experts to be able to implement existing AI language models into their applications

  • It is easy for developer to build complex chain from components

  • Chains are the fundamental principle that holds various AI components in LangChain to provide context-aware responses. A chain is a series of automated actions from the user's query to the model's output. For example, developers can use a chain for:

    • Connecting to different data sources.

    • Generating unique content.

    • Translating multiple languages.

    • Answering user queries.

  • Chains are made of links. Each action that developers string together to form a chained sequence is called a link. With links, developers can divide complex tasks into multiple, smaller tasks. Examples of links include:

    • Formatting user input.

    • Sending a query to an LLM.

    • Retrieving data from cloud storage.

    • Translating from one language to another.

Customization Methodology

Fine-tuning

  • Fine-tuning entails techniques to further train a model whose weights have already been updated through prior training. Using the base model’s previous knowledge as a starting point, fine-tuning tailors the model by training it on a smaller, task-specific dataset.

Prompt engineering

Retrieval-Augmented Generation (RAG)

  • Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generatedarrow-up-right responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information.

  • Firstly, fetching the related content by function call, similarity search based on embedding (R)

  • After that, add the result to the prompt

  • Finally, the result will be returned from the model

References

Last updated