LLM (Large Language Model)

Transformer Architecture

Encoder

Purpose: To understand the input text meaning and create a perfect mathematical "summary" or "map" of the input text.

Step 1: Tokenization (Chopping)

The computer cannot read words. It needs numbers.

Action: The text is chopped into small chunks called "Tokens" (words or parts of words).
Input: "I love AI."
Output: [ "I", "love", "AI" ] → [ 45, 2098, 11 ]

Step 2: Embedding (Vectorization)

This is where the concepts we discussed earlier (Weights/Vectors) come in.

Action: Each number is converted into a massive list of numbers (a vector) that represents its meaning.
The Logic: Words with similar meanings have similar numbers.
- King is mathematically close to Queen.
- Apple is mathematically far from Car.

Step 3: Positional Encoding (The Order)

Because the Transformer looks at the whole sentence at once (Parallel Processing), it doesn't naturally know that "Man bites Dog" is different from "Dog bites Man."

Action: The architecture adds a mathematical "timestamp" to each word so the model knows: "This word is 1st, this word is 2nd."

Step 4: The Attention Blocks (The Processing)

This is the Deep Learning part (The Hidden Layers).

Multi-Head Attention: The model looks at the sentence from multiple "perspectives" (Heads) at the same time.
- Head 1: Focuses on grammar (Subject-Verb).
- Head 2: Focuses on relationships (Who is "it"?).
- Head 3: Focuses on tone (Is this angry?).
Feed-Forward Network: The model passes this information through its weights (the logic learned during training) to refine the understanding.
Finally created context vector (The meaning of the input text)

Decoder

Purpose: To take the "map" from the Encoder and generate the output text, step-by-step.

Step 1: Input Processing (The Setup)

The Decoder takes the list of words it has generated so far: ["<Start>", "The", "Giant"].

Embedding: It turns "Giant" into a vector (a list of numbers representing the meaning of Giant).
Positional Encoding: It adds a "timestamp" to the vector so the model knows "Giant" is the 3rd word in the sentence.

Step 2: Masked Self-Attention (The Internal Check)

Action: The Decoder looks at the input ["The", "Giant"].
The Mask: It deliberately blocks out any future positions (so it can't see what hasn't been written yet).
The Logic: It calculates how "The" and "Giant" relate to each other.
- It determines that "Giant" is an adjective modifying "The."
- It establishes the expectation: "I have an Adjective. I need a Noun next."
Result: The vector for "Giant" is updated to include this grammatical context.

Step 3: Cross-Attention (The Fact Retrieval)

Action: The Decoder takes this updated "Giant" vector and looks at the Encoder’s Context Vector (the map of the original source sentence).
The Query: "I am at the word 'Giant'. What concept in the original map corresponds to this?"
The Match: The Attention mechanism finds a high match with the concept "Apple" in the Encoder's map.
Result: It pulls the "Apple" information into the Decoder. The vector now contains the meaning of the word it wants to say.

Step 4: Feed-Forward Network (The Processing)

Action: The "Apple" vector passes through the Feed-Forward neural network.
The Logic: This is where the model applies its "Deep Learning" logic (weights).
- It refines the vector.
- It resolves specific details (e.g., "Should it be 'Apple' or 'Apples'? Well, the source was singular, so keep it singular.")
Result: A highly polished vector that represents the perfect next concept.

Step 5: Linear Layer (The Vocabulary Check)

Action: The Decoder takes that polished vector and compares it against its entire Dictionary (e.g., 50,000 possible words).
The Math: It performs a massive multiplication.
- How much does this vector look like "Aardvark"? (Score: 0.01)
- How much does this vector look like "Banana"? (Score: 3.5)
- How much does this vector look like "Apple"? (Score: 15.2)
Result: A list of raw scores (called Logits) for every word in the dictionary.

Step 6: Softmax Function (The Probability)

Action: It turns those raw scores into percentages.
Result:
- Apple: 94%
- Pear: 4%
- Car: 0.001%

Step 7: Selection (Decoding Strategy)

Action: The computer picks the word.
- Greedy Search: Picks the highest number (94% -> "Apple").
- Temperature Sampling: Sometimes picks a slightly less likely word to be "creative" (e.g., might pick "Pear").
Final Output: The word "Apple" is printed on the screen.

Relationship between Transformer and LLM

LLM is trained by transformer architecture of deep learning
When using transformer way, each token can be processed in parallel to understand its meaning with large data set.
Furthermore, as the input is processed at once instead of sequential call with RNN, so it can keep of the context to truly understand the meaning.
Old Way: Like one teacher grading 1,000 tests one by one. (Slow, cannot be rushed).
Transformer Way: Like hiring 1,000 teachers to grade 1,000 tests simultaneously. (Done in seconds). So more gpu are required and more expensive

Token & Tokenizer

A Token is the smallest unit of text that a model can process. Sometimes, it is a word / part of a word / character
The Tokenizer is the standalone software/tool that sits in front of the Transformer.
Its only job is to translate Human Text into Token IDs for the consumption of embedding layer.
So, specific tokenizer is only corresponding for the specific embedding model in order to understand the token id correctly.

When translating token to token id, there are some special tokens involved into it for better understanding of the sentence

The above are some list of special tokens

Fine Tuning

Fine-Tuning LLMs: A Guide With ExamplesDataCamp

Parameter

Top-p

Top-p, also known as Nucleus Sampling, is a setting used to control the randomness and creativity of the text generated by Large Language Models (LLMs)
Imagine the AI is trying to finish the sentence: "The car drove down the..."
The AI ranks the possible next words by probability:
1. Road (50%)
2. Street (30%)
3. Highway (15%)
4. River (1%)
5. Banana (0.001%)
If you set Top-p to 0.9 (90%): The AI starts adding up the probabilities from the top down until it hits 90%.
- Road (50%) → Total: 50% (Keep going)
- Street (30%) → Total: 80% (Keep going)
- Highway (15%) → Total: 95% (Stop! We crossed 90%)

Temperature

Temperature is a setting that acts as the "randomness dial" for an LLM. It controls how "confident" or "wild" the model allows itself to be when selecting the next word.
Low Temperature: The AI exaggerates the differences between probabilities. It makes the most likely word even more likely (close to 100%) and makes the less likely words almost impossible to pick.
High Temperature: The AI "flattens" the probabilities. It makes the most likely word less dominant, and gives the unlikely words a better fighting chance.

Frequency penalty & Presence penalty

Frequency_penalty: This parameter is used to discourage the model from repeating the same words or phrases too frequently within the generated text. A higher frequency_penalty value will result in the less repeated keyword
Presence_penalty: This parameter is used to encourage the model to include a diverse range of tokens in the generated text. so that the result will be in different topic or content . A higher presence_penalty value will result in the model being more likely to generate tokens that have not yet been included in the generated text and in different topic

從新手邁向專家：全面解析LLM中的Frequency與Presence PenaltyMedium

Max Token

The maximum number of tokens allowed for the generated answer. By default, the number of tokens the model can return will be (4096 - prompt tokens).

Tools

For function call , here is the example of object

tools = [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                    },
                    "required": ["location"],
                },
            },
        }
    ]

Chat Completion

https://{your-resource-name}.openai.azure.com/openai/deployments/{deployment-id}/chat/completions?api-version={api-version}

curl https://YOUR_RESOURCE_NAME.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT_NAME/chat/completions?api-version=2023-05-15 \
  -H "Content-Type: application/json" \
  -H "api-key: YOUR_API_KEY" \
  -d '{"messages":[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Does Azure OpenAI support customer managed keys?"},{"role": "assistant", "content": "Yes, customer managed keys are supported by Azure OpenAI."},{"role": "user", "content": "Do other Azure AI services support this too?"}]}'

{
    "id":"chatcmpl-6v7mkQj980V1yBec6ETrKPRqFjNw9",
    "object":"chat.completion",
    "created":1679072642,
    "model":"gpt-35-turbo",
    "usage":{"prompt_tokens":58,
    "completion_tokens":68,
    "total_tokens":126},
    "choices":[
     {
        "message":{"role":"assistant","content":"Yes, other Azure AI services also support customer managed keys. Azure AI services offer multiple options for customers to manage keys, such as using Azure Key Vault, customer-managed keys in Azure Key Vault or customer-managed keys through Azure Storage service. This helps customers ensure that their data is secure and access to their services is controlled."},
        "finish_reason":"stop",
        "index":0
     }]
}

Embedding

Overview

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
Commonly used for:
- Search (where results are ranked by relevance to a query string)
- Clustering (where text strings are grouped by similarity)
- Recommendations (where items with related text strings are recommended)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)

Example

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "input": "Your text string goes here",
    "model": "text-embedding-ada-002"
  }

{
  "data": [
    {
      "embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],
      "index": 0,
      "object": "embedding"
    }
  ],
  "model": "text-embedding-ada-002",
  "object": "list",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

Function Call

Overview

You can describe the parameter required, description and the function name, and provide it to gpt
Gpt will help to decide which function will be called and return the parameter based on the user question
Then, you can make good use of return value to call your own function , and return back the answer to provide to gpt
Finally, gpt will output the answer, that involves multiple completion calls

Example

from openai import OpenAI
import json

client = OpenAI()

# Example dummy function hard coded to return the same weather
# In production, this could be your backend API or an external API
def get_current_weather(location, unit="fahrenheit"):
    """Get the current weather in a given location"""
    if "tokyo" in location.lower():
        return json.dumps({"location": "Tokyo", "temperature": "10", "unit": unit})
    elif "san francisco" in location.lower():
        return json.dumps({"location": "San Francisco", "temperature": "72", "unit": unit})
    elif "paris" in location.lower():
        return json.dumps({"location": "Paris", "temperature": "22", "unit": unit})
    else:
        return json.dumps({"location": location, "temperature": "unknown"})

def run_conversation():
    # Step 1: send the conversation and available functions to the model
    messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                    },
                    "required": ["location"],
                },
            },
        }
    ]
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
        tools=tools,
        tool_choice="auto",  # auto is default, but we'll be explicit
    )
    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    # Step 2: check if the model wanted to call a function
    if tool_calls:
        # Step 3: call the function
        # Note: the JSON response may not always be valid; be sure to handle errors
        available_functions = {
            "get_current_weather": get_current_weather,
        }  # only one function in this example, but you can have multiple
        messages.append(response_message)  # extend conversation with assistant's reply
        # Step 4: send the info for each function call and function response to the model
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_functions[function_name]
            function_args = json.loads(tool_call.function.arguments)
            function_response = function_to_call(
                location=function_args.get("location"),
                unit=function_args.get("unit"),
            )
            messages.append(
                {
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": function_response,
                }
            )  # extend conversation with function response
        second_response = client.chat.completions.create(
            model="gpt-3.5-turbo-1106",
            messages=messages,
        )  # get a new response from the model where it can see the function response
        return second_response
print(run_conversation())

// user ask question
{"role": "user", "content": "What's the weather like today"}
// gpt ask for clarification
{
    'role': 'assistant', 
    'content': 'Sure, I can help you with that. Could you please tell me the city and state you are in or the location you want to know the weather for?'
}
 // user ask again
 {"role": "user", "content": "I'm in Glasgow, Scotland."}
 // gpt detect which function should be called and return parameter based on the question
 {
    'role': 'assistant', 
    'content': None,'tool_calls': [{'id': 'call_o7uyztQLeVIoRdjcDkDJY3ni',
    'type': 'function',
    'function': {'name': 'get_current_weather',
    'arguments': '{\n  "location": "Glasgow, Scotland",\n  "format": "celsius"\n}'}}
 }
 // Calling api to get temperature 22
 // After calling 3rd data source, and then call gpt again with 3rd party answer   
 {
     "role": "function",
      "name": "get_current_weather", 
      "content": "{\"temperature\": "22", \"unit\": \"celsius\", \"description\": \"Sunny\"}"
 } 
 // Return final answer by gpt
 {
      "role": "assistant",
      "content": "The weather in Boston is currently sunny with a temperature of 22 degrees Celsius.",
  }

Langchain

LangChain is a framework that facilitates the creation of applications using language models.
It provides different components (e.g: llm model and embedding) that allow non-AI experts to be able to implement existing AI language models into their applications
It is easy for developer to build complex chain from components
Chains are the fundamental principle that holds various AI components in LangChain to provide context-aware responses. A chain is a series of automated actions from the user's query to the model's output. For example, developers can use a chain for:
- Connecting to different data sources.
- Generating unique content.
- Translating multiple languages.
- Answering user queries.
Chains are made of links. Each action that developers string together to form a chained sequence is called a link. With links, developers can divide complex tasks into multiple, smaller tasks. Examples of links include:
- Formatting user input.
- Sending a query to an LLM.
- Retrieving data from cloud storage.
- Translating from one language to another.

Customization Methodology

Fine-tuning

Fine-tuning entails techniques to further train a model whose weights have already been updated through prior training. Using the base model’s previous knowledge as a starting point, fine-tuning tailors the model by training it on a smaller, task-specific dataset.

Prompt engineering

Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information.
Firstly, fetching the related content by function call, similarity search based on embedding (R)
After that, add the result to the prompt
Finally, the result will be returned from the model

References

How to call functions with chat models | OpenAI Cookbookcookbook.openai.com

Function calling and other API updatesOpenAI

Azure OpenAI in Microsoft Foundry Models REST API reference - Azure OpenAIMicrosoftLearn

PreviousNatural Language Processing (NLP)NextPrompt

Last updated 22 days ago

hashtagTransformer Architecture

hashtagEncoder

hashtagDecoder

hashtagStep 1: Input Processing (The Setup)

hashtagStep 2: Masked Self-Attention (The Internal Check)

hashtagStep 3: Cross-Attention (The Fact Retrieval)

hashtagStep 4: Feed-Forward Network (The Processing)

hashtagStep 5: Linear Layer (The Vocabulary Check)

hashtagStep 6: Softmax Function (The Probability)

hashtagStep 7: Selection (Decoding Strategy)

hashtagRelationship between Transformer and LLM

hashtagToken & Tokenizer

hashtagFine Tuning

hashtagParameter

hashtagTop-p

hashtagTemperature

hashtagFrequency penalty & Presence penalty

hashtagMax Token

hashtagTools

hashtagChat Completion

hashtagEmbedding

hashtagOverview

hashtagExample

hashtagFunction Call

hashtagOverview

hashtagExample

hashtagLangchain

hashtagCustomization Methodology

hashtagFine-tuning

hashtagPrompt engineering

hashtagRetrieval-Augmented Generation (RAG)

hashtagReferences

Transformer Architecture

Encoder

Decoder

Step 1: Input Processing (The Setup)

Step 2: Masked Self-Attention (The Internal Check)

Step 3: Cross-Attention (The Fact Retrieval)

Step 4: Feed-Forward Network (The Processing)

Step 5: Linear Layer (The Vocabulary Check)

Step 6: Softmax Function (The Probability)

Step 7: Selection (Decoding Strategy)

Relationship between Transformer and LLM

Token & Tokenizer

Fine Tuning

Parameter

Top-p

Temperature

Frequency penalty & Presence penalty

Max Token

Tools

Chat Completion

Embedding

Overview

Example

Function Call

Overview

Example

Langchain

Customization Methodology

Fine-tuning

Prompt engineering

Retrieval-Augmented Generation (RAG)

References