curl https://YOUR_RESOURCE_NAME.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT_NAME/chat/completions?api-version=2023-05-15 \
-H "Content-Type: application/json" \
-H "api-key: YOUR_API_KEY" \
-d '{"messages":[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Does Azure OpenAI support customer managed keys?"},{"role": "assistant", "content": "Yes, customer managed keys are supported by Azure OpenAI."},{"role": "user", "content": "Do other Azure AI services support this too?"}]}'
{
"id":"chatcmpl-6v7mkQj980V1yBec6ETrKPRqFjNw9",
"object":"chat.completion",
"created":1679072642,
"model":"gpt-35-turbo",
"usage":{"prompt_tokens":58,
"completion_tokens":68,
"total_tokens":126},
"choices":[
{
"message":{"role":"assistant","content":"Yes, other Azure AI services also support customer managed keys. Azure AI services offer multiple options for customers to manage keys, such as using Azure Key Vault, customer-managed keys in Azure Key Vault or customer-managed keys through Azure Storage service. This helps customers ensure that their data is secure and access to their services is controlled."},
"finish_reason":"stop",
"index":0
}]
}
Parameter
Messages
Include role and content
Role: Indicates who is giving the current message. Can be system,user,assistant,tool, or function.
Content: The question and the answer
Temperature
What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
Frequency penalty & Presence penalty
Frequency_penalty: This parameter is used to discourage the model from repeating the same words or phrases too frequently within the generated text. A higher frequency_penalty value will result in the less repeated keyword
Presence_penalty: This parameter is used to encourage the model to include a diverse range of tokens in the generated text. so that the result will be in different topic or content . A higher presence_penalty value will result in the model being more likely to generate tokens that have not yet been included in the generated text and in different topic
Max Token
The maximum number of tokens allowed for the generated answer. By default, the number of tokens the model can return will be (4096 - prompt tokens).
Tools
For function call , here is the example of object
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
Embedding
Overview
Commonly used for:
Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)
You can describe the parameter required, description and the function name, and provide it to gpt
Gpt will help to decide which function will be called and return the parameter based on the user question
Then, you can make good use of return value to call your own function , and return back the answer to provide to gpt
Finally, gpt will output the answer, that involves multiple completion calls
Example
from openai import OpenAI
import json
client = OpenAI()
# Example dummy function hard coded to return the same weather
# In production, this could be your backend API or an external API
def get_current_weather(location, unit="fahrenheit"):
"""Get the current weather in a given location"""
if "tokyo" in location.lower():
return json.dumps({"location": "Tokyo", "temperature": "10", "unit": unit})
elif "san francisco" in location.lower():
return json.dumps({"location": "San Francisco", "temperature": "72", "unit": unit})
elif "paris" in location.lower():
return json.dumps({"location": "Paris", "temperature": "22", "unit": unit})
else:
return json.dumps({"location": location, "temperature": "unknown"})
def run_conversation():
# Step 1: send the conversation and available functions to the model
messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="gpt-3.5-turbo-1106",
messages=messages,
tools=tools,
tool_choice="auto", # auto is default, but we'll be explicit
)
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
# Step 2: check if the model wanted to call a function
if tool_calls:
# Step 3: call the function
# Note: the JSON response may not always be valid; be sure to handle errors
available_functions = {
"get_current_weather": get_current_weather,
} # only one function in this example, but you can have multiple
messages.append(response_message) # extend conversation with assistant's reply
# Step 4: send the info for each function call and function response to the model
for tool_call in tool_calls:
function_name = tool_call.function.name
function_to_call = available_functions[function_name]
function_args = json.loads(tool_call.function.arguments)
function_response = function_to_call(
location=function_args.get("location"),
unit=function_args.get("unit"),
)
messages.append(
{
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": function_response,
}
) # extend conversation with function response
second_response = client.chat.completions.create(
model="gpt-3.5-turbo-1106",
messages=messages,
) # get a new response from the model where it can see the function response
return second_response
print(run_conversation())
// user ask question
{"role": "user", "content": "What's the weather like today"}
// gpt ask for clarification
{
'role': 'assistant',
'content': 'Sure, I can help you with that. Could you please tell me the city and state you are in or the location you want to know the weather for?'
}
// user ask again
{"role": "user", "content": "I'm in Glasgow, Scotland."}
// gpt detect which function should be called and return parameter based on the question
{
'role': 'assistant',
'content': None,'tool_calls': [{'id': 'call_o7uyztQLeVIoRdjcDkDJY3ni',
'type': 'function',
'function': {'name': 'get_current_weather',
'arguments': '{\n "location": "Glasgow, Scotland",\n "format": "celsius"\n}'}}
}
// Calling api to get temperature 22
// After calling 3rd data source, and then call gpt again with 3rd party answer
{
"role": "function",
"name": "get_current_weather",
"content": "{\"temperature\": "22", \"unit\": \"celsius\", \"description\": \"Sunny\"}"
}
// Return final answer by gpt
{
"role": "assistant",
"content": "The weather in Boston is currently sunny with a temperature of 22 degrees Celsius.",
}
Langchain
LangChain is a framework that facilitates the creation of applications using language models.
It provides different components (e.g: llm model and embedding) that allow non-AI experts to be able to implement existing AI language models into their applications
It is easy for developer to build complex chain from components
Chains are the fundamental principle that holds various AI components in LangChain to provide context-aware responses. A chain is a series of automated actions from the user's query to the model's output. For example, developers can use a chain for:
Connecting to different data sources.
Generating unique content.
Translating multiple languages.
Answering user queries.
Chains are made of links. Each action that developers string together to form a chained sequence is called a link. With links, developers can divide complex tasks into multiple, smaller tasks. Examples of links include:
Formatting user input.
Sending a query to an LLM.
Retrieving data from cloud storage.
Translating from one language to another.
Customization Methodology
Fine-tuning
Fine-tuning entails techniques to further train a model whose weights have already been updated through prior training. Using the base model’s previous knowledge as a starting point, fine-tuning tailors the model by training it on a smaller, task-specific dataset.
Prompt engineering
Retrieval-Augmented Generation (RAG)
Firstly, fetching the related content by function call, similarity search based on embedding (R)
After that, add the result to the prompt
Finally, the result will be returned from the model
An embedding is a vector (list) of floating point numbers. The between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
Retrieval-augmented generation (RAG) is an AI framework for improving the quality of responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information.