REF: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_stream_completions.ipynb
How to stream completions
默認情況下,當你請求OpenAI的完成時,整個完成內(nèi)容會在生成后作為單個響應(yīng)返回。
如果你正在生成長的完成,等待響應(yīng)可能需要多秒鐘。
為了更快地獲得響應(yīng),你可以在生成完成時“流式傳輸”完成。這使你可以在完成全文結(jié)束之前開始打印或處理完成的開頭部分。
要流式傳輸完成,當調(diào)用聊天完成或完成端點時,設(shè)置stream=True。這將返回一個對象,以數(shù)據(jù)為唯一的服務(wù)器推送事件流方式返回響應(yīng)。從delta字段而不是message字段中提取塊。
缺點
請注意,在生產(chǎn)應(yīng)用程序中使用“stream = True”會使內(nèi)容的評估更加困難,因為部分完成可能更難以評估,這對于批準的使用有影響。
流式響應(yīng)的另一個小缺點是響應(yīng)不再包括使用字段來告訴您已經(jīng)使用了多少個令牌。在接收和組合所有響應(yīng)后,您可以使用tiktoken自己計算出這個值。
Example code
Below, this notebook shows:
- What a typical chat completion response looks like
- What a streaming chat completion response looks like
- How much time is saved by streaming a chat completion
- How to stream non-chat completions (used by older models like text-davinci-003)
# imports
import openai # for OpenAI API calls
import time # for measuring time duration of API calls
1. 一個典型的聊天完成響應(yīng)看起來是什么樣子的
With a typical ChatCompletions API call, the response is first computed and then returned all at once.
# Example of an OpenAI ChatCompletion request
# https://platform.openai.com/docs/guides/chat
# record the time before the request is sent
start_time = time.time()
# send a ChatCompletion request to count to 100
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
],
temperature=0,
)
# calculate the time it took to receive the response
response_time = time.time() - start_time
# print the time delay and text received
print(f"Full response received {response_time:.2f} seconds after request")
print(f"Full response received:\n{response}")
Full response received 3.03 seconds after request
Full response received:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "\n\n1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.",
"role": "assistant"
}
}
],
"created": 1677825456,
"id": "chatcmpl-6ptKqrhgRoVchm58Bby0UvJzq2ZuQ",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion",
"usage": {
"completion_tokens": 301,
"prompt_tokens": 36,
"total_tokens": 337
}
}
The reply can be extracted with response['choices'][0]['message'].
The content of the reply can be extracted with response['choices'][0]['message']['content'].
reply = response['choices'][0]['message']
print(f"Extracted reply: \n{reply}")
reply_content = response['choices'][0]['message']['content']
print(f"Extracted content: \n{reply_content}")
Extracted reply:
{
"content": "\n\n1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.",
"role": "assistant"
}
Extracted content:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.
2. How to stream a chat completion
通過流API調(diào)用,響應(yīng)以事件流的形式分成塊逐步發(fā)送回來。在Python中,你可以使用for循環(huán)迭代這些事件。
讓我們看看它是什么樣子的:
# Example of an OpenAI ChatCompletion request with stream=True
# https://platform.openai.com/docs/guides/chat
# a ChatCompletion request
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{'role': 'user', 'content': "What's 1+1? Answer in one word."}
],
temperature=0,
stream=True # this time, we set stream=True
)
for chunk in response:
print(chunk)
{
"choices": [
{
"delta": {
"role": "assistant"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {
"content": "\n\n"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {
"content": "2"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {},
"finish_reason": "stop",
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
As you can see above, streaming responses have a delta field rather than a message field. delta can hold things like:
a role token (e.g., {"role": "assistant"})
a content token (e.g., {"content": "\n\n"})
nothing (e.g., {}), when the stream is over