LLM 如何流式传输回答

发布时间:2025 年 1 月 21 日

流式 LLM 响应由增量和连续发出的数据组成。 在服务器和客户端上,流式数据的外观不同。

从服务器

为了了解流式响应的内容,我使用命令行工具 curl 提示 Gemini 告诉我一个长笑话。请考虑以下对 Gemini API 的调用。如果您要试用,请务必将网址中的 {GOOGLE_API_KEY} 替换为您的 Gemini API 密钥。

$ curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:streamGenerateContent?alt=sse&key={GOOGLE_API_KEY}" \
      -H 'Content-Type: application/json' \
      --no-buffer \
      -d '{ "contents":[{"parts":[{"text": "Tell me a long T-rex joke, please."}]}]}'

此请求会以事件流格式记录以下(截断的)输出。每行都以 data: 开头,后跟消息载荷。具体格式实际上并不重要,重要的是文本块。

//
data: {"candidates":[{"content": {"parts": [{"text": "A T-Rex"}],"role": "model"},
  "finishReason": "STOP","index": 0,"safetyRatings": [{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_HATE_SPEECH","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_HARASSMENT","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_DANGEROUS_CONTENT","probability": "NEGLIGIBLE"}]}],
  "usageMetadata": {"promptTokenCount": 11,"candidatesTokenCount": 4,"totalTokenCount": 15}}

data: {"candidates": [{"content": {"parts": [{ "text": " walks into a bar and orders a drink. As he sits there, he notices a" }], "role": "model"},
  "finishReason": "STOP","index": 0,"safetyRatings": [{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_HATE_SPEECH","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_HARASSMENT","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_DANGEROUS_CONTENT","probability": "NEGLIGIBLE"}]}],
  "usageMetadata": {"promptTokenCount": 11,"candidatesTokenCount": 21,"totalTokenCount": 32}}
执行该命令后,结果分块会流式传入。

第一个载荷为 JSON。请仔细查看突出显示的 candidates[0].content.parts[0].text

{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "A T-Rex"
          }
        ],
        "role": "model"
      },
      "finishReason": "STOP",
      "index": 0,
      "safetyRatings": [
        {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_HARASSMENT",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "probability": "NEGLIGIBLE"
        }
      ]
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 11,
    "candidatesTokenCount": 4,
    "totalTokenCount": 15
  }
}

第一个 text 条目是 Gemini 回答的开头。当您提取更多 text 条目时,响应会以换行符分隔。

以下代码段显示了多个 text 条目,其中显示了模型的最终响应。

"A T-Rex"

" was walking through the prehistoric jungle when he came across a group of Triceratops. "

"\n\n\"Hey, Triceratops!\" the T-Rex roared. \"What are"

" you guys doing?\"\n\nThe Triceratops, a bit nervous, mumbled,
\"Just... just hanging out, you know? Relaxing.\"\n\n\"Well, you"

" guys look pretty relaxed,\" the T-Rex said, eyeing them with a sly grin.
\"Maybe you could give me a hand with something.\"\n\n\"A hand?\""

...

但是,如果您不问霸王龙笑话,而是问模型一些稍微复杂的问题,会怎么样?例如,让 Gemini 编写一个 JavaScript 函数来确定数字是偶数还是奇数。text: 分块看起来略有不同。

输出现在包含 Markdown 格式,以 JavaScript 代码块开头。以下示例包含与上文相同的预处理步骤。

"```javascript\nfunction"

" isEven(number) {\n  // Check if the number is an integer.\n"

"  if (Number.isInteger(number)) {\n  // Use the modulo operator"

" (%) to check if the remainder after dividing by 2 is 0.\n  return number % 2 === 0; \n  } else {\n  "
"// Return false if the number is not an integer.\n    return false;\n }\n}\n\n// Example usage:\nconsole.log(isEven("

"4)); // Output: true\nconsole.log(isEven(7)); // Output: false\nconsole.log(isEven(3.5)); // Output: false\n```\n\n**Explanation:**\n\n1. **`isEven("

"number)` function:**\n   - Takes a single argument `number` representing the number to be checked.\n   - Checks if the `number` is an integer using `Number.isInteger()`.\n   - If it's an"

...

更具挑战性的是,一些标记的项从一个分块开始,在另一个分块结束。某些标记是嵌套的。在以下示例中,突出显示的函数会拆分到两行:**isEven(number) function:**。合并后,输出为 **isEven("number) function:**。这意味着,如果您想输出格式化的 Markdown,就不能仅使用 Markdown 解析器单独处理每个代码段。

从客户端

如果您在客户端上使用 MediaPipe LLM 等框架运行 Gemma 等模型,则流式数据会通过回调函数传入。

例如:

llmInference.generateResponse(
  inputPrompt,
  (chunk, done) => {
     console.log(chunk);
});

借助 Prompt API,您可以通过迭代 ReadableStream 以分块形式获取流式数据。

const languageModel = await self.ai.languageModel.create();
const stream = languageModel.promptStreaming(inputPrompt);
for await (const chunk of stream) {
  console.log(chunk);
}

后续步骤

您是否在想如何高效且安全地渲染流式数据?请参阅呈现 LLM 回答的最佳实践