Published: January 21, 2025
A streamed LLM response consists of data emitted incrementally and continuously. Streaming data looks different from the server and the client.
From the server
To understand what a streamed response looks like, I prompted Gemini to tell me
a long joke using the command line tool curl
. Consider the
following call to the Gemini API. If you try it, be sure to replace
{GOOGLE_API_KEY}
in the URL with your Gemini API key.
$ curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:streamGenerateContent?alt=sse&key={GOOGLE_API_KEY}" \
-H 'Content-Type: application/json' \
--no-buffer \
-d '{ "contents":[{"parts":[{"text": "Tell me a long T-rex joke, please."}]}]}'
This request logs the following (truncated) output, in
event stream format.
Each line begins with data:
followed by the message payload. The concrete
format is not actually important, what matters are the chunks of text.
//
data: {"candidates":[{"content": {"parts": [{"text": "A T-Rex"}],"role": "model"},
"finishReason": "STOP","index": 0,"safetyRatings": [{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_HATE_SPEECH","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_HARASSMENT","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_DANGEROUS_CONTENT","probability": "NEGLIGIBLE"}]}],
"usageMetadata": {"promptTokenCount": 11,"candidatesTokenCount": 4,"totalTokenCount": 15}}
data: {"candidates": [{"content": {"parts": [{ "text": " walks into a bar and orders a drink. As he sits there, he notices a" }], "role": "model"},
"finishReason": "STOP","index": 0,"safetyRatings": [{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_HATE_SPEECH","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_HARASSMENT","probability": "NEGLIGIBLE"},{"category": "HARM_CATEGORY_DANGEROUS_CONTENT","probability": "NEGLIGIBLE"}]}],
"usageMetadata": {"promptTokenCount": 11,"candidatesTokenCount": 21,"totalTokenCount": 32}}
The first payload is JSON. Take a closer look at the highlighted
candidates[0].content.parts[0].text
:
{
"candidates": [
{
"content": {
"parts": [
{
"text": "A T-Rex"
}
],
"role": "model"
},
"finishReason": "STOP",
"index": 0,
"safetyRatings": [
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_HARASSMENT",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"probability": "NEGLIGIBLE"
}
]
}
],
"usageMetadata": {
"promptTokenCount": 11,
"candidatesTokenCount": 4,
"totalTokenCount": 15
}
}
That first text
entry is the beginning of Gemini's response. When you extract
more text
entries, the response is newline-delimited.
The following snippet shows multiple text
entries, which shows the final
response from the model.
"A T-Rex"
" was walking through the prehistoric jungle when he came across a group of Triceratops. "
"\n\n\"Hey, Triceratops!\" the T-Rex roared. \"What are"
" you guys doing?\"\n\nThe Triceratops, a bit nervous, mumbled,
\"Just... just hanging out, you know? Relaxing.\"\n\n\"Well, you"
" guys look pretty relaxed,\" the T-Rex said, eyeing them with a sly grin.
\"Maybe you could give me a hand with something.\"\n\n\"A hand?\""
...
But what happens if instead of for T-rex jokes, you ask the model for something
slightly more complex. For example, ask Gemini to come up with a JavaScript
function to determine if a number is even or odd. The text:
chunks look
slightly different.
The output now contains Markdown format, starting with the JavaScript code block. The following sample includes the same pre-processing steps as before.
"```javascript\nfunction"
" isEven(number) {\n // Check if the number is an integer.\n"
" if (Number.isInteger(number)) {\n // Use the modulo operator"
" (%) to check if the remainder after dividing by 2 is 0.\n return number % 2 === 0; \n } else {\n "
"// Return false if the number is not an integer.\n return false;\n }\n}\n\n// Example usage:\nconsole.log(isEven("
"4)); // Output: true\nconsole.log(isEven(7)); // Output: false\nconsole.log(isEven(3.5)); // Output: false\n```\n\n**Explanation:**\n\n1. **`isEven("
"number)` function:**\n - Takes a single argument `number` representing the number to be checked.\n - Checks if the `number` is an integer using `Number.isInteger()`.\n - If it's an"
...
To make matters more challenging, some of the marked up items begin in one chunk
and end in another. Some of the markup is nested. In the following example, the
highlighted function is split between two lines:
**isEven(
and number) function:**
. Combined, the output is
**isEven("number) function:**
. This means if you want to output formatted
Markdown, you can't just process each chunk individually with a Markdown parser.
From the client
If you run models like Gemma on the client with a framework like MediaPipe LLM, streaming data comes through a callback function.
For example:
llmInference.generateResponse(
inputPrompt,
(chunk, done) => {
console.log(chunk);
});
With the Prompt API,
you get streaming data as chunks by iterating over a
ReadableStream
.
const languageModel = await self.ai.languageModel.create();
const stream = languageModel.promptStreaming(inputPrompt);
for await (const chunk of stream) {
console.log(chunk);
}
Next steps
Are you wondering how to performantly and securely render streamed data? Read our best practices to render LLM responses.