Scale client-side summarization in small context windows

Published: March 12, 2025

Explainer Web Extensions Chrome Status Intent
GitHub Behind a flag Origin trial Behind a flag Origin trial View Intent to Experiment

The Summarizer API helps you generate summaries of information in various lengths and formats. Use it with Gemini Nano in Chrome to perform client-side inference, and concisely explain complicated or long texts.

When performed client-side, you can work with data locally, which lets you keep sensitive data safe and can offer availability at scale. However, the context window is much smaller than with server-side models, which means very large documents could be challenging to summarize. To solve this problem, you can use the summary of summaries technique.

What is summary of summaries?

To use the summary of summaries technique, split the input content at key points, then summarize each part independently. You can concatenate the outputs from each part, then summarize this concatenated text into one final summary.

For example, if a document is split in three parts, each part is summarized. Those three summaries are put together and summarized again for the final result.

Thoughtfully split your content

It's important to consider how you'll split up a large piece of text, as splitting in different locations can lead to radically different outputs by Gemini Nano or other LLMs. Ideally, texts should be split when there's a change of topic, such as a new section of an article, or at a paragraph. It's important to avoid splitting the text in the middle of a word or sentence, which means you cannot set a number of characters as your only split guideline.

There are many ways you can do this, without manual effort. In the following example, we used the Recursive Text Splitter from LangChain.js, which balances performance and output quality. This should work for most workloads.

When creating a new instance, there are two key parameters:

  • chunkSize is the maximum number of characters allowed in each split.
  • chunkOverlap is the amount of characters to overlap between two consecutive splits. This ensures that each chunk has some of the context from the previous chunk.

Split the text with splitText() to return an array of strings with each chunk.

Most LLMs have their context window expressed as a number of tokens, rather than a number of characters. On average, a token contains 4 characters, so you can estimate the number of tokens used by an input by dividing the number of characters by 4.

In our example, the chunkSize is 3000 characters and that's approximately 750 tokens.

Generate summaries for each split

Once you've set up how the content is split, you can generate summaries for each part with the Summarizer API.

Create an instance of the summarizer with the create() function. To keep as much context as possible, we've set the format parameter to plain-text, the type to tl;dr, and the length to long.

Then, generate the summary for each split created by the RecursiveCharacterTextSplitter and concatenate the results into a new string. We separated each summary with a new line to clearly identify the summary for each part.

While this new line doesn't matter when executing this loop just once, it's useful for determining how each summary adds to the token value for the final summary. In most cases, this solution should work for medium and long content.

Recursive summary of summaries

When you've got an exceedingly long amount of text, the length of the concatenated summary may be larger than the available context window, thus causing the summarization to fail. To address this, you can recursively summarize the summaries.

If your summary of summaries is still too long, you can repeat the process. You could, in theory, repeat the process indefinitely, until you receive an appropriate length.

We still collect the initial splits generated by RecursiveCharacterTextSplitter. Then, in the recursiveSummarizer() function, we loop the summarization process based on the character length of the concatenated splits. If the character length of the summaries exceeds 3000, then we concatenate into fullSummaries. If the limit isn't reached, the summary is saved as partialSummaries.

Once all of the summaries are generated, the final partial summaries are added to the full summary. If there's just 1 summary in fullSummaries, no additional recursion is needed. The function returns a final summary. If there's more than one summary present, the function repeats and continues summarizing the partial summaries.

We tested this solution with Internet Relay Chat (IRC) RFC, which has a whopping 110,030 characters that include 17,560 words. The Summarizer API provided the following summary:

Internet Relay Chat (IRC) is a way to communicate online in real-time using text messages. You can chat in channels or send private messages, and you can use commands to control the chat and interact with the server. It's like a chat room on the internet where you can type and see others' messages instantly.

That's pretty effective! And, it's only 309 characters.

Limitations

The summary of summaries technique helps you operate within a client-size model's context window. Though there are many benefits for client-side AI, you may encounter the following:

  • Less accurate summaries: With recursion, the summary process repetition is possibly infinite, and each summary is farther from the original text. This means the model may generate a final summary that is too shallow to be useful.
  • Slower performance: Each summary takes time to generate. Again, with an infinite possible number of summaries in larger texts, this approach may take several minutes to finish.

We have a summarizer demo available, and you can view the full source code.

Share your feedback

Try to use the summary of summaries technique with different lengths of input text, different split sizes, and different overlap lengths, to determine what works best for your use cases.

Start testing the Summarizer API now by joining the origin trial and share your feedback. Your input can directly impact how we build and implement future versions of this API, and all built-in AI APIs.