The Token Toll of Reasoning: How Context Window Limits Impact LLMs

Large Language Models (LLMs) have transformed how we process and interact with information. However, their capabilities are bounded by certain constraints, one of the most significant being the context window size—the amount of information the model can consider at once. While this limitation is often associated with the length of input text, the complexity of the task itself plays a critical role in how effectively the model uses its context window.

Initially, LLMs like GPT impressed us with their ability to summarize large sets of complex documents. However, as we began to rely on them more, our needs shifted towards requiring much deeper reasoning and analysis of that same data.

In this blog, we’ll explore how reasoning tasks, such as comparing and contrasting multiple documents, can stretch the limits of even a large context window. We’ll also illustrate why a 1-million-token context window might suffice for one scenario but fall short in another. Along the way, we’ll provide detailed examples, delve into the “lost in the middle” problem, and estimate how task complexity impacts token usage.

Understanding Context Windows and Task Complexity

The context window of an LLM determines how much text it can “see” at once. For tasks like summarizing a single long document, the model processes one continuous block of text. However, when the task involves reasoning—such as comparing and contrasting multiple documents—the demands on the context window increase significantly.

Document Sizes in Megabytes and Tokens

To provide a tangible sense of scale, let’s consider the relationship between document size in megabytes (MB) and tokens. On average:

1 MB of plain text contains approximately 500,000 characters.
Assuming an average word length of 5 characters (plus spaces), this translates to roughly 100,000 words.
Since LLMs tokenize text into smaller units (tokens), 1 MB of text typically corresponds to 150,000–200,000 tokens, depending on the language and content.

With this in mind:

A 5 MB document would contain approximately 750,000–1,000,000 tokens.
Ten 500 KB documents (totaling 5 MB) would also contain around 750,000–1,000,000 tokens.

Comparing Two Scenarios: Summarization vs. Compare and Contrast

Scenario 1: Summarizing a 5 MB Document

Imagine a single document that is 5 MB in size, containing approximately 1 million tokens. The task is to summarize it. Here’s what happens:

The model processes one continuous block of text.
Summarization involves identifying and condensing key information, a task that primarily requires abstraction and extraction.
While the “lost in the middle” problem might affect the quality of the summary (if key information is buried in the middle), the task is relatively straightforward.

A 1-million-token context window is likely sufficient for this task, as the model can process the entire document in one pass.

Scenario 2: Comparing and Contrasting Ten 500 KB Documents

Now, consider ten documents, each 500 KB in size, with a total length of 5 MB (approximately 1 million tokens). The task is to compare and contrast these documents. Here’s what happens:

The model must process multiple smaller documents within the same context window.
The task involves several higher-order reasoning steps:
1. Retrieve and Understand: Read and comprehend each document individually.
2. Relate and Map: Identify corresponding points, themes, or data across documents.
3. Analyze: Determine similarities and differences between these points.
4. Synthesize: Combine findings into a coherent and structured output.

Even though the total input length is the same as in Scenario 1, the reasoning complexity is much higher. The model needs to “juggle” information across multiple documents, which can exacerbate the “lost in the middle” problem.

Why the “Lost in the Middle” Problem Is Worse in Scenario 2

The “lost in the middle” problem refers to the tendency of LLMs to pay less attention to information located in the middle of a long context window. This issue is particularly problematic in tasks requiring reasoning, such as compare and contrast.

In Summarization (Scenario 1):

The model processes one continuous block of text.
If key information is buried in the middle, the model might overlook it, leading to a subpar summary.
However, the task is linear: the model moves through the text sequentially, and the output is a condensed version of the input.

In Compare and Contrast (Scenario 2):

The model must process multiple documents within the same context window.
Key comparative details might be:
- Located in the middle sections of individual documents.
- Spread across the middle of the combined context (e.g., Document 3 and Document 7).
The model must actively recall and relate information from different parts of the context window. This requires “mental juggling,” which is harder when middle sections receive less attention.
The segmented nature of the input (multiple documents) further complicates the task, as the model must switch between contexts while maintaining coherence.

Estimating Token Usage in Scenario 2

In Scenario 2, the increased reasoning complexity leads to higher token usage. Here’s why:

The model generates intermediate “thoughts” or steps during processing, which consume additional tokens.
For example, when comparing and contrasting, the model might internally generate:
- Summaries of each document.
- Lists of similarities and differences.
- Structured outlines for the final output.

Research suggests that tasks requiring detailed reasoning can increase token usage by 20–50% beyond the input size. For a 1-million-token input, this means:

1.2–1.5 million tokens might be required to complete the task.

This exceeds the 1-million-token context window, highlighting the limitations of even large context windows for complex reasoning tasks.

Conclusion: Beyond Token Limits

The size of the context window is not just about how much text an LLM can process—it’s also about how the model uses that text to perform complex reasoning tasks. As we’ve seen, tasks like comparing and contrasting multiple documents can stretch the limits of even a large context window, highlighting the importance of task complexity in determining token usage.

Key Takeaways:

Document Size Matters: A 5 MB document (1 million tokens) can be summarized within a 1-million-token context window, but comparing and contrasting ten 500 KB documents (also 1 million tokens) might exceed this limit.
Task Complexity Drives Token Usage: Reasoning tasks require more intermediate steps, increasing token usage by 20–50%.
The “Lost in the Middle” Problem Is Task-Dependent: It’s more pronounced in tasks requiring the model to recall and relate information across multiple documents.

At Awarity, we realized early that understanding these nuances is crucial for designing effective applications of LLMs. By tailoring tasks to the strengths and limitations of existing LLMs, we have demonstrated the capacity to accurately reason over private document corpora exceeding 400 million tokens. We believe our Elastic Context Window effectively solves the ‘lost in the middle’ problem.

For more about how we can help your team more accurately reason over large sets of private data, contact us at awarity.ai.

April 24, 2025

Share the Post:

Making AI Work for Code Documentation

In the realm of large-scale software projects, well-maintained documentation is essential for smooth collaboration, efficient onboarding, and effective maintenance.

Grounding AI with Structure: Addressing Context Window Limitations for Reliable Outputs

Large Language Models (LLMs) have become indispensable tools for research, enabling organizations to analyze extensive datasets and generate actionable insights.