Speculative Tool Calling: Making LLM Agents Faster

11 Feb, 2026

This post summarizes work from my project in CSE-291A: Systems for LLMs and AI Agents at UC San Diego, under the guidance of Professor Yiying Zhang.

The Problem

LLM agents work in a loop: reason about what to do, call an external tool (web search, file reader, calculator, etc.), observe the result, repeat. The bottleneck isn't where you'd expect. Tool calls finish in milliseconds, but the model spends seconds at each step just deciding which tool to call. That thinking time dominates the entire pipeline.

What if a small, fast model could predict the next tool call while the big model is still thinking? If it guesses right, we skip the wait. If it guesses wrong, we discard the result and carry on normally.

This is the same idea behind speculative decoding, where a small draft model generates candidate tokens that a larger model verifies in a single pass. Correct guesses are free; wrong ones get regenerated. My project applies this one level up: instead of speculating on the next token, I speculate on the next tool call.

It works beautifully in some cases and falls apart in others. This post walks through the full picture.

Profiling the Bottleneck

First, I needed to confirm that thinking time really was the bottleneck. I profiled Co-Sight (a top GAIA leaderboard agent) and Open Deep Research (ODR, a LangGraph-based research agent) on the GAIA benchmark, which tests agents on real-world multi-step reasoning tasks across three difficulty levels (Level 1: quick search + math, Level 2: multi-tool chains, Level 3: deep open-ended research).

Co-Sight results (Gemini-2.5-Pro):

GAIA Level	Time Thinking	Time Doing
Level 1	97.2%	2.8%
Level 2	91.3%	8.7%
Level 3	95.3%	4.7%

20x more time thinking than doing, even on the hardest tasks. ODR showed the same pattern (85-98% LLM time), with tool-only idle periods reaching 15% of runtime in some cases. That's pure latency that speculation could overlap.

Can a Smaller Model Predict Tool Calls?

I compared Gemini-Pro and Gemini-Flash-Lite on the same tasks, measuring both speed and a competency score (how close the model's tool-call path was to the ideal solution from GAIA's human annotations, minus penalties for redundant calls).

Level 1: Flash-Lite thought 2.1x faster, finished 37% faster overall. Pro scored higher on competency (0.60 vs 0.47).
Level 3: Flash-Lite thought 3.7x faster, finished 7.4x faster overall. Flash-Lite actually scored better on competency (0.33 vs 0.01) because Pro fell into repetitive search loops.

A smaller model won't always pick the ideal tool, but it's often close enough, and it's much faster.

How the System Works

Three components:

Actor (or "verifier/target") Model: Strong model (e.g., GPT-5, Gemini-2.5-Pro) that reasons and makes the authoritative tool call.
Speculator (or "draft") Model: Fast, cheap model (e.g., GPT-5-nano, Gemini-Flash-Lite) that predicts the next tool call from the same context.
Tool Cache: Stores the speculator's pre-executed results.

(I'll use these terms interchangeably throughout the post.)

Both models receive the conversation history in parallel. The speculator finishes first, runs its predicted tool call, and caches the result. When the actor finishes, the system compares the two. Cache hit: reuse the stored result, skip the wait. Cache miss: run the actor's tool call normally, discard the speculator's result.

Formal Setup

At each step $t$ , let $C_{t}$ denote the current conversation context. The actor $M_{A}$ and speculator $M_{S}$ both receive $C_{t}$ and produce a tool call (tool name $f$ , arguments $a$ ):

(f_{A}, a_{A}) = M_{A} (C_{t}) (f_{S}, a_{S}) = M_{S} (C_{t})

Since $M_{S}$ is smaller and faster, it finishes first and immediately pre-executes its predicted call, caching the result $r_{S} = exec (f_{S}, a_{S})$ . When $M_{A}$ completes, a verification function determines whether to reuse the cached result:

{result}_{t} = {\begin{matrix} r_{S} & if verify ((f_{S}, a_{S}), (f_{A}, a_{A})) = 1 (cache hit) \\ exec (f_{A}, a_{A}) & otherwise (cache miss) \end{matrix}

The latency at each step depends on whether we get a hit or miss. Let $T_{A}$ be the actor's inference time, $T_{S}$ be the speculator's inference time, and $T_{exec}$ be the tool execution time:

T_{step} = {\begin{matrix} max (T_{A}, T_{S} + T_{exec}) \approx T_{A} & (hit: tool runs in parallel with actor) \\ T_{A} + T_{exec} + T_{overhead} & (miss: sequential, plus wasted speculator cost) \end{matrix}

On a hit, tool execution is fully overlapped with the actor's thinking time, so the step completes in roughly $T_{A}$ alone. On a miss, we pay the normal sequential cost plus some overhead from running the speculator for nothing. Given a cache hit rate $p$ across a task, speculation provides a net speedup when:

p \cdot T_{exec} > (1 - p) \cdot T_{overhead}

In other words, the expected latency saved from cache hits must exceed the expected overhead from cache misses. This is why speculation only helps when hit rates are high enough and tool execution is slow enough to offset the cost of wrong guesses.

Verification Strategies

The key design choice is how to implement $verify$ . I tried strict exact match (tool name + all arguments identical), tool-name-only match, and similarity match (embed descriptions and arguments, accept if cosine similarity exceeds a threshold, so search("climate change effects") and search("impact of climate change") count as equivalent).

The Naive Approach: Mostly Disappointing

I tested on GAIA Levels 1 and 2 with GPT-5 as actor and GPT-5-nano as speculator, using Jaccard similarity verification (token overlap after normalizing and removing stopwords, threshold 0.7).

Metric	L1 Baseline	L1 Speculative	L2 Baseline	L2 Speculative
Accuracy	60.4%	58.5% (-1.9%)	53.5%	48.8% (-4.7%)
Avg Latency	173.8s	184.1s (-5.9%)	198.5s	264.3s (-33.2%)
Cache Hit Rate	--	18.2%	--	10.5%

Cache hit rates of 10-18% meant the speculator's work was wasted on most steps, and tools got executed twice. The failure modes were varied: empty response loops (the actor producing blank outputs repeatedly), unproductive loops (retrying the same failing tool call), premature termination, system crashes from dual-model instability, and pure overhead from zero cache hits.

The Silver Lining

Filtering to just the examples where speculation helped:

Level 1 (30% of examples): 63% speedup, no accuracy loss
Level 2 (35% of examples): 37% speedup, no accuracy loss

The winning pattern: slow baselines (many steps, long runtime) relying on inherently slow tools like web search. Fast tools like calculate (<1s) offer no savings since the model's own inference time already exceeds the tool's. The challenge is knowing ahead of time which cases will benefit.

Tool Preferences: Models Have Taste

GAIA's diverse toolset (search, code, vision, file I/O) made it hard to isolate why cache hit rates were so low. Was it bad tool name prediction, bad argument prediction, or both? To study tool preferences in isolation, I switched to HotPotQA (a multi-hop QA dataset where questions require combining info from multiple sources, e.g., "Were the directors of Jaws and Casino Royale born in the same country?"). HotPotQA tasks mostly just need web search, so I could focus on how models choose between functionally similar tools. I gave both models access to 9 different search APIs and ran 100 queries with cosine similarity verification (threshold 0.8).

Different models strongly prefer different tools, even with the same options. The draft model favored Brave Search; the target preferred DuckDuckGo and Google Serper. When the draft picked DuckDuckGo, the target picked a different tool 57% of the time.

But it barely mattered for answer quality. Accuracy with matching tool names: 74.1%. With different tool names: 76.9%. Most search tools are functionally interchangeable. This suggests draft models could use entirely different (cheaper) tool sets and still produce reusable results.

The system hit 81% cache rate and 76% accuracy on HotPotQA, a big improvement over GAIA, because the toolset was more homogeneous (mostly search) compared to GAIA's diverse mix of search, code execution, file I/O, and vision tools.

Integrating Into Real Frameworks

I also integrated the pattern directly into Co-Sight and ODR.

Co-Sight: Gemini-Flash-Lite generated 3 candidate predictions in parallel (with a small temperature for diversity). If all 3 agreed on the same tool call and arguments, execute immediately without consulting the expensive verifier (Gemini-Pro). If they disagreed, invoke the verifier to make the final decision. Result: 33% speedup (169s to 113s) on a representative task.

ODR: Latency increased from 128s to 162s. The draft model proposed broader search queries, triggering extra tool calls and giving the LLM more content to synthesize. With baseline tool-idle time already below 5%, there wasn't enough latency to hide, and the pipeline overhead dominated.

Fine-Tuning: Training a Better Predictor

The root cause of poor naive performance was clear: 10-18% cache hit rates. Could fine-tuning close the gap?

I fine-tuned Qwen-3-4B using LoRA (Low-Rank Adaptation, which updates only small adapter matrices to keep inference fast) on expert tool-use trajectories. The training data was synthetic: each GAIA example comes with human-annotated metadata describing which tools were used and why (e.g., "use a search engine to find X, then open a spreadsheet to filter Y"). I fed this metadata to Gemini 2.5 Pro and had it produce structured thought → tool call → observation traces in strict JSON format, essentially a blueprint of what an ideal agent trajectory should look like. No real tools were executed during generation. 90 training examples, 28 validation, 27 test.

I evaluated with two metrics: Tool Name Score (exact match of predicted tool names against expert traces, 0 to 1) and LLM-as-judge (Gemini 2.5 Pro comparing base vs. fine-tuned outputs on tool selection, formatting, and reasoning quality, returning "base wins," "fine-tuned wins," or "tie").

Small Toolset (4 tools): Impressive Gains

With search, wikipedia, calculator, final_answer: tool name accuracy jumped from 1.9% to 51.9%. The fine-tuned model won 37% of head-to-head comparisons, with 63% ties and 0% losses.

Large Toolset (17 tools): Things Fall Apart

With 17 tools including near-duplicates (search, web_search, search_web, search_with_content): accuracy dropped to 31.5%. Win rate collapsed to 11.1%; the base model won 33.3%.

New failure modes emerged:

Catastrophic looping: final_answer called 329 times across 27 examples (expert: 21 times).
Tool hallucination: vision_ocr predicted 118 times when the expert used file_reader twice.
Semantic confusion: Abandoned the correct search tool (5 calls vs. expert's 38), confused by near-duplicates.

Fine-tuning can teach a small model what tools to use, but not reliably when to stop or how to disambiguate similar options. It went from "doesn't act" to "acts too aggressively."

Why This Is Harder Than Speculative Decoding

In speculative decoding, verification is a simple exact token match. With tool calls, you're comparing structured arguments that can be semantically equivalent but syntactically different, which is why verification strategies are so important and imperfect. Prediction is also fundamentally harder: the next token is heavily constrained by grammar and local context, but the next tool call requires reasoning about the full task state, what's been tried, what information is missing, and which tool best fills the gap. Failures are more costly too. Rejecting a wrong token is nearly free, but a wrong tool call wastes real compute on an API call, and in the worst case, its result can pollute the agent's context and derail subsequent reasoning. Finally, there's an asymmetry in context: in token-level speculation the draft generates tokens the target hasn't seen yet, but in tool-level speculation both models see the exact same conversation history, which you'd expect to make prediction easier. The fact that it's still so hard underscores how much of tool selection depends on deep reasoning rather than surface-level pattern matching.

When Does It Actually Help?

Speculation works when tasks follow predictable tool patterns, the baseline takes many steps, and tool latency is high relative to model latency.

Speculation hurts when tool selection requires complex reasoning, the baseline is already efficient, or bad predictions lead the actor astray.

What's Next

Speculative tool calling has real potential (37-63% speedups on favorable tasks), but it's far from a drop-in optimization. Promising directions:

Adaptive speculation: Learn when to speculate based on task context and model uncertainty, rather than every step.
Specialized draft models: Separate predictors for different tool families to reduce semantic confusion.
Tool vocabulary alignment: Match the draft model's training environment exactly to inference-time tools.
Multi-step speculation: Predict short trajectories and verify jointly.
Hybrid approaches: Combine tool-level and token-level speculative decoding.

The fundamental insight holds: agents waste most of their time deciding what to do, not doing it. If we can predict those decisions better, there's a lot of latency to reclaim.

Code and experiment artifacts: https://github.com/Yogesh914/speculative_tool

References

Hua, W. et al. Interactive Speculative Planning. arXiv:2410.00079 (2024).
Incident.io Engineering Team. Speculative Tool-Calling Architecture. incident.io/blog (2024).
Ye, N. et al. Speculative Actions. arXiv:2509.06871 (2025).
Guan, Y. et al. Dynamic Speculative Agent Planning. arXiv:2509.01920 (2025).
Mialon, G. et al. GAIA: A General AI Assistant Benchmark. arXiv:2311.12983 (2023).
Yao, S. et al. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 (2022).
LangGraph Team. langchain-ai.github.io/langgraph (2024).