What is tool use (a.k.a. function calling)?
A model that only emits text somehow ends up booking your flight. The trick isn't in the weights — it's in the contract between model, harness, and your code.
Why it exists
Ask ChatGPT or Claude “what’s the weather in Tokyo right now?” and you’ll often get back an actual current temperature. Pause on that for a second. The model’s weights were frozen months ago. It has no clock, no thermometer, and no internet connection wired into the matrix multiplications. All it does — at the lowest level — is map a sequence of tokens to a probability distribution over the next token. So how did the right number end up on your screen?
The short version: the model didn’t fetch the weather. It emitted some
text of a particular shape — something like
{"name": "get_weather", "args": {"city": "Tokyo"}} — and a different
program, sitting between you and the model, recognized that shape, called
the real weather API, and pasted the result back into the conversation
before asking the model to continue.
That dance is “tool use” (when the docs are talking about agents) or “function calling” (when they’re talking about an API). It exists because a raw LLM can think about the world but can’t touch it. Tool use is the seam where text-prediction meets the rest of your software. Without it, every assistant is a parlor trick that ends at the edge of its training data.
Why it matters now
Almost every useful AI product in 2026 is a tool-using one. Coding agents that read your files and run your tests, chatbots that search the web, assistants that book travel or file tickets, anything wired into MCP — under the hood, all of them are doing the same loop: model emits a structured call, host runs it, result goes back into context.
This is also where most agent failures land. The model picks the wrong tool, hallucinates an argument that looks plausible but doesn’t exist, calls the same tool ten times in a row, or misreads the result and confidently lies about it. None of those are “the model isn’t smart enough” — they’re specific failure modes of the tool-use contract. You can’t reason about why your agent broke without a clear picture of what tool use actually is.
It’s also the layer where the API providers compete most directly. OpenAI, Anthropic, Google, and the open-model serving stacks each ship slightly different tool-use surfaces, but they all converged on the same shape: a JSON schema for each tool, a way for the model to emit a call against that schema, and a server-side guarantee that the call will parse.
The short answer
tool use = the model emits a structured call + the host executes it + the result is fed back into the context
The model never runs anything. It just produces text in a format the host agrees to interpret as “please run this function.” The host runs it, appends the output to the conversation, and asks the model to continue. Loop until the model decides it’s done.
How it works
There are four moving parts. Once you see them separately, “function calling” stops feeling magical.
1. The tool list is in the prompt
When you start a conversation that supports tools, the host sends the model a list of available tools. Each tool has a name, a one-line description, and a JSON schema for its arguments — exactly the kind of metadata you’d write into an OpenAPI spec. In a typical API call:
{
"tools": [
{
"name": "get_weather",
"description": "Get the current weather for a city.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
]
}
The provider’s SDK formats this into whatever the model was trained to read — often a section of the system prompt, sometimes a dedicated channel in a chat template. The model sees the tools the same way it sees any other text: as tokens.
So when people say “the model decided to call get_weather,” that’s
shorthand. What actually happened is: the model conditioned on (your
question + the description of get_weather) and the next-token
distribution put high probability on a sequence of tokens that, decoded,
spells out a tool call.
2. Post-training taught it the contract
A base model trained only on internet text would not, on its own, reliably
emit {"name": "get_weather", ...} when asked about the weather. It might
write a tutorial about weather APIs instead.
What makes function calling work is post-training — supervised fine-tuning and RLHF — on examples of the form “given this tool list and this user question, the right next thing to emit is this tool call.” After enough examples, the model has internalized: “when a request needs information I don’t have, and a relevant tool is in the list, the right move is to emit a call against that tool’s schema, not to guess.”
This is the part that’s easy to miss. Tool use is not a property of the model architecture; it’s a behavior the model was trained to perform when the prompt has the right shape. The model can still ignore tools, call the wrong one, or invent a tool that doesn’t exist. Better post-training narrows those failure modes; it doesn’t eliminate them.
3. Constrained decoding makes the call parse
Even with great post-training, the model is still sampling tokens, and a single bad token (a stray comma, a missing brace) would break the parse. The cleanest way around this is constrained decoding: once the model has decided it’s emitting a tool call, the inference server tracks which tokens can still keep the JSON valid given the prefix so far, sets the probability of every other token to zero, renormalizes, and samples from what’s left. The model still chooses the city name and the units, weighted by its own distribution — it just can’t go off the rails of the grammar.
OpenAI has documented this explicitly: their “Structured Outputs” mode
(strict: true) uses constrained decoding to guarantee that the call
matches the supplied JSON Schema. Other providers ship similar
schema-conformance guarantees for tool use, though they don’t always spell
out the mechanism — it can be constrained decoding, heavy post-training,
internal retries, or a mix. The contract you can rely on is the call will
parse against the schema; how the provider gets there is sometimes
opaque.
(For the longer version of this idea, see Why is structured output so hard? — function calling is structured output with a per-tool schema and a “when to call” prior baked in.)
4. The host runs the tool and feeds the result back
The model’s part is over the moment the tool call is emitted. From here on, the agent harness takes over:
user: "what's the weather in Tokyo?"
→ model emits: tool_call(get_weather, {city: "Tokyo", units: "celsius"})
→ harness validates the args against the schema
→ harness asks the user to approve (if the tool is sensitive)
→ harness calls the real weather API
→ harness appends to context: tool_result(get_weather, "14°C, light rain")
→ model emits: "It's 14°C and lightly raining in Tokyo right now."
→ harness exits the loop, returns the answer
The model never touched the network. It saw a question, emitted a structured call, then — on the next turn, with the result now in its context — produced the friendly answer. From the user’s perspective it felt like one continuous reply, but it was at least two model calls and a real HTTP request stitched together by the host.
Where it gets subtle
- The model picks; the harness enforces. The model decides which tool
and what arguments. Whether the call is allowed to actually run, who
approves it, what happens on error — none of that is in the model. It’s
all harness policy. A
delete_everythingtool will be called if the model thinks it should be; the only thing standing between that thought and the action is your permission layer. - The model can hallucinate tools. If the prompt is confusing or the tool list is long, the model can emit a call to a function that doesn’t exist, or pass an argument the schema doesn’t allow. Constrained decoding prevents the parse error, not the semantic error. Good harnesses validate, return a clear “no such tool” message, and let the model recover.
- Parallel and serial calls. Modern APIs let the model emit several tool calls in one turn (run them in parallel) or chain them across turns (call A, see the result, then decide whether to call B). The wire format differs by provider; the underlying loop is the same.
- Function calling vs. tool use vs. MCP. “Function calling” was the
original name OpenAI shipped in 2023; the API has since moved toward
tools/tool_calls, and Anthropic has used “tool use” from the start. In practice the two terms point at the same idea — the model emits a structured call against a schema. MCP sits a level above: it’s a standard for how the host discovers tools in the first place, so the same tool can be reused across hosts. The model-side mechanism is the same in all three. - It’s not really a model feature. A useful gut-check: tool use is a contract between the host’s prompt format, the model’s post-training, and the inference server’s decoder. Take any of those three away and the feature collapses. That’s why “does this model support function calling?” is a question about the provider’s stack, not just the weights.
The thing to walk away with: a function-calling model isn’t doing anything fundamentally different from a regular model. It’s still emitting the most likely next tokens. The trick is that the prompt, the training, and the decoder have all been arranged so that — when the situation calls for it — the most likely next tokens spell out a structured call into your code.
Famous related terms
- Agent harness —
agent = model + harness. The loop that actually runs the tool and feeds results back. Tool use is the thing the model emits into a harness. - MCP (Model Context Protocol) —
MCP = JSON-RPC + tools/resources/prompts vocabulary + client-server split. The standard for plugging tools into a host without writing a new adapter each time. - Structured output / JSON mode —
structured output = next-token sampling + a hard parse constraint. Function calling is structured output with a per-tool argument schema. - Constrained decoding —
constrained decoding = mask out illegal tokens at each step + sample from what's left. The reason “the call will parse” is something a provider can actually promise. - ReAct —
ReAct ≈ reason + act, interleaved. An early prompting pattern where the model alternated free-text “thoughts” with tool calls; modern tool use is the productized descendant. - Tool hallucination —
tool hallucination = model emits a syntactically valid call to a tool/argument that doesn't exist. The failure mode constrained decoding doesn’t catch.
Going deeper
- Anthropic’s tool use documentation — for the precise wire format of tool definitions, tool calls, and tool results, and how parallel tool use is handled.
- OpenAI’s function-calling guide — the same shape from the other major provider; reading both side by side makes the cross-vendor pattern obvious.
- ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022) — for where the “model alternates between thinking and calling tools” idea was first written down cleanly.
Notes on what I’m sure of: that the model itself doesn’t execute anything (the host does), that post-training is what teaches a model to emit calls in the right shape, and that OpenAI’s strict Structured Outputs mode is documented to use constrained decoding — these are in provider docs and published work. What’s less publicly specified is the exact mechanism each provider uses for strict tool use; some are clearly constrained decoding, others may be a mix of constrained decoding, heavy fine-tuning, and internal retry. The exact per-provider wire format and which specific models support which tool-use features changes often; treat any specific claim about “model X supports parallel tool calls as of date Y” as something to verify against current docs rather than memorize.