Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

What is tool use (a.k.a. function calling)?

A model that only emits text somehow ends up booking your flight. The trick isn't in the weights — it's in the contract between model, harness, and your code.

AI & ML intro May 7, 2026

Why it exists

Ask ChatGPT or Claude “what’s the weather in Tokyo right now?” and you’ll often get back an actual current temperature. Pause on that for a second. The model’s weights were frozen months ago. It has no clock, no thermometer, and no internet connection wired into the matrix multiplications. All it does — at the lowest level — is map a sequence of tokens to a probability distribution over the next token. So how did the right number end up on your screen?

The short version: the model didn’t fetch the weather. It emitted some text of a particular shape — something like {"name": "get_weather", "args": {"city": "Tokyo"}} — and a different program, sitting between you and the model, recognized that shape, called the real weather API, and pasted the result back into the conversation before asking the model to continue.

That dance is “tool use” (when the docs are talking about agents) or “function calling” (when they’re talking about an API). It exists because a raw LLM can think about the world but can’t touch it. Tool use is the seam where text-prediction meets the rest of your software. Without it, every assistant is a parlor trick that ends at the edge of its training data.

Why it matters now

Almost every useful AI product in 2026 is a tool-using one. Coding agents that read your files and run your tests, chatbots that search the web, assistants that book travel or file tickets, anything wired into MCP — under the hood, all of them are doing the same loop: model emits a structured call, host runs it, result goes back into context.

This is also where most agent failures land. The model picks the wrong tool, hallucinates an argument that looks plausible but doesn’t exist, calls the same tool ten times in a row, or misreads the result and confidently lies about it. None of those are “the model isn’t smart enough” — they’re specific failure modes of the tool-use contract. You can’t reason about why your agent broke without a clear picture of what tool use actually is.

It’s also the layer where the API providers compete most directly. OpenAI, Anthropic, Google, and the open-model serving stacks each ship slightly different tool-use surfaces, but they all converged on the same shape: a JSON schema for each tool, a way for the model to emit a call against that schema, and a server-side guarantee that the call will parse.

The short answer

tool use = the model emits a structured call + the host executes it + the result is fed back into the context

The model never runs anything. It just produces text in a format the host agrees to interpret as “please run this function.” The host runs it, appends the output to the conversation, and asks the model to continue. Loop until the model decides it’s done.

How it works

There are four moving parts. Once you see them separately, “function calling” stops feeling magical.

1. The tool list is in the prompt

When you start a conversation that supports tools, the host sends the model a list of available tools. Each tool has a name, a one-line description, and a JSON schema for its arguments — exactly the kind of metadata you’d write into an OpenAPI spec. In a typical API call:

{
  "tools": [
    {
      "name": "get_weather",
      "description": "Get the current weather for a city.",
      "input_schema": {
        "type": "object",
        "properties": {
          "city":  {"type": "string"},
          "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["city"]
      }
    }
  ]
}

The provider’s SDK formats this into whatever the model was trained to read — often a section of the system prompt, sometimes a dedicated channel in a chat template. The model sees the tools the same way it sees any other text: as tokens.

So when people say “the model decided to call get_weather,” that’s shorthand. What actually happened is: the model conditioned on (your question + the description of get_weather) and the next-token distribution put high probability on a sequence of tokens that, decoded, spells out a tool call.

2. Post-training taught it the contract

A base model trained only on internet text would not, on its own, reliably emit {"name": "get_weather", ...} when asked about the weather. It might write a tutorial about weather APIs instead.

What makes function calling work is post-training — supervised fine-tuning and RLHF — on examples of the form “given this tool list and this user question, the right next thing to emit is this tool call.” After enough examples, the model has internalized: “when a request needs information I don’t have, and a relevant tool is in the list, the right move is to emit a call against that tool’s schema, not to guess.”

This is the part that’s easy to miss. Tool use is not a property of the model architecture; it’s a behavior the model was trained to perform when the prompt has the right shape. The model can still ignore tools, call the wrong one, or invent a tool that doesn’t exist. Better post-training narrows those failure modes; it doesn’t eliminate them.

3. Constrained decoding makes the call parse

Even with great post-training, the model is still sampling tokens, and a single bad token (a stray comma, a missing brace) would break the parse. The cleanest way around this is constrained decoding: once the model has decided it’s emitting a tool call, the inference server tracks which tokens can still keep the JSON valid given the prefix so far, sets the probability of every other token to zero, renormalizes, and samples from what’s left. The model still chooses the city name and the units, weighted by its own distribution — it just can’t go off the rails of the grammar.

OpenAI has documented this explicitly: their “Structured Outputs” mode (strict: true) uses constrained decoding to guarantee that the call matches the supplied JSON Schema. Other providers ship similar schema-conformance guarantees for tool use, though they don’t always spell out the mechanism — it can be constrained decoding, heavy post-training, internal retries, or a mix. The contract you can rely on is the call will parse against the schema; how the provider gets there is sometimes opaque.

(For the longer version of this idea, see Why is structured output so hard? — function calling is structured output with a per-tool schema and a “when to call” prior baked in.)

4. The host runs the tool and feeds the result back

The model’s part is over the moment the tool call is emitted. From here on, the agent harness takes over:

user:     "what's the weather in Tokyo?"

→ model emits: tool_call(get_weather, {city: "Tokyo", units: "celsius"})
→ harness validates the args against the schema
→ harness asks the user to approve (if the tool is sensitive)
→ harness calls the real weather API
→ harness appends to context: tool_result(get_weather, "14°C, light rain")
→ model emits: "It's 14°C and lightly raining in Tokyo right now."
→ harness exits the loop, returns the answer

The model never touched the network. It saw a question, emitted a structured call, then — on the next turn, with the result now in its context — produced the friendly answer. From the user’s perspective it felt like one continuous reply, but it was at least two model calls and a real HTTP request stitched together by the host.

Where it gets subtle

The thing to walk away with: a function-calling model isn’t doing anything fundamentally different from a regular model. It’s still emitting the most likely next tokens. The trick is that the prompt, the training, and the decoder have all been arranged so that — when the situation calls for it — the most likely next tokens spell out a structured call into your code.

Going deeper

Notes on what I’m sure of: that the model itself doesn’t execute anything (the host does), that post-training is what teaches a model to emit calls in the right shape, and that OpenAI’s strict Structured Outputs mode is documented to use constrained decoding — these are in provider docs and published work. What’s less publicly specified is the exact mechanism each provider uses for strict tool use; some are clearly constrained decoding, others may be a mix of constrained decoding, heavy fine-tuning, and internal retry. The exact per-provider wire format and which specific models support which tool-use features changes often; treat any specific claim about “model X supports parallel tool calls as of date Y” as something to verify against current docs rather than memorize.