How LLM Tool Calling Works (and How to Use It Well)

A language model is brilliant at language and hopeless at facts that aren’t in its head. Ask it today’s weather and it will either shrug or, worse, confidently invent a sunny afternoon in a city that’s currently underwater. Tool calling (also called function calling) is how we fix that: instead of making the model pretend it can check the weather, we give it a phone number it can dial and let your code answer the call.

The crucial thing to understand up front is that the model never runs your code. It only asks. It looks at the tools you’ve described, decides one is relevant, and hands back a structured request: “please call get_weather with city = Lisbon.” You execute that function, feed the result back, and the model writes the final answer. The LLM is the brain; your code is the hands.

Step one: describe your tools

You hand the model a JSON Schema for each tool. This schema is the entire contract, so write it like documentation for a slightly literal-minded intern.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city. "
                           "Use only when the user asks about weather.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City name, e.g. 'Lisbon'",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit",
                    },
                },
                "required": ["city"],
                "additionalProperties": false,
            },
            "strict": true,
        },
    }
]

That description field does more work than anything else here. It’s not a comment for you, it’s a prompt for the model deciding whether to call the tool at all. “Get the current weather for a city” with a clear enum and a tight required list beats a vague one-liner every time. Turn on strict mode where your provider supports it: it forces the generated arguments to actually match the schema, which kills a whole category of malformed-JSON bugs.

Step two: the call loop

The flow is a tidy little cycle: request → tool_call → execute → result → final answer. The model returns a tool call (with arguments as a JSON string, so parse it), you run the real function, append the result to the conversation, and ask again.

import json
from openai import OpenAI

client = OpenAI()
messages = [{"role": "user", "content": "What's the weather in Lisbon?"}]

resp = client.chat.completions.create(
    model="gpt-5.4", messages=messages, tools=tools
)
msg = resp.choices[0].message

if msg.tool_calls:
    messages.append(msg)  # keep the assistant's request in history
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        result = run_weather_lookup(**args)  # your actual code
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,   # must echo the id back
            "content": json.dumps(result),
        })
    # second round-trip: model turns raw data into prose
    final = client.chat.completions.create(
        model="gpt-5.4", messages=messages, tools=tools
    )
    print(final.choices[0].message.content)

Two details people miss: append the assistant’s tool-call message and the tool result before the second call, and match each result to its tool_call_id. Skip either and the model loses the thread. Anthropic’s Claude and Google’s Gemini use the same shape with different names (tool_use blocks instead of tool_calls), but the loop is identical, so once you’ve built it once you can swap providers without re-learning the dance.

The pitfalls (where it all goes sideways)

Hallucinated arguments. The model loves filling blanks. Ask it to book a meeting and it’ll happily invent a time you never gave it. Rule of thumb: if a missing field touches money, deletion, or sending something to a human, the schema should make it required and your code should make the model ask, not guess.
Too many tools. Dump 50 tools into one request and two things happen: your input-token bill balloons, and the model’s aim gets worse because it’s choosing from a crowd. Keep the active set small, or load tools on demand.
No error handling. Your function will fail. Don’t throw a stack trace into the void; return a structured error like {"error": "city not found", "hint": "check spelling"}. The model reads it and recovers gracefully, often by asking the user to clarify.
Trusting the output. Well-formed JSON is not correct JSON. Validate and coerce arguments before you execute anything. Treat the model’s output like any other untrusted user input, because that’s exactly what it is.

The takeaway

Tool calling is less “AI magic” and more “a very well-documented API call where the model picks the parameters.” Get three things right and you’re 90% there: write descriptions that explain when to call, mark anything dangerous as required so the model asks instead of guesses, and validate every argument before executing. Build the request→execute→result loop once, keep your tool list lean, and return errors the model can actually read. Do that, and you’ve turned a confident know-it-all into something that genuinely knows when to look things up.

How LLM Tool Calling Works (and How to Use It Well)

Step one: describe your tools

Step two: the call loop

The pitfalls (where it all goes sideways)

The takeaway

More posts

Reading Learning Curves: Diagnosing Model Training

Model Explainability: Making Sense of SHAP and LIME

Diffusion Models: How AI Generates Images