The most surprising thing about vLLM’s tool calling support is that it’s not a separate feature you enable, but rather a core capability that emerges from its efficient inference engine when you structure your prompts correctly.

Let’s see it in action. Imagine you have a Python function that can fetch the current weather:

import json

def get_weather(city: str, unit: str = "celsius") -> str:
    """Fetches the current weather for a given city."""
    # In a real scenario, this would call a weather API
    weather_data = {
        "london": {"temperature": "15", "unit": "celsius"},
        "tokyo": {"temperature": "25", "unit": "celsius"},
        "new york": {"temperature": "70", "unit": "fahrenheit"}
    }
    city_data = weather_data.get(city.lower())
    if city_data:
        return json.dumps({
            "city": city,
            "temperature": city_data["temperature"],
            "unit": city_data["unit"]
        })
    else:
        return json.dumps({"error": f"Weather data not available for {city}"})

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Fetches the current weather for a given city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "The city to get the weather for."},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit of temperature."}
                },
                "required": ["city"]
            }
        }
    }
]

Now, let’s use vLLM to prompt the model to use this get_weather tool. We’ll set up a vLLM server and then make a request.

First, start the vLLM server with the tool calling model (e.g., gpt-3.5-turbo or claude-3-haiku which are known for good tool calling capabilities). You can use a model that has been fine-tuned for function calling or a general-purpose model that has demonstrated this ability. For this example, let’s assume you’re using a compatible model.

python -m vllm.entrypoints.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --served-model-name mixtral-instruct

Then, in your Python client, you’d make a request like this:

import requests
import json

# Assuming your vLLM server is running on localhost:8000
VLLM_API_URL = "http://localhost:8000/v1/chat/completions"

def call_vllm_tool_calling(prompt: str, tools: list):
    payload = {
        "model": "mixtral-instruct", # Use the served-model-name from your server
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "tools": tools,
        "tool_choice": "auto"
    }
    response = requests.post(VLLM_API_URL, json=payload)
    return response.json()

user_prompt = "What's the weather like in London?"
result = call_vllm_tool_calling(user_prompt, tools)

print(json.dumps(result, indent=2))

The expected output from vLLM, if it decides to use the tool, will look something like this:

{
  "id": "cmpl-...",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "mixtral-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_...",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"city\": \"London\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 150,
    "completion_tokens": 20,
    "total_tokens": 170
  }
}

Notice how the message.tool_calls array contains a function object with the name and arguments for the get_weather tool. The model has correctly identified that it needs to call get_weather and has parsed the necessary arguments.

To complete the interaction, you would then take these arguments, call your Python function get_weather(city="London"), get the result, and send it back to the LLM for a final answer.

# ... (previous code)

if result["choices"][0]["finish_reason"] == "tool_calls":
    tool_call = result["choices"][0]["message"]["tool_calls"][0]
    function_name = tool_call["function"]["name"]
    function_args = json.loads(tool_call["function"]["arguments"])

    if function_name == "get_weather":
        function_response = get_weather(**function_args)
        print(f"Tool response: {function_response}")

        # Now, send the tool response back to the model
        payload_follow_up = {
            "model": "mixtral-instruct",
            "messages": [
                {"role": "user", "content": user_prompt},
                {"role": "assistant", "content": None, "tool_calls": [tool_call]},
                {"role": "tool", "tool_call_id": tool_call["id"], "content": function_response}
            ],
            "tools": tools, # You might still need tools here for multi-turn
            "tool_choice": "auto"
        }
        response_final = requests.post(VLLM_API_URL, json=payload_follow_up)
        print("\nFinal response from LLM:")
        print(json.dumps(response_final.json(), indent=2))

The final response from the LLM might look like:

{
  "id": "cmpl-...",
  "object": "chat.completion",
  "created": 1677652289,
  "model": "mixtral-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The weather in London is currently 15 degrees Celsius.",
        "tool_calls": null
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 200,
    "completion_tokens": 30,
    "total_tokens": 230
  }
}

The core problem vLLM’s tool calling solves is efficiently generating structured JSON or other formats that represent function calls, without requiring separate fine-tuning steps for every possible tool. It leverages the model’s inherent understanding of instructions and data formats to perform this. The tools parameter in the API request is crucial; it provides the LLM with the schema and descriptions of available functions. The tool_choice parameter (set to "auto" here) allows the model to decide whether to call a tool or respond directly to the user.

The mental model here is that you’re not teaching the model to call functions, you’re describing the functions to it and letting it use its existing capabilities to map user intent to those functions. The model’s output is a structured representation of its decision, which your application then interprets and acts upon. This is a powerful abstraction that decouples the LLM’s reasoning from the execution of actual code.

What most people don’t realize is that the exact format of the tools definition, especially the description fields within the function parameters, has a disproportionately large impact on the model’s ability to correctly infer arguments. Vague descriptions lead to vague or incorrect argument parsing, while precise, unambiguous descriptions allow the model to accurately extract the specific values needed for your functions.

The next concept you’ll naturally run into is handling multiple tool calls within a single turn, or implementing more complex chaining of tool usage based on tool outputs.

Want structured learning?

Take the full Vllm course →