AgentBench

Name: AgentBench
Author: THUDM

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

by THUDM

Rating

0.0

Votes

score

Downloads

total

Price

Free

API key required

Works With

Claude CodeCursorWindsurfVS CodeDeveloper tool

About

AgentBench

🌐 Leaderboard (new) | 🐦 Twitter | ✉️ Google Group | 📃 Paper

👋 Join our Slack for Q & A or collaboration on next version of AgentBench!

🔥2025.10.10] Introducing AgentBench FC (Function Calling) based on [AgentRL

The current repository contains the function-calling version of AgentBench, integrated with AgentRL, an end-to-end multitask and mutliturn LLM Agent RL framework. If you wish to use the older version, you can revert to v0.1 and v0.2.

Comparing to the original AgentBench, this version uses a function-calling style prompt, and adds fully-containerized deployment support for the following tasks:

alfworld (AF)
dbbench (DB)
knowledgegraph (KG)
os_interaction (OS)
webshop (WS)

Quick Start

We support a quick one-command setup for all the above tasks using Docker Compose.

Before starting, please download or build the following Docker images required by the tasks:

shell

# dbbench
docker pull mysql:8

# os_interaction
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles

To run the KG freebase server, you will also need a copy of the data found here. Download, extract and place the data at ./virtuoso_db/virtuoso.db (or modify extra/docker-compose.yml and set the mount point to your data location).

Then, you can bring up the stack with:

shell

docker compose -f extra/docker-compose.yml up

This command will download or build the necessary Docker images and start the following services in Docker:

AgentRL Controller
alfworld task worker (x1, increase as needed)
dbbench task worker (x1, increase as needed)
knowledgegraph task worker (x1, increase as needed)
os_interaction task worker (x1, increase as needed)
webshop task worker (x1, increase as needed)
freebase server (for knowledgegraph task)
Redis server (for container allocation)

If your machine already has Redis (version 7+) running, you can omit the Redis service from the docker-compose.yml.

[!WARNING] Please note that the webshop environment requires ~16GB of RAM to start, and the current implementation of alfworld leaks memory and disk space until the task worker is restarted. Make sure your machine has sufficient resources before running.

Benchmarking Results

We report the results of various models on the test set of AgentBench FC.

Don't lose this

Three weeks from now, you'll want AgentBench again. Will you remember where to find it?

Save it to your library and the next time you need AgentBench, it’s one tap away — from any AI app you use. Group it into a bench with the rest of the team for that kind of task and you can pull the whole stack at once.

⚡ Pro tip for geeks: add a-gnt 🤵🏻‍♂️ as a custom connector in Claude or a custom GPT in ChatGPT — one click and your library is right there in the chat. Or, if you’re in an editor, install the a-gnt MCP server and say “use my [bench name]” in Claude Code, Cursor, VS Code, or Windsurf.

🤵🏻‍♂️

a-gnt's Take

Our honest review

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24). Best for anyone looking to make their AI assistant more capable in communication. It's completely free and works across most major AI apps.

Tips for getting started

Tap "Get" above, pick your AI app, and follow the steps. Most installs take under 30 seconds.

Heads up: this needs an API key to work. You'll get one from the service's website (usually free). The setup guide tells you exactly where.

Communication