Skip to main content
0
A

AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Rating

0.0

Votes

0

score

Downloads

0

total

Price

Free

API key required

Works With

Claude CodeCursorWindsurfVS CodeDeveloper tool

About

AgentBench

🌐 Leaderboard (new) | 🐦 Twitter | βœ‰οΈ Google Group | πŸ“ƒ Paper

πŸ‘‹ Join our Slack for Q & A or collaboration on next version of AgentBench!

πŸ”₯2025.10.10] Introducing **AgentBench FC (Function Calling)** based on [AgentRL

The current repository contains the function-calling version of AgentBench, integrated with AgentRL, an end-to-end multitask and mutliturn LLM Agent RL framework. If you wish to use the older version, you can revert to v0.1 and v0.2.

Comparing to the original AgentBench, this version uses a function-calling style prompt, and adds fully-containerized deployment support for the following tasks:

  • alfworld (AF)
  • dbbench (DB)
  • knowledgegraph (KG)
  • os_interaction (OS)
  • webshop (WS)

Quick Start

We support a quick one-command setup for all the above tasks using Docker Compose.

Before starting, please download or build the following Docker images required by the tasks:

shell
# dbbench
docker pull mysql:8

# os_interaction
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles

To run the KG freebase server, you will also need a copy of the data found here. Download, extract and place the data at ./virtuoso_db/virtuoso.db (or modify extra/docker-compose.yml and set the mount point to your data location).

Then, you can bring up the stack with:

shell
docker compose -f extra/docker-compose.yml up

This command will download or build the necessary Docker images and start the following services in Docker:

  • AgentRL Controller
  • alfworld task worker (x1, increase as needed)
  • dbbench task worker (x1, increase as needed)
  • knowledgegraph task worker (x1, increase as needed)
  • os_interaction task worker (x1, increase as needed)
  • webshop task worker (x1, increase as needed)
  • freebase server (for knowledgegraph task)
  • Redis server (for container allocation)

If your machine already has Redis (version 7+) running, you can omit the Redis service from the docker-compose.yml.

[!WARNING] Please note that the webshop environment requires ~16GB of RAM to start, and the current implementation of alfworld leaks memory and disk space until the task worker is restarted. Make sure your machine has sufficient resources before running.

Benchmarking Results

We report the results of various models on the test set of AgentBench FC.

Don't lose this

Three weeks from now, you'll want AgentBench again. Will you remember where to find it?

Save it to your library and the next time you need AgentBench, it’s one tap away β€” from any AI app you use. Group it into a bench with the rest of the team for that kind of task and you can pull the whole stack at once.

⚑ Pro tip for geeks: add a-gnt πŸ€΅πŸ»β€β™‚οΈ as a custom connector in Claude or a custom GPT in ChatGPT β€” one click and your library is right there in the chat. Or, if you’re in an editor, install the a-gnt MCP server and say β€œuse my [bench name]” in Claude Code, Cursor, VS Code, or Windsurf.

πŸ€΅πŸ»β€β™‚οΈ

a-gnt's Take

Our honest review

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24). Best for anyone looking to make their AI assistant more capable in communication. It's completely free and works across most major AI apps. This one just landed in the catalog β€” worth trying while it's fresh.

Tips for getting started

1

Tap "Get" above, pick your AI app, and follow the steps. Most installs take under 30 seconds.

2

Heads up: this needs an API key to work. You'll get one from the service's website (usually free). The setup guide tells you exactly where.

What's New

Version 1.0.06 days ago

Imported from GitHub

Ratings & Reviews

0.0

out of 5

0 ratings

No reviews yet. Be the first to share your experience.