- Home
- Communication
- AgentBench
Rating
Votes
0
score
Downloads
0
total
Price
Free
API key required
Works With
About
AgentBench
π Leaderboard (new) | π¦ Twitter | βοΈ Google Group | π Paper
π Join our Slack for Q & A or collaboration on next version of AgentBench!
π₯2025.10.10] Introducing **AgentBench FC (Function Calling)** based on [AgentRL
The current repository contains the function-calling version of AgentBench, integrated with AgentRL, an end-to-end multitask and mutliturn LLM Agent RL framework. If you wish to use the older version, you can revert to v0.1 and v0.2.
Comparing to the original AgentBench, this version uses a function-calling style prompt, and adds fully-containerized deployment support for the following tasks:
alfworld(AF)dbbench(DB)knowledgegraph(KG)os_interaction(OS)webshop(WS)
Quick Start
We support a quick one-command setup for all the above tasks using Docker Compose.
Before starting, please download or build the following Docker images required by the tasks:
# dbbench
docker pull mysql:8
# os_interaction
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfilesTo run the KG freebase server, you will also need a copy of the data found here. Download, extract and place the data at ./virtuoso_db/virtuoso.db (or modify extra/docker-compose.yml and set the mount point to your data location).
Then, you can bring up the stack with:
docker compose -f extra/docker-compose.yml upThis command will download or build the necessary Docker images and start the following services in Docker:
- AgentRL Controller
alfworldtask worker (x1, increase as needed)dbbenchtask worker (x1, increase as needed)knowledgegraphtask worker (x1, increase as needed)os_interactiontask worker (x1, increase as needed)webshoptask worker (x1, increase as needed)- freebase server (for
knowledgegraphtask) - Redis server (for container allocation)
If your machine already has Redis (version 7+) running, you can omit the Redis service from the docker-compose.yml.
[!WARNING] Please note that thewebshopenvironment requires ~16GB of RAM to start, and the current implementation ofalfworldleaks memory and disk space until the task worker is restarted. Make sure your machine has sufficient resources before running.
Benchmarking Results
We report the results of various models on the test set of AgentBench FC.
Don't lose this
Three weeks from now, you'll want AgentBench again. Will you remember where to find it?
Save it to your library and the next time you need AgentBench, itβs one tap away β from any AI app you use. Group it into a bench with the rest of the team for that kind of task and you can pull the whole stack at once.
β‘ Pro tip for geeks: add a-gnt π€΅π»ββοΈ as a custom connector in Claude or a custom GPT in ChatGPT β one click and your library is right there in the chat. Or, if youβre in an editor, install the a-gnt MCP server and say βuse my [bench name]β in Claude Code, Cursor, VS Code, or Windsurf.
a-gnt's Take
Our honest review
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24). Best for anyone looking to make their AI assistant more capable in communication. It's completely free and works across most major AI apps. This one just landed in the catalog β worth trying while it's fresh.
Tips for getting started
Tap "Get" above, pick your AI app, and follow the steps. Most installs take under 30 seconds.
Heads up: this needs an API key to work. You'll get one from the service's website (usually free). The setup guide tells you exactly where.
What's New
Imported from GitHub
Ratings & Reviews
0.0
out of 5
0 ratings
No reviews yet. Be the first to share your experience.