Alibaba Releases Qwen-AgentWorld to Simulate AI Agent Environments

Alibaba’s Qwen team released Qwen-AgentWorld on June 25, a language model trained not to take actions inside agent environments but to predict what those environments return — a fundamentally different approach that the researchers say makes agent training cheaper and more controllable.

What Was Released

Qwen-AgentWorld covers seven domain types in a single unified architecture: MCP, Search, Terminal, Software Engineering, Android, Web, and OS. Instead of optimizing action selection, the model learns to simulate the environment itself — given any action, it predicts the resulting system state. This means AI agents can be trained and tested in synthetic environments without needing access to live systems.

Two model scales were published: a 35-billion-parameter sparse model (35B-A3B) and a 397-billion-parameter variant (397B-A17B). Weights for the 35B model are available on Hugging Face and ModelScope under the Apache 2.0 license. The 397B model is not publicly released.

How It Was Trained

The team used a three-stage pipeline. Continual pre-training builds domain knowledge from more than 10 million real agent-environment interaction trajectories. Supervised fine-tuning activates step-by-step reasoning using thinking blocks. Reinforcement learning then optimizes prediction quality against recorded environment responses.

Benchmark Results

On AgentWorldBench, a new benchmark released alongside the model for evaluating simulation quality, the 397B variant scored 58.71, outperforming GPT-5.4 (58.25), Claude Opus 4.8, and Gemini 3.1 Pro, according to Alibaba. The 35B model scored 56.39, above Claude Sonnet 4.6.

Why It Matters

Training AI agents today typically requires running them inside live environments or expensive high-fidelity simulators, which limits scale and introduces real-world risk during development. A world model that can reliably simulate environment responses would allow agents to be trained on synthetic rollouts and tested in controlled conditions before any deployment. The seven-domain coverage of a single model also reduces the overhead of maintaining separate simulation infrastructure for each task type.