Kuaishou Klear Team × HIT-SCIR-LACG

LiveCVEBench

A Contamination-Free Agentic Benchmark for Evaluating Code Agents on Real-World CVE Vulnerability Fixing

About LiveCVEBench

LiveCVEBench is a continuously updated agentic benchmark for evaluating Code Agents on real-world CVE (Common Vulnerabilities and Exposures) vulnerability fixing tasks.

Agents are deployed in real development environments where they must autonomously explore codebases, understand vulnerability context, and implement proper fixes — just like human developers would.

Unlike static benchmarks that may suffer from data contamination, LiveCVEBench sources new CVEs after model training cutoff dates, ensuring a fair and unbiased evaluation of agent capabilities.

Continuously Updated

New CVEs added regularly to prevent data contamination

Real-World Vulnerabilities

Based on actual CVEs from production software

Agentic Evaluation

Autonomous exploration and fixing in real environments

Leaderboard

# Model Agent Tested
/ 0
Accuracy Success (avg) Failed (avg)
Turns Tokens Turns Tokens
Fully compatible with Terminal Bench evaluation framework

Note: The official Terminal Bench statistics script has some issues. The terminus-2 results are our own verified statistics, while other agents' turn/token data are not calculated by Terminal Bench (shown as "-").

Ranking: Entries with the same accuracy share the same rank. Ties are ordered by average Success Tokens (lower is better).

Submit Your Results

Want to submit your Code Agent's results to the leaderboard? Check out our submission guidelines.

Acknowledgments

We greatly appreciate the following projects:

  • Terminal Bench — Provides a unified evaluation framework for code agents.
  • PatchEval — We converted some of their CVEs to Terminal Bench format and included them in our leaderboard.