HIT-SCIR-LACG × Kuaishou Technology

LiveCVEBench

Contamination-Free Agentic Benchmark

Evaluating Code Agents on Real-World CVE Vulnerability Fixing

Top Performers

Loading...
View Full Leaderboard →

About LiveCVEBench

LiveCVEBench is a continuously updated agentic benchmark for evaluating Code Agents on real-world CVE (Common Vulnerabilities and Exposures) vulnerability fixing tasks.

Agents are deployed in real development environments where they must autonomously explore codebases, understand vulnerability context, and implement proper fixes — just like human developers would. Unlike static benchmarks that may suffer from data contamination, LiveCVEBench sources new CVEs after model training cutoff dates, ensuring a fair and unbiased evaluation.

Continuously Updated

New CVEs added regularly to prevent data contamination

Real-World Vulnerabilities

Based on actual CVEs from production software

Agentic Evaluation

Autonomous exploration and fixing in real environments

CVE-Factory

CVE-Factory is a Multi-Agent system for fully automated, end-to-end CVE reproduction. Given CVE records, the system automatically researches details, generates test cases, builds Docker environments, and validates that each vulnerability can be both exploited and patched. The pipeline transforms CVE metadata into reproducible, testable vulnerability environments without manual intervention.

CVE-Factory Pipeline

Sample CVE Tasks

Real-world vulnerability fixing challenges from the benchmark

Loading sample tasks...

Leaderboard

Timeline
2025-01-01
2025-12-31
# Model Agent Tested
/ 0
Accuracy Success (avg) Failed (avg)
Turns Tokens Turns Tokens
Fully compatible with Terminal Bench evaluation framework

Note: The official Terminal Bench statistics script has some issues. The terminus-2 results are our own verified statistics, while other agents' turn/token data are not calculated by Terminal Bench (shown as "-").

Ranking: Entries with the same accuracy share the same rank. Ties are ordered by average Success Tokens (lower is better).

Submit Your Results

Want to submit your Code Agent's results to the leaderboard? Check out our submission guidelines.

Acknowledgments

We greatly appreciate the following projects:

  • Terminal Bench — Provides a unified evaluation framework for code agents.
  • PatchEval — We converted some of their CVEs to Terminal Bench format and included them in our leaderboard.

Citation

If you find our work useful, please cite our paper:

@misc{luo2026cvefactoryscalingexpertlevelagentic,
      title={CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability},
      author={Xianzhen Luo and Jingyuan Zhang and Shiqi Zhou and Rain Huang and Chuan Xiao and Qingfu Zhu and Zhiyuan Ma and Xing Yue and Yang Yue and Wencong Zeng and Wanxiang Che},
      year={2026},
      eprint={2602.03012},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2602.03012},
}