paulirish / experiments

Repository from Github https://github.compaulirish/experimentsRepository from Github https://github.compaulirish/experiments

πŸ‘‹ Hi, everyone!
We are ByteDance Seed team.

You can get to know us better through the following channelsπŸ‘‡

seed logo

Multi-SWE-bench Experiments

This repository contains the predictions, execution logs, trajectories, and results for model inference + evaluation runs on the Multi-SWE-bench task.

πŸ… Leaderboard Submission

If you are interested in submitting your model to the Multi-SWE-bench Leaderboard, please do the following:

  1. Fork this repository.

  2. Clone the repository. Due to this repository's large diff history, consider using git clone --depth 1 if cloning takes too long.

  3. Under the split that you evaluate on (e.g. evaluation/java/verified/), create a new folder with the submission date and the model name (e.g. 20250329_Agentless_Claude-3.7-Sonnet).

  4. Within the folder (evaluation/<split>/<date + model>), please include the following required assets:

    • all_preds.jsonl: Model predictions
    • results/: Multi-SWE-bench evaluation artifacts dump, containing:
      • results.json: Summary of evaluation outcomes
    • logs/: Multi-SWE-bench evaluation artifacts dump, which stores the contents of the language folder generated in the workdir after the evaluation. The folder structure is as follows:
      logs/
         β”œβ”€β”€ [org]/[repo]/              # A certain repository
         β”‚  β”œβ”€β”€ evals/              # Files related to the evaluation process
         β”‚  β”‚  β”œβ”€β”€ pr-[id]/              # Files for a certain instance evaluation process
         β”‚  β”‚  β”‚  β”œβ”€β”€ fix.patch              # The model's generated prediction
         β”‚  β”‚  β”‚  β”œβ”€β”€ fix-patch-run.log              # A log of evaluation steps
         β”‚  β”‚  β”‚  └── report.json              # Summary of evaluation outcomes for this instance
         β”‚  β”‚  └──  ...              # Other instance evaluation process files
         β”‚  └── images/           # (Optional) Files related to the image build process
         └── ...              # Other repositories
      
    • metadata.yaml: Metadata for how result is shown on website. Please include the following fields:
      • name: The name of your leaderboard entry
      • orgIcon (optional): URL/link to an icon representing your organization
      • oss: true if your system is open-source
      • site: URL/link to more information about your system
      • verified: false (See below for results verification)
    • trajs/: Reasoning trace reflecting how your system solved the problem
      • Submit one reasoning trace per task instance. The reasoning trace should show all of the steps your system took while solving the task. If your system outputs thoughts or comments during operation, they should be included as well.
      • The reasoning trace can be represented with any text based file format (e.g. md, json, yaml)
      • Ensure the task instance ID is in the name of the corresponding reasoning trace file.
      • For an example, see Agentless + Claude-3.7-Sonnet
  5. Create a pull request to this repository with the new folder, and the leaderboard will automatically update once the PR is merged.

You can refer to this tutorial for a quick overview of how to evaluate your model on Multi-SWE-bench.

βœ… Results Verification

The Verified check βœ“ indicates that we (the Multi-SWE-bench team) received access to the model and were able to reproduce the patch generations.

If you are interested in receiving the "verified" checkmark βœ“ on your submission, please do the following:

  1. Create an issue.
  2. In the issue, provide us instructions on how to run your model on Multi-SWE-bench.
  3. We will run your model on a random subset of Multi-SWE-bench and verify the results.

πŸ” Viewing Trajectories and Logs

We host all model trajectories and execution logs on Hugging Face at Multi-SWE-bench_trajs.
You can download and inspect them locally for detailed analysis.

πŸ™ Acknowledgements

We express our deepest gratitude to the creators of the SWE-bench dataset. This project is an adapted version of their original experiments repository.

πŸ“„ Citation

If you found Multi-SWE-bench helpful for your work, please cite as follows:

@misc{zan2025multiswebench,
      title={Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving}, 
      author={Daoguang Zan and Zhirong Huang and Wei Liu and Hanwu Chen and Linhao Zhang and Shulin Xin and Lu Chen and Qi Liu and Xiaojian Zhong and Aoyan Li and Siyao Liu and Yongsheng Xiao and Liangqiang Chen and Yuyu Zhang and Jing Su and Tianyu Liu and Rui Long and Kai Shen and Liang Xiang},
      year={2025},
      eprint={2504.02605},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2504.02605},
}

πŸ“œ License

This project is licensed under Apache License 2.0. See the LICENSE flie for details.

🏒 About ByteDance Seed Team

Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

About

License:Apache License 2.0


Languages

Language:Shell 99.6%Language:TypeScript 0.4%