yingying123321 / SimXNS

SimXNS, a research project for information retrieval, containing official implementations, by MSRA NLC IR team.

Home Page:https://aka.ms/simxns

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SimXNS

✨Updates | 📜Citation | 🤘Furthermore | ❤️Contributing | 📚Trademarks

SimXNS is a research project for information retrieval by MSRA NLC IR team. Some of the techniques are actively used in Microsoft Bing. This repo provides the official code implementations.

Currently, this repo contains SimANS, MASTER and PROD, and all these methods are designed for information retrieval. Here are some basic descriptions to help you catch up with the characteristics of each work:

  • SimANS is a simple, general and flexible ambiguous negatives sampling method for dense text retrieval. It can be easily applied to various dense retrieval methods like AR2. We tested this method on the MS MARCO, Natural Questions and TriviaQA, and outperformed the state-of-the-art methods. This method is also applied in Bing search engine, which is proven to be effective. The whole magic is behind this formula. SimANS
  • MASTER is a multi-task pre-trained model that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. We tested this method on the MS MARCO, Natural Questions and BEIR. This method outperforms the state-of-the-art methods. MASTER
  • PROD is a novel distillation framework for dense retrieval, which consists of a teacher progressive distillation and a data progressive distillation to gradually improve the student. We tested this method on the MS MARCO, Natural Questions and TREC 2019 Deep Learning Track. This method exceeds almost all the existing methods with 6-layer students.

Updates

  • 2023/02/16: refine the resources of SimANS by uploading files in a seperated style and offering the file list.
  • 2023/02/02: release the official code of PROD.
  • 2022/12/16: release the official code of MASTER.
  • 2022/11/17: release the official code of SimANS.

Citation

If you extend or use this work, please cite our paper where it was introduced:

  • SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval. Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, Nan Duan, Weizhu Chen. EMNLP 2022. Code, Paper.
  • MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers. Kun Zhou, Xiao Liu, Yeyun Gong, Wayne Xin Zhao, Daxin Jiang, Nan Duan, Ji-Rong Wen. arXiv. Code, Paper.
  • PROD: Progressive Distillation for Dense Retrieval. Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, Nan Duan. WWW 2023. Code, Paper.
@article{zhou2022simans,
   title={SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval},
   author={Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, Nan Duan and Weizhu Chen},
   booktitle = {{EMNLP}},
   year={2022}
}
@article{zhou2022master,
   title={MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers},
   author={Kun Zhou, Xiao Liu, Yeyun Gong, Wayne Xin Zhao, Daxin Jiang, Nan Duan, Ji-Rong Wen},
   booktitle = {{arXiv}},
   year={2022}
}
@article{lin2023prod,
   title={PROD: Progressive Distillation for Dense Retrieval},
   author={Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, Nan Duan},
   booktitle = {{WWW}},
   year={2023}
}

Furthermore

This repo is still developing, feel free to report bugs and we will fix them.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

SimXNS, a research project for information retrieval, containing official implementations, by MSRA NLC IR team.

https://aka.ms/simxns

License:MIT License


Languages

Language:Python 99.2%Language:Shell 0.8%