quantms is a nextflow pipeline for the analysis of quantitative proteomics data. The pipeline is based on the OpenMS framework and DIA-NN; and it is designed to analyze large scale experiments. The main outputs of quantms workflow are the following:
- mzTab files with the identification and quantification information.
- MSstats input file with the peptide quantification values needed for the MSstats analysis.
- MSstats output file with the differential expression values for each protein.
- The input SDRF of the pipeline if available.
While all the previous formats are well-known standards and popular formats in the proteomics community; they are difficult to use in big data analysis projects. In addition, these file formats are difficult to extend and provide multiple views of the underlying data. For example, in mzTab it is extremely hard for big datasets to retrieve the identified peptides and features and the corresponding intensities. At the same time it is difficult to get the protein quantification values for a given sample.
Here, we aim to formalize and develop a more standardized format that enables better representation of the identification and quantification results but also enables new and novel use cases for proteomics data analysis. The main use cases for the format are:
- Fast and easy visualization of the identification and quantification results.
- Easy integration with other omics data.
- Easy integration with sample metadata.
- AI/ML model development based on identification and quantification results.
- Easy data retrieval for big datasets and large-scale collections of proteomics data.
Note: We are not trying to replace the mzTab format, but to provide a new format that enables AI-related use cases. Most of the features of the mzTab format will be included in the new format.
quantms.io could be seen as a multiple view representation of a proteomics data analysis results. Each view of the format can be serialized in different formats depending on the use case. the data model of quantms.io defines two main things, the view and how the view is serialized.
- The data model view defines the structure, the fields and properties that will be included in a view for each peptide, psms, feature or protein, for example.
- The data serialization defines the format in which the view will be serialized and what features of serialization will be supported, for example compression, indexing or slicing.
view | file class | serialization format | definition | example |
---|---|---|---|---|
psm | psm_file | parquet | psm | psm example |
feature | feature_file | parquet | feature | feature example |
absolute | absolute_file | tsv | absolute | absolute example |
differential | differential_file | tsv | differential | differential example |
sdrf | sdrf_file | tsv | metadata | sdrf example |
project | - | json | project | -- |
Note: Views can be extended and new views can be added to the format.
A quantms.io file is a collection of views, and they are aggregated into a folder .qms
and inside that folder a file collect project.json
MUST be present. Please read about the project view for more information.
The introduction to the format, concepts and more details topics about serialization can be read in the introduction to the format here.
External contributors, researchers and the proteomics community are more than welcome to contribute to this project.
Contribute with the specification: you can contribute to the specification with ideas or refinements by adding an issue into the issue tracker or performing a PR.
The project is run by different groups:
- Yasset Perez-Riverol (PRIDE Team, European Bioinformatics Institute - EMBL-EBI, U.K.)
IMPORTANT: If you contribute with the following specification, please make sure to add your name to the list of contributors.
As part of our efforts toward delivering open and inclusive science, we follow the Contributor Covenant Code of Conduct for Open Source Projects.
This information is free; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This information is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this work; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.