pwilmart / PXD030990_human-tear_re-analysis

Re-analysis of human tear data from PXD030990

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PXD030990_human-tear_re-analysis

How well do single shot runs work for proteome profiling?


Phil Wilmarth, PSR Core, OHSU

November 22, 2023

Re-analysis of human tear data from PXD030990. The publication is here:

Jones, G., Lee, T.J., Glass, J., Rountree, G., Ulrich, L., Estes, A., Sezer, M., Zhi, W., Sharma, S. and Sharma, A., 2022. Comparison of different mass spectrometry workflows for the proteomic analysis of tear fluid. International journal of molecular sciences, 23(4), p.2307.

Publication summary

Human tear fluid was collected from 11 heathy subjects using Schirmer's strips. The wetted strips were cut in half and two protein digestion strategies compared. Proteins were extracted from the strips and digested with trypsin after extraction in one method. In the other method, trypsin was added to the strips for in-strip digestion followed by peptide extraction. Each sample was run single shot on a Thermo Tribrid Fusion instrument in maximum peptide ID modes using a 120-minute LC separation. The 22 samples (11 subjects with two digestion methods) were run twice; one run with CID dissociation and a second run using HCD dissociation. The dataset was 44 RAW files of around 1.2 GB per file.

The data were processed with Proteome Discoverer v2.2 (PD). SEQUEST was used with a human Swiss-Prot FASTA file (about 20.3K sequences), precursor mass tolerance of 10 ppm, and fragment ion tolerance of 0.6 Da. Variable oxidized Met and static alkylated Cys modifications were specified. Percolator (a support vector machine learning classifier) was used for PSM validation. Q-value cutoffs were not listed. Protein inference was described as parsimonious. Use of decoy sequences for FDR control was not mentioned.

PD has default setting of a 6 amino acid minimum peptide length. Assuming some form of target/decoy was used, PD can set PSM (peptide-spectrum match) cutoffs of medium (q less that 0.05) or high (q less than 0.01). It is not clear what was used. Protein reporting in PD uses protein ranking functions and allows single peptide per protein IDs by default. I think decoy proteins may be used in some form of protein FDR control. Protein inference details were missing.

The paper is nicely written and the interpretation of results seems sensible. The over 3,000 proteins identified claim is questionable and is more likely the consequence of poor protein ID error control in PD. This is more of a negative for the PD developers than the authors, although the burden of testing analysis tools lies with the users of said tools.

Re-analysis details

The RAW files were processed with the PAW pipeline as follows:

  • RAW files converted to MS2 format files with MSConvert
  • Comet searches for peptide sequence assignments
    • human canonical UniProt reference FASTA file (about 20.5K sequences)
    • semi-tryptic cleavage option with up to 3 missed cleavages (bio fluids need semi-tryptic)
    • parent ion mass tolerance of 1.25 Da (wide tolerance searches work better)
    • fragment ion tolerance 1.0005 (ion trap analyzer used for MS2)
    • variable oxidation of Methionine
    • static alkylated Cys
  • target/decoy method using accurate mass conditioned score histograms
    • linear discriminant function of Comet scores
    • minimum peptide length of 7 amino acids
    • Charge states 2+, 3+, and 4+
    • delta mass windows set at 0-Da, 1-Da, and any other delta mass
    • FDR analysis done separately on 2+, 3+, and 4+ peptides
    • FDR analysis done separately for semi-tryptic and fully tryptic peptides
    • FDR analysis done separately for unmodified and modified peptides
  • basic and extended parsimony protein inference
    • basic parsimony logic groups identical peptide sets
    • basic parsimony logic removes peptide sets that are subsets
    • extended parsimony logic groups proteins with insufficient ID evidence
    • protein ID uses the two-peptide rule
      • two distinct, full or semi tryptic, unmodified peptides per protein per samples

This is a basic, best practices pipeline. FASTA files are chosen carefully for completeness with minimal peptide redundancy, search setting are kept simple, wide tolerance searches are used so that accurate mass has classifier power (i.e., distinguishes correct from incorrect matches), proteins are grouped so that shared peptides do not adversely affect quantitation, and the two-peptide rule controls random protein false discoveries. There are some blog posts that describe my pipeline in more detail: what makes it different and some concepts for bottom-up quantitative proteomics.

Two Excel files are present in the repository. One has dataset summary statistics. The other is protein summary table from the PAW pipeline.

Dataset statistics

Here is the breakdown on the 44 RAW files - how many starting MS2 scans, how many scans had matches that made the 1% target/decoy FDR cutoff, and the scan-level PSM ID rates.

Dataset statistics (comma is used for thousands separator and period is used for decimal point):

Dissociation Digestion Method LC Name MS2 Scans Filtered Scans ID Rate
CID After extraction T017_Post-extraction_digestion_CID 65,566 9,766 14.9%
CID After extraction T018_Post-extraction_digestion_CID 65,032 11,285 17.4%
CID After extraction T019_Post-extraction_digestion_CID 70,269 14,529 20.7%
CID After extraction T020_Post-extraction_digestion_CID 71,621 12,119 16.9%
CID After extraction T021_Post-extraction_digestion_CID 71,991 12,955 18.0%
CID After extraction T022_Post-extraction_digestion_CID 68,927 10,558 15.3%
CID After extraction T023_Post-extraction_digestion_CID 67,883 9,876 14.5%
CID After extraction T024_Post-extraction_digestion_CID 70,993 13,396 18.9%
CID After extraction T025_Post-extraction_digestion_CID 66,343 12,859 19.4%
CID After extraction T026_Post-extraction_digestion_CID 64,524 12,790 19.8%
CID After extraction T027_Post-extraction_digestion_CID 66,562 9,078 13.6%
CID In strip T017_in-strip_digestion_CID 70,096 12,766 18.2%
CID In strip T018_in-strip_digestion_CID 68,521 16,833 24.6%
CID In strip T019_in-strip_digestion_CID 70,724 14,543 20.6%
CID In strip T020_in-strip_digestion_CID 65,246 13,307 20.4%
CID In strip T021_in-strip_digestion_CID 72,039 13,571 18.8%
CID In strip T022_in-strip_digestion_CID 73,199 11,595 15.8%
CID In strip T023_in-strip_digestion_CID 70,586 12,621 17.9%
CID In strip T024_in-strip_digestion_CID 69,449 14,509 20.9%
CID In strip T025_in-strip_digestion_CID 67,950 13,044 19.2%
CID In strip T026_in-strip_digestion_CID 69,062 12,737 18.4%
CID In strip T027_in-strip_digestion_CID 68,453 13,928 20.3%
HCD After extraction T017_Post-extraction_digestion_HCD 62,031 9,359 15.1%
HCD After extraction T018_Post-extraction_digestion_HCD 62,159 10,749 17.3%
HCD After extraction T019_Post-extraction_digestion_HCD 72,196 12,579 17.4%
HCD After extraction T020_Post-extraction_digestion_HCD 68,588 11,003 16.0%
HCD After extraction T021_Post-extraction_digestion_HCD 68,699 10,978 16.0%
HCD After extraction T022_Post-extraction_digestion_HCD 64,249 9,363 14.6%
HCD After extraction T023_Post-extraction_digestion_HCD 62,922 9,881 15.7%
HCD After extraction T024_Post-extraction_digestion_HCD 68,707 12,013 17.5%
HCD After extraction T025_Post-extraction_digestion_HCD 65,603 10,147 15.5%
HCD After extraction T026_Post-extraction_digestion_HCD 69,264 11,923 17.2%
HCD After extraction T027_Post-extraction_digestion_HCD 62,595 8,844 14.1%
HCD In strip T017_in-strip_digestion_HCD 71,427 12,482 17.5%
HCD In strip T018_in-strip_digestion_HCD 69,743 15,447 22.1%
HCD In strip T019_in-strip_digestion_HCD 67,189 13,275 19.8%
HCD In strip T020_in-strip_digestion_HCD 65,861 12,615 19.2%
HCD In strip T021_in-strip_digestion_HCD 68,263 12,504 18.3%
HCD In strip T022_in-strip_digestion_HCD 67,482 9,603 14.2%
HCD In strip T023_in-strip_digestion_HCD 67,049 11,102 16.6%
HCD In strip T024_in-strip_digestion_HCD 68,384 13,118 19.2%
HCD In strip T025_in-strip_digestion_HCD 66,769 10,815 16.2%
HCD In strip T026_in-strip_digestion_HCD 72,934 11,593 15.9%
HCD In strip T027_in-strip_digestion_HCD 68,021 13,766 20.2%
Totals 2,995,171 531,824 17.8%

We can summarize things by the four methods (digestion method and dissociation method) to get a better overview of the data.

Summary by digestion/dissociation methods (averages)

Dissociation Digestion Method MS2 Scans Filtered Scans ID Rate
CID After extraction 68,156 11,746 17.2%
HCD After extraction 66,092 10,622 16.1%
CID In-strip 69,575 13,587 19.5%
HCD In-strip 68,466 12,393 18.1%

There were nearly 3 million MS2 scans acquired from the 44 runs. Over half a million scans passed the 1% FDR cutoff. There are more PSMs identified from the in-strip digestion method. There are fewer identified PSMs from HCD compared to HCD despite acquiring a few more MS2 scans with the faster HCD dissociation.

Note that the number of PSMs at 1% FDR implies (as many as) 5,318 incorrect peptide sequence assignments among the accepted scans (1% of 531,824). Given the filtered scan numbers above, that is from 88 to 168 incorrect PSMs per run (about 120 average). This means that we could have over 100 or so incorrect protein IDs per sample (assuming no protein FDR control) or as many as 5,000 or so experiment-wide. NOTE: the overall FDR was less than 1% because there were very small number of decoy matches for accurate mass (0-Da deltamass), fully tryptic, unmodified peptides (the peptide class with the most target matches). Instead of 5K decoy PSMs, the filtered data had closer to 3K decoy matches.

Some other notes about the analysis: 1+ peptides were not excluded from acquisition (charges 1 to 6 were present). The mass calibration was very good across all samples. The data were processed by dissociation method (22 samples per chunk) in case Comet score distributions differed for CID versus HCD data (they actually looked similar). Protein inference was done from the confident (1% FDR) PSMs from all 44 samples to produce a master list of putative protein IDs experiment-wide. That list was culled by requiring the two-peptide rule to be satisfied in at least one sample (per LC run here).

Two distinct peptides across the whole experiment is not sufficient criteria for protein IDs. We know that the proteins are present in individual samples and we need to apply the two-peptide rule to individual samples.

PAW pipeline does not tally numbers of peptides or unique peptides

Some tools and many papers quote and compare numbers of unique peptides as some sort of figure of merit. There is seldom any description of how the counting is done and never any mention of how FASTA file choice can change numbers of peptides and proteins from the same data. I even see calculations of peptide false discovery rates. The only reliable metrics for comparing proteomics sample processing methods, instrument methods, or data analysis choices in DDA experiments are PSM counts. Even those need some clarification. It should be instrument scan counts in DDA experiments not sequence counts. Single spectra can have more than one assigned peptide sequence. The number of peptide sequences associated with a set of PSMs will be greater than the number of MS2 scans associated with the PSMs. Peptide counts depend on many factors: how are different charge states counted? How are unmodified and modified versions of the same peptide sequence counted? What is the context (FASTA file or inferred protein list) for categorizing a peptide as shared or unique? The number of inferred proteins also depends strongly on the FASTA file choice and on protein ID criteria (minimum number of peptides per protein, inclusion or exclusion of multiple forms of the same peptide base sequence, choice of protein ranking function, and protein FDR methods).

Counting peptides and/or proteins is always problematic. Counting scans associated with PSMs is easier and more likely to be unbiased. My pipeline does not count peptides. There are peptide reports, but those are the peptides conditioned on the reported protein list. Most pipelines have conditioned peptide summaries; however, those do not give accurate peptide counts (especially for decoy peptides).

How are PAW protein results summarized?

Some clarification on the PAW protein summary table is necessary. The two-peptide rule can cause some confusion as it can be applied at different levels in larger experiments. There is also the fear that there is a loss of protein IDs when using the 2-peptide rule. That may have been an important concern in the Q-STAR (a less sensitive, slow scanning early Q-TOF instrument from ABI) days when a "large proteome" was a couple hundred proteins. Very large datasets can be easily generated these days and the handful of very low abundance proteins that would be legitimate single peptide per protein IDs are still more trouble than they are worth (the nickname of "one-hit-wonders" is still on the nose). They will almost always be below the limits of quantitation in any experiment.

A feature of the PAW pipeline is to be as generous in discovery phase experiments as possible without shooting yourself in the foot with appreciable numbers of false identifications. The protein summary table is basically one row for each protein, protein group, or protein family inferred from the confidently identified PSMs in the entire experiment. The "experiment" is whatever set of biological samples you want a unified protein report for (you do not explicitly specify an experimental design at any point in my pipeline). Samples can consist of multiple LC runs (fractions) or have hidden sample dimensions (SILAC channels, TMT channels, etc.). The sets of columns in the protein summary tables are the samples. There are three sets of PSM/peptide count columns: total PSM counts, total unique PSM counts, and corrected total counts:

  • Total counts are PSM counts for respective proteins where all peptides (shared and unique) are counted
  • Unique counts are PSM counts of the unique peptides for the respective proteins
  • Corrected counts are PSM counts where shared peptides have been fractionally split based on relative unique count evidence (for any proteins that share given peptides)

shared and unique need to be carefully defined in any bottom-up proteomics data analysis. Here, a unique peptide has a one-to-one mapping to a reported protein, protein group (indistinguishable peptide sets), or protein family (nearly indistinguishable peptide sets). Shared peptides map to more than one protein, protein group, or protein family. The context for mapping is the final set of inferred proteins, not the original FASTA file (this distinction is extremely important for quantitative studies).

A protein row exists if that protein was identified by two distinct peptides in at least one sample across the experiment. The PSM counts for each sample (the cells in that row) can be greater than 2, equal to 2 (this has to be true in at least one sample), or less than 2. The counts can be zero or one for some of the columns. This strikes a sensible compromise between having sufficient evidence to report a protein and being able to track low abundance proteins across samples. Not all proteins that get reported in the protein summaries will be quantifiable, though. There are many low abundance proteins can be identified but not quantified in large scale experiments.

Re-analysis results

Five protein summaries were done with different subsets of runs to get total number of proteins from all data and the numbers of proteins identified in each of the 4 methods (2 digestion methods and two dissociation types). The idea is to start with as much data as possible for protein inference and work back down to the sample-level protein identifications.

This is my solution to the complicated problem of summarizing experiments with larger number of samples. This can be approached in the other direction where proteins are inferred in each sample and some combination of those results are generated for the experiment-wide summary. This can be challenging because protein grouping and parsimony logic may produce different proteins for related sets of proteins in each sample. Logic to say two non-identical things are really the same between samples can be tricky to code.

Total protein identifications in full experiment and by method. Net proteins are after exclusion of decoy and contaminant proteins (keratins, trypsin, etc.). Decoy protein numbers are ballpark estimates of protein reporting error rates. There is no concept of protein FDR control in my pipeline (if results seem too noisy, PSMs need to be filtered more stringently).

Experiment subset Number of samples Net number of protein IDs Number of decoy proteins
Whole experiment 44 478 9
Extracted proteins + CID 11 321 18
Extracted proteins + HCD 11 311 10
In-strip digest + CID 11 469 14
In-strip digest + HCD 11 441 14


The paper claimed over 3,000 identified proteins and this re-analysis has only 478 identified proteins. Note that the best method (in-strip + CID) had 469 proteins, almost the same as the 478 experiment-wide. The paper also reported average numbers of proteins identified per sample (and standard deviations) for each of the 4 methods. Below we report the re-analysis average protein identification numbers and the numbers from the paper.

Average number of protein IDs per sample by method. Values are averages plus/minus standard deviations.

Method Average Protein IDs (re-analysis) Average Protein IDs (publication)
Extracted proteins + CID 163 +/- 41 489 +/- 90
Extracted proteins + HCD 157 +/- 34 496 +/- 85
In-strip digest + CID 235 +/- 65 666 +/- 161
In-strip digest + HCD 231 +/- 62 678 +/- 180

I know from analyses of many datasets that results from my pipeline and PD will be very similar if two peptides per protein is specified in PD. We can assume the extra identifications reported in the publication are from single peptide protein IDs. There are about 400 extra protein IDs per sample.

Are extra protein IDs reliable?

How reliable are these extra protein IDs? Random noise matches are unlikely to be the same sample-to-sample. The paper reported 4 frequency of observation categories for each method: high (seen in 9, 10, or 11 of the samples), medium (seen in 6, 7, or 8 of the samples), low (seen in 3, 4, or 5 of the samples), and rare (seen in just 1 or 2 samples). We can compute similar quantities for the re-analysis results and compare to the publication.

Observation frequencies from this re-analysis:

Category extracted CID extracted HCD in-strip CID in-strip HCD
high 103 98 149 144
medium 41 37 65 61
low 79 79 107 106
rare 105 99 122 122


Observation frequencies from publication:

Category extracted CID extracted HCD in-strip CID in-strip HCD
high 122 112 178 182
medium 92 90 153 147
low 248 258 366 373
rare 2237 2310 2596 2668


There is a real explosion of protein IDs seen in just one or two of the 11 samples for the published results. Such a dramatic increase is a red flag. Identification numbers are much more stable across categories from the re-analysis.

If we exclude the "rare" category IDs and sum up the rest, we can get a ballpark estimate of the "more reliable" protein identifications.

Sum of high, medium, and low categories by method:

Source extracted CID extracted HCD in-strip CID in-strip HCD
Re-analysis 223 214 321 311
Publication 462 460 697 702


The publication numbers are larger because there are some true single peptide per protein IDs. The 2-peptide rule used in the re-analysis is more strict.

Estimate of how many single peptide per protein IDs are true

I can also do single peptide per protein IDs with my pipeline (results will be very noisy). With the target/decoy concatenated FASTA file, estimates of inferred decoy protein counts can be subtracted from inferred target protein matches to get net correct inferred protein ID estimates. I did this for a results summary of the 44-sample full experiment. Common contaminant proteins were excluded leaving 2,266 target protein matches. There were 1,558 decoy protein matches. Assuming the decoy protein count is an estimate of the incorrect target protein count, we have a net estimated correct target protein count of 708.

This 708 number is very close to the 700 numbers from the sum of the high, medium, and low observation frequency categories from the publication. Also, 708 is a little over 200 more protein IDs than the 478 proteins with the 2-peptide rule, suggesting about 230 single peptide per protein IDs are correct. This seems like a reasonable number of barely detectible proteins in human tear. I speculate that the single peptide per protein IDs from PD in the publication were mostly incorrect with the chosen parameters. The protein inference details and protein FDR control method were not detailed in the paper, so I cannot speculate further.

Single shot experimental designs

Getting about 700 proteins from over 530K PSMs is nothing to write home about and (strongly) suggests single shot experimental designs have severe limitations. We have done some TMT-labeled human tear samples collected with Schirmer's strips in our core. We extracted proteins, did S-Trap digests, labeled with TMTpro, and ran 20-high pH reverse phase fractions on a Fusion Tribrid with a 2-hour low pH RP gradient for each fraction. An SPS-MS3 TMT method was used. More about the experiments can be found in this publication. Each 20-fraction experiment had about 370K MS2/MS3 scans acquired. About 66K scans were identified at a 1% FDR in each experiment. Those PSMs mapped to around 3K proteins with a 2-peptide per protein requirement for each experiment.

Fractionation gives almost 5 times as many protein IDs from PSM numbers that are 8 times smaller. You could do a fractionated 18-plex TMT experiment where over 3K proteins could be accurately quantified from tears in the same clock time as 18 single shot runs where a few hundred proteins could be quantified. The two experimental designs use the same number of hours of instrument time, but are not remotely equivalent in terms of quantitative proteome profiling.

If you are not a fan of DDA experiments, DIA won't change the situation. Wide window DDA won't do it either. Allowing chimeric precursors is going to help either. Nothing gets past the fact that you have limited dynamic range for peptides binding on the LC system and limited inter-scan dynamic range on the mass spec instrument. These are fundamental measurement science limitations. Think for a minute or two about acquiring 3 million MS2 scans, filtering to over 530K confident PSMs, and getting only 500-700 proteins from samples known to have well over 3,000 proteins (human tears).

There were real reasons why MudPIT methods were developed 25 years ago. Loading more digest and fractionating is not just about instrument scan rates and sampling. It is also about how you can get more of the peptides from the complex digest into the instrument. More total digest is used in fractionated experiments. Single shot experiments are limited by the laws of physics. Very short gradient single shot experiments seem like an incredibly bad idea to me with my scant 20 years experience in proteomics. I do not understand why smart folks think deep proteome profiling is possible with little separation of very complex peptide digests.


Phil Wilmarth
November 22, 2023

About

Re-analysis of human tear data from PXD030990

License:MIT License