Help needed: How to interpret the output file from "X_aggregate_comparison_bound" with multiple conditions and others...

Question

Help needed: How to interpret the output file from "X_aggregate_comparison_bound" with multiple conditions and others...

c2b2pss opened this issue 4 months ago · comments

c2b2pss commented 4 months ago

Hi,

I am comparing the progression of 3 samples: WT >> PR >> CR in that order. Like a pseudo timeseries.

In the attached file "...._aggregate_comparison_bound.pdf" which compares foot prints across WT, PR, CR, but is a 4 x4 matrix of figures. How and why is this plotted this way? How do I interpret each comparison?
CREB5_HUMAN.H11MO.0.D_CREB5_HUMAN.H11MO.0.D_aggregate_comparison_bound.pdf
In the "bindetect_results" (see example below) aggregated file how are the comparisons done, like A vs B? According to the WIki """The differential binding score for the TF between the two conditions. Negative values imply more bound in condition2""" I can conclude that more of the TF in row 1 is bound in PR than in WT?
Is there a fold change kind of metric that can be interppreted? How would I say "PR binds x times more than WT" or something like that?? Or is that really not possible?

Thanks!

| WT_mean_score | WT_bound | PR_mean_score | PR_bound | CR_mean_score | CR_bound | WT_PR_change | WT_PR_pvalue | PR_CR_change | PR_CR_pvalue |
|:-------------:|:--------:|:-------------:|:--------:|:-------------:|:--------:|:------------:|:------------:|:------------:|:------------:|
| 0.12317       | 9176     | 0.1576        | 12962    | 0.12729       | 10392    | -0.39182     | 4.91846E-196 | 0.26183      | 7.30073E-180 |
| 0.12048       | 9206     | 0.1542        | 12956    | 0.12466       | 10353    | -0.38908     | 1.55481E-193 | 0.26221      | 1.57677E-178 |
| 0.12065       | 9146     | 0.15503       | 12801    | 0.12446       | 10190    | -0.38507     | 8.11777E-196 | 0.2623       | 1.69253E-175 |
| 0.11889       | 9269     | 0.15139       | 12902    | 0.12237       | 10345    | -0.371       | 8.53009E-190 | 0.25516      | 3.64172E-175 |
| 0.12249       | 9165     | 0.15483       | 12726    | 0.12736       | 10384    | -0.36164     | 8.01726E-193 | 0.23559      | 6.60438E-176 |
| 0.12688       | 9777     | 0.15732       | 13264    | 0.13116       | 10980    | -0.33052     | 4.48453E-192 | 0.21647      | 1.56954E-173 |
| 0.11789       | 6394     | 0.14368       | 8370     | 0.12053       | 7177     | -0.32588     | 2.63666E-175 | 0.22962      | 3.20349E-167 |
| 0.12538       | 10090    | 0.15298       | 13428    | 0.12722       | 11098    | -0.30566     | 6.51881E-190 | 0.21202      | 2.39010E-172 |

Moritz Hobein · Answer 1 · Thu May 23 2024 19:28:15 GMT+0800 (China Standard Time)

Hey,

The plots with white background are basically a grid with all possible combinations of average signals for the bound TFBS in each condition. The plots at the edges are basically a combination of their row/column, showing all 3 graphs in the same figure. The bottom right plot shows the diagonal, so the signal with just the TFBS predicted at the corresponding condition. So for example, if we look at the first row, the plot on the left shows the average signal of your WT condition at all locations where CREB5 was bound in your WT. The next plot shows the WT signal again, but around all locations where CREB5 was bound in PR and in the next one, same thing but for all bound TFBS in condition CR. The last plot in the row makes it easier to visually compare them, and while we see that on average, there is a visible footprint for all of these TFBS, the footprint seems slightly stronger in WT (which makes sense, as we used the signal from that condition). However, all things considered, the is not a huge difference (as in one example in the wiki, where one condition shows a footprint, and the other does not). So for this TF, the changes between conditions seem to be minimal. If, for example, 50% of all possible TFBS for this TF were bound in one condition, and the other 50% were bound in the other, we would see that here as the average signals would only show strong footprints with their own condition's set of bound TFBS, but in your example, the bound TFBS seems to overlap between conditions a lot or at least the signals seem quite similar in each subset.
Yes, that is the correct conclusion.
For each individual TF, there is also an *_overview.xlsx file. There, you can find a column at then end with the log2 fold changes between the scores of each comparison per TFBS. To have this exact metric not per TFBS but per TF instead, I guess you could aggregate this column. The differential binding score in column <comparison>_changes is the effect size compared to the background, but it is based on these log2FCs.

c2b2pss · Answer 2 · Fri May 24 2024 00:25:15 GMT+0800 (China Standard Time)

Thank you again for your kind and detail explanation.
So this brings up the question of what is considered significant in TOBIAS.

Going back to this figure CREB5_HUMAN.H11MO.0.D_CREB5_HUMAN.H11MO.0.D_aggregate_comparison_bound.pdf linked above. In the last row it says comparison and grey background. There is a difference in the CR footprint -- it is a little higher. What does this mean? And in the results how is this reflected?
I still have doubts over what is the metric I should use from all of the results files. All the p-values are fantastic, but what is the metric that reflects difference in sites bound. I understand that different binding to different target genes of TFs have different effect. SO a gene by gene understanding is the final real test. But the question for now is with a combination of TF changes going in one direction, can I predict what the biological state is? For this, which of the metrics provide the best way to look at the cummulative effects of TFs. Don't get me wrong, I am not asking for an explanation of metrics etc. Just which metric I should choose and why. Then I can go form there.
I should point out that the information TOBIAS provides is excellent -- so much better than other similar algorithms, but with some much ease of running it. I would like to make the most use of the information TOBIAS is giving me. So Kudos to your team!

Best regards!

Moritz Hobein · Answer 3 · Fri May 24 2024 16:05:22 GMT+0800 (China Standard Time)

So the signal you see is the bias corrected cutsites. Right at the motif (depicted by the grey dashed lines), we see a depletion, meaning fewer cutsites compared to the motif flanks (=something occupied the DNA here). TOBIAS calculates a score based on how much depletion there is compared to the flanks, and this is the score you see in the output file columns with the name _score. This plot right here is an average visual representation of these footprints that were used to calculate the scores, and they correlate with TOBIAS scores. So slightly more pronounced footprints should also be supported by slightly higher footprinting scores in your results files on average. Because it is an average, this might mean that the cutsite depletion was stronger in CR, or that a higher fraction of bound TFBS had a very clear footprint. In this example with CREB5, I am not sure whether this difference means that much, as it is not that large and you can see footprints in all conditions. The _changes and _logfc columns might tell you more about the scale of those differences in comparison between conditions.
That depends a little bit on your biological question. You are talking about the impact of these TF binding dynamics on genes. Perhaps this figure from the TOBIAS paper might be a good reference for that. Here, a single TF (DUX) was taken, and the TFBS of it were filtered to only contain those associated with known target genes. Then, the log2fc of the footprint scores between control and over expression conditions was analyzed and compared with RNA-seq experiment results. This revealed which genes were regulated by DUX activity. If you want to look at your experiment from the standpoint of general TF activity in your condition, the raw scores are better suited. That was for example used here to check which TFs are active in different conditions, here developmental stages in a time series experiment. The scores were Z-score transformed over the entire time series, which shows during which stages they were most active. Alternatively, you can use the _changes column, which summarizes the scores globally per TF and directly compares exactly two conditions, if that suffices.
Thanks, we believe that is how it should be :)

github-actions · Answer 4 · Tue Jul 16 2024 15:24:47 GMT+0800 (China Standard Time)

No activity for at least 30 days. Marking issue as stale. Stale issues are closed after one week.