mcmero / SVclone

A computational method for inferring the cancer cell fraction of tumour structural variation from whole-genome sequencing data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Several questions regarding CCF estimation results

willhooper opened this issue · comments

Hi,

I’ve been benchmarking SVClone on our in-house cell line purity ladders, where each tumor sample is an in-silico mix of tumor and normal (i.e., known purity/ploidy). I’ve been reviewing the results and have a few questions.

For SVs I’m using breakpoints that were called by two or more callers (svaba/manta/lumpy), and for SNVs I’m using calls that were called by two or more callers (mutect2/strelka2/lancet). For CNVs I used ASCAT, run with known purities/ploidies.

1. Oscillating CCF estimates

Looking at how individual variant CCF estimates change with purity, I noticed an oscillating pattern in a subset of variants. This seems to occur for variants with CCF estimates farther away for the cluster mean. I’m seeing this for both SVs and SNVs, visualized below. Each line traces a given variant call’s CCF estimate across each purity. For SVs, I’m using the CCF estimate from breakend 1. All CCFs are capped at 1.

Screen Shot 2021-02-05 at 3 47 19 PM

Have you seen this before? Is this just an artifact of the clustering algorithm?

2. Collapse of clusters

For the same cell line, the number of clusters as estimated by SVClone collapses from 4 clusters to 1 halfway down the purity ladder. While I assume you’d expect a diminishing ability to resolve clusters as purity decreases, this sudden decrease suggests that something else is going wrong. The boxplot below shows CCFs for each purity level, split by cluster (not capped at 1 here).

Screen Shot 2021-02-05 at 3 48 49 PM

My first thought was that this was simply due to a decrease in variant calls at lower purity levels. While you do see some variant dropout, it doesn’t seem to be enough to be the cause of the issue. Below is a plot similar to those in (1), but including variants that were only called in a subset of purities.

Screen Shot 2021-02-05 at 3 49 12 PM

Have you encountered something like this before? Do you have any idea what could be happening here?

3. Inconsistent cluster assignments

Variants will change cluster assignment between purities, even when the same number of clusters is present in both purity levels. Below is a similar plot to the line plot in (2), except cluster mean CCF is used instead of individual breakend CCF. The numbers in the plot represent the number of variants present at a given cluster/purity combination. The collapse in clusters mentioned in (2) is visible here as well. Is this expected?

Screen Shot 2021-02-05 at 3 48 23 PM

Thanks for your continued help.
-Will

Re 1: I'm guessing something going on with the actual read counts and copy number. Cluster assignments mainly affect instance CCFs via multiplicity calls. The ups and downs seem to be 1 multiplicity difference.

Re 2: For collapse of clusters, I think this could have something to do with variant read counts being too high in those low purity samples. You can see instance CCFs as high as 5 in the first box and strong super clonal clusters in 1st four boxes. We dont expect a gradual drop of number of clusters when purity is low. During fitting, maximum allowed cluster centre is fixed at 1. If there are a lot of instance variants having CCF larger than 1, then at some point the fitting will have difficulty in calling clusters < 1. This is an intended effect to spot out problematic samples.

R3: Generally speaking, assignment changes could happen due to uncertainty allowed in the probability model. Another measurement we use for sanity is instance CCF * multiplicity. In datasets where you know the truth, a good agreement usually indicates the fitting is working as intended.
Finally, you trace plots show a trend of more lower instance CCF estimates as purity increases, which fits our expectation.

Hope this helps.