kensung-lab / SurVirus

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to judge the order between the host and the virus in the integrated sequence

huerqiang opened this issue · comments

Hi, Thank you for your greate software! I have some questions in use that I would like to ask:
In the results.txt file, I see the first line :
0 chr9:+:5468046:5468561 NC_001526.2:+:7155:7771
And in the results.t1.txt, I see the first line:
ID=0 chr9:+5468561 NC_001526.2:+7771
I don't know how the host connects to the virus in what order. Should it be:
----------------------||-----------------
NC_001526.27:+:7760~7771 || chr9:+:5468561:5468551 ?
Or:
----------------------||-----------------
chr9:+:5468551:5468561 || NC_001526.27:+: 7771:7760 ?
I have seen this issue #1 , but it still hasn't solved my doubts.

When I test another result, I found that the positive and negative chains in the results may be reversed:
The first line in results.txt is :
0 chr8:+:127885488:127886048 NC_001526.2:+:4055:4667 1836 1310 1399 165237 0.972534 0.990243 0 0 0.788219 0.859748
The first line in results.t1.txt is:
ID=0 chr8:+127886048 NC_001526.2:+4667 SUPPORTING_PAIRS=1310 SPLIT_READS=1399 HOST_PBS=0.972534 COVERAGE=0.823983 PAIRED WITH ID=3
The result of SurVirus shows the integrated sequence is positive strand in the virus and positive strand in the human.
But when I test the bam file and the fastq file, I found the integrated sequence should be:
NC_001526.2:+:4570:4667 || chr8:-:127886048:127885994
image
The real sequence shows the integrated sequence is positive strand in the virus and negative strand in the human.

commented

One way of thinking about it is this: for every pair of reads supporting the integration, one will align to the human genome in the region chr8:127885488-127886048 on the positive strand, and the mate will align to NC_001526.2:4055-4667 also on the positive strand.

Therefore the junction will look like this, because we know that one of the reads must be sequenced from the reverse strand:

human 5' -- human 3' || virus 3' -- virus 5'
|==========================||=========================|
chr8:127885488 -- chr8:127886048 || NC_001526.2:4667 -- NC_001526.2:4570

You can also reverse complement the whole sequence, i.e. "virus 5' --> 3' | human 3' <-- 5'" and you would get the same results, in terms of read pairs.

Thank you very much. One more question I'd like to ask is whether there is a way to determine whether the results in the results.txt file are obtained from a pair of reads or from a split reads? Take the above result as an example:
The first line in results.txt is :
0 chr8:+:127885488:127886048 NC_001526.2:+:4055:4667 1836 1310 1399 165237 0.972534 0.990243 0 0 0.788219 0.859748
The first line in results.t1.txt is:
ID=0 chr8:+127886048 NC_001526.2:+4667 SUPPORTING_PAIRS=1310 SPLIT_READS=1399 HOST_PBS=0.972534 COVERAGE=0.823983 PAIRED WITH ID=3
If the result was obtained from a pair of reads, the real read should be:
human 5' -- human 3' (negative) || virus 3' -- virus 5' (positive)
|==========================||=========================|
chr8:127885488 -- chr8:127886048 || NC_001526.2:4667 -- NC_001526.2:4570
But if the result was obtained from a split read, the real read should be:
human 5' -- human 3' (positive) || virus 3' -- virus 5' (positive)
|==========================||=========================|
chr8:127885488 -- chr8:127886048 || NC_001526.2:4667 -- NC_001526.2:4570

commented

The result in your case is obtained by both.
SUPPORTING_PAIRS=1310 means 1310 read pairs support the junction
SPLIT_READS=1399 means 1399 split reads support the junction

Sorry to bother you again: When I get the result "chr8:+:127885488:127886048 NC_001526.2:+:4055:4667 1836 1310", I would like to know whether the positive strand of the virus is inserted into the positive strand of the host or the positive strand of the virus is inserted into the negative strand of the host?Are there any other methods besides manually selecting split reads from the fastq files for me to examine?

commented

One way of thinking about it is this: for every pair of reads supporting the integration, one will align to the human genome in the region chr8:127885488-127886048 on the positive strand, and the mate will align to NC_001526.2:4055-4667 also on the positive strand.

Therefore the junction will look like this, because we know that one of the reads must be sequenced from the reverse strand:

human 5' -- human 3' || virus 3' -- virus 5' |==========================||=========================| chr8:127885488 -- chr8:127886048 || NC_001526.2:4667 -- NC_001526.2:4570

You can also reverse complement the whole sequence, i.e. "virus 5' --> 3' | human 3' <-- 5'" and you would get the same results, in terms of read pairs.

This is a junction, i.e. a human and a virus sequences are "connected". We don't know if the virus was inserted in the host genome, unless we have a paired junction representing the second breakpoint of the insertion. You can see my previous reply for what it looks like.
IF the virus is inserted in the host genome, then it should be that the positive strand of the virus is inserted into the negative strand of the host (or vice versa).

Edit: you have PAIRED WITH ID=3 on your call. Can you show ID=3?

Here is the results_t1 file:

ID=0 chr8:+127886048 NC_001526.2:+4667 SUPPORTING_PAIRS=1310 SPLIT_READS=1399 HOST_PBS=0.972534 COVERAGE=0.823983 PAIRED WITH ID=3
ID=1 chr8:-127874880 NC_001526.2:-5296 SUPPORTING_PAIRS=563 SPLIT_READS=479 HOST_PBS=0.912176 COVERAGE=0.678821
ID=2 chr8:+127727167 NC_001526.2:+2354 SUPPORTING_PAIRS=415 SPLIT_READS=342 HOST_PBS=0.979958 COVERAGE=0.722300
ID=3 chr8:-127886058 NC_001526.2:-5296 SUPPORTING_PAIRS=292 SPLIT_READS=320 HOST_PBS=0.978892 COVERAGE=0.675316 PAIRED WITH ID=9
ID=5 chr8:+127884965 NC_001526.2:+3224 SUPPORTING_PAIRS=122 SPLIT_READS=142 HOST_PBS=0.977738 COVERAGE=0.669705
ID=6 chr8:-127882447 NC_001526.2:-2102 SUPPORTING_PAIRS=88 SPLIT_READS=70 HOST_PBS=0.961891 COVERAGE=0.561010
ID=7 chr8:-127740814 NC_001526.2:-5307 SUPPORTING_PAIRS=61 SPLIT_READS=61 HOST_PBS=0.980265 COVERAGE=0.541374
ID=8 chr8:+127885269 NC_001526.2:+3763 SUPPORTING_PAIRS=66 SPLIT_READS=52 HOST_PBS=0.981123 COVERAGE=0.635344 PAIRED WITH ID=16
ID=10 chr8:+127877991 NC_001526.2:+7035 SUPPORTING_PAIRS=32 SPLIT_READS=28 HOST_PBS=0.979961 COVERAGE=0.379383
ID=21 chr8:+127748887 NC_001526.2:+2096 SUPPORTING_PAIRS=4 SPLIT_READS=3 HOST_PBS=0.911489 COVERAGE=0.323983
commented

From ID=0 and ID=3, it looks like the the virus was inserted reverse complemented into chr8:127886048-127886058