Reproducing results for 5'CAGE data
nannabarnkob opened this issue · comments
Hi there
First of all thank you for your work on publishing the code and pipeline.
I was wondering if you could share more details on how you have processed your data.
I have downloaded the data you generated for COV413A cell line and processed it according to your pipeline. Of course, some additional preprocessing steps were necessary, including generating individual fastq files from interleaved format, running STAR and STRINGTIE.
These are the candidate transcripts you recover (Supplementary Table 9 filtered on COV413A):
Transcript ID | Class | Family | Subfam | Chr TE | Start TE | End TE | Location TE | Gene | Splice Target | Strand | Cell Line | CAGE TPM |
---|---|---|---|---|---|---|---|---|---|---|---|---|
TCONS_00027238 | DNA | hAT-Charlie | MER1B | chr12 | 130340312 | 130340636 | intron_1 | PIWIL1 | exon_2 | + | COV413A | 0,396505544 |
TCONS_00034780 | LINE | L1 | L1PA2 | chr14 | 71842964 | 71848996 | Intergenic | RGS6 | exon_2 | + | COV413A | 3,105960093 |
TCONS_00055478 | LINE | L1 | L1PA2 | chr18 | 34552378 | 34558395 | Intergenic | DTNA | exon_2 | + | COV413A | 0,660842573 |
TCONS_00086600 | LINE | L1 | L1PA2 | chr3 | 58842154 | 58848179 | Intergenic | FAM3D | exon_2 | - | COV413A | 0,396505544 |
TCONS_00098838 | LINE | L1 | L1PA2 | chr5 | 102671229 | 102677260 | Intergenic | SLCO6A1 | exon_2 | - | COV413A | 0,72692683 |
TCONS_00103663 | LINE | L1 | L1PB1 | chr6 | 7347074 | 7349650 | intron_8 | CAGE1 | exon_9 | - | COV413A | 0,396505544 |
TCONS_00107032 | LINE | L1 | L1HS | chr7 | 12497211 | 12500000 | Intergenic | AC005281.1 | exon_2 | + | COV413A | 0,72692683 |
TCONS_00107035 | LINE | L1 | L1HS | chr7 | 12497211 | 12500000 | Intergenic | AC005281.1 | exon_5 | + | COV413A | 0,72692683 |
TCONS_00107037 | LINE | L1 | L1HS | chr7 | 12497211 | 12500000 | Intergenic | SCIN | exon_2 | + | COV413A | 0,72692683 |
TCONS_00116734 | LINE | L1 | L1PA2 | chr8 | 66949103 | 66955119 | intron_3 | TCF24 | exon_4 | - | COV413A | 0,660842573 |
TCONS_00119408 | LINE | L1 | L1PA2 | chr9 | 94089082 | 94095103 | intron_4 | PTPDC1 | exon_5 | + | COV413A | 0,330421286 |
TCONS_00070187 | LTR | ERV1 | LTR7 | chr2 | 38086114 | 38086512 | Intergenic | CYP1B1 | exon_2 | - | COV413A | 15,92630601 |
TCONS_00074167 | LTR | ERV1 | LTR2B | chr20 | 15985767 | 15986246 | intron_13 | MACROD2 | exon_14 | + | COV413A | 0,396505544 |
TCONS_00089490 | LTR | ERV1 | LTR2B | chr4 | 37546188 | 37546669 | intron_1 | C4orf19 | exon_2 | + | COV413A | 0,859095345 |
TCONS_00105271 | LTR | ERVL | LTR18A | chr6 | 79313214 | 79313548 | Intergenic | HMGN3 | exon_1 | - | COV413A | 2,841623064 |
TCONS_00016149 | SINE | Alu | AluY | chr10 | 101729855 | 101730163 | Intergenic | FBXW4 | exon_1 | - | COV413A | 0,991263859 |
TCONS_00016150 | SINE | Alu | AluY | chr10 | 101729855 | 101730163 | Intergenic | FBXW4 | exon_2 | - | COV413A | 0,991263859 |
TCONS_00030551 | SINE | Alu | AluJo | chr12 | 121847358 | 121847535 | intron_9 | HPD | exon_10 | - | COV413A | 0,330421286 |
TCONS_00041268 | SINE | Alu | AluY | chr15 | 51603584 | 51603891 | intron_1 | DMXL2 | exon_2 | - | COV413A | 1,652106432 |
I recover these - sorry for the truncated output.
I have used the hg38 reference genome and gtf, your reference data download and and your pre-defined arguments.txt.
I hope we together can get to the bottom of why I don't recover any of the same TE chimers as you.
Best regards
Nanna
Hello,
This pipeline is to be used with short-read (ideally paired-end) RNA sequencing data to help find potential TE promoters. The data that you downloaded was nanoCAGE data, which can help validate promoter locations. Thus, you should not use this pipeline on the nanoCAGE data itself. The nanoCAGE data will help define promoters accurately, but it will normally not be able to assemble the full-length transcript.
For details on how to process the nanoCAGE data, we have that in our Supplementary Methods section of the paper: https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-023-01349-3/MediaObjects/41588_2023_1349_MOESM1_ESM.pdf
In addition, the following paper introduced the method and has more details on it: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1670-6
In addition, we used the cell lines to validate the TE-gene chimeras seen in the tumor samples. There could be TE-gene chimeras in the cell lines that were not part of our reference that could be new.
Best,
Nakul