twlab / TEProf2Paper

TEProf2 Pipeline used to find promoters and predict protein sequences from RNA-sequencing data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reproducing results for 5'CAGE data

nannabarnkob opened this issue · comments

commented

Hi there

First of all thank you for your work on publishing the code and pipeline.

I was wondering if you could share more details on how you have processed your data.
I have downloaded the data you generated for COV413A cell line and processed it according to your pipeline. Of course, some additional preprocessing steps were necessary, including generating individual fastq files from interleaved format, running STAR and STRINGTIE.
These are the candidate transcripts you recover (Supplementary Table 9 filtered on COV413A):

</style>
Transcript ID Class Family Subfam Chr TE Start TE End TE Location TE Gene Splice Target Strand Cell Line CAGE TPM
TCONS_00027238 DNA hAT-Charlie MER1B chr12 130340312 130340636 intron_1 PIWIL1 exon_2 + COV413A 0,396505544
TCONS_00034780 LINE L1 L1PA2 chr14 71842964 71848996 Intergenic RGS6 exon_2 + COV413A 3,105960093
TCONS_00055478 LINE L1 L1PA2 chr18 34552378 34558395 Intergenic DTNA exon_2 + COV413A 0,660842573
TCONS_00086600 LINE L1 L1PA2 chr3 58842154 58848179 Intergenic FAM3D exon_2 - COV413A 0,396505544
TCONS_00098838 LINE L1 L1PA2 chr5 102671229 102677260 Intergenic SLCO6A1 exon_2 - COV413A 0,72692683
TCONS_00103663 LINE L1 L1PB1 chr6 7347074 7349650 intron_8 CAGE1 exon_9 - COV413A 0,396505544
TCONS_00107032 LINE L1 L1HS chr7 12497211 12500000 Intergenic AC005281.1 exon_2 + COV413A 0,72692683
TCONS_00107035 LINE L1 L1HS chr7 12497211 12500000 Intergenic AC005281.1 exon_5 + COV413A 0,72692683
TCONS_00107037 LINE L1 L1HS chr7 12497211 12500000 Intergenic SCIN exon_2 + COV413A 0,72692683
TCONS_00116734 LINE L1 L1PA2 chr8 66949103 66955119 intron_3 TCF24 exon_4 - COV413A 0,660842573
TCONS_00119408 LINE L1 L1PA2 chr9 94089082 94095103 intron_4 PTPDC1 exon_5 + COV413A 0,330421286
TCONS_00070187 LTR ERV1 LTR7 chr2 38086114 38086512 Intergenic CYP1B1 exon_2 - COV413A 15,92630601
TCONS_00074167 LTR ERV1 LTR2B chr20 15985767 15986246 intron_13 MACROD2 exon_14 + COV413A 0,396505544
TCONS_00089490 LTR ERV1 LTR2B chr4 37546188 37546669 intron_1 C4orf19 exon_2 + COV413A 0,859095345
TCONS_00105271 LTR ERVL LTR18A chr6 79313214 79313548 Intergenic HMGN3 exon_1 - COV413A 2,841623064
TCONS_00016149 SINE Alu AluY chr10 101729855 101730163 Intergenic FBXW4 exon_1 - COV413A 0,991263859
TCONS_00016150 SINE Alu AluY chr10 101729855 101730163 Intergenic FBXW4 exon_2 - COV413A 0,991263859
TCONS_00030551 SINE Alu AluJo chr12 121847358 121847535 intron_9 HPD exon_10 - COV413A 0,330421286
TCONS_00041268 SINE Alu AluY chr15 51603584 51603891 intron_1 DMXL2 exon_2 - COV413A 1,652106432

I recover these - sorry for the truncated output.
image

I have used the hg38 reference genome and gtf, your reference data download and and your pre-defined arguments.txt.
I hope we together can get to the bottom of why I don't recover any of the same TE chimers as you.

Best regards
Nanna

Hello,

This pipeline is to be used with short-read (ideally paired-end) RNA sequencing data to help find potential TE promoters. The data that you downloaded was nanoCAGE data, which can help validate promoter locations. Thus, you should not use this pipeline on the nanoCAGE data itself. The nanoCAGE data will help define promoters accurately, but it will normally not be able to assemble the full-length transcript.

For details on how to process the nanoCAGE data, we have that in our Supplementary Methods section of the paper: https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-023-01349-3/MediaObjects/41588_2023_1349_MOESM1_ESM.pdf
In addition, the following paper introduced the method and has more details on it: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1670-6

In addition, we used the cell lines to validate the TE-gene chimeras seen in the tumor samples. There could be TE-gene chimeras in the cell lines that were not part of our reference that could be new.

Best,
Nakul