Reproducing results for 5'CAGE data

Question

Reproducing results for 5'CAGE data

nannabarnkob opened this issue a year ago · comments

Hi there

First of all thank you for your work on publishing the code and pipeline.

I was wondering if you could share more details on how you have processed your data.
I have downloaded the data you generated for COV413A cell line and processed it according to your pipeline. Of course, some additional preprocessing steps were necessary, including generating individual fastq files from interleaved format, running STAR and STRINGTIE.
These are the candidate transcripts you recover (Supplementary Table 9 filtered on COV413A):

</style>

Transcript ID	Class	Family	Subfam	Chr TE	Start TE	End TE	Location TE	Gene	Splice Target	Strand	Cell Line	CAGE TPM
TCONS_00027238	DNA	hAT-Charlie	MER1B	chr12	130340312	130340636	intron_1	PIWIL1	exon_2	+	COV413A	0,396505544
TCONS_00034780	LINE	L1	L1PA2	chr14	71842964	71848996	Intergenic	RGS6	exon_2	+	COV413A	3,105960093
TCONS_00055478	LINE	L1	L1PA2	chr18	34552378	34558395	Intergenic	DTNA	exon_2	+	COV413A	0,660842573
TCONS_00086600	LINE	L1	L1PA2	chr3	58842154	58848179	Intergenic	FAM3D	exon_2	-	COV413A	0,396505544
TCONS_00098838	LINE	L1	L1PA2	chr5	102671229	102677260	Intergenic	SLCO6A1	exon_2	-	COV413A	0,72692683
TCONS_00103663	LINE	L1	L1PB1	chr6	7347074	7349650	intron_8	CAGE1	exon_9	-	COV413A	0,396505544
TCONS_00107032	LINE	L1	L1HS	chr7	12497211	12500000	Intergenic	AC005281.1	exon_2	+	COV413A	0,72692683
TCONS_00107035	LINE	L1	L1HS	chr7	12497211	12500000	Intergenic	AC005281.1	exon_5	+	COV413A	0,72692683
TCONS_00107037	LINE	L1	L1HS	chr7	12497211	12500000	Intergenic	SCIN	exon_2	+	COV413A	0,72692683
TCONS_00116734	LINE	L1	L1PA2	chr8	66949103	66955119	intron_3	TCF24	exon_4	-	COV413A	0,660842573
TCONS_00119408	LINE	L1	L1PA2	chr9	94089082	94095103	intron_4	PTPDC1	exon_5	+	COV413A	0,330421286
TCONS_00070187	LTR	ERV1	LTR7	chr2	38086114	38086512	Intergenic	CYP1B1	exon_2	-	COV413A	15,92630601
TCONS_00074167	LTR	ERV1	LTR2B	chr20	15985767	15986246	intron_13	MACROD2	exon_14	+	COV413A	0,396505544
TCONS_00089490	LTR	ERV1	LTR2B	chr4	37546188	37546669	intron_1	C4orf19	exon_2	+	COV413A	0,859095345
TCONS_00105271	LTR	ERVL	LTR18A	chr6	79313214	79313548	Intergenic	HMGN3	exon_1	-	COV413A	2,841623064
TCONS_00016149	SINE	Alu	AluY	chr10	101729855	101730163	Intergenic	FBXW4	exon_1	-	COV413A	0,991263859
TCONS_00016150	SINE	Alu	AluY	chr10	101729855	101730163	Intergenic	FBXW4	exon_2	-	COV413A	0,991263859
TCONS_00030551	SINE	Alu	AluJo	chr12	121847358	121847535	intron_9	HPD	exon_10	-	COV413A	0,330421286
TCONS_00041268	SINE	Alu	AluY	chr15	51603584	51603891	intron_1	DMXL2	exon_2	-	COV413A	1,652106432

I recover these - sorry for the truncated output.

I have used the hg38 reference genome and gtf, your reference data download and and your pre-defined arguments.txt.
I hope we together can get to the bottom of why I don't recover any of the same TE chimers as you.

Best regards
Nanna

Nakul Shah · Answer 1 · Thu May 11 2023 05:01:37 GMT+0800 (China Standard Time)

Hello,

This pipeline is to be used with short-read (ideally paired-end) RNA sequencing data to help find potential TE promoters. The data that you downloaded was nanoCAGE data, which can help validate promoter locations. Thus, you should not use this pipeline on the nanoCAGE data itself. The nanoCAGE data will help define promoters accurately, but it will normally not be able to assemble the full-length transcript.

For details on how to process the nanoCAGE data, we have that in our Supplementary Methods section of the paper: https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-023-01349-3/MediaObjects/41588_2023_1349_MOESM1_ESM.pdf
In addition, the following paper introduced the method and has more details on it: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1670-6

In addition, we used the cell lines to validate the TE-gene chimeras seen in the tumor samples. There could be TE-gene chimeras in the cell lines that were not part of our reference that could be new.

Best,
Nakul