STOmics / SAW

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ask for GEM file format description

jphe opened this issue · comments

commented

Hello,

I'm trying to understand the format of a GEM file. In this file, the first column represents the gene name, the second and third columns represent the protein CID coordinates, and the fourth column (MIDCount) represents the number of UMIs detected for that gene in the spot. Don't known if I am right?

But what does the fifth column, ExonCount, represent? Does it means the exonID? but it has value of 0.

$ less E16.5_E1S1.tissue.gem.gz | head -1000000 |grep 0610005C13Rik | cut -f 1,4,5 | sort | uniq -c
     72 0610005C13Rik	1	0
   2330 0610005C13Rik	1	1
      9 0610005C13Rik	2	0
      2 0610005C13Rik	2	1
    295 0610005C13Rik	2	2
      3 0610005C13Rik	3	0
      1 0610005C13Rik	3	2
     85 0610005C13Rik	3	3
      1 0610005C13Rik	4	1
     26 0610005C13Rik	4	4
      1 0610005C13Rik	5	5

Hi, in the GEM file, the second and third columns represent the CID coordinates minus offset X or Y in the header, and you're right about the first and the fourth column.
For the fifth column, ExonCount, the reason is that expression matrix also includes reads mapped to introns, not only exons.