Ask for GEM file format description

Question

Ask for GEM file format description

jphe opened this issue 2 months ago · comments

Hello,

I'm trying to understand the format of a GEM file. In this file, the first column represents the gene name, the second and third columns represent the protein CID coordinates, and the fourth column (MIDCount) represents the number of UMIs detected for that gene in the spot. Don't known if I am right?

But what does the fifth column, ExonCount, represent? Does it means the exonID? but it has value of 0.

$ less E16.5_E1S1.tissue.gem.gz | head -1000000 |grep 0610005C13Rik | cut -f 1,4,5 | sort | uniq -c
     72 0610005C13Rik	1	0
   2330 0610005C13Rik	1	1
      9 0610005C13Rik	2	0
      2 0610005C13Rik	2	1
    295 0610005C13Rik	2	2
      3 0610005C13Rik	3	0
      1 0610005C13Rik	3	2
     85 0610005C13Rik	3	3
      1 0610005C13Rik	4	1
     26 0610005C13Rik	4	4
      1 0610005C13Rik	5	5

KentLE · Answer 1 · Sun May 19 2024 17:57:04 GMT+0800 (China Standard Time)

Hi, in the GEM file, the second and third columns represent the CID coordinates minus offset X or Y in the header, and you're right about the first and the fourth column.
For the fifth column, ExonCount, the reason is that expression matrix also includes reads mapped to introns, not only exons.