davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics

Home Page:https://davidemms.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to interpret OrthologuesStats*.tsv files?

bbalog87 opened this issue · comments

Hello,

it is not clear to my me how to interpret the files in OrthologuesStats*.tsv.

For instance, this matrix from the OrthologuesStats_one-to-one.tsv file is not symmetric. It is not clear how to infer the total number of one-to-one orthologs for ach species. Is it the rows sum or either le columns sum?
Chaar Latma Pagma Parch Perfl Sanlu Silsi
Chaar 0.0 12540.0 11361.0 13307.0 9480.0 6867.0 14564.0
Latma 12540.0 0.0 10242.0 11736.0 8457.0 6323.0 12891.0
Pagma 11361.0 10242.0 0.0 11388.0 7496.0 6068.0 12220.0
Parch 13307.0 11736.0 11388.0 0.0 9261.0 6840.0 13963.0
Perfl 9480.0 8457.0 7496.0 9261.0 0.0 5292.0 9781.0
Sanlu 6867.0 6323.0 6068.0 6840.0 5292.0 0.0 7035.0
Silsi 14564.0 12891.0 12220.0 13963.0 9781.0 7035.0 0.0

Thank you,
Julien

Hi Julien

I've just checked the matrix in your post and it is symmetric, e.g. it reports that the number of one-to-one orthologs between Chaar and Latma is 12540 and that is the same if you look at the M(1,0) entry of the matrix or the M(0,1) entry. So for each pair of species the corresponding number in the matrix is the number of one-to-one orthologs between that pair of species. You don't need to take the sum over the rows of columns.

All the best
David

Hi David,

Thanks for the helpful answer. I have now understood how to read the matrix.

How about this one-to-many matrix?

    Chaar   Latma   Pagma   Parch   Perfl   Sanlu   Silsi

Chaar 0.0 816.0 1269.0 1511.0 6740.0 597.0 871.0
Latma 431.0 0.0 1156.0 1388.0 5992.0 625.0 787.0
Pagma 1007.0 1218.0 0.0 2021.0 6088.0 738.0 1387.0
Parch 383.0 690.0 1110.0 0.0 6552.0 513.0 714.0
Perfl 197.0 441.0 772.0 706.0 0.0 448.0 430.0
Sanlu 239.0 426.0 719.0 827.0 3587.0 0.0 431.0
Silsi 441.0 822.0 1135.0 1492.0 6923.0 608.0 0.0

Best,
Julien

Hi Julien

The reason for this is that it's not a symmetrical (e.g. one-to-one) relationship. Thanks for bringing this up, below is a explanation of how this works. I'll add something to the README file to describe these results files more fully as I realise now that there's not enough info for users to interpret them currently.

For some gene trees you will have multiple duplication events post-speciation. This could lead to, for example, 2 genes in Latma being orthologs of 3 genes in Chaar. All of these occurrences are summed up in the many-to-many matrix. This case would add 2 to the entry for M(Latma, Chaar) and 3 to the entry for M(Chaar, Latma). This is a tree showing 3 genes in arabidopsis (AT2G07671, ATMG01080, ATMG00040) that are orthologs to 2 genes in volvox (Vocar.0009s0017.1, Vocar.0009s0018.1):

many-to-many

For the one-to-many/many-to-one relationships, you might have matrices like this:

one-to-many, X=

             A. thaliana   O. sativa   P. patens  V. carteri
 A. thaliana           0        1601        1614         115
   O. sativa        1893           0        1686         108
   P. patens         906         880           0         123
  V. carteri        1693        1606        2155           0

many-to-one, Y=

             A. thaliana   O. sativa   P. patens  V. carteri
 A. thaliana           0        4683        2463        5596
   O. sativa        4135           0        2483        5510
   P. patens        4099        4347           0        6439
  V. carteri         282         269         329           0

This means there are 1693 genes in V. carteria that are in a one-to-many relationship with orthologs in A. thaliana whereas there are only 115 genes in A. thaliana that are in a one-to-many relationship with genes in V. carteria. That corresponds to what should be expected, the genome of A. thaliana is larger and there have been more gene duplication events in lineage leading to A. thaliana than to the green algae V. carteria.

A little care needs to be taken when reading these files though as the 1693 genes in volvox are orthologs of the 5596 genes in arabidopsis (i.e. X(i,j) genes are orthologs of Y(j,i) genes) and the 115 genes in arabidopsis are orthologs of the 282 genes in volvox. This makes sense in terms of the naming of the matrices and the ordering of the entries, but might be different from what might naively be expected.

All the best
David

Hi David,

Thank you for the comprehensive explanations. It would really be great if you could edit the README, in order to help users to better interpret those results.

Best,
Julien.

PS: I deleted the previous post by mistake. I'll just repost the one-to-many matrix here for other readers who might be interested to this issue.

        Chaar Latma  Pagma   Parch   Perfl   Sanlu   Silsi
Chaar   0.0     816.0   1269.0  1511.0  6740.0  597.0   871.0
Latma   431.0   0.0     1156.0  1388.0  5992.0  625.0   787.0
Pagma   1007.0  1218.0  0.0     2021.0  6088.0  738.0   1387.0
Parch   383.0   690.0   1110.0  0.0     6552.0  513.0   714.0
Perfl   197.0   441.0   772.0   706.0   0.0     448.0   430.0
Sanlu   239.0   426.0   719.0   827.0   3587.0  0.0     431.0
Silsi   441.0   822.0   1135.0  1492.0  6923.0  608.0   0.0