rderelle / Broccoli

orthology assignment using phylogenetic and network analyses

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Low single copy OGs

francicco opened this issue · comments

Hi

I’m running a pretty large study on 63 species of Butterflies, mainly Heliconiini. The phylogenetic framework spans ~70 Mys, but mainly within 30 Mys.

It seems like I retrieve a very low amount of scOGs ~400. This number can vary depending on how I treat missing genes, but stays always very low. My guess is that in many OGs with multiple paralogs there are false paralogues. This is an example:

Screenshot 2021-06-29 at 14 42 07

Screenshot 2021-06-29 at 14 42 02

Two clearly different proteins.
How can I fix this? Is there a parameter controlling the granularity like the inflation in MCL? Any advice.

Seeing these results makes me panic a bit.
Cheers
F

Hi,

2 things:
_ indeed some spurious hits exist. They will be eliminated in the next version I'm currently working on. For the time being, you could increase the p-value of Diamond search (e.g. 1e-5)
_ low copy of snOGs: this is very strange ... you might have a species (or 2) in your dataset with high levels of duplicates. Are these proteomes coming from 'gold genomes' or yours ?

These are my genomes.
1e-5 seems pretty high at the moment I'm using 1e-20 and I was thinking to decrease it further. These are closely related species I'm expecting the evalue to be even lower.

how do I check the level of duplication among species?

F

Hi,

actually my answer was not really good. Here is a better one.

If the goal is to get scOGs, (i) I would lower the species overlap parameter (from 0.5 to 0.3-0.4) and (ii) increase the number of species to validate a chimeric protein (from 3 to 40 for instance).

Finally, since some of your species are likely to be closely related, you should increase the kmer-size parameter at step1 (from 100 to 200-300).

best,
Romain

Hi Romain,

Thanks a lot for the clarification. I actually lowered the overlap to 0.1 and it works better, I also increase the evalue to 1e-40 and the -chimeric_nb_sp to 10. I didn't want to mess with the kmer parameter, I'll try with 200 and see.

Best,
F

So, the results improve using a higher kmer. I tried 100, 200, 300, 400. Between 300 and 400 there's a very little difference, I'm gonna use 300.
Thanks a lot
F