Post genomics era, we have been flooded with large amount of genome wide information. Many taxonomic relationships between organisms are undergoing revision after their genome sequences were revealed. Studying core orthologous genes in a particular group of organism can shed light in understanding evolution and many pathogenic traits. Programs such as multiparanoid has been very useful in studying the orthologous relationship between organism. I will try to explain the program and the output in this blog.
This is a much confusing subject. Recently, I was asked a question by one of my students that led me into putting things up here for public consumption. Post genomics era many new terms have been coined based on sequence similarity/divergence. We already know that genes undergo duplication in any given genome. They sometimes stay as a duplicate gene and sometimes mutate into an isoform or sometimes completely change themselves to code for an entirely different protein product. As long as they stay in the same genome, we call them paralogs. So, is there a difference between paralogs and alleles? Yes, there are. By classical genetic definition alleles exist at the same chromosomal location while paralogs exist in tandem.
Orthologs on the other hand stand for a gene in two different organisms that codes for more or less the same protein product and would have retained its identity through speciation. Now what are in-paralogs and out-paralogs? This is an event that involves both gene duplication and speciation. Lets say there is a gene A that undergoes duplication to A' in the same species. Now speciation took place giving rise to species X and species Y, both having gene XA, XA' and YA and YA'. Now we know XA -YA and XA' - YA' are orthologs to each other. What is the relationship between XA - XA' and YA - YA'? They are in-paralogs. Similarly XA - YA' and XA' - YA are out-paralogs. Diagrammatically, this can be represented as:
[Taken from Inparanoid paper]
By definition outparalogs have undergone duplication before speciation, so they are not orthologs!
We are presently working on finding core orthogs in a certain sets of genomes belonging to one group. When we taxonomically classify organisms to a group, we do so based on morphological and behavioral patterns. But now we have the luxury to look at the genes. So, in order to get to the holi grail of core sets of proteins in Oomycetes organisms, we used Inparanoid and multiparanoid. Between a particular species pair (A and B), inparanoid runs 4 blastp searches: A-B, B-A, A-A and B-B. It checks for outparalogs and keeps them away.
Multiparanoid on the other hand takes N * (N-1)/2 number of species comparison outputs from inparanoid and undertakes seed clustering till it merges all the inputs. Multiparanoid extends overlapping paired orthologous groups from Inparanoid, so in a given time, an error may get propagated. That is why Inparanoid needs to be run with stringent parameters (Default is also fine). It is also important to note that multiparanoid is designed to cluster species that originated near about the same time. If species that are diverged at different time scale used for clustering using multiparanoid, then you may not get the desirable outcome.