Protocol of Gene Homology Pipeline
Species
Molgula oculata; Molgula occidentalis; Botryllus schlosseri; Halocynthia roretzi; Halocynthia aurantium; Ciona intestinalis; Ciona savignyi; Phallusia mammillata; Phallusia fumigata; Latimeria chalumnae; Callorhinchus milii; Pelodiscus sinensis; Gallus gallus; Homo sapiens; Mus musculus; Branchiostoma belcheri; Saccoglossus kowalevskii; Strongylocentrotus purpuratus
Authors
Paul Simion
Céline Scornavacca
Frédéric Delsuc
Emmanuel J.P. Douzery
We downloaded proteomic data for 12 species spanning the diversity of chordates
and including outgroups in addition of which we used data produced here
for 6 new tunicate species. These 18 datasets were dereplicated
(i.e. the longuest transcript for each gene was kept) and then used as input
for the clustering software package SiLiX (Miele et al. 2012).
Two rounds of clusterization were ran, the second one having been designed
to break apart and reclusterize around 30% of all sequences that were affiliated
to the same "mega-cluster". We then only kept cluster of sequences that contained
either at least two tunicate sequences, or at least one tunicate and one vertebrate
sequence. This resulted in 12,885 clusters of homologous sequences, containing both
orthologs and paralogs.
These clusters were then aligned using MAFFT (Katoh & Standley 2013), and fragmented
sequences (i.e. small sequence in both absolute length and relatively to the rest of
the alignment) were discarded. For each of these alignements, we computed a
phylogenetic tree as well as a 100 bootstrap replicates to estimate node support
using the LG+G4+F evolution model in RaxML (Stamatakis 2014). These trees of
homologous sequences were then analyzed with a custom program (written in C++) in
order to detect the vertebrates orthologous genes of each tunicate sequence.
This phylogeny-based orthology information was subsequently used to create a
definitive name for tunicate gene that follows recently published recommendation
for tunicate gene nomenclature (Stolfi et al. 2015).
Miele et al. Ultra-fast sequence clustering from similarity networks with SiLiX, BMC Bioinformatics
Kazutaka Katoh, Daron M. Standley.(2013).MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol 30(4):772-780.
Alexandros Stamatakis.(2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9):1312-1313.
Stolfi A et al.(2015) Guidelines for the Nomenclature of Genetic Elements in Tunicate Genomes. Genesis 53(1):1-14.