|
4 Dec. 2024 revised |
|
ORTHOSCOPE* (star) is an analysis pipeline to infer evolutionary histories of genes at genome scale. By estimating orthogroups and gene trees for a complete set of protein coding genes, ORTHOSCOPE* evaluates:
- gene duplication events that occurred at species nodes.
- the presence or absence of genes in species lineages.
- sister group of a focal species/group in the gene tree.
The code is derived from the ORTHOSCOPE web version. ORTHOSCOPE* is designed to use on supercomputers (High Performance Computer) for genome scale analyses.
|
|
|
|
In Japanese |
|
|
|
ORTHOSCOPE* |
You can install ORTHOSCOPE* by downloading it from GitHub:
|
|
It is written in Python. It requires Python 3 in your environment. I am using ORTHOSCOPE* on Linux/Mac. Some modification would be needed on Windows.
|
Dependencies |
ORTHOSCOPE* requires seven dependencies in the tools directory: blastp, makeblastdb, mafft, trimal, pal2nal.pl, APE and Rscript in R with R itself, and Notung.jar. For Mac analyses, some newly downloaded softwares may be need to set "Open Anyway" in Security & Privacy, System preferences.
blastp, makeblastdb: BLAST+
Available here: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.
Download the appropriate file for your PC. ncbi-blast-x.xx.x+-x64-macosx.tar.gz for Mac users. Copy blastp and makeblastdb files into the tools directory. Then change the permissions.
$ cp DOWNLODED_DIR/bin/blastp tools
$ cp DOWNLODED_DIR/bin/makeblastdb tools
blastp and makeblastdb should be the same version.
mafft: MAFFT v7.475
Available here: https://mafft.cbrc.jp/alignment/software/.
After compilation, copy mafft to the tools directory.
$ which mafft
/usr/local/bin/mafft
$ cp /usr/local/bin/mafft tools
trimal: TRIMAL v1.4.1:
Available here: https://github.com/inab/trimal/releases/tag/v1.4.1.
cd to trimAl/source, and type make.
$ make
$ cp trimal tools
pal2nal.pl: pal2nal.v14
Available here: http://www.bork.embl.de/pal2nal/#Download.
$ cp pal2nal.pl tools
APE, Rscript, APE, R: R
R(4.0.1) is available from here.
APE can be installed from the R console as follows:
> install.packages("ape")
By installing R, Rscript will be installed automatically. After installation, copy Rscript into the tools directory.
$ which Rscript
$ cp /usr/local/bin/Rscript tools
MacOS Big Sur does not show the address with "which Rscript." Instead, try the following script from the R console:
> R.home()
[1] "/Library/Frameworks/R.framework/Resources"
Then you can make sure the Rscript from your Terminal application (on Mac, Applications > Utilities):
$ cd /Library/Frameworks/R.framework/
$ ls
R
Rscript
....
Notung.jar: NOTUNG 2.9
Available here: http://amberjack.compbio.cs.cmu.edu/Notung/download29.html.
Download Notung-2.9.x.x.zip file. Modify the file name and copy Notung-2.9.jar into the tools directory. JAVA is neede to run Notung.
$ cp Notung-2.9.jar tools/Notung.jar
|
|
Downloaded the example from github: https://github.com/jun-inoue/ORTHOSCOPE_STAR. |
|
The downloaded file contains teleost data with the taxon sampling employed in Satoh et al. (2009).
|
Using the Terminal, cd to where you downloaded the package. ORTHOSCOPE* has three modes that can be specified using the Mode parameter in the control.txt file. |
|
control.txt |
Mode E: Estimating gene trees and orthogroups.
Type:
python3 orthoscope_star.py ENSORLT00000003136.1
Multiple analyses can be tested as follows:
sh command.sh
For genome wide analyses using multiple query sequences, use supercomputer with job schedulers. A script to generate job scripts can be found in below ("Analysis using job schedulers" section).
Mode S: Summarizing results.
This mode uses result files derived from of Mode E analyses. Mode E analyses saves result files in the outdir directory for each query. This mode can be employed on your PC by using the result directory downloaded from the supercomputer.
python3 orthoscope_star.py list_geneIDs.txt
Then, a summary is saved in the results.csv file.
|
|
Mode D: Drawing gene trees.
Draw gene trees by using a result derived from the model E analysis. Please make sure >Mode is set as D in the control.txt
python3 orthoscope_star.py ENSORLT00000003136.1
Double click the ENSORLT00000003136.1.html file, you can check estimated gene trees on your web browser: |
|
|
Species tree hypothesis can be found from my GitHub repository. For "SpeciesTree" parameter in the control.txt file, the Newick format can be constricted by using tree-manipulating programs such as TreeGraph 2. |
|
|
|
Version difference between blastp and makeblastdb
[inouejunmp:ORTHOSCOPE_STAR-main]$ ./orthoscope_star.py ENSORLT00000004629.1
############### ENSORLT00000004629.1 ################
##### SpeciesTree draw ######
##### 1st tree: BLAST ######
BLAST Database error: Error: Not a valid version 4 database.
BLAST Database error: Error: Not a valid version 4 database.
......
Use the same version for blastp and makeblastdb.
|
Analysis using job schedulers |
|
ORTHOSCOPE* can be conducted by array job systems.
SGE
Split the list_geneIDs.txt file into several files (each file contains 20 gene IDs).
$ -l 100 -a 3 -d list_geneIDs_Oryzias_latipes.txt --numeric-suffixes=1 --additional-suffix=.txt list_geneIDs-
Make a new directory, list_geneIDs_split, and move list_geneIDs-00* files to this new directory.
$ mkdir list_geneIDs_split
$ mv list_geneIDs-00* list_geneIDs_split
Then qsub the following job file.
#$ -S /bin/bash
#$ -cwd
#$ -t 1-7
#$ -tc 7
#$ -l s_vmem=32G
#$ -l mem_req=32G
#$ -N an_array_job
echo running on `hostname`
echo starting at
date
echo -e ""
#echo SGE_TASK_ID $SGE_TASK_ID
#SGE_TASK_ID=`expr $SGE_TASK_ID - 1`
echo SGE_TASK_ID $SGE_TASK_ID
PADDING_N=$(printf "%03d" ${SGE_TASK_ID})
echo PADDING_N $PADDING_N
for gene in `cat ./list_geneIDs_split/list_geneIDs-${PADDING_N}.txt`
do
echo ########### $gene ###########
python3 /home/jun-inoue/ORTHOSCOPE_STAR-main/orthoscope_star_v1.1.7.py ${gene}
done
echo -e ""
echo ending at
date
For a test run, qsub the following batch file.
#!/bin/bash
#$ -cwd
#$ -V
#$ -l short
#$ -l d_rt=00:10:00
#$ -l s_rt=00:10:00
#$ -l s_vmem=16G
#$ -l mem_req=16G
#$ -N an_batchJob
#$ -S /bin/bash
python3 /home/jun-inoue/ORTHOSCOPE_STAR-main/orthoscope_star_v1.1.7.py ENSORLT00000004629.1
|
Slurm
The downloaded example file contains 910_run_scheduler.py. This Python script produces files for array jobs on a job scheduler, slurm.
1. To make query files (containing query gene IDs) and a job file for slurm jobs, type:
python3 910_run_scheduler.py
Query files: 910_list*.txt
Job file: 920_arrayJob.slurm
2. Then run slurm jobs:
sbatch 920_arrayJob.slurm
|
Exploring the result file |
|
Results.csv file
The results.csv file produced by the mode S analysis contains the integrated results derived from multiple query analyses. This file is best viewed in a spreadsheet program like Excel or LibreOffice Calc. These files might be handled correctly on your computer automatically, or you might need to tell it explicitly that they are comma-delimited.
In this table, one line contains a result per one orthogroup, estimated by one query sequence.
Terms used in the results.csv file |
Column name |
|
QueryGeneID |
Query gene ID |
QueryLength |
Query gene length (bp) |
SpeciesWithGeneFunction |
Gene ID of species with a gene function to represent functions of the orthogroup members |
BS_of_orthogroupBasalNode |
Bootstrap value of orthogroup basal node |
2ndGeneTree |
Presence/absence of 2nd gene tree |
BHnum_Medaka |
BLAST hit numbers of Medaka |
OGnum_Medaka |
Orthogroup-member numbers of Medaka |
BS_of_Teleostei_monophyly |
BS value of Teleostei monophyly. Note that the other basal teleost node not in the query gene lineage (e.g., the node marked with 87_Teleostei_D=N, below) is not counted. |
dupStatus_Teleostei |
Duplication status (D=Y [gene duplication] or D=N [speciation]) of Teleostei |
Sister_of_Teleostei |
Estimated sistergroup of Teleostei |
BS_with_Teleostei |
BS value of Teleostei vs sistergroup |
|
|
Variable |
|
D=Y |
Gene node status as gene duplication. |
D=N |
Gene node status as speciation. |
NoGeneNode |
No gene node was identified for the species node shown in its column. |
leaf |
Node consisting of one sequence. |
r |
Rearranged gene nodes due to bootstrap values lower than the threshold defined in the BSthreshold option. |
No_orthogroup |
No orthogroup was delineated in the 1st gene tree. |
|
By sorting this file with Excel/LibreOffice options, users can count the number of orthogroups fulfilling BS value criterion, monophyletic teleost gene groups, etc.
|
Sister group evaluation
To evaluate sistergroup hypotheses in the species tree, users shoud count the number of genes manually using the result.csv file with Excel or similar software.
|
↓
|
In the result.csv file (Table S3), the following bold numbers in Inoue (2022) are highlited by colors:
Sistergroup Evaluation. ....Results obtained for the Case Study 2 data set can be used to evaluate three sistergroup hypotheses for the Percomorpha: (A) Protacanthopterygii, (B) Otophysi, (C) Protacanthopterygii + Otophysi. Among 6,269 orthogroups having teleost-gene clades supported by >70% bootstrap values, 4,752 orthogroups showed a Percomorph gene clade with >70% bootstrap probability (BS_of_Percomorpha_monophyly in supplementary table S3, Supplementary Material online). Of these, 2,850 orthogroups supported one of the three sistergroup hypothesis (Sister_of_Percomorpha) with >70% bootstrap support (BS_with_Percomorpha). As expected, the number of orthogroups supporting the Protacanthopterygii hypothesis (2,520) was much larger than the number of remaining orthogroups (Otophysi, 66; Protacanthopterygii + Otophysi, 264).
The orthogroup counting is as follows:
Traces of TGD. Among these 11,539 orthogroups, 6,269 orthogroups contained monophyletic teleost-gene clades supported by >70% BS values (BS_of_Teleostei_monophyly in supplementary table S3, Supplementary Material online).....
|
(2024/10/16)
|
|
Extracting genes from the results.csv file
extract_genes_from resultFile.zip
By analyzing the results.xlsx file generated by the Mode S and Excel, this script extract orthogroups fulfilling conditions for:
- environmental gene markers in Case Study 2
(1:1 single copy genes that have lost one of a pair after teleost genome duplication, but before teleost diversification).
- horizontal gene transfer in Case Study 3
(genes with orthologs in all sequenced tunicate genomes, but absent in other metazoan genomes).
Users need pandas (Python library) to run this script.
python3 extract_genes.py
It takes a few minutes to produce results.
(2021/7/9)
|
|
|
ORTHOSCOPE* employs a genome-scale protein-coding gene database (coding and amino acid sequences: gene models) for each species. To count numbers of orthologs in each species, only the longest sequence should be used when transcript variants exist for single locus. Such gene models can be downloaded from the ORTHOSCOPE website: |
|
|
|
Data used in Inoue (2022)
# Case Study 1 (Fig. 4A, 9 teleosts)
database_SHN.tar.gz (154 MB).
Control.file
Species name |
Ver. in OS |
|
Species name |
Ver. in OS |
Drosophila-melanogaster |
EnsMet38 |
|
Danio-rerio |
Ens91 |
Ciona-intestinalis |
Ens91 |
|
Gasterosteus aculeatus |
Ens91 |
Xenopus-tropicalis |
Ens91 |
|
Tetraodon-nigroviridis |
Ens91 |
Gallus-gallus |
Ens102 |
|
Oryzias-latipes |
Ens91 |
Homo-sapiens |
Ens102 |
|
|
|
# Case Study 2 (Fig. 4B, 15 teleosts)
Control.file
Species name |
Ver. in OS |
|
Species name |
Ver. in OS |
Drosophila-melanogaster |
EnsMet38 |
|
Lepisosteus-oculatus |
Ens91 |
Acanthaster-planci |
OIST-S |
|
Pangasianodon-hypophthalmus |
RefSeq100 |
Branchiostoma-floridae |
RefSeq89 |
|
Danio-rerio |
Ens91 |
Xenopus-tropicalis |
Ens91 |
|
Esox-lucius |
RefSeq |
Anolis-carolinensis |
Ens91 |
|
Oncorhynchus-kisutch |
RefSeq |
Gallus-gallus |
Ens102 |
|
Tetraodon-nigroviridis |
Ens91 |
Homo-sapiens |
Ens102 |
|
Oryzias-latipes |
Ens91 |
Acipenser-ruthenus |
RefSeq |
|
|
|
# Case Study 3 (Fig. 5, 49 metazoans)
Control.file
|
|
Inoue J. 2022. ORTHOSCOPE*: a phylogenetic pipeline for inferring gene histories from genome -wide data. Molecular Biology and Evolution, 39(1):msab301. Link. |
|
jinoueATg.ecc.u-tokyo.ac.jp |