ORTHOSCOPE* Instruction

4 Dec. 2024 revised

ORTHOSCOPE* (star) is an analysis pipeline to infer evolutionary histories of genes at genome scale. By estimating orthogroups and gene trees for a complete set of protein coding genes, ORTHOSCOPE* evaluates:

- gene duplication events that occurred at species nodes.
- the presence or absence of genes in species lineages.
- sister group of a focal species/group in the gene tree.

The code is derived from the ORTHOSCOPE web version. ORTHOSCOPE* is designed to use on supercomputers (High Performance Computer) for genome scale analyses.

 
In Japanese  

Installation
ORTHOSCOPE*
You can install ORTHOSCOPE* by downloading it from GitHub:

It is written in Python. It requires Python 3 in your environment. I am using ORTHOSCOPE* on Linux/Mac. Some modification would be needed on Windows.

 

Dependencies

ORTHOSCOPE* requires seven dependencies in the tools directory: blastp, makeblastdb, mafft, trimal, pal2nal.pl, APE and Rscript in R with R itself, and Notung.jar. For Mac analyses, some newly downloaded softwares may be need to set "Open Anyway" in Security & Privacy, System preferences.

 

blastp, makeblastdb: BLAST+
Available here: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.
Download the appropriate file for your PC.
ncbi-blast-x.xx.x+-x64-macosx.tar.gz for Mac users. Copy blastp and makeblastdb files into the tools directory. Then change the permissions.

$ cp DOWNLODED_DIR/bin/blastp tools
$ cp DOWNLODED_DIR/bin/makeblastdb tools

blastp and makeblastdb should be the same version.

mafft: MAFFT v7.475
Available here: https://mafft.cbrc.jp/alignment/software/.
After compilation,
copy mafft to the tools directory.

$ which mafft
/usr/local/bin/mafft
$ cp /usr/local/bin/mafft tools

trimal: TRIMAL v1.4.1:
Available here: https://github.com/inab/trimal/releases/tag/v1.4.1.
cd to trimAl/source, and type make.

$ make
$ cp trimal tools

pal2nal.pl: pal2nal.v14
Available here: http://www.bork.embl.de/pal2nal/#Download.

$ cp pal2nal.pl tools

APE, Rscript, APE, R: R
R(4.0.1) is available from here.
APE can be installed from the R console as follows:

> install.packages("ape")

By installing RRscript will be installed automatically. After installation, copy Rscript into the tools directory.

$ which Rscript
$ cp /usr/local/bin/Rscript tools

MacOS Big Sur does not show the address with "which Rscript." Instead, try the following script from the R console:

> R.home()
[1] "/Library/Frameworks/R.framework/Resources"

Then you can make sure the Rscript from your Terminal application (on Mac, Applications > Utilities):

$ cd /Library/Frameworks/R.framework/
$ ls
R
Rscript ....

 

Notung.jar: NOTUNG 2.9
Available here: http://amberjack.compbio.cs.cmu.edu/Notung/download29.html.
Download
Notung-2.9.x.x.zip file. Modify the file name and copy Notung-2.9.jar into the tools directory. JAVA is neede to run Notung.

$ cp Notung-2.9.jar tools/Notung.jar

 

Usage
Downloaded the example from github: https://github.com/jun-inoue/ORTHOSCOPE_STAR.

The downloaded file contains teleost data with the taxon sampling employed in Satoh et al. (2009).

Using the Terminal, cd to where you downloaded the package. ORTHOSCOPE* has three modes that can be specified using the Mode parameter in the control.txt file.
control.txt

Mode E: Estimating gene trees and orthogroups.

Type:

python3 orthoscope_star.py ENSORLT00000003136.1

Multiple analyses can be tested as follows:

sh command.sh

For genome wide analyses using multiple query sequences, use supercomputer with job schedulers. A script to generate job scripts can be found in below ("Analysis using job schedulers" section).

Mode S: Summarizing results.

This mode uses result files derived from of Mode E analyses. Mode E analyses saves result files in the outdir directory for each query. This mode can be employed on your PC by using the result directory downloaded from the supercomputer.

python3 orthoscope_star.py list_geneIDs.txt

Then, a summary is saved in the results.csv file.


Mode D: Drawing gene trees.
Draw gene trees by using a result derived from the model E analysis. Please make sure >Mode is set as D in the control.txt

python3 orthoscope_star.py ENSORLT00000003136.1

Double click the ENSORLT00000003136.1.html file, you can check estimated gene trees on your web browser:




Species tree hypothesis
Species tree hypothesis can be found from my GitHub repository. For "SpeciesTree" parameter in the control.txt file, the Newick format can be constricted by using tree-manipulating programs such as TreeGraph 2.
Error handling

Version difference between blastp and makeblastdb

[inouejunmp:ORTHOSCOPE_STAR-main]$ ./orthoscope_star.py ENSORLT00000004629.1

############### ENSORLT00000004629.1 ################

##### SpeciesTree draw ######

##### 1st tree: BLAST ######

BLAST Database error: Error: Not a valid version 4 database.
BLAST Database error: Error: Not a valid version 4 database.

......

Use the same version for blastp and makeblastdb.

 

Analysis using job schedulers

ORTHOSCOPE* can be conducted by array job systems.

SGE

Split the list_geneIDs.txt file into several files (each file contains 20 gene IDs).

$ -l 100 -a 3 -d list_geneIDs_Oryzias_latipes.txt --numeric-suffixes=1 --additional-suffix=.txt list_geneIDs-

Make a new directory, list_geneIDs_split, and move list_geneIDs-00* files to this new directory.

$ mkdir list_geneIDs_split
$ mv list_geneIDs-00* list_geneIDs_split

Then qsub the following job file.

#$ -S /bin/bash
#$ -cwd
#$ -t 1-7
#$ -tc 7
#$ -l s_vmem=32G
#$ -l mem_req=32G
#$ -N an_array_job

echo running on `hostname`
echo starting at
date
echo -e ""
#echo SGE_TASK_ID $SGE_TASK_ID
#SGE_TASK_ID=`expr $SGE_TASK_ID - 1`
echo SGE_TASK_ID $SGE_TASK_ID

PADDING_N=$(printf "%03d" ${SGE_TASK_ID})
echo PADDING_N $PADDING_N

for gene in `cat ./list_geneIDs_split/list_geneIDs-${PADDING_N}.txt`
do
echo ########### $gene ###########
python3 /home/jun-inoue/ORTHOSCOPE_STAR-main/orthoscope_star_v1.1.7.py ${gene}
done

echo -e ""
echo ending at
date

For a test run, qsub the following batch file.

#!/bin/bash

#$ -cwd
#$ -V
#$ -l short
#$ -l d_rt=00:10:00
#$ -l s_rt=00:10:00
#$ -l s_vmem=16G
#$ -l mem_req=16G
#$ -N an_batchJob
#$ -S /bin/bash

python3 /home/jun-inoue/ORTHOSCOPE_STAR-main/orthoscope_star_v1.1.7.py ENSORLT00000004629.1

 

Slurm

The downloaded example file contains 910_run_scheduler.py. This Python script produces files for array jobs on a job scheduler, slurm.

1. To make query files (containing query gene IDs) and a job file for slurm jobs, type:

python3 910_run_scheduler.py

Query files: 910_list*.txt
Job file: 920_arrayJob.slurm

2. Then run slurm jobs:

sbatch 920_arrayJob.slurm

 

Exploring the result file
Results.csv file

The results.csv file produced by the mode S analysis contains the integrated results derived from multiple query analyses. This file is best viewed in a spreadsheet program like Excel or LibreOffice Calc. These files might be handled correctly on your computer automatically, or you might need to tell it explicitly that they are comma-delimited.

In this table, one line contains a result per one orthogroup, estimated by one query sequence.

Terms used in the results.csv file
Column name  
QueryGeneID Query gene ID
QueryLength Query gene length (bp)
SpeciesWithGeneFunction Gene ID of species with a gene function to represent functions of the orthogroup members
BS_of_orthogroupBasalNode Bootstrap value of orthogroup basal node
2ndGeneTree Presence/absence of 2nd gene tree
BHnum_Medaka BLAST hit numbers of Medaka
OGnum_Medaka Orthogroup-member numbers of Medaka
BS_of_Teleostei_monophyly BS value of Teleostei monophyly. Note that the other basal teleost node not in the query gene lineage (e.g., the node marked with 87_Teleostei_D=N, below) is not counted.
dupStatus_Teleostei Duplication status (D=Y [gene duplication] or D=N [speciation]) of Teleostei
Sister_of_Teleostei Estimated sistergroup of Teleostei
BS_with_Teleostei BS value of Teleostei vs sistergroup
   
Variable  
D=Y Gene node status as gene duplication.
D=N Gene node status as speciation.
NoGeneNode No gene node was identified for the species node shown in its column.
leaf Node consisting of one sequence.
r Rearranged gene nodes due to bootstrap values lower than the threshold defined in the BSthreshold option.
No_orthogroup No orthogroup was delineated in the 1st gene tree.

By sorting this file with Excel/LibreOffice options, users can count the number of orthogroups fulfilling BS value criterion, monophyletic teleost gene groups, etc.


Sister group evaluation

To evaluate sistergroup hypotheses in the species tree, users shoud count the number of genes manually using the result.csv file with Excel or similar software.




In the result.csv file (Table S3), the following bold numbers in Inoue (2022) are highlited by colors:

Sistergroup Evaluation. ....Results obtained for the Case Study 2 data set can be used to evaluate three sistergroup hypotheses for the Percomorpha: (A) Protacanthopterygii, (B) Otophysi, (C) Protacanthopterygii + Otophysi. Among 6,269 orthogroups having teleost-gene clades supported by >70% bootstrap values, 4,752 orthogroups showed a Percomorph gene clade with >70% bootstrap probability (BS_of_Percomorpha_monophyly in supplementary table S3, Supplementary Material online). Of these, 2,850 orthogroups supported one of the three sistergroup hypothesis (Sister_of_Percomorpha) with >70% bootstrap support (BS_with_Percomorpha). As expected, the number of orthogroups supporting the Protacanthopterygii hypothesis (2,520) was much larger than the number of remaining orthogroups (Otophysi, 66; Protacanthopterygii + Otophysi, 264).

The orthogroup counting is as follows:

Traces of TGD. Among these 11,539 orthogroups, 6,269 orthogroups contained monophyletic teleost-gene clades supported by >70% BS values (BS_of_Teleostei_monophyly in supplementary table S3, Supplementary Material online).....

(2024/10/16)

 

Additional scripts

Extracting genes from the results.csv file
extract_genes_from resultFile.zip
By analyzing the results.xlsx file generated by the Mode S and Excel, this script extract orthogroups fulfilling conditions for:

- environmental gene markers in Case Study 2
(1:1 single copy genes that have lost one of a pair after teleost genome duplication, but before teleost diversification).
- horizontal gene transfer in Case Study 3
(genes with orthologs in all sequenced tunicate genomes, but absent in other metazoan genomes).

Users need pandas (Python library) to run this script.

python3 extract_genes.py

It takes a few minutes to produce results.
(2021/7/9)


Databases

ORTHOSCOPE* employs a genome-scale protein-coding gene database (coding and amino acid sequences: gene models) for each species. To count numbers of orthologs in each species, only the longest sequence should be used when transcript variants exist for single locus. Such gene models can be downloaded from the ORTHOSCOPE website:



Data used in Inoue (2022)

# Case Study 1 (Fig. 4A, 9 teleosts)
database_SHN.tar.gz (154 MB).
Control.file

Species name Ver. in OS   Species name Ver. in OS
Drosophila-melanogaster EnsMet38   Danio-rerio Ens91
Ciona-intestinalis Ens91   Gasterosteus aculeatus Ens91
Xenopus-tropicalis Ens91   Tetraodon-nigroviridis Ens91
Gallus-gallus Ens102   Oryzias-latipes Ens91
Homo-sapiens Ens102      

 

# Case Study 2 (Fig. 4B, 15 teleosts)
Control.file

Species name Ver. in OS   Species name Ver. in OS
Drosophila-melanogaster EnsMet38   Lepisosteus-oculatus Ens91
Acanthaster-planci OIST-S   Pangasianodon-hypophthalmus RefSeq100
Branchiostoma-floridae RefSeq89   Danio-rerio Ens91
Xenopus-tropicalis Ens91   Esox-lucius RefSeq
Anolis-carolinensis Ens91   Oncorhynchus-kisutch RefSeq
Gallus-gallus Ens102   Tetraodon-nigroviridis Ens91
Homo-sapiens Ens102   Oryzias-latipes Ens91
Acipenser-ruthenus RefSeq      

 

# Case Study 3 (Fig. 5, 49 metazoans)
Control.file



Citation
Inoue J. 2022. ORTHOSCOPE*: a phylogenetic pipeline for inferring gene histories from genome -wide data. Molecular Biology and Evolution, 39(1):msab301. Link.

jinoueATg.ecc.u-tokyo.ac.jp