|
|||
18 Dec 2012 Jun Inoue
|
|||
This website shows how to collect the ortholog sequences (cDNA and amino acid) with 1:1 relationships between human and other species for phylogenetic analysis from Ensembl database. In this website, we are assuming the orthology information of Ensembl to be reliable. Download the working directory, EnsOrthoCollection.tar.gz, including example files and Perl scripts. |
|||
|
|||
Before collecting the sequences, we need to make the orthology table. This table includes the Ensembl gene IDs of orthology information between human and other species.
Enter the Ensembl top page and choose BioMart from the top line of this site. Human protein gene IDs Push "Dataset" from the blue left column and select Database. Now I chose "Ensembl Genes69" and "Homo sapiens genes (GRCh37.p8)". Then push "Filters" from the left column and check "ID list limit" in the "GENE" column. Pull down the bar to "Ensembl Gene ID(s) [e.g. ENSG000xxx]" as follows. In the BioMart, you always need to check your setting in the left blue column (Dataset, Filters, Attributes, etc). |
|||
Push "Attribute" from the left column and select "GENE". Then select "Ensembl Gene ID" from "Ensembl" column as follows: |
|||
Press "Count" button from the upper left column and check the number of genes. Then press "Results" button from the upper left column and download the mart_export.txt file including all Human protein coding gene IDs. Make sure you checked the two "Unique results only" buttons as follows. |
|||
Check the downloaded file mart_export.txt. It should be like "EnsOrthoCollection/mart_export_HumanTEST.txt" file. | |||
Orthologous protein gene IDs from the other species Then we will collect the orthologous gene IDs of the species against human gene. Press (or Make sure to press) "Attributes" button from the left column and check "Homologs" in the upper column. For Chimp, chose "Chimp Ensembl GeneID" and "Homology Type" as follows. Then check the number of genes and download the mart_export.txt file as in the human section (see above). Check the downloaded file. The format should be like "EnsOrthoCollection/mart_export_ChimpTEST.txt". |
|||
|
|||
Download cDNA and amino acid databases from the following ftp site and decompress them. I log into the ftp site as a guest user from my Mac.
For example, you can download the human cDNA database from
and corresponding amino acid database from
Download cDNA and amino acid databases for Pan troglodytes (Chimp) and Canis familiaris (dog) as shown above. Save the downloaded 6 files (AA/cDNA databased for 3 spp.) into the EnsOrthoCollection directory. | |||
| |||
Merge ortholog information of multiple species
Make sure the above should be written in one line in your terminal. Then try 2nd and 3rd commands written in 010_orthoTableMaker.pl. | |||
| |||
We will retrieve the amino acid sequences from downloaded databases according to the orthology table (020_orthoFile.txt). Note that this program automatically chose longest transcript (amino acid sequences) among the recored with same gene ID. Before conduct the following analysis, change the name of the output file, 020_out_multiProtFiles, to something or just delete if it exists.
020_orthoFile.txt: Orthology information file. | |||
| |||
According to the amino acid file including several species, we will retrieve cDNA sequences from the database. Rename or delete 030_out_multiCDNAFiles if it exist.
020_out_multiProtFiles: Directory including amino acid files. | |||
Type 2nd and 3rd commands shown in the program. | |||
| |||
Align all amino acid files using MAFFT. Install MAFFT before conduct the following command. You need to use MAFFT from the current directory. Otherwise rewrite the corresponding line of the following program.
020_out_multiAAFiles: Directory including amino acid files. | |||
| |||
Align cDNA sequences according to the aligned amino acid sequences using PAL2NAL. PAL2NAL is included in this example directory. PAL2NAL automatically assigns the corresponding codon sequence even if the input DNA sequence contains UTRs, polyA tails.
040_out_aligneAAfileDir: Directory including aligned amino acid files. For, "2. ENSG00000000005.txt." and "3. ENSG00000000419.txt.", PAL2NAL will return the error message as follows:
It means no sequence was found in amino acid file. In this example, these 2 human genes do not have any orthologs and we got right answer about them.
Type,
I am not sure this works on Windows. | |||