TransDecoder

転写配列の推定：TransDecoder

2020 年 11 月 4 日　改訂
井上潤

TransDecoder は，トランスクリプトームの生データ DNA 配列から，1st codon position を見つけてアミノ酸に翻訳するソフトウェアです．アミノ酸配列と対応する cDNA 配列が得られます．
　次世代シーケンサーで得られたトランスクリプトーム・データをアセンブルして得られた多数の配列は，翻訳される方向や 1st codon poistion がわからないです．これを解決してくれるのが TransDecoder です．
　翻訳不能とみなされた配列は除去されるので，得られるレコード数は少なくなります．

Perl スクリプトで書かれているので，コンパイル不要です． Dropbox 内部で解析すると、インファイルがうまく読み込まれませんでした (2020 年 11 月)。

ダウンロード

例題: TSA データ

Oryzias melastigma の TSA データを用いて解析します．TSA (transcriptome shotgun assembly) については，こちらをご覧ください．

ダウンロードして得られた GBFV01.1.fsa_nt ファイルを TransDecoder-TransDecoder-v5.3.0 ディレクトリにコピーします．その後ターミナルから TransDecoder-TransDecoder-v5.3.0 ディレクトリに入り，

./TransDecoder.LongOrfs -t GBFV01.1.fsa_nt

と入力してください．解析は数分で終了します．アウトファイルとして，GBKV01.1.fsa_nt.transdecoder_dir　ディレクトリに翻訳済みのアミノ酸配列 (longest_orfs.pep) と，これに対応するするコーディング配列 (longest_orfs.cds) が出力されます．

注意： TransDecoder-TransDecoder-v5.3.0 ディレクトリを保存する場所によっては、以下のようなエラーが出ることがあります。macOS Ventura 13.2.1 では、デスクトップ、ダウンロードで解析が動きませんでした。この場合は、ディレクトリをホームディレクトリなどに移動して試してください (2023 年 8 月)。

[inouejun:TransDecoder-TransDecoder-v5.7.1]$ ./TransDecoder.LongOrfs -t GBFV01.1.fsa_nt
-- Skipping CMD: /Users/inouejun/Dropbox/My Mac (rrcs-172-254-99-49.nyc.biz.rr.com)/Desktop/TransDecoder-TransDecoder-v5.7.1/util/compute_base_probs.pl GBFV01.1.fsa_nt 0 > /Users/inouejun/Dropbox/My Mac (rrcs-172-254-99-49.nyc.biz.rr.com)/Desktop/TransDecoder-TransDecoder-v5.7.1/GBFV01.1.fsa_nt.transdecoder_dir/base_freqs.dat, checkpoint [/Users/inouejun/Dropbox/My Mac (rrcs-172-254-99-49.nyc.biz.rr.com)/Desktop/TransDecoder-TransDecoder-v5.7.1/GBFV01.1.fsa_nt.transdecoder_dir/__checkpoints_longorfs/base_freqs_file.ok] exists.
-skipping long orf extraction, already completed earlier as per checkpoint: /Users/inouejun/Dropbox/My Mac (rrcs-172-254-99-49.nyc.biz.rr.com)/Desktop/TransDecoder-TransDecoder-v5.7.1/GBFV01.1.fsa_nt.transdecoder_dir/__checkpoints_longorfs/TD.longorfs.ok
[inouejun:TransDecoder-TransDecoder-v5.7.1]$

BLAST+ で類似配列を収集する

ブラスト検索によって，あるアミノ酸配列に類似した配列のセット (アミノ酸と cDNA 配列) を収集します．GBFV01.1.fsa_nt.transdecoder_dir ディレクトリで以下の操作をターミナルで行います．

データベース化：

makeblastdb -in longest_orfs.pep -dbtype prot -parse_seqids
makeblastdb -in longest_orfs.cds -dbtype nucl -parse_seqids

Blast検索：

アミノ酸配列データベースを検索．

blastp -query query.txt -db longest_orfs.pep -num_alignments 10 -evalue 1e-12 -out 010_out.txt

Blast hit した配列の収集：

アミノ酸配列と共通した ID を使って，cDNA 配列を取得．

blastdbcmd -db longest_orfs.cds -dbtype nucl -entry_batch queryIDs.txt -out 020_out.txt

例題: slurm job

こちらです。
インファイル：Trinity の例題の解析結果を CD-HIT 処理したファイル (infile_cdhit_out.fs)。

#!/bin/bash
#SBATCH --job-name=transD
#SBATCH --mail-user="jun.inoue@oist.jp"
#SBATCH --partition=compute
#SBATCH --mem=2G
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1 # 1 task

TransDecoder-TransDecoder-v5.5.0/TransDecoder.LongOrfs \
-t infile_cdhit_out.fs

(2020 年 10 月)

リンク

mac でインフォマティクス

トランスクリプトームデータ解析シリーズ

次回は「6. 類似配列の除去：CD-HIT」です．

1. SRA データのダウンロード

2. fastq データの検証: fastqc

3. アダプター配列の除去: Trimmomatic

4. アッセンブル: Trinity．

5. 転写配列の推定: TransDecoder

6. 類似配列の除去：CD-HIT

7. 同じ機能を持った遺伝子の選定: ORTHOSCOPE

OIST 同僚の AA さんから教えていただきました．ありがとうございました．