Followers

Thursday, July 16, 2015

Using Scipio for generating training dataset for Augustus gene modeler

It is a bit of a hassel if you have a brand new species and would like to train augustus for your dataset. However, there is an incredibly easy way to do this using scipio. Scipio is a wrpper for BLAT program that takes advantage of having a protein file and a genome file of a reference organism. It is very easy and fast to generate genbank files from this 2 files using scipio and BLAT.

Since Scipio only has 3 perl scripts, one can install it inside the augustus installation directory.

Proceed the following way:
[Make sure BLAT is in your path. Also make sure YAML module is installed in your system]

./scipio.1.4.1.pl --blat_output=test.psl genome.fa proteins.aa > test.yaml
cat test.yaml | yaml2gff.1.4.pl > test.scipiogff
scipiogff2gff.pl --in=test.scipiogff --out=scipio.gff -> this script comes with augustus distribution
cat test.yaml | yaml2log.1.4.pl > scipio.log

# Convert gff into Genbank format for training purposes.
# Here 1000 means intergenic distance is minimum 1000
gff2gbSmallDNA.pl scipio.gff genome.fa 1000 genes.raw.gb -> This script also comes with augustus package

# Modify these crude gb files into a more cleaner gb file
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb

# Running Training for gene prediction
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.gb 2> train.err

# Modify these crude gb files into a more cleaner gb file
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb

# Now run etraining again
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.gb 2> train.err

[NOTE: Here you have to remember few things: 1) first the AUGUSTUS_CONFIG_PATH should be set to the config directory inside the augustus installation path. 2) make a directory named 'myspecies' inside config/species directory and place a file  'myspecies_parameters.cfg' under config/species sub directory and one 'generic_weightmatrix.txt'. Both these files can be copied from ../generic/generic_parameters.cfg (rename this file) and ../generic/generic_weightmatrix.txt (leave it as such). Then run etraining. Several files will be created under config/species/myspecies directory. Now you are ready to roll]

# Now run Augustus for gene prediction
# It can be run with these simple commandline steps:

augustus --species=myspecies genome.fa > test.gff

Augustus 3.1 now produces very neat gff files that is both easy to visualize and easy to understand

# start gene g1
Scaffold_1      AUGUSTUS        gene    2248    2811    0.62    -       .       g1
Scaffold_1      AUGUSTUS        transcript      2248    2811    0.62    -       .       g1.t1
Scaffold_1      AUGUSTUS        stop_codon      2248    2250    .       -       0       transcript_id "g1.t1"; gene_id "g1";
Scaffold_1      AUGUSTUS        CDS     2248    2811    0.62    -       0       transcript_id "g1.t1"; gene_id "g1";
Scaffold_1      AUGUSTUS        start_codon     2809    2811    .       -       0       transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MLLTTPNRIAIYSGLDTAMATFSFEVSLRQSSYYELFFAHSVCFLRKSGERDADFWQYCGGRADVFYLHQWCEHRGAD
# REFCSANIYSDGEDDSQKKGTSRAKKGNRKRRGSSQAEVLATLAESVSAITAANNTREAQEVAWKDQALLLHDKRISHRLGMLTEIFDRTNCYIFDLK
# KHMKMTMATTS]
# end gene g1


No comments:

Post a Comment