It is a bit of a hassel if you have a brand new species and would like to train augustus for your dataset. However, there is an incredibly easy way to do this using scipio. Scipio is a wrpper for BLAT program that takes advantage of having a protein file and a genome file of a reference organism. It is very easy and fast to generate genbank files from this 2 files using scipio and BLAT.
Since Scipio only has 3 perl scripts, one can install it inside the augustus installation directory.
Proceed the following way:
[Make sure BLAT is in your path. Also make sure YAML module is installed in your system]
./scipio.1.4.1.pl --blat_output=test.psl genome.fa proteins.aa > test.yaml
cat test.yaml | yaml2gff.1.4.pl > test.scipiogff
scipiogff2gff.pl --in=test.scipiogff --out=scipio.gff -> this script comes with augustus distribution
cat test.yaml | yaml2log.1.4.pl > scipio.log
# Convert gff into Genbank format for training purposes.
# Here 1000 means intergenic distance is minimum 1000
gff2gbSmallDNA.pl scipio.gff genome.fa 1000 genes.raw.gb -> This script also comes with augustus package
Generate train.err file first using the following command
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.raw.gb 2> train.err
# Modify these crude gb files into a more cleaner gb file
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb
# Running Training for gene prediction
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.gb 2> train.err
# Modify these crude gb files into a more cleaner gb file
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb
# Now run etraining again
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.gb 2> train.err
[NOTE: Here you have to remember few things: 1) first the AUGUSTUS_CONFIG_PATH should be set to the config directory inside the augustus installation path. 2) make a directory named 'myspecies' inside config/species directory and place a file 'myspecies_parameters.cfg' under config/species sub directory and one 'generic_weightmatrix.txt'. Both these files can be copied from ../generic/generic_parameters.cfg (rename this file) and ../generic/generic_weightmatrix.txt (leave it as such). Then run etraining. Several files will be created under config/species/myspecies directory. Now you are ready to roll]
Since Scipio only has 3 perl scripts, one can install it inside the augustus installation directory.
Proceed the following way:
[Make sure BLAT is in your path. Also make sure YAML module is installed in your system]
./scipio.1.4.1.pl --blat_output=test.psl genome.fa proteins.aa > test.yaml
cat test.yaml | yaml2gff.1.4.pl > test.scipiogff
scipiogff2gff.pl --in=test.scipiogff --out=scipio.gff -> this script comes with augustus distribution
cat test.yaml | yaml2log.1.4.pl > scipio.log
# Convert gff into Genbank format for training purposes.
# Here 1000 means intergenic distance is minimum 1000
gff2gbSmallDNA.pl scipio.gff genome.fa 1000 genes.raw.gb -> This script also comes with augustus package
Generate train.err file first using the following command
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.raw.gb 2> train.err
# Modify these crude gb files into a more cleaner gb file
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.gb 2> train.err
# Modify these crude gb files into a more cleaner gb file
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb
# Now run etraining again
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.gb 2> train.err
[NOTE: Here you have to remember few things: 1) first the AUGUSTUS_CONFIG_PATH should be set to the config directory inside the augustus installation path. 2) make a directory named 'myspecies' inside config/species directory and place a file 'myspecies_parameters.cfg' under config/species sub directory and one 'generic_weightmatrix.txt'. Both these files can be copied from ../generic/generic_parameters.cfg (rename this file) and ../generic/generic_weightmatrix.txt (leave it as such). Then run etraining. Several files will be created under config/species/myspecies directory. Now you are ready to roll]
# Now run Augustus for gene prediction
# It can be run with these simple commandline steps:
augustus --species=myspecies genome.fa > test.gff
Augustus 3.1 now produces very neat gff files that is both easy to visualize and easy to understand
# start gene g1
Scaffold_1 AUGUSTUS gene 2248 2811 0.62 - . g1
Scaffold_1 AUGUSTUS transcript 2248 2811 0.62 - . g1.t1
Scaffold_1 AUGUSTUS stop_codon 2248 2250 . - 0 transcript_id "g1.t1"; gene_id "g1";
Scaffold_1 AUGUSTUS CDS 2248 2811 0.62 - 0 transcript_id "g1.t1"; gene_id "g1";
Scaffold_1 AUGUSTUS start_codon 2809 2811 . - 0 transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MLLTTPNRIAIYSGLDTAMATFSFEVSLRQSSYYELFFAHSVCFLRKSGERDADFWQYCGGRADVFYLHQWCEHRGAD
# REFCSANIYSDGEDDSQKKGTSRAKKGNRKRRGSSQAEVLATLAESVSAITAANNTREAQEVAWKDQALLLHDKRISHRLGMLTEIFDRTNCYIFDLK
# KHMKMTMATTS]
# end gene g1