Followers

Thursday, July 16, 2015

Using Scipio for generating training dataset for Augustus gene modeler

It is a bit of a hassel if you have a brand new species and would like to train augustus for your dataset. However, there is an incredibly easy way to do this using scipio. Scipio is a wrpper for BLAT program that takes advantage of having a protein file and a genome file of a reference organism. It is very easy and fast to generate genbank files from this 2 files using scipio and BLAT.

Since Scipio only has 3 perl scripts, one can install it inside the augustus installation directory.

Proceed the following way:
[Make sure BLAT is in your path. Also make sure YAML module is installed in your system]

./scipio.1.4.1.pl --blat_output=test.psl genome.fa proteins.aa > test.yaml
cat test.yaml | yaml2gff.1.4.pl > test.scipiogff
scipiogff2gff.pl --in=test.scipiogff --out=scipio.gff -> this script comes with augustus distribution
cat test.yaml | yaml2log.1.4.pl > scipio.log

# Convert gff into Genbank format for training purposes.
# Here 1000 means intergenic distance is minimum 1000
gff2gbSmallDNA.pl scipio.gff genome.fa 1000 genes.raw.gb -> This script also comes with augustus package

Generate train.err file first using the following command
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.raw.gb 2> train.err

# Modify these crude gb files into a more cleaner gb file
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb

# Running Training for gene prediction
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.gb 2> train.err

# Modify these crude gb files into a more cleaner gb file
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb

# Now run etraining again
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.gb 2> train.err

[NOTE: Here you have to remember few things: 1) first the AUGUSTUS_CONFIG_PATH should be set to the config directory inside the augustus installation path. 2) make a directory named 'myspecies' inside config/species directory and place a file  'myspecies_parameters.cfg' under config/species sub directory and one 'generic_weightmatrix.txt'. Both these files can be copied from ../generic/generic_parameters.cfg (rename this file) and ../generic/generic_weightmatrix.txt (leave it as such). Then run etraining. Several files will be created under config/species/myspecies directory. Now you are ready to roll]

# Now run Augustus for gene prediction
# It can be run with these simple commandline steps:

augustus --species=myspecies genome.fa > test.gff

Augustus 3.1 now produces very neat gff files that is both easy to visualize and easy to understand

# start gene g1
Scaffold_1      AUGUSTUS        gene    2248    2811    0.62    -       .       g1
Scaffold_1      AUGUSTUS        transcript      2248    2811    0.62    -       .       g1.t1
Scaffold_1      AUGUSTUS        stop_codon      2248    2250    .       -       0       transcript_id "g1.t1"; gene_id "g1";
Scaffold_1      AUGUSTUS        CDS     2248    2811    0.62    -       0       transcript_id "g1.t1"; gene_id "g1";
Scaffold_1      AUGUSTUS        start_codon     2809    2811    .       -       0       transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MLLTTPNRIAIYSGLDTAMATFSFEVSLRQSSYYELFFAHSVCFLRKSGERDADFWQYCGGRADVFYLHQWCEHRGAD
# REFCSANIYSDGEDDSQKKGTSRAKKGNRKRRGSSQAEVLATLAESVSAITAANNTREAQEVAWKDQALLLHDKRISHRLGMLTEIFDRTNCYIFDLK
# KHMKMTMATTS]
# end gene g1


Wednesday, July 8, 2015

Hypocrisy of scientific journals

I dont want to sound like a bitter dejected negative person, however, this has been my un-biased view of publication policies of some of the well known prestigious journals.

When I consider publishing somewhere I quickly browse and get to the section where it lists the publication cost. While the cost of publication is usually very high, they do have sometimes concession for List A and List B countries which is economically un-developed. In both the lists you will not find India there, so that means we have to pay the full publication cost! But you open a news paper in any of the western countries or just see the economic rating given to India by wetern raters and it is always in the junk level. So, why this hypocrisy? On one hand you want to rate India way below many developing countries, and on the other hand you consider India to be considerably developed to pay the full publication fee...

My experiences with pre-publication inquiry to some of the journals that I published before (when I was in US) is also very alarming. The same paper will get published if sent by an US lab, but a lab from India will face rejection at the pre-publication inquiry stage itself. I am wondering if anyone else has noticed this. But we dont want this to deter us from publishing good articles. We will rise above the cloud and publish our work in higher journals.

Tuesday, July 7, 2015

Trend line and regression analysis. How to do it easily

Today I came across a thesis os a student who used this and it was a long forgotten topic for me. nevertheless, it was refreshing to read and revisit this very simple and statistically inclined topic that we encounter in day to day life. So what is a trend line?

We all knowingly or unknowingly try to predict future from our past experience. For example if a student is scoring good grades in previous years, we tend to predict what his/her grades are going to be in coming years. There could be many more such examples as predicting the stock market, predicting weather and so on. So, what we do internally is generate a trend and try to fit in the future based on that trend.

So, let take a simple example of scores of a student:

1      300
2      340
3      320
4      400
5      420
9      500

So, here these are some values/scores the student had in the months in  the left column. Eye balling this data what we see is, there is a gradual improvement in scores. And we also observe that there are some missing points here (6,7,8). So, can we predict the score for the 10th month? Can we say what the score would have been on 6th, 7th and 8th month?

So, in this case, we may like to fit this into a trendline. The easiest way to do it is to use excel. So lets fill in the data in an excel sheet as shown below: [figure 1]

So, now the formula is Y = 25.5X + 278 at R2=0.928. Using this equation, we can predict the scores on 6th month e.g; Y = 25.5 X 6 + 278 = 309.5;
On 7th month it is = 25.5 X 7 + 278 = 454.4
On 8th month it is = 25.5 X 8 + 278 = 479.6
On 10th month, it is going to be 25.5 X 10 + 278 = 533

Now putting these values back into excel we get the following figure (Figure - 2)
So, what we see there? The equation and R2 has changed!! Slightly though.. So, what does it tell us? Probably the linear trendline is not correct for this data series.
Figure 1: Excel screen shot of how to select trendline in excel and display the formula.






















Figure -2: Changed R value as well as trend line formula