Thursday, March 28, 2013

Randomizing experimental samples

Often we encounter this problem, where we have to randomize samples and group them separately. While, this is a very simple problem, sometimes new users find it difficult to comprehend. There are several randomizing functions available in many programming languages that can generate random numbers. Here, I have used rand() function of excel for doing this simple task and group my samples into 2 categories.

Here is little bit about the problem I am trying to solve:

I have 4 experimental replicates: 1. Tounge sample at N0 metastasis 2. Tounge sample at N1 metastasis 3. Buccal mucosa sample at N0 metastasis 4. Buccal mucosa sample at N1 metastasis. Now I want to divide all these data into 2 random groups; perform analysis on each group independent of each other and compare the results. This can be done in several ways, but one of the simplest ways I have adopted is as below.

1. List each of these groups in one excel sheets in column A.
2. In column B perform excel rand() function by doing =rand() [fig-1]
3. Now drag this till the end of data rows.
4. This will generate random numbers in column B.
5. Sort column B with expand sort option.
6. Split the number of data points into 2 halfs and store them separately.
7. Repeat this for all other samples.

Friday, March 22, 2013

Training Augustus Gene Prediction program

Lately I have been asked by multiple people to solve the training problem of Augustus for their organism data. Although I have done it earlier, this time, I faced unusually long time in solving this issue. Augustus wants you to provide a training dataset either in genbank format or in gff format. It typically wants users to remove duplicate genes; proteins with 70% identity, in order to get over the over fitting problem. Also, remove any kind of overlapping genes from your training files. If you could not find a script that can do it efficiently, please contact me directly I will provide you one (sorry, I am yet to set my source code page up).

I have generated a genbank file from my gff and the annotation files (again script can be sent upon request). Howver, few things I have overlooked and they were leaving empty lines between 2 genomic LOCUS entries. If empty space is left then the that is used to split the gb training file into 2 random parts fails. You either get all your data in .test file or all of them in .train file. Suppose you decide to move on with your .train file with all the data and .test as a empty file. Surprises that you will end up with std::bad_alloc() error from augustus, that has nothing to do with the memory of your system. So, if you have an autogenerated genbank file with empty lines, open the file in vim editor and get rid of empty lines using  :g/^$/d. 

Once done, try splitting your *.gb file with This will do the magic. Now try to train your program and it should run file.

Here is a check list of commands what you should be doing to train for Augustus.

perl scripts/ --species=sojae
perl scripts/ 30
./bin/etraining --species=sojae
# mock prediction
./bin/augustus --species=sojae | tee firsttest.out
grep -A 22 Evaluation firsttest.out

This is what you get when you run the last command. This is a pretty good indicator that your training is almost good to go.

*******      Evaluation of gene prediction     *******

                 | sensitivity | specificity |
nucleotide level |       0.861 |       0.674 |

           |  #pred |  #anno |      |    FP = false pos. |    FN = false neg. |
            |             |
           | total/ | total/ |   TP |--------------------|--------------------|
sensitivity | specificity |
           | unique | unique |      | part | ovlp | wrng | part | ovlp | wrng |
            |             |
           |        |        |      |              24559 |              20733 |
            |             |
exon level |  34213 |  30387 | 9654 | ------------------ | ------------------ |
      0.318 |       0.282 |
           |  34213 |  30387 |      | 8709 | 4167 | 11683 | 9312 | 5502 | 5919 |
             |             |

transcript | #pred | #anno |   TP |   FP |   FN | sensitivity | specificity |
gene level | 10348 | 12700 | 2547 | 7801 | 10153 |       0.201 |       0.246 |

Sorry this output is jumbled but in essence it tells that 2547 genes were predicted exactly. Although exon and gene level results does not look that good, you could actually try and train the software using various types of exons that are accurate.

perl scripts/ # This will take a while....

Thursday, March 7, 2013

Installing perl GD

I never thought I will get stuck installing perl GD module so badly. I use GD::Graph module from bioperl for most of my plotting work. Our old servers had all of these installed over a period of time. Now I am in a hurry to plot something, but could not get this thing working. It keeps complaining about libgd-config file. You google but you will never find a proper answer. I went down my own memory lane and remembered about the gd module. Also looking up in the error profile, you will locate somewhere it is asking you to install libgd 2.0.8 or higher.

Finally I located the module at  and looked for manual at at . Unpacked and did a ./configure followed by a make and it just exited with errors...
Tracing back into ./configure log again, I figured out, it wanted zlib library.
The zlib package can be found in this site:

Once downloaded it, it complains about libpng package.

Get libpng package from
do  ./configure followed by make install.
Then go back and install zlib then install gd package.

then go back and do a perl -MCPAN -e "install GD"

But my script did not run even then. It asked about GD::Graph. So, do a
perl -MCPAN -e "install GD"