Computational Genomics Lab at IICB: 2013

Monday, December 9, 2013

Renaming a Mysql database

RENAME {DATABASE | SCHEMA} db_name TO new_db_name;

Does not work anymore since this is quite dangerous.

So the best way to do this will be first create a database with command: create database myDB in mysql prompt, then do a:
mysqldump -u root -p oldname | mysql -u root -ppassword myDB

This may or may not work depending on what is your max_allowed_packet is set. In many cases, it may fail giving an error such as:

ERROR 1153 (08S01) at line 1441: Got a packet bigger than 'max_allowed_packet' bytes

For this, you may have to go back to your mysql prompt and use a command:

mysql> show GLOBAL variables like 'max_allowed_packet%';
+--------------------+---------+
| Variable_name | Value |
+--------------------+---------+
| max_allowed_packet | 1048576 |

+--------------------+---------+

Now go to your my.cnf file as root (most likely this will be located in your /etc folder)

edit or add the line (if there is no variable as max_allowed_packet) below your mysqld line.

[mysqld]
max_allowed_packet = 2024M

Now as root do a:

service mysqld restart

Stopping mysqld: [ OK ]

Starting mysqld: [ OK ]

Then go to mysql prompt and check if your max_allowed_packet size has changed by querying:

show global variables like 'max_allowed_packet%';

This time you may see a changed number such as this:

+--------------------+------------+

| Variable_name | Value |

+--------------------+------------+

| max_allowed_packet | 1073741824 |

+--------------------+------------+

1 row in set (0.00 sec)

Now run your copy database command:

mysqldump -u root -p oldname | mysql -u root -ppassword myDB

see if it is working...

Friday, November 29, 2013

Analysing the transcriptome data

Here i had an opportunity to analyse the transcript data,
I have been analyzing the transcriptomic Data, i use two different assembly programs Ray and Trinity assembler, we prefer the trinity assembler for the transcriptomics data. the trinity which does the job by 3 process
Inchworm:
it assembles the unique transcripts and reports the alternatively spliced transcripts
Chrysalis:
it takes the inchworm contigs into clusters and constructs the De-brujin graphs for those clusters
Butterfly:
it makes the full lenght of transcripts for alternatively spliced isoforms those transcripts corresponds the paralogous genes
suppose if u have a sequence format in the fastq it converts into fasta by using fastool it takes much time to do it.. if we have large number of datas like millions of reads we can minimise thr run by using the option called --minkmercov in case of paired end it uses the bowtie short read aligner and the command line is
trinityrnaseq_r2013-02-25/Trinity.pl --seqType fq --JM 30G --min_kmer_cov 1 --left left_file1.fastq --right right_file2.fastq --CPU 10 --output trinity_out_dir > assembly_log once you are done with running the trinity you can check your process by top and then job will be shown like inchworm, chryasalis and butterfly suppose the sequence is in the fastq format it uses the fasttool and convert to fasta nad it process,it uses the cat command also for joining the contigs at last, when you do top u can see that command also even it uses sort command also.
once the assembly is done you can run the trinitystats.pl from the script directory and can get the N50, total transcripts and total trinity components.

I have tried using various assembler like ray but it doesn't give me the expected output, ray it works perfect with the genome assembly, for the transcriptomic data trinity is doing well.
Analysing differential expression using edgeR trough Rsem:

I have transcriptome sample with 3 different conditions, ie one with early passage, other one at intermediate passage and another as a last passage

basic steps i will follow are:
1. make a assembly of all conditions together like said bellow
trinityrnaseq_r2013-02-25/Trinity.pl --seqType fq --JM 30G --min_kmer_cov 1 --left condA1.fastq condB1.fastq condC1.fastq --right condA2.fastq condB2.fastq condC2.fastq --CPU 10 --output trinity_out > log
2.map back the obtained transcripts with the respective reads basically you should be mapping the transcripts with condA , then condB, then conC separately by using bowtie or sam whatever mapping tool convenient with that.
3.Now estimate the gene-level abundance using Rsem suppose i want to check the differentially expressed genes between condA and condB, cond A and C, and condB and C , the matrix can be created by using Rsem or the otherway you wanna check in all three cond A,B,C that is also possible make the matrix of all 3 cond together and do the differential expression.
4. Then provide the way you wanna check the differential expression in edgeR a Bioconductor package with sample description.

Analysis/DifferentialExpression/run_DE_analysis.pl --matrix counts.matrix --method edgeR --samples_file samples_described.txt

5. then extract the differential expression genes by the following command with the log transferred value will get the expression data , this TMM_normalized.FPKM value file will be created in the RSem matrix calulation and provide that normalized value

cd edgeR/

/Analysis/DifferentialExpression/analyze_diff_expr.pl --matrix ../matrix.TMM_normalized.FPKM     -P 1e-3 -C 2

The output will be in the tab limited format of ids, logFc values, FPKM, condA expression value, condB expression value Blah Blah etc etc ..

Wednesday, October 23, 2013

Full length cDNA Technologies

Transcriptome analysis is proving its immense importance day by day as it RNAs are crucial to unterstand the dynamic nature of cellular events. Since RNAs are not stable, they are reverse transcribed to cDNA and the cDNAs are sequence instead of the RNAs to get the sequence of RNAs. But the a mammoth obstacle of this procedure is the incomplete synthesis of cDNA.

There are several reasons why the first strand of cDNA is not fully transcribed during the reverse transcription. Here some of the causes and the ways of overcoming those difficulties will be discussed.

1) Problem: inefficiency in the synthesis of the first cDNA strand
Cause: due to inefficiency of reverse transcriptase enzymes.
Solution: Addition of trehalose.
Mechanism: trehalose gives heat resistance to enzyme including the reverse transcriptional enzyme. From this discovery, it became possible to induce reactions using the reverse transcriptional enzyme with 60°C instead of 42°C as was previously the norm. Under 60°C, the template RNAs' secondary structure gets weaker, and the area with stable secondary structure, which often exists in non-coding 5' end mRNAs, becomes possible to be transcribed in reverse in a very efficient manner.

2) Poblem: absence of actual 3' end of cDNA (i.e. 5' end of the mRNA)
Cause: inefficiency of reverse transcriptase
Solution: Cap-trapper method.
Mechanism: biotinylationof the cap-structure that is specific to mRNA of eukaryotic organisms followed by selective extraction of the biotinylated cap structure of non full-length cDNA (enzymatic degradation of ss mRNA ). The remaining full-length cDNA containing a biotinylated cap can be caught by using magnetic beads coated with streptavidin. The selected full-length cDNA can then be eluted by alkaline treatment. Synthesizing the second strand from the 1st strand cDNA (which became single stranded when separated from the beads) makes it possible to obtain the full-length cDNA selectively.

Fig: Cap Trapper method

Monday, October 21, 2013

Installing Ensembl API

Ensembl has a cool sets of sub routines very useful. For some people installing and testing is just very easy but for some it may be very difficult especially when you are setting up a new server. You just get the API from http://asia.ensembl.org/info/docs/api/core/core_tutorial.html where step by step instructions are given.

Many times running /ensembl/misc-scripts/ping_ensembl.pl tell you what is the problem. In our instance it kep complaining about installation of DBI even though DBI was very much there. Looking into the complaining part:

101 eval {
102 require DBI;
103 require DBD::mysql;
104 require Bio::Perl;
105 require Bio::EnsEMBL::Registry;
106 require Bio::EnsEMBL::ApiVersion;
107 require Bio::EnsEMBL::LookUp if $ensembl_genomes;
108 $api_version = Bio::EnsEMBL::ApiVersion::software_version();

I realized that DBD::mysql is required and is not available in my machine. Then login as root and type cpan at the command mode, then do an 'install DBD::mysql' exits with lot of error. Tracing back it says can't find mysql.h. I looked for the source and never found them since I had only done a yum install mysql-server. In this case, you first have to do a yum install mysql-devel. Then go to cpan and do force install DBD::mysql. So, probably it will install.

Set Path for Ensembl:

As written in the document, set paths for ensembl packages as below:

export PERL5LIB=$PERL5LIB:/home/sutripa/ensembl/modules
export PERL5LIB=$PERL5LIB:/home/sutripa/ensembl-compara/modules
export PERL5LIB=$PERL5LIB:/home/sutripa/ensembl-functgenomic/modules
export PERL5LIB=$PERL5LIB:/home/sutripa/ensembl-variation/modules

export PERL5LIB=$PERL5LIB:/home/sutripa/ensembl-tools/

Then go to:

ensembl/misc-scripts and run ./ping_ensembl.pl and see if you get a message like this:
Installation is good. Connection to Ensembl works and you can query the human co
re database

Then you are good to go....

Thursday, October 17, 2013

Installation of secretome in your server

We are also working on probiotic bacteria and analyzed few secretory proteins from MALDI. As it turns out, some of the secretory proteins are predicted to signalp positive and some are not. Intrigued, we did a secretome analysis on those and found all the secretory proteins to be either signalP positive or secretome positive. We got a copy of secretome from http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?secretomep . Untarred and got the package. However, it has a large number of dependencies and they are:

1. ProP 1.0 http://www.cbs.dtu.dk/services/ProP/
2. PSORT II http://www.psort.org/
3. seg ftp://ftp.ncbi.nih.gov/pub/seg/seg
4. TMHMM 1.0 http://www.cbs.dtu.dk/services/TMHMM/
5. SignalP 3.0 http://www.cbs.dtu.dk/services/SignalP/

ProP 1.0
Prop predicts arginine and lysine cleavage sites using ensemble Neural Network. By default it does a Furin specific prediction. It also does a proprotein convertase prediction and has been integrated into signalp.
For installation of this package, you have to open the 'prop' executable file and start editing the paths in the following section:

else if ( $SYSTEM == "Linux" ) then # typical Linux
setenv AWK /usr/bin/gawk setenv ECHO "/bin/echo -e" setenv GNUPLOT /usr/local/bin/gnuplot # /usr/local/bin/gnuplot-3.7 #setenv PPM2GIF /usr/bin/ppmtogif # I could not find a ppmtogif package in my installation so I did the following:
setenv PPM2GIF /usr/bin/ppm2tiff setenv SIGNALP /usr/cbs/bio/bin/signalp

set appropriate path after checking the path of the above mentioned programs.

Now set PROPHOME correctly in prop script.

setenv PROPHOME /usr/adadata/prop-1.0c

Then just test by running ./prop and see if it runs fine.

PSORT II

I obtained a copy by emailing: Yuki Saito <yuki-s@hgc.jp> secretary to prof Prof. Kenta Nakai

This is a very useful program used for predicting sub-cellular localization of a protein and is a perl script. All that you need to do it in the sha bang line set the correct path for perl!

Test it by running ./psort . If it runs fine, then installation is OK.

Seg:

It comes with the blast distribution.

SignalP and TMHMM :

Can be obtained from the same (http://www.cbs.dtu.dk) place and can be installed.

Once all the installations are done check if all of these programs are working. In case, all are running fine, then go back to secretomep script and set the correct paths to all the existing programs:

Here is a template of my changes:

else if ( `uname` == "Linux" ) then

setenv ECHO "/bin/echo -e"

setenv AWK /usr/bin/gawk

setenv PERL /usr/bin/perl

set prop = /usr/adadata/prop-1.0c/prop

set psort = /usr/adadata/psort/psort

set seg = /usr/adadata/ncbi-blast-2.2.28+/bin/segmasker

set tmhmm = /usr/adadata/tmhmm-2.0c/bin/tmhmm

set signalp = /usr/adadata/signalp-4.1/signalp

endif

Once done just run ./secretomep in silent mode and see if it runs fine...

[our installation was successful]

Tuesday, July 30, 2013

synetny and the world of genomics

The Comparative genomics is one of my interest where in India many people knows bioinformatics as a designing drug, simulation, and protein folding, predicting 3D structures.but i gotta a right place to do ... i will share what i know the little which i learnt...

Comparative genomics:

The comparative genomics is the one of the bioinformatics approach here we can compare the unknown genomes with known genomes by the genome size, no of genes and chromosome. by comparing the two genome we can identify the core set of genes between two organisms, we can identify the genes which is involved in mutation or ability to cause diseases also.. the comparative genomics it reveals the evolutionary relationships of an organisms

Synteny? The two or more genes which are located on the same chromosome and the linkage between them ie genes. most likely the chromosome is based on collinearity data that two or more chromosomes or segments are derived from a common ancestor, we can say that synteny is likely used to identify the homologous genes to the ancestral chromosomal position.for example in human we all know we have 23 pairs of chromosome and the human chromosome 17 corresponds to the entire portion of mouse chomosome 11.

some terms related to synteny:

Single gene transposition:

it refers to the insertion of one gene into a new location

Fractionation:

here by which the duplicated gene, chromosomal segment , genome organization tends to return to its preduplication gene content

Subfunctionalization:

selectively neutral tendency of a duplicated cis-acting unit of function to lose dispensable sequences on one but not both, duplicates, such that the ancestral function is spread over both duplicates

COGE is one of the online database used for analysing the synteny

the identification of syntenic regions can be done by

1. finding the putative regions or homologous genes between the two genomes.

2. identifying the co-linear set of genes or the regions of sequence similarity

synmap methods:

1. extracts the sequence for comparison builds the fasta files.

2.it creates the database and compare using the BLAST algorithm

3.It contains the default e value cutoff of 0.001

4. it identify tandem gene duplicates by blast2raw.

5. it filters the repetitive matches

6.identify the syntenic pairs by finding co-linear putatative homologous sequences.

7. it calculates the synonymous and non synonymous mutation rates for syntenic gene pairs.

8.it generates the dotplot for putative homologous matches.

9. the colored dotplots based on the synonymous and nonsynonymous mutations.

Analysis options:

1. breaks the sequences into multiple pieces and searches

2 filter repetitive matches:

it adjusts the evalues of the blast hits to lower the significance of sequence that occurs in multiple times in genome.

3. DAG chainer options:

identify synntenic regios between genomes and gene spaces in genomes.

4. average distance between syntenic regions.

5.maximum distance between two matches.

6. minium range of aligned pairs

7. syntenic depth: it gives the best syntenic regios that covers each genome.

the syntenic dotplot which is constructed between Ecoli genomes.. iam just giving this an example so that everybody can have an idea of dotplot and how it looks..

Saturday, July 27, 2013

UID's of cyanobacteria..

have been tired of downloading genomes and genes of cyanobacteria one by one.. then got something called UID's and i collected all the UID's of my genomes.. i think this will be useful for everybody those who are working with cyanobacteria's.

cyanobacterium_UCYN_A_uid43697

Nostoc_azollae__0708_uid49725

Trichodesmium_erythraeum_IMS101_uid57925

Thermosynechococcus_elongatus_BP_1_uid57907

Synechocystis_PCC_6803_uid57659

Synechocystis_PCC_6803_uid189748

Synechocystis_PCC_6803_uid159873

Synechocystis_PCC_6803_substr__PCC_N_uid159835

Synechocystis_PCC_6803_substr__GT_I_uid158059

Synechocystis_PCC_6803_substr__GT_I_uid157913

Synechococcus_elongatus_PCC_7942_uid58045

Synechococcus_elongatus_PCC_6301_uid58235

Synechococcus_WH_8102_uid61581

Synechococcus_WH_7803_uid61607

Synechococcus_RCC307_uid61609

Synechococcus_PCC_7502_uid183008

Synechococcus_PCC_7002_uid59137

Synechococcus_PCC_6312_uid182934

Synechococcus_JA_3_3Ab_uid58535

Synechococcus_JA_2_3B_a_2_13__uid58537

Synechococcus_CC9902_uid58323

Synechococcus_CC9605_uid58319

Synechococcus_CC9311_uid58123

Rivularia_PCC_7116_uid182929

Prochlorococcus_marinus_pastoris_CCMP1986_uid57761

Prochlorococcus_marinus_NATL2A_uid58359

Prochlorococcus_marinus_NATL1A_uid58423

Prochlorococcus_marinus_MIT_9515_uid58313

Prochlorococcus_marinus_MIT_9313_uid57773

Prochlorococcus_marinus_MIT_9312_uid58357

Prochlorococcus_marinus_MIT_9303_uid58305

Prochlorococcus_marinus_MIT_9301_uid58437

Prochlorococcus_marinus_MIT_9215_uid58819

Prochlorococcus_marinus_MIT_9211_uid58309

Prochlorococcus_marinus_CCMP1375_uid57995

Prochlorococcus_marinus_AS9601_uid58307

Oscillatoria_acuminata_PCC_6304_uid183003

Oscillatoria_PCC_7112_uid183110

Nostoc_punctiforme_PCC_73102_uid57767

Nostoc_PCC_7524_uid182933

Nostoc_PCC_7120_uid57803

Nostoc_PCC_7107_uid182932

Microcystis_aeruginosa_NIES_843_uid59101

Microcoleus_PCC_7113_uid183114

Leptolyngbya_PCC_7376_uid182928

Halothece_PCC_7418_uid183338

Gloeocapsa_PCC_7428_uid183112

Gloeobacter_violaceus_PCC_7421_uid58011

Geitlerinema_PCC_7407_uid183007

Dactylococcopsis_salina_PCC_8305_uid183341

Cylindrospermum_stagnale_PCC_7417_uid183111

Cyanothece_PCC_8802_uid59143

Cyanothece_PCC_8801_uid59027

Cyanothece_PCC_7822_uid52547

Cyanothece_PCC_7425_uid59435

Cyanothece_PCC_7424_uid59025

Cyanothece_ATCC_51142_uid59013

Cyanobium_gracile_PCC_6307_uid182931

Cyanobacterium_stanieri_PCC_7202_uid183337

Cyanobacterium_PCC_10605_uid183340

Crinalium_epipsammum_PCC_9333_uid183113

Chroococcidiopsis_thermalis_PCC_7203_uid183002

Chamaesiphon_PCC_6605_uid183005

Calothrix_PCC_7507_uid182930

Calothrix_PCC_6303_uid183109

Arthrospira_platensis_NIES_39_uid197171

Anabaena_variabilis_ATCC_29413_uid58043

Anabaena_cylindrica_PCC_7122_uid183339

Anabaena_90_uid179383

Acaryochloris_marina_MBIC11017_uid58167

by using with this can download the genome information with the help of NCBIeutilities by doing some perl scripts can save the time...

Monday, July 22, 2013

Importing & exporting MySQL dump files including/excluding data

Let us suppose we have a database created in MySQL as: testDB

EXPORT

1. DATA + STRUCTURE
[user@pc]$ mysqldump -u user -p testDB > /path-to-export/testDB.sql

2. STRUCTURE Only
[user@pc]$ mysqldump -u user -p --no-data testDB > /path-to-export/testDB.sql

3. DATA only
[user@pc]$ mysqldump -u -p --no-create-db --no-create-info testDB > /path-to-export/testDB.sql

IMPORT

1. STRUCTURE + DATA
[user@pc]$ mysql -u user -p testDB < /path-from-import/testDB.sql

I would like to mention about an issue that you might face while importing a huge dump file (specially genome databases). Default MySQL configuration will give you an error:
"Got a packet bigger than 'max_allowed_packet' bytes"

The solution is to globally increase the import size of MySQL server+client versions

1. While importing add --max_allowed_packet=100M or your specified size.
e.g.: [user@pc]$ mysql -u user -p --max_allowed_packet=100M testDB < /path-from-import/testDB.sql

2. Open another terminal and login into MySQL server as root
Add the following lines:
mysql> set global net_buffer_length=10000000;
mysql> set global max_allowed_packet=10000000000;

Now proceed with the import command.

Hope it helps you!

Friday, July 19, 2013

mysql.sock file missing or appear to be missing

I had an episode when we tried to install the data directory mysql in a location other than the standard /var/lib location in our server. For that we uninstalled mysql several times but still some remnants were there elsewhere and it was interfering. If you need to re-install mysql just do the following.

1. yum erase mysql -> This will erase all your mysql from any location
2. yum install mysql-server -> This will again install mysql
3. login as root or as sudo run : service mysqld start -> This will start mysql
4. Create root password for your mysql using: mysqladmin -u root password lklklklskl -> This will reset root password as lklklklskl
5. If step 4 runs fine then all is well. But in my case it complained a lot about
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)'
Check that mysqld is running and that the socket: '/var/lib/mysql/mysql.sock' exists!

I googled a lot and then found that in /etc/my.cnf file, mysql.sock file path is written. Inside my.cnf file I found the path was at /var/run/mysqld/mysqld.sock. Here notice it is mysqld.sock NOT mysql.sock. After a lot of caution I created a mysqld.sock link at /var/lib/mysql using:
ln -s /var/run/mysqld/mysqld.sock /var/lib/mysql/mysql.sock

Then retired command 4 and it ran fine....

If your problem persists and it is unable to connect through mysql.sock file, then try connecting through TCP/IP. By default, mysql tries to connect to server via localhost that uses socket connector. But connecting to 127.0.0.1 uses TCP/IP connector. So, you can try connecting using:

mysql --protocol TCP -u xxx -p

or
If you sont want to create a symlink to mysqld.sock file, try giving the path to socket file with:

mysql --socket=/var/run/mysqld/mysqld.sock -u xxx -p

Alternatively, you can change the host name in /etc/my.cnf file to 127.0.0.1 from localhost. Then it will not look for socket connector.

Wednesday, June 26, 2013

EuMicrobedb; Transcriptomicsdb and TOOLKIT with EMBOSS interface is finally resurrected....

Uncountable number of emails, worries, warnings, anger, frustration over these databases is finally over. Good news is Eumicrobedb, transcriptomicsDB and toolkit that served the oomycetes community for a long time is now finally up and available. It is now hosted at IICB instead of Virginia Tech. These resources were the one of the first efforts to organize and disseminate oomycetes genomes and transcriptomes for 7 long years. In October 2012, after a hacker attack on the server, the web site went down and subsequently many of the web files were completely erased and we could only recover part of the softwares from the back up server. We tried to revive it in VBI, but it failed and was unavailable for 8 months.

Why it took almost 8 months to recreate it?
After joining IICB, I thought I will just maintain a mirror site with the main site at VBI. But I realized that the cost of maintaining a server at VBI is one of the biggest limiting factors in keeping this resource alive. Then it completely fell back on me to recreate it, for which I was not completely prepared. One of the biggest problems was, this database was designed on a oracle framework with many complicated tables and views. I knew after using it for a long time that we only used about 50% of the total architecture for real time data storage. I was aware of this, but did not really have the time or inclination to sit and rework on the schema. Now that I have had only small servers and No Oracle license, I was kind of pushed to a corner to work with this. thus began the effort to recreate this schema in Mysql.

Finally there was lot of pressure to do this (I work best under pressure!) and I spent solid two months sitting literally day and night to work on the schema. I even merged several tables, removed several obsolete fields and created few new tables to finally come up with a new schema. Although we have retained most of the nicest features of GUS, at the same time we have reduced overheads substantially.

Another challenging feature of this schema is the data upload and the dependencies. Earlier a bioperl layer with several plugins served the purpose. But that would mean again installation and dependencies and I decided to get rid of all that. We re-wrote most of the interfacing softwares in Python and some perl that is extremely light weight.

One of the most challenging task was data transfer from VBI oracle server to the new schema. There was a need for remapping data at various points. At the same time the VBI data (VMD) needed some cleaning because I have kept many genomes that were either obsolete or not very clean. So, this required a lot of soul searching to remove unwanted stuff. Finally I have been able to remap, reanalyze where necessary and upload them to our new schema.

The front end of the database was also riddled with several challenges such as complete change in queries and migrating it to a different linux server settings with limited backed up softwares. Finally all these have been done and today I released the first mysql version of Eumicrobedb. There are still a lot of glitches that I know exists and there are some unknown bugs too. In coming days all these will be cleared.

So, altogether it was a herculean task that one of my MTech students Akash and I have undertaken amidst our other responsibilities...
Database is now available at www.eumicrobedb.org/eumicrobedb/

Sunday, June 23, 2013

Installation of Samtools and BEDTools in my new RH 6 64 bit server

Troubles come as a package when you start your new lab. In my effort to resurrect our old database Eumicrobedb.org, I am facing many issues such as installing myriad of packages that are open source. What I did over a period of a decade now needs to be done right away...

I never recall having any issues with installation of easy-to-install packages such as samtools and BEDTools. One of the reasons may be because the system admin was taking care of several system level library installations that were a pre-requisite. While compiling BEDTools, it complained about "undefined reference to `gzopen64'". Searching the forums, I figured out that it is complaining about not finding zlib. I checked installation several times and zlib was right there and was there on path. Checking the stdout of make commands I found that it searches zlib at the right place yet comes back in the end complaining about it. However, finally changing the LIBS path in makefile solved the problem.

Look for the line in Makefile
'export LIBS'

Change it to:
export LIBS = YOUR_PATH/libz.so.1.2.7 -lz

$make clean
$make
It should compile fine.

Samtools:

With Samtools, the problem lasted for a very long time. It always exited with error
"samtools error bam_import.c:76: undefined reference to gzopen64"

I tried re-installing zlib, added zlib path to LD_LIBRARY_PATH, installed latest version zlib, nothing worked.
Finally changing CFLAGS in makefile did the trick :

Change
CFLAGS= -g -Wall -O2
to
CFLAGS= -g -Wall -O2 -L /usr/local/lib (This is where my zlib libraries are located)

.... It finally worked.

NOTE: Install as regular user. May be later you can copy the binary files to the system
directories.

Wednesday, June 12, 2013

Bioinformatics: An Inseparable part of Systems Biology.

Hello everybody,

As per the title of this post, today I'm going to discuss about two widely used terms Bioinformatics and Systems Biology. Well, this time I'm here not to illuminate you with some new information, since none of these terms are newly coined terms, rather my objective behind this post is to share my personal view on these popular disciplines of modern biology, their relation and interdependency on each other.

Let's first come to Systems Biology; what does it mean?

Systems biology deals with biological phenomenon at systems level and systems are defined to be a cluster/collection of smaller components on which we, or the viewer have interest. Now, if we consider cell as a system then its building materials are organelles, macro-micro molecules and the genetic materials (DNA and RNA). If we consider an organ as a system, then its component tissues and the signalling system are the building material of the system. So, what appears from the discussion so far is that this term "system" in biological context is very flexible and it follows some kind of hiererchical construction.

We gathered huge amount of information about these biological entities from varieties types of experiments, which are needed to be organized and accessible in order to fetch out other valuable information out of them- and this is nothing but bioinformatics. So, without the proper organization of biological data and without the facility to fetch out valuable information out of them, it is impossible to stitch those entities together to build a system.

Thus, Bioinformatics should me viewed as a inseparable, rather than only as an important part of systems biology.

Monday, June 3, 2013

NANO SLEUTH

Came through a very interesting article this morning and so wanted to share.

National Centre for Biological Science (NCBS) Bangalore based researchers Yamuna Krishnan & Souvik Modi gifted the world a tool that could play at being a sleuth (detective) within a living cell & help scientists develop the best treatment for a disease.

They demonstrated that 2 or more extremely tiny DNA devices which they call as nano-device can be dispatched inside a cell to report on goings-on within. They created this device by cooking & cooling commercially available DNA strands with Potassium chloride.

These tiny device is 14nm long and 2nm in diameter, helped the scientists accurately measure the pH values of subcellular loctions where they were parked. The authors further added that "if the pH is found to be different from normal, we know something is amiss".

The application potential of the technique is humongous. It can be great tool for drug discoverers. And this may give us the next generation probes for sensing intracellular signals also.

This study appeared in Nature Nanotechnology recently.

Souvik Modi Yamuna Krishnan

Thursday, May 30, 2013

Newly discovered specie provides probable link between cyanobacteria and chloroplast!

This is my another blog on cyanobacteria. It is not only about a newly discovered specie but also a newly discovered symbiotic relationship. This cyanobacterium has not been cultured yet and so has been given a provisional name Candidatus Atelocyanobacterium thalassa. This cyanobacterium is unusual in the sense that it lacks some v.basic components without which it is not able to perform photosynthesis and thus unable to fix carbon for sustaining itself; these components are RuBisCo, photosystem II, tricarboxylic acid cycle.

Despite the absence of these enzymes and pathways, this cyanobacteria survive; this makes us think how? The answer to this lie in a phenomenon that could provide us insights about the very birth of chloroplast!

This cyanobacterium is found in a symbiotic association with an alga; a single celled,free living, photosynthetic picoeukaryote prymnesiophyte. The cyanobacterium gets its carbon requirements from this alga and in return it provides the alga fixed nitrogen which it efficiently fixes from marine environment. There is an imbalance in synergy because the cyanobacterium gives around 95% of fixed nitrogen to the alga whereas the alga gives a v.little amount of carbon(1-17%) but this imbalance can be attributed to the difference in their sizes.

This discovery does not only gives an insight to the planktonic symbiotic world but also gives some proof for the much sought out answer about the link between ancient cyanobacteria and chloroplast. Chloroplasts are organelles present inside photosynthetic cells and are responsible for fixing carbon and thus supporting all life forms. This cyanobacterium lives on the surface of the alga in some groove like structure. This type of association might have preceded the event of the alga engulfing the cyanobacterium and thus giving rise to the present chloroplast.

Microscopy showing the symbiotic partners

Reference: Thompson AW, Foster RA, Krupke A, Carter BJ, Musat N, Vaulot D, Kuypers MM, Zehr JP. Unicellular cyanobacterium symbiotic with a single-celled eukaryotic alga. Science. 2012 Sep 21;337

Wednesday, May 29, 2013

Cyanobacteria which sequesters carbonate granules internally

Cyanobacterial metabolism is quite interesting leading to different kinds of products ranging from biofuels, sugars, isoprene, carbonates. Until now, what have been found with cyanobacteria is extracellular carbonate deposits but it is recently found that a specie of cyanobacteria produce intracellular carbonate deposits. This specie has been given the nomenclature as order Gloeobacterales, Candidatus Gloeomargarita lithophora.

During photosynthesis,CO₂ is fixed leading to a more alkaline environment inside the cell. This makes the cells excrete alkalinity outside and if calcium is present in the environment, carbonate precipitates and form crystals with it leading to calcium carbonate deposits.

But, in these cyanobacteria the mechanism for excreting carbonates is either not present because of their ancient origin or the reason could be something that makes this phenomenon advantageous for them. The probable advantage of these inclusions is that they raise the density of cells to ~12% which has been predicted to be useful for these ancient benthic organisms to stay grounded.

The intracellular granules are found to be ~270nm and average number of inclusions per cell are ~21. These inclusions are rich in Ca, Ba, Sr,Mg and their ratios in the granules indicates their abundance in the local environment.

Reference: Couradeau E, Benzerara K, Gérard E, Moreira D, Bernard S, Brown GE Jr, López-García P. An early-branching microbialite cyanobacterium forms intracellular carbonates. Science. 2012 Apr 27;336(6080):459-62.

Friday, May 24, 2013

genome sequencing..

Genome sequencing is famous and one hottest area of bioinformatics and i would like to share my thoughts on it:

why sequencing of genome?

each and every gene should be studied and the function of the gene, operons, the base pare composition also should be known for the entire genome or the entire chromosome. the genome sequencing also be useful for personalized medicine also for curing the diseases also by sequencing the genome. the first sequenced whole genome was H. Influenza and the bacteriophage MS2 . early the genome sequencing was done by Maxam gilbert and sangers method, well it is very popular method but its a tedious method,as the time changes the science also improves and grows fast and so many sequencing technologies where growing. the recent one is the next generation sequencing method.

Sequencing Costs:

The human genome sequencing project took 7 -8 years to complete the work about $ 1 billion. Then after this they proceed to sequence the 2^nd genome of human it took around 3 – 4 months of about $30 million - $40 million. Now a days it costs around $ 5,000 and around 3 to 4 months maximum 6 months

SOLID :
The one method which we are interested is SOLID lets see about the method and how its done ,
Its introduced by the Life technologies it is one of the method and here, the sequencing is done by the DNA ligase and it leads to the formation of beads with the adapter each beads contains the same copies of DNA and those base pairs are further used, and the each base are represented in 4 different colors. then will get so much of data then the first step is assembly of data. we first clean or remove the unwanted data(remove vector adaptor) and the sequencing give the so many short fragments of the overlaps called contigs and those contigs may be the size of 4-5kb in. then the scaffolds are formed and we can find the size of the scaffolds by using the BAC insertion size. the assembly is done by using FASTX file and we can get the plot.

LIBRARY FORMATION IN SOLID:

These are the steps which so far learnt from my PI and am interested to work in this so am writing this blog to show how much i understood might be some mistakes will be there or am unable to express properly, remaining part of sequencing after learning i will write in the next blog soon....

lets c some steps in Linux which i learnt:

1. once u started to work with the Linux need to install the fastx tool.

2. download the software.
(might download in the zipped format)

3. Then tar it by giving the command tar-xjf then give the filename.

4.then give the path name where the software is downloaded, to know the path use the command pwd.

5.then start installing the file.

6.if u have get any errors like PKG_CONFIG or GTEXTUTILS this path $ export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH

7. Then configure the path by giving ./ configure

8. type make

9. type make install

then the software is installed successfully then enjoy the sequencing.. happy sequencing...

Some Unix and Perl One liners....

Reposting...

1    File format conversion/line counting/counting number of files etc.

1.    $ wc –l   : count number of lines in a file.
2.    $ ls | wc –l        : count number of files in a directory.
3.    $ tac     : print the file in reverse order e.g; last line first, first line last.
4.    $ rev     : reverse the file in lines.
5.    $ sed 's/.$//' or sed 's/^M$//' or sed 's/\x0D$//' : converts a dos file into unix mode.
6.    $sed "s/$/`echo -e \\\r`/" or sed 's/$/\r/' or sed "s/$//": converts a unix newline into a DOS newline.
7.    $ awk '1; { print "" }' : Double space a file.
8.    $ awk '{ total = total + NF }; END { print total+0 }' : prints the number of words in a file.
9.    $sed '/^$/d' or [grep ‘.’] : Delete all blank lines in a file.
10.    $sed '/./,$!d' : Delete all blank lines in the beginning of the file.
11.    $sed -e :a -e '/^\n*$/{$d;N;ba' -e '}': Delete all blank lines at the end of the file.
12.    $sed -e :a -e 's/<[^>]*>//g;/
13.    $sed 's/^[ \t]*//' : deleting all leading white space tabs in a file.
14.    $ sed 's/[ \t]*$//' : Delete all trailing white space and tab in a file.
15.    $ sed 's/^[ \t]*//;s/[ \t]*$//' : Delete both leading and trailing white space and tab in a file.

2.2    Working with Patterns/numbers in a sequence file
16.    $awk '/Pattern/ { n++ }; END { print n+0 }' : print the total number of lines containing the word pattern.
17.    $sed 10q : print first 10 lines.
18.    $sed -n '/regexp/p' : Print the line that matches the pattern.
19.    $sed '/regexp/d' : Deletes the lines that matches the regexp.
20.    $sed -n '/regexp/!p' : Print the lines that does not match the pattern.
21.    $sed '/regexp/!d' : Deletes the lines that does NOT match the regular expression.
22.    $sed -n '/^.\{65\}/p' : print lines that are longer than 65 characters.
23.    $sed -n '/^.\{65\}/!p' : print lines that are lesser than 65 characters.
24.    $sed -n '/regexp/{g;1!p;};h' : print one line before the pattern match.
25.    $sed -n '/regexp/{n;p;}' : print one line after the pattern match.
26.    $sed -n '/^.\{65\}/ {g;1!p;};h' < sojae_seq > tmp : print the names of the sequences that are larger than 65 nucleotide long.
27.    $sed -n '/regexp/,$p' : Print regular expression to the end of file.
28.    $sed -n '8,12p' : print line 8 to 12(inclusive)
29.    $sed -n '52p' : print only line number 52.
30.    $seq ‘/pattern1/,/pattern2/d’ < inputfile > outfile : will delete all the lines between pattern1 and pattern2.
31.    $sed ‘/20,30/d’ < inputfile > outfile : will delete all lines between 20 and 30.   OR sed ‘/20,30/d’ < input > output will delete lines between 20 and 30.
32.    awk '/baz/ { gsub(/foo/, "bar") }; { print }' : Substitute foo with bar in lines that contains ‘baz’.
33.    awk '!/baz/ { gsub(/foo/, "bar") }; { print }' : Substitute foo with bar in lines that does not contain ‘baz’.
34.    grep –i –B 1 ‘pattern’ filename > out : Will print the name of the sequence and the sequence having the pattern in a case insensitive way(make sure the sequence name and the sequence each occupy a single line).
35.    grep –i –A 1 ‘seqname’ filename > out : will print the sequence name as well as the sequence into file ‘out’.

2.3    Inserting Data into a file:

36.    gawk --re-interval 'BEGIN{ while(a++<49) s=s "x" }; { sub(/^.{6}/,"&" s) }; 1' > fileout : will insert 49 ‘X’ in the sixth position of every line.

37.    gawk --re-interval 'BEGIN{ s="YourName" }; { sub(/^.{6}/,"&" s) }; 1' : Insert your name at the 6 th position in every line.

3.    Working with Data Files[Tab delimited files]:

3.1    Error Checking and data handling:
38.    awk '{ print NF ":" $0 } ' : print the number of fields of each line followed by the line.
39.    awk '{ print $NF }' : print the last field of each line.
40.    awk 'NF > n' : print every line with more than ‘n’ fields.
41.    awk '$NF > n' : print every line where the last field is greater than n.
42.    awk '{ print $2, $1 }' : prints just first 2 fields of a data file in reverse order.
43.    awk '{ temp = $1; $1 = $2; $2 = temp; print }' : prints all the fields in the correct order except the first 2 fields.
44.    awk '{ for (i=NF; i>0; i--) printf("%s ", $i); printf ("\n") }' : prints all the fields in reverse order.
45.    awk '{ $2 = ""; print }' : deletes the 2nd field in each line.
46.    awk '$5 == "abc123"' : print each line where the 5th field is equal to ‘abc123’.
47.    awk '$5 != "abc123"' : print each line where 5th field is NOT equal to abc123.
48.    awk '$7 ~ /^[a-f]/' : Print each line whose 7th field matches the regular expression.
49.    awk '$7 !~ /^[a-f]/' : print each line whose 7th field does NOT match the regular expression.
50.    cut –f n1,n2,n3.. > output file : will cut n1,n2,n3 columns(fields) from input file and print the output in output file. If delimiter is other than TAB then give additional argument such as cut –d ‘,’ –f n1,n2.. inputfile > out
51.    sort –n –k 2,2 –k 4,4 file > fileout : Will conduct a numerical sort of column 2, and then column 4. If –n is not specified, then, sort will do a lexicographical sort(of the ascii value).

4.    Miscellaneous:
52.    uniq –u inputfile > out : will print only the uniq lines present in the sorted input file.
53.    uniq –d inputfile > out : will print only the lines that are in doubles from the sorted input file.
54.    cat file1 file2 file3 … fileN > outfile : Will concatenate files back to back in outfile.
55.    paste file1 file2 > outfile : will merge two files horizontally. This function is good for merging with same number of rows but different column width.
56.    !:p : will print the previous command run with the ‘pattern’ in it.
57.    !! : repeat the last command entered at the shell.
58.    ~ : Go back to home directory
59.    echo {a,t,g,c}{a,t,g,c}{a,t,g,c}{a,t,g,c} : will generate all tetramers using ‘atgc’. If you want pentamers/hexamers etc. then just increase the number of bracketed entities.NOTE: This is not a efficient sequence shuffler. If you wish to generate longer sequences then use other means.
60.    kill -HUP ` ps -aef | grep -i firefox | sort -k 2 -r | sed 1d | awk ' { print $2 } ' ` : Kills a hanging firefox process.
61.    csplit -n 7 input.fasta '/>/' '{*}' : will split the file ‘input.fasta’ wherever it encounters delimiter ‘>’. The file names will appear as 7 digit long strings.
62.    find . -name data.txt –print: finds and prints the path for file data.txt.
Sample Script to make set operations on sequence files:
63.    grep ‘>’ filenameA > list1 # Will list just the sequence names in a file names.
grep ‘>’ filenameB > list2 # Will list names for file 2
cat list1 list2 > tmp # concatenates list1 and list2 into tmp
sort tmp > tmp1 # File sorted
uniq –u tmp1 > uniq    # AUB – A ∩ B (OR (A-B) U (B-A))
uniq –d tmp1 > double # Is the intersection (A ∩ B)
cat uniq double > Union # AUB
cat list1 double > tmp
sort tmp | uniq –u > list1uniq # A - B
cat list2 double > tmp
sort tmp | uniq –u > list2uniq # B - A

PERL ONELINERS:

1.    perl -pe '$\="\n"'   : double space a file
2.    perl -pe '$_ .= "\n" unless /^$/' : double space a file except blank lines
3.    perl -pe '$_.="\n"x7' : 7 space in a line.
4.    perl -ne 'print unless /^$/' : remove all blank lines
5.    perl -lne 'print if length($_) < 20' : print all lines with length less than 20.
6.    perl -00 -pe '' : If there are multiple spaces, delete all leaving one(make the file a single spaced file).
7.    perl -00 -pe '$_.="\n"x4' : Expand single blank lines into 4 consecutive blank lines
8.    perl -pe '$_ = "$. $_"': Number all lines in a file
9.    perl -pe '$_ = ++$a." $_" if /./' : Number only non-empty lines in a file
10.    perl -ne 'print ++$a." $_" if /./' : Number and print only non-empty lines in a file
11.    perl -pe '$_ = ++$a." $_" if /regex/' ; Number only lines that match a pattern
12.    perl -ne 'print ++$a." $_" if /regex/' : Number and print only lines that match a pattern
13.    perl -ne 'printf "%-5d %s", $., $_ if /regex/' : Left align lines with 5 white spaces if matches a pattern (perl -ne 'printf "%-5d %s", $., $_' : for all the lines)
14.    perl -le 'print scalar(grep{/./}<>)' : prints the total number of non-empty lines in a file
15.    perl -lne '$a++ if /regex/; END {print $a+0}' : print the total number of lines that matches the pattern
16.    perl -alne 'print scalar @F' : print the total number fields(words) in each line.
17.    perl -alne '$t += @F; END { print $t}' : Find total number of words in the file
18.    perl -alne 'map { /regex/ && $t++ } @F; END { print $t }' : find total number of fields that match the pattern
19.    perl -lne '/regex/ && $t++; END { print $t }' : Find total number of lines that match a pattern
20.    perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m' : will calculate the GCD of two numbers.
21.    perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m' : will calculate lcd of 20 and 35.
22.    perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n' : Generates 10 random numbers between 5 and 15.
23.    perl -le 'print map { ("a".."z",”0”..”9”)[rand 36] } 1..8': Generates a 8 character password from a to z and number 0 – 9.
24.    perl -le 'print map { ("a",”t”,”g”,”c”)[rand 4] } 1..20': Generates a 20 nucleotide long random residue.
25.    perl -le 'print "a"x50': generate a string of ‘x’ 50 character long
26.    perl -le 'print join ", ", map { ord } split //, "hello world"': Will print the ascii value of the string hello world.
27.    perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)': converts ascii values into character strings.
28.    perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"': Generates an array of odd numbers.
29.    perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"': Generate an array of even numbers
30.    perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file: Convert the entire file into 13 characters offset(ROT13)
31.    perl -nle 'print uc' : Convert all text to uppercase:
32.    perl -nle 'print lc' : Convert text to lowercase:
33.    perl -nle 'print ucfirst lc' : Convert only first letter of first word to uppercas
34.    perl -ple 'y/A-Za-z/a-zA-Z/' : Convert upper case to lower case and vice versa
35.    perl -ple 's/(\w+)/\u$1/g' : Camel Casing
36.    perl -pe 's|\n|\r\n|' : Convert unix new lines into DOS new lines:
37.    perl -pe 's|\r\n|\n|' : Convert DOS newlines into unix new line
38.    perl -pe 's|\n|\r|' : Convert unix newlines into MAC newlines:
39.    perl -pe '/regexp/ && s/foo/bar/' : Substitute a foo with a bar in a line with a regexp.

Some other Perl Tricks

Want to display some progress bars while perl does your job:

For this perl provides a nice utility called "pipe opens" ('perldoc -f open' will provide more info)

open(my $file, '-|', 'command','option', 'option', ...) or die "Could not run tar ... - $!";
  while (<$file>) {
       print "-";
  }
  print "\n";
  close($file);

Will print - on the screen till the process is completed

Computational Genomics Lab at IICB

Followers

Monday, December 9, 2013

Renaming a Mysql database

Friday, November 29, 2013

Analysing the transcriptome data

Wednesday, October 23, 2013

Full length cDNA Technologies

Monday, October 21, 2013

Installing Ensembl API

Thursday, October 17, 2013

Installation of secretome in your server

Tuesday, July 30, 2013

synetny and the world of genomics

Saturday, July 27, 2013

UID's of cyanobacteria..

Monday, July 22, 2013

Importing & exporting MySQL dump files including/excluding data

Friday, July 19, 2013

mysql.sock file missing or appear to be missing

Wednesday, June 26, 2013

EuMicrobedb; Transcriptomicsdb and TOOLKIT with EMBOSS interface is finally resurrected....

Sunday, June 23, 2013

Installation of Samtools and BEDTools in my new RH 6 64 bit server

Wednesday, June 12, 2013

Bioinformatics: An Inseparable part of Systems Biology.

Monday, June 3, 2013

NANO SLEUTH

Thursday, May 30, 2013

Newly discovered specie provides probable link between cyanobacteria and chloroplast!

Wednesday, May 29, 2013

Cyanobacteria which sequesters carbonate granules internally

Friday, May 24, 2013

genome sequencing..

Some Unix and Perl One liners....

About the group