Computational Genomics Lab at IICB: May 2013

Thursday, May 30, 2013

Newly discovered specie provides probable link between cyanobacteria and chloroplast!

This is my another blog on cyanobacteria. It is not only about a newly discovered specie but also a newly discovered symbiotic relationship. This cyanobacterium has not been cultured yet and so has been given a provisional name Candidatus Atelocyanobacterium thalassa. This cyanobacterium is unusual in the sense that it lacks some v.basic components without which it is not able to perform photosynthesis and thus unable to fix carbon for sustaining itself; these components are RuBisCo, photosystem II, tricarboxylic acid cycle.

Despite the absence of these enzymes and pathways, this cyanobacteria survive; this makes us think how? The answer to this lie in a phenomenon that could provide us insights about the very birth of chloroplast!

This cyanobacterium is found in a symbiotic association with an alga; a single celled,free living, photosynthetic picoeukaryote prymnesiophyte. The cyanobacterium gets its carbon requirements from this alga and in return it provides the alga fixed nitrogen which it efficiently fixes from marine environment. There is an imbalance in synergy because the cyanobacterium gives around 95% of fixed nitrogen to the alga whereas the alga gives a v.little amount of carbon(1-17%) but this imbalance can be attributed to the difference in their sizes.

This discovery does not only gives an insight to the planktonic symbiotic world but also gives some proof for the much sought out answer about the link between ancient cyanobacteria and chloroplast. Chloroplasts are organelles present inside photosynthetic cells and are responsible for fixing carbon and thus supporting all life forms. This cyanobacterium lives on the surface of the alga in some groove like structure. This type of association might have preceded the event of the alga engulfing the cyanobacterium and thus giving rise to the present chloroplast.

Microscopy showing the symbiotic partners

Reference: Thompson AW, Foster RA, Krupke A, Carter BJ, Musat N, Vaulot D, Kuypers MM, Zehr JP. Unicellular cyanobacterium symbiotic with a single-celled eukaryotic alga. Science. 2012 Sep 21;337

Wednesday, May 29, 2013

Cyanobacteria which sequesters carbonate granules internally

Cyanobacterial metabolism is quite interesting leading to different kinds of products ranging from biofuels, sugars, isoprene, carbonates. Until now, what have been found with cyanobacteria is extracellular carbonate deposits but it is recently found that a specie of cyanobacteria produce intracellular carbonate deposits. This specie has been given the nomenclature as order Gloeobacterales, Candidatus Gloeomargarita lithophora.

During photosynthesis,CO₂ is fixed leading to a more alkaline environment inside the cell. This makes the cells excrete alkalinity outside and if calcium is present in the environment, carbonate precipitates and form crystals with it leading to calcium carbonate deposits.

But, in these cyanobacteria the mechanism for excreting carbonates is either not present because of their ancient origin or the reason could be something that makes this phenomenon advantageous for them. The probable advantage of these inclusions is that they raise the density of cells to ~12% which has been predicted to be useful for these ancient benthic organisms to stay grounded.

The intracellular granules are found to be ~270nm and average number of inclusions per cell are ~21. These inclusions are rich in Ca, Ba, Sr,Mg and their ratios in the granules indicates their abundance in the local environment.

Reference: Couradeau E, Benzerara K, Gérard E, Moreira D, Bernard S, Brown GE Jr, López-García P. An early-branching microbialite cyanobacterium forms intracellular carbonates. Science. 2012 Apr 27;336(6080):459-62.

Friday, May 24, 2013

genome sequencing..

Genome sequencing is famous and one hottest area of bioinformatics and i would like to share my thoughts on it:

why sequencing of genome?

each and every gene should be studied and the function of the gene, operons, the base pare composition also should be known for the entire genome or the entire chromosome. the genome sequencing also be useful for personalized medicine also for curing the diseases also by sequencing the genome. the first sequenced whole genome was H. Influenza and the bacteriophage MS2 . early the genome sequencing was done by Maxam gilbert and sangers method, well it is very popular method but its a tedious method,as the time changes the science also improves and grows fast and so many sequencing technologies where growing. the recent one is the next generation sequencing method.

Sequencing Costs:

The human genome sequencing project took 7 -8 years to complete the work about $ 1 billion. Then after this they proceed to sequence the 2^nd genome of human it took around 3 – 4 months of about $30 million - $40 million. Now a days it costs around $ 5,000 and around 3 to 4 months maximum 6 months

SOLID :
The one method which we are interested is SOLID lets see about the method and how its done ,
Its introduced by the Life technologies it is one of the method and here, the sequencing is done by the DNA ligase and it leads to the formation of beads with the adapter each beads contains the same copies of DNA and those base pairs are further used, and the each base are represented in 4 different colors. then will get so much of data then the first step is assembly of data. we first clean or remove the unwanted data(remove vector adaptor) and the sequencing give the so many short fragments of the overlaps called contigs and those contigs may be the size of 4-5kb in. then the scaffolds are formed and we can find the size of the scaffolds by using the BAC insertion size. the assembly is done by using FASTX file and we can get the plot.

LIBRARY FORMATION IN SOLID:

These are the steps which so far learnt from my PI and am interested to work in this so am writing this blog to show how much i understood might be some mistakes will be there or am unable to express properly, remaining part of sequencing after learning i will write in the next blog soon....

lets c some steps in Linux which i learnt:

1. once u started to work with the Linux need to install the fastx tool.

2. download the software.
(might download in the zipped format)

3. Then tar it by giving the command tar-xjf then give the filename.

4.then give the path name where the software is downloaded, to know the path use the command pwd.

5.then start installing the file.

6.if u have get any errors like PKG_CONFIG or GTEXTUTILS this path $ export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH

7. Then configure the path by giving ./ configure

8. type make

9. type make install

then the software is installed successfully then enjoy the sequencing.. happy sequencing...

Some Unix and Perl One liners....

Reposting...

1    File format conversion/line counting/counting number of files etc.

1.    $ wc –l   : count number of lines in a file.
2.    $ ls | wc –l        : count number of files in a directory.
3.    $ tac     : print the file in reverse order e.g; last line first, first line last.
4.    $ rev     : reverse the file in lines.
5.    $ sed 's/.$//' or sed 's/^M$//' or sed 's/\x0D$//' : converts a dos file into unix mode.
6.    $sed "s/$/`echo -e \\\r`/" or sed 's/$/\r/' or sed "s/$//": converts a unix newline into a DOS newline.
7.    $ awk '1; { print "" }' : Double space a file.
8.    $ awk '{ total = total + NF }; END { print total+0 }' : prints the number of words in a file.
9.    $sed '/^$/d' or [grep ‘.’] : Delete all blank lines in a file.
10.    $sed '/./,$!d' : Delete all blank lines in the beginning of the file.
11.    $sed -e :a -e '/^\n*$/{$d;N;ba' -e '}': Delete all blank lines at the end of the file.
12.    $sed -e :a -e 's/<[^>]*>//g;/
13.    $sed 's/^[ \t]*//' : deleting all leading white space tabs in a file.
14.    $ sed 's/[ \t]*$//' : Delete all trailing white space and tab in a file.
15.    $ sed 's/^[ \t]*//;s/[ \t]*$//' : Delete both leading and trailing white space and tab in a file.

2.2    Working with Patterns/numbers in a sequence file
16.    $awk '/Pattern/ { n++ }; END { print n+0 }' : print the total number of lines containing the word pattern.
17.    $sed 10q : print first 10 lines.
18.    $sed -n '/regexp/p' : Print the line that matches the pattern.
19.    $sed '/regexp/d' : Deletes the lines that matches the regexp.
20.    $sed -n '/regexp/!p' : Print the lines that does not match the pattern.
21.    $sed '/regexp/!d' : Deletes the lines that does NOT match the regular expression.
22.    $sed -n '/^.\{65\}/p' : print lines that are longer than 65 characters.
23.    $sed -n '/^.\{65\}/!p' : print lines that are lesser than 65 characters.
24.    $sed -n '/regexp/{g;1!p;};h' : print one line before the pattern match.
25.    $sed -n '/regexp/{n;p;}' : print one line after the pattern match.
26.    $sed -n '/^.\{65\}/ {g;1!p;};h' < sojae_seq > tmp : print the names of the sequences that are larger than 65 nucleotide long.
27.    $sed -n '/regexp/,$p' : Print regular expression to the end of file.
28.    $sed -n '8,12p' : print line 8 to 12(inclusive)
29.    $sed -n '52p' : print only line number 52.
30.    $seq ‘/pattern1/,/pattern2/d’ < inputfile > outfile : will delete all the lines between pattern1 and pattern2.
31.    $sed ‘/20,30/d’ < inputfile > outfile : will delete all lines between 20 and 30.   OR sed ‘/20,30/d’ < input > output will delete lines between 20 and 30.
32.    awk '/baz/ { gsub(/foo/, "bar") }; { print }' : Substitute foo with bar in lines that contains ‘baz’.
33.    awk '!/baz/ { gsub(/foo/, "bar") }; { print }' : Substitute foo with bar in lines that does not contain ‘baz’.
34.    grep –i –B 1 ‘pattern’ filename > out : Will print the name of the sequence and the sequence having the pattern in a case insensitive way(make sure the sequence name and the sequence each occupy a single line).
35.    grep –i –A 1 ‘seqname’ filename > out : will print the sequence name as well as the sequence into file ‘out’.

2.3    Inserting Data into a file:

36.    gawk --re-interval 'BEGIN{ while(a++<49) s=s "x" }; { sub(/^.{6}/,"&" s) }; 1' > fileout : will insert 49 ‘X’ in the sixth position of every line.

37.    gawk --re-interval 'BEGIN{ s="YourName" }; { sub(/^.{6}/,"&" s) }; 1' : Insert your name at the 6 th position in every line.

3.    Working with Data Files[Tab delimited files]:

3.1    Error Checking and data handling:
38.    awk '{ print NF ":" $0 } ' : print the number of fields of each line followed by the line.
39.    awk '{ print $NF }' : print the last field of each line.
40.    awk 'NF > n' : print every line with more than ‘n’ fields.
41.    awk '$NF > n' : print every line where the last field is greater than n.
42.    awk '{ print $2, $1 }' : prints just first 2 fields of a data file in reverse order.
43.    awk '{ temp = $1; $1 = $2; $2 = temp; print }' : prints all the fields in the correct order except the first 2 fields.
44.    awk '{ for (i=NF; i>0; i--) printf("%s ", $i); printf ("\n") }' : prints all the fields in reverse order.
45.    awk '{ $2 = ""; print }' : deletes the 2nd field in each line.
46.    awk '$5 == "abc123"' : print each line where the 5th field is equal to ‘abc123’.
47.    awk '$5 != "abc123"' : print each line where 5th field is NOT equal to abc123.
48.    awk '$7 ~ /^[a-f]/' : Print each line whose 7th field matches the regular expression.
49.    awk '$7 !~ /^[a-f]/' : print each line whose 7th field does NOT match the regular expression.
50.    cut –f n1,n2,n3.. > output file : will cut n1,n2,n3 columns(fields) from input file and print the output in output file. If delimiter is other than TAB then give additional argument such as cut –d ‘,’ –f n1,n2.. inputfile > out
51.    sort –n –k 2,2 –k 4,4 file > fileout : Will conduct a numerical sort of column 2, and then column 4. If –n is not specified, then, sort will do a lexicographical sort(of the ascii value).

4.    Miscellaneous:
52.    uniq –u inputfile > out : will print only the uniq lines present in the sorted input file.
53.    uniq –d inputfile > out : will print only the lines that are in doubles from the sorted input file.
54.    cat file1 file2 file3 … fileN > outfile : Will concatenate files back to back in outfile.
55.    paste file1 file2 > outfile : will merge two files horizontally. This function is good for merging with same number of rows but different column width.
56.    !:p : will print the previous command run with the ‘pattern’ in it.
57.    !! : repeat the last command entered at the shell.
58.    ~ : Go back to home directory
59.    echo {a,t,g,c}{a,t,g,c}{a,t,g,c}{a,t,g,c} : will generate all tetramers using ‘atgc’. If you want pentamers/hexamers etc. then just increase the number of bracketed entities.NOTE: This is not a efficient sequence shuffler. If you wish to generate longer sequences then use other means.
60.    kill -HUP ` ps -aef | grep -i firefox | sort -k 2 -r | sed 1d | awk ' { print $2 } ' ` : Kills a hanging firefox process.
61.    csplit -n 7 input.fasta '/>/' '{*}' : will split the file ‘input.fasta’ wherever it encounters delimiter ‘>’. The file names will appear as 7 digit long strings.
62.    find . -name data.txt –print: finds and prints the path for file data.txt.
Sample Script to make set operations on sequence files:
63.    grep ‘>’ filenameA > list1 # Will list just the sequence names in a file names.
grep ‘>’ filenameB > list2 # Will list names for file 2
cat list1 list2 > tmp # concatenates list1 and list2 into tmp
sort tmp > tmp1 # File sorted
uniq –u tmp1 > uniq    # AUB – A ∩ B (OR (A-B) U (B-A))
uniq –d tmp1 > double # Is the intersection (A ∩ B)
cat uniq double > Union # AUB
cat list1 double > tmp
sort tmp | uniq –u > list1uniq # A - B
cat list2 double > tmp
sort tmp | uniq –u > list2uniq # B - A

PERL ONELINERS:

1.    perl -pe '$\="\n"'   : double space a file
2.    perl -pe '$_ .= "\n" unless /^$/' : double space a file except blank lines
3.    perl -pe '$_.="\n"x7' : 7 space in a line.
4.    perl -ne 'print unless /^$/' : remove all blank lines
5.    perl -lne 'print if length($_) < 20' : print all lines with length less than 20.
6.    perl -00 -pe '' : If there are multiple spaces, delete all leaving one(make the file a single spaced file).
7.    perl -00 -pe '$_.="\n"x4' : Expand single blank lines into 4 consecutive blank lines
8.    perl -pe '$_ = "$. $_"': Number all lines in a file
9.    perl -pe '$_ = ++$a." $_" if /./' : Number only non-empty lines in a file
10.    perl -ne 'print ++$a." $_" if /./' : Number and print only non-empty lines in a file
11.    perl -pe '$_ = ++$a." $_" if /regex/' ; Number only lines that match a pattern
12.    perl -ne 'print ++$a." $_" if /regex/' : Number and print only lines that match a pattern
13.    perl -ne 'printf "%-5d %s", $., $_ if /regex/' : Left align lines with 5 white spaces if matches a pattern (perl -ne 'printf "%-5d %s", $., $_' : for all the lines)
14.    perl -le 'print scalar(grep{/./}<>)' : prints the total number of non-empty lines in a file
15.    perl -lne '$a++ if /regex/; END {print $a+0}' : print the total number of lines that matches the pattern
16.    perl -alne 'print scalar @F' : print the total number fields(words) in each line.
17.    perl -alne '$t += @F; END { print $t}' : Find total number of words in the file
18.    perl -alne 'map { /regex/ && $t++ } @F; END { print $t }' : find total number of fields that match the pattern
19.    perl -lne '/regex/ && $t++; END { print $t }' : Find total number of lines that match a pattern
20.    perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m' : will calculate the GCD of two numbers.
21.    perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m' : will calculate lcd of 20 and 35.
22.    perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n' : Generates 10 random numbers between 5 and 15.
23.    perl -le 'print map { ("a".."z",”0”..”9”)[rand 36] } 1..8': Generates a 8 character password from a to z and number 0 – 9.
24.    perl -le 'print map { ("a",”t”,”g”,”c”)[rand 4] } 1..20': Generates a 20 nucleotide long random residue.
25.    perl -le 'print "a"x50': generate a string of ‘x’ 50 character long
26.    perl -le 'print join ", ", map { ord } split //, "hello world"': Will print the ascii value of the string hello world.
27.    perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)': converts ascii values into character strings.
28.    perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"': Generates an array of odd numbers.
29.    perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"': Generate an array of even numbers
30.    perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file: Convert the entire file into 13 characters offset(ROT13)
31.    perl -nle 'print uc' : Convert all text to uppercase:
32.    perl -nle 'print lc' : Convert text to lowercase:
33.    perl -nle 'print ucfirst lc' : Convert only first letter of first word to uppercas
34.    perl -ple 'y/A-Za-z/a-zA-Z/' : Convert upper case to lower case and vice versa
35.    perl -ple 's/(\w+)/\u$1/g' : Camel Casing
36.    perl -pe 's|\n|\r\n|' : Convert unix new lines into DOS new lines:
37.    perl -pe 's|\r\n|\n|' : Convert DOS newlines into unix new line
38.    perl -pe 's|\n|\r|' : Convert unix newlines into MAC newlines:
39.    perl -pe '/regexp/ && s/foo/bar/' : Substitute a foo with a bar in a line with a regexp.

Some other Perl Tricks

Want to display some progress bars while perl does your job:

For this perl provides a nice utility called "pipe opens" ('perldoc -f open' will provide more info)

open(my $file, '-|', 'command','option', 'option', ...) or die "Could not run tar ... - $!";
  while (<$file>) {
       print "-";
  }
  print "\n";
  close($file);

Will print - on the screen till the process is completed

Friday, May 17, 2013

Installing CEGMA

Installing CEGMA can be a bit of a challenge for most of the users including me! I had installed it earlier, but this time it really gave me a very hard time in RH 6.2. Apart from installing a set of pre-requisitites, installation of genewise is bit of a challenge - especially so because now its status is archival. The INSTALL instruction that comes with the package is not very useful..

Genewise has several pre-requisites:

glib -> Provides low level data structure support for C programming. Can be found here:
http://www.linuxfromscratch.org/blfs/view/svn/general/glib2.html

glib has several pre-requisites such as
1. pkg-config: This package provides tools for passing include and library path so that packages can be built using configure and make utilities.

This package can be found here: http://pkgconfig.freedesktop.org/releases/pkg-config-0.28.tar.gz

2. libffi: This library provides high level portable interface for programmers to call any function at run time and can be found at: ftp://sourceware.org/pub/libffi/libffi-3.0.13.tar.gz

This package also requires a patch that can be found at:

http://www.linuxfromscratch.org/patches/blfs/svn/libffi-3.0.13-includedir-1.patch

3. Python2.7.5 if it is not already there: http://www.python.org/ftp/python/2.7.5/Python-2.7.5.tar.xz

4. PCRE library: This contains perl like regular expression libraries:
http://downloads.sourceforge.net/pcre/pcre-8.32.tar.bz2

All these installation will go smoothly, unless you have some problems with your system. Once done, go
Download wise2.4.1 source from http://wise.sourcearchive.com/downloads/2.4.1/

Do the following:

go to wise-2.4.1/src/HMMer2
then replace 'getline' to 'my_getline' in sqio.c
replace 'isnumber' in src/models/phasemodel.c into 'isdigit'

Then go and check the makefile under each directory under src and relace 'glib-config --libs' to 'pkg-config --libs glib-2.0' and also glib-config --cflags' to 'pkg-config --cflags glib-2.0'
using the following command:

find ./ -type f -name "makefile" -exec sed -i.old 's/glib-config --cflags/pkg-config --cflags glib-2.0/g' "{}" +;
and

find ./ -type f -name "makefile" -exec sed -i.old 's/glib-config --libs/pkg-config --libs glib-2.0/g' "{}" +;

Also set path for wiseconfigdir to wisecfg file that is somewhere inside the distribution.

Then do

make all

It works fine for RH6.2

Make sure blast program and Hmmer3 is on your path. Also install geneid in your system as that is also a pre-requisite. Once installed, first set the environment variables for the following:

CEGMA; CEGMATMP; PERL5LIB

PATH=$PATH:$HOME/bin:/usr/adadata/bin:/bin
export WISECONFIGDIR=/usr/adadata/wise2.4.0/wisecfg
export CEGMA=/usr/adadata/cegma_v2.4.010312
export CEGMATMP=/usr/adadata/cegma_v2.4.010312/tmp/
export PERL5LIB=$CEGMA/lib

Now finally Run CEGMA with the default parameters:

bin/cegma --genome sample/sample.dna --protein sample/sample.prot -o sample

References: http://korflab.ucdavis.edu/datasets/cegma/README

Computational Genomics Lab at IICB

Followers