Computational Genomics Lab at IICB: 2016

Wednesday, October 26, 2016

Structural variation in the genomes

Structural variation:Structural variation is a change or variation which leads to change in the structure of organisms's chromosome. structural variants can be of Insertions, duplication, Inversion and translocation. According to the human genome or people work in genome say that if there is a variant more than of 50 base pairs changes in the human genome of 1%. Its believed that some of the genetic diseases are caused due to the structural variations.whats the difference between the SNP's and structural variation?SNP's are single nucleotide base mutations which have been validated to be present in more than 1% of the population when a single base differes between the 2 genomes. These are any mutations which cause a change in the organism's chromosome structure, such as Insertions, deletions, copy number variations, duplications, inversions and translocation. SNPs and INDELs are about low-level genomic variation. The structural variants which affect the genome at larger scales. Events like gene duplications, tandem repeats, transposon insertions, inversions, and other chromosomal rearrangements. The long read sequencing technology paves the way to understand the structural variants using the split read alignment.[Information from literature Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly Yingrui Li, et al] structural variations from short sequencing reads are hampered by one or more of the following limitations: (i) the methods may favor a particular length range of structural variations; (ii) they may favor discovery of particular types of structural variations; (iii) they may be unable to resolve the exact structural variation genotypes and/or breakpoints at single nucleotide resolution; and (iv) because of difficulties mapping reads to the genome, they may not be able to accurately identify complex rearrangements. Paired-end mapping, for example, can only predict insertion breakpoints within a few base pairs of the exact breakpoint position, and it can only detect insertions when the entire sequence is contained within the DNA fragment whose ends are being sequenced; thus, the maximum size of an insertion that can be detected by paired-end mapping is limited by the largest insert size present in a library. Split-read methods, on the other hand, can precisely define a breakpoint and genotype of an insertion, but only when it is shorter than the read length. Thus, studies carried out so far have been of limited completeness, accuracy and/or resolution.

BWA-MEM or BLASR

http://lh3.github.io/2014/12/10/bwa-mem-for-long-error-prone-reads/ this is a very nice blog discusses about the alignment methods useful of the pacbio long reads.

https://www.biostars.org/p/63306/ forum discusses about the split read alignments.

Tips for structural variant analysis:

1. The maximum number of Reads should be mapped in the breakpoints of the chromosome and the coverage should be high.

2. How many Individual reads are supporting the translocation versus supporting assembly for identifying the translocations.

[ I spoke with some of the developers asking about the structural variants of draft pacbio assembly plant pathogen human said completely I can use the tools for predicting , am trying to do for one of the plant pathogen genome]

one of the paper in 2014 talks about all approaches

https://bib.oxfordjournals.org/content/early/2014/12/12/bib.bbu047.full#sec-9

Tuesday, September 27, 2016

Posters from ECCB2016

I found some interesting poster and thought it will help my friends who are working on the same area , and same type of work going in my lab those are here

Sunday, September 25, 2016

ECCB 2016 Den Hague, Netherlands computational Biologist and Bioinformaticians gatherings at a sweet Dutch country !

I have been to several conferences within India, while ECCB 2016 which happened in Den Hague, from September 3 2016 - September 7 2016. It was the first time for me travel outside India, had butterflies on my stomach the day before I travel.The trip really went well. It was a gathering of computational Biologist and Bioinformaticians over the world. Well I should thank Department of Science and Technology, Government of India for providing me the travel award. The Meeting started with the workshop on discussing Pacbio and Nanopore data. Expertise from the field of nanopore and Pacbio were discussing the problems with the long reads. People were complaining about the "error rates" of these reads, and difficulties in genome assembly of these reads. Had a great opportunity to discuss with the experts. The Nanopore experts were suggesting that Canu assembler can do better when handling the problematic regions in the genome. The Miniasm and Racon assembler also be tried . There were sessions about Irys to create a genome map and align the created map back to the genome assembly to get a better genome assembly. The structural variants and the comparative genomics are also studied from the graph. Next topic was using Isoseq from pacific Bio-systems to produce a full length transcripts without assembly, followed by promethion and squiggle sequencing system from nanopore technology"Read Until " approach it enables selection of individual DNA molecules for sequencing from a pool of DNA molecules. Then there was a session of Minotour where the base calling of the nanopore reads where done without performing the cloud base calling since there is a dependency of high speed internet. The developers of the tools and technologies were very friendly and gave suggestions on working with the long reads. after the workshop the conference scientific sessions started and many interesting talks where there, I was more interested towards the error correction algorithm development, genome assembly tools, new ortholog prediction tools. Most of the sessions and posters where about the cancer ( a devil), and ENCODE. I can say that 60% of people presented towards cancer transcriptomics and genomics, and 30 % of work in ENCODE, rest where like plant, bacteria, database development, Docking and simulations. The talks and discussions can be retrieved from twitter via #ECCB2016. I liked the theme of the conference here not only PhD students and Scientists were presenting the work, even people working in companies were also showcased the ongoing work.Some people were very happy and showed interest towards the poster of my PhD work, since its a plant pathogen. I am more interested towards studying the environmental organisms, pathogens of human, discovering various new species from the environment. About the food it was good, had a varieties of cheese. I had time to visit Amsterdam its a very nice place with a polite people. Visited churches, Museum, had a good canal Boat riding.I had few friends from conference and joined with them and rented a boat and rode over the Amsterdam city. The future ECCB2017 conference will be held in Prague.

Tuesday, September 20, 2016

Analyzing Differential expression analysis data using the tuxedo suite (cummeRbund)

Tuxedo suite comprises of bowtie, tophat, cufflink, cummeRBund and many more accessory tools.

First get your genome fasta file (final genome assembly file).
1. Map your RNAseq fastq files using tophat (if all is well your run will be seamless)
2. Run cufflink over your tophat output file (cufflinks accepted_hits.bam). This run will take a while since cufflink will actually merge the reads into transcripts, isoforms, genes and so on. If your files are large then in a good enough server expect it to run for 8-12 hours.
3. Run cuffmerge: cuffmerge list.txt -> where list.txt carries the names of the files of *_transcripts.gtf files. This will run very fast and will merge all the gene_ids that will be same across all your samples. The output of this file is a merged.gtf file.
4. For running differential expression analysis run the following:
/cuffdiff merged.gtf tophat_HTI1-vs-HTI4/accepted_hits.bam tophat_HTI2-vs-HTI4/accepted_hits.bam tophat_HTI3-vs-HTI4/accepted_hits.bam

This will create a plethora of files, but the following files are the ones you will be proceeding with for cummeRbund for result visualization and generating publication quality images.

For running cummeRbund, get all these files to your working directory

isoforms.fpkm_tracking

isoform_exp.diff

genes.fpkm_tracking

gene_exp.diff

tss_groups.fpkm_tracking

tss_group_exp.diff

cds.fpkm_tracking

cds_exp.diff

cds.diff

promoters.diff

splicing.diff

The best option will be to put all of these 11 files into a separate directory inside your working directory: say 'diff_exp'
You can run Rstudio if you like in your windows machine or run R in your server. For running CummeRbund you will need the following packages that you can go ahead and download upfront:

RSQLite
ggplot2 v0.9.2
reshape2
plyr
fastcluster
rtracklayer
Gviz
BiocGenerics (>=0.3.2)
Hmisc

In case you have forgotten how to install R packages go this way: source('http://www.bioconductor.org/biocLite.R') biocLite('cummeRbund') And follow this same protocol for installing other R packages. Once done you can start with setting your working directory using setwd() command.
For example: setwd("C:/Users/Sucheta/Documents/MyLabIICB/AllCollaborations/NahidAliCollaboration/companion")
Then load the library:

library(cummeRbund)

Now read your 11 files using this command
data <-readCufflinks("diff_exp")

This will take a while to read but will create a db file in your source directory. This is your database file.

Now you can plot gene density using the following command:

csDensity(genes(data))

Or can do a volcano plot of differentially expressed genes using:

v<-csVolcanoMatrix(genes(data))
v

As you can see from this file, the different conditions have least difference among themselves.

This will continue in next blog...

Friday, June 17, 2016

#OMGN2016 Malmo, Sweden - Between Then and Now...

Many things have changed in the years in front of me since the day I started attending OMGN meetings. My first meeting was in year 2005 and then the first Oomycetes genomes were getting sequenced and getting analyzed - at its own pace (read very slow pace). We used to get excited even when we got SSRs or repeats predicted. I distinctly remember the 2004 Joint Genome Institute sequence jamboree when in the evenings we used to gather to discuss what was done during the day. On second or third day of Jamboree, Brett came up with this multiple sequence alignment that presumably indicated that there was an RXLR motif in the effector proteins. It was a huge deal then. Subsequently in all the meetings everybody started discussing on these proteins. Initially it appeared too good to be true with this small 4 letter motif, but a lot of work was done especially in Brett's lab to prove that it indeed was a significant motif. The prediction algorithms of effectors got published in high flying journals, everybody was excited. Slowly many more papers came out on RXLRs, their prediction methods, characterizations till 2010. Now in 2016, I see the level of science has gone up way higher. Genome sequencing using PacBio or Illumina is no deal, neither is analyzing them. Effector prediction has become just a days job (Thanks to all the hard work done by the pioneers). Genome analysis are now carried out by single individuals in few months time. This meeting was a skew towards miRNAs, CRISPER technology for genome editing, pacbio sequencing, RNA silencing. The effector biology has moved many many steps up now. Many more things are now known. Many more proteins have been characterized. It is exciting time in the history of oomycetes biology where many things are happening right in front of our eyes. For those who could not attend please check #OMGN16 for more details. For me now bye bye lovely Malmo!!

Computational Genomics Lab at IICB

Followers