Computational Genomics Lab at IICB

Tuesday, October 19, 2021

A dangerous $# in perl arrays - Not recommended for iteration

The innocent looking $# operator that we often use for determining the maximum index of an array in for loop can be sometimes dangerous.

I spent a sizable amount of time wondering why my for loop is becoming an infinite loop without realizing that writing something like this actually changes the max index value or $# value of an array

Say you have 2 dimensional array @sorted and you want to print the component. The easiest way would be:

for(my $i=0; $i <= $#sorted; $i++){

for(my $j=0; $j <= @{$sorted[$i]}; $j++){

print "$sorted[$i][$j]\t";

}

print "$#sorted\n";

}

The output will be a neat:

3349 4097 gene-PR001_g8806 9 ----> The last column indicates the max index

6662 6832 gene-PR001_g8807 9

11316 11696 gene-PR001_g8808 9

13158 13334 gene-PR001_g8809 9

18688 19095 gene-PR001_g8810 9

25175 25342 gene-PR001_g8811 9

26554 26883 gene-PR001_g8812 9

28100 29059 gene-PR001_g8813 9

29128 30235 gene-PR001_g8814 9

30266 30786 gene-PR001_g8815 9

Here one can notice the value of the last column is that of the max index of the array that remains unchanged and hence the lop terminates.

However, something as innocent as a c style this involving accessing the $i+1 element of the array actually changes the array maximum index!! This came as a surprise to me where I hit the infinite loop leading to out of memory and locked file alert.

Check this out::

for(my $i=0; $i <= $#sorted; $i++){

for(my $j=0; $j <= @{$sorted[$i]}; $j++){

print "$sorted[$i+1][$j]\t";---> Accessing the $i+1 value rather than $i value

}

print "$#sorted\n";

}

Here all the hell breaks loose where you hit a infinite loop when you suspect the least. Therefore, it

will be prudent to first pass the value of $#sorted to a variable and loop over that value instead of looping directly over $#sorted.

3349 4097 gene-PR001_g8806 9

6662 6832 gene-PR001_g8807 9

11316 11696 gene-PR001_g8808 9

13158 13334 gene-PR001_g8809 9

18688 19095 gene-PR001_g8810 9

25175 25342 gene-PR001_g8811 9

26554 26883 gene-PR001_g8812 9

28100 29059 gene-PR001_g8813 9

29128 30235 gene-PR001_g8814 9

30266 30786 gene-PR001_g8815 9

... --> Increases infinitely!

A potentially dangerous infinite loop where you are least suspicious!!!

A neat solution for this problem will be:

my $index = $#sorted;--> notice this statement

for(my $i=0; $i <= $index; $i++){

for(my $j=0; $j <= @{$sorted[$i]}; $j++){

print "$sorted[$i+1][$j]\t";

}

print "$#sorted\n";

}

6662 6832 gene-PR001_g8807 9

11316 11696 gene-PR001_g8808 9

13158 13334 gene-PR001_g8809 9

18688 19095 gene-PR001_g8810 9

25175 25342 gene-PR001_g8811 9

26554 26883 gene-PR001_g8812 9

28100 29059 gene-PR001_g8813 9

29128 30235 gene-PR001_g8814 9

30266 30786 gene-PR001_g8815 9

10 --> Notice this 10 below.

This means that the value of index is on rise but the loop terminates nevertheless!! Is this a bug in perl??

Thursday, December 10, 2020

Piping server for transferring data back and forth between any device

Today I came across a system called as piping server. The beauty of this technology is you can transfer any file between the devices using simple commands such as curl. If you are ubuntu or any other linux user, the files that you want to transfer from machine A to machine B involves the following simple commands.

Suppose in 'A' you have a file called as mandel1.jpg and you need to transfer that to 'B' then simply go to the terminal in A and type:

$ curl -T mandel1.jpg https://ppng.io/mandel1

[The following will prompt in your terminal @ A]

[INFO] Waiting for 1 receiver(s)...

Then go to terminal 'B' and type:

sutripa@amrit:~$ curl https://ppng.io/mandel1 > mandel1.jpg

You will get the following prompt @B

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 137 100 137 0 0 2 0 0:01:08 0:01:06 0:00:02 33

Then finally do an ls at B

you will see the file you have transferred e.g; mandel1.jpg.

Transferring directories between devices is also pretty easy using this server.

At the sending server just do a

$tar zfcp - ./QC | curl -T - https://ppng.io/such

or if you want to compress using zip do the following:

$zip -q -r - ./QC | curl -T - https://ppng.io/such

At the receiving server do a

sutripa@amrit:~$ curl https://ppng.io/such > QC

Then you will see

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 855k 0 855k 0 0 60467 0 --:--:-- 0:00:14 --:--:-- 216k

Directory transferred.

For more information check this link out:

https://ostechnix.com/transfer-files-between-any-devices-using-piping-server/

Tuesday, April 11, 2017

two-speed genome analysis using R and perl

I have discussed about my speed genome analysis on my previous blog, now am writing the steps how to do that

1. calculate the intergenic distance of ur organism "x" from gtf file using the perl script , I have used the augustus predicted gtf file , the file needs to be modified according to the perl script so that it looks for the particular feature and pattern such as it looks for gene/exon/mRNA feature

sample of the augustus gtf file
(make sure the last column is mentioned in the gene row)

scaffold_1904 AUGUSTUS gene 1 775 0.09 + . transcript_id "g18705"; gene_id "temp";
scaffold_1 AUGUSTUS transcript 1 4141 0.05 + . g1.t1
scaffold_1 AUGUSTUS intron 1 180 0.55 + . transcript_id "g1.t1"; gene_id "g1";
scaffold_1 AUGUSTUS intron 2022 3511 0.21 + . transcript_id "g1.t1"; gene_id "g1";
scaffold_1 AUGUSTUS CDS 181 2021 0.09 + 2 transcript_id "g1.t1"; gene_id "g1";
scaffold_1 AUGUSTUS CDS 3512 4141 0.49 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold_1 AUGUSTUS stop_codon 4139 4141 . + 0 transcript_id "g1.t1"; gene_id "g1";

use the perl script Calculate_FIR_length.pl using the gtf file to calculate the intergenic distance between the features, already complimentbed is there but that is not the one which we need, that is for difference purpose when I posted in a github forum to author I came to know the difference, this is the link https://github.com/Adamtaranto/density-Mapr/issues/1#issuecomment-291475238

the intergenic distance between the gene file looks like this

"geneid","strand","fiveprime","threeprime"

"g10","+",5020,927

"g84","+",3316,8625

"g42","+",1558,1773

"g156","+",4460,13837

"g93","-",2035,553

"g30","+",117,361

"g106","+",1656,874

"g39","+",1380,720

"g1","+",NA,1222

"g70","+",2614,419

2. make a gtf file of Avh's with the location and calculate the intergenic distance for Avh's , intergenic 5 prime and 3 prime distance needs to be calculated the intergenic distance of effector file should look like this

"geneid","strand","fiveprime","threeprime"

"Avh92","-",4946,80942

"Avh48","+",137224,80942

"Avh102","+",38474,24067

"Avh127","-",137224,6955

"Avh304","-",7882,12043

"Avh313","+",23825,26166

"Avh91","-",24826,4946

"Avh303","+",16698,12043

"Avh34","-",41377,26166

"Avh61","+",4031,5317

"Avh311","-",14467,1021

"Avh310","+",4389,1021

"Avh93","+",41377,38474

3. then just Run the R script which is pasted below just by changing the names of the file, if the points or plots are going out the cut off please change the bin size and num of bins before its creating the heatmap don't change after the heatmap is made

whole_intergene=read.csv(file="intergene_whole.csv",sep=",")

NumBins=50

if ((max(whole_intergene$fiveprime, na.rm=TRUE)>max(whole_intergene$threeprime, na.rm=TRUE)) == TRUE) { whole_intergene2Bin=whole_intergene$fiveprime} else { whole_intergene2Bin=whole_intergene$threeprime}

whole_intergene2Bin=whole_intergene2Bin[which(whole_intergene2Bin!=0)]

whole_intergene2Bin=na.omit(whole_intergene2Bin)

BinSteps=round(length(whole_intergene2Bin)/ (NumBins-20) , digits=10)

whole_intergene2BinOrd=sort(whole_intergene2Bin)

#### The [2*BinSteps] has been changed to [1*Binsteps it was producing an error after googling the error has been fixed]

TempBinLimits=whole_intergene2BinOrd[seq(whole_intergene2BinOrd[2*BinSteps],length(whole_intergene2BinOrd),BinSteps)]

TempBinLimits[length(TempBinLimits)+1]=max(whole_intergene2Bin, na.rm=TRUE)

x<-seq(length(TempBinLimits))

fit<-nls(log(TempBinLimits) ~ a*x + b, start= c(a=0, b=0),algorithm='port',weights=((x-1.0* NumBins)^2))

pred=predict(fit, x)

BinLimits=c(1, round(exp(pred),0), max(whole_intergene2Bin))

xbin=cut(whole_intergene$fiveprime, breaks=c(BinLimits))

ybin=cut(whole_intergene$threeprime, breaks=c(BinLimits))

whole_intergene=cbind(whole_intergene, xbin, ybin, genevalue=rep(1, length (whole_intergene$fiveprime)))

GenValMatrix<-with(whole_intergene, tapply (genevalue, list(xbin, ybin), sum))

x<-1:ncol(GenValMatrix)

y<-1:nrow(GenValMatrix)

zlim = range(as.numeric (unlist(GenValMatrix)) , finite=TRUE)

mypalette<-colorRampPalette(c( "white","darkblue", "forestgreen", "goldenrod1","orangered", "red3", "darkred"), space="rgb")

mycol=mypalette(2*max(GenValMatrix, na.rm=TRUE))

mylabels<-paste(BinLimits[1:length(BinLimits)-1], BinLimits[2:length(BinLimits)], sep="- ", collapse=NULL)

filled.contour(x, y, z=GenValMatrix,plot.title = title(main ="Phytophthora ramorum Pr102 genome",xlab = "five prime intergenic regions", ylab= "three prime intergenic regions", cex.main=0.8, cex.lab=0.8),key.title = title(main ="Number ofgenes", cex.main=0.5,line=1),col=mycol,levels = pretty(zlim, 1*max(GenValMatrix,na.rm=TRUE)),plot.axes={axis(1,at=x, labels=mylabels, las=2,cex.axis=0.5);axis(2,at=y, labels=mylabels,cex.axis=0.5)})

#wget http://wiki.cbr.washington.edu/qerm/sites/qerm/images/1/16/Filled.contour3.R

source('Filled.contour3.R')

library(png)

library(gridExtra)

library(ggplot2)

image_name<-paste(as.character(format(Sys.time(),"%Y%m%d%H%M%S")), "_graph", sep="")

png(filename = paste(image_name, ".png", sep=""))

par(mar=c(0,0,0,0))

filled.contour(x, y, z=GenValMatrix,col=mycol,levels = pretty(zlim, 2*max(GenValMatrix,na.rm=TRUE)),frame.plot = FALSE,axes = FALSE)

dev.off()

img <- readPNG(paste(image_name, ".png", sep=""))

library(gridExtra)

library(grid)

library(ggplot2)

library(lattice)

g <- rasterGrob(img, interpolate=TRUE)

rxlrData=as.data.frame(read.csv('rxrl_whole_intergenic.csv',header=TRUE))

ggplot(data=rxlrData,aes(x=rxlrData$fiveprime,y=rxlrData$threeprime,geom="blank"))+annotation_custom(g,xmin=-Inf,xmax=Inf,ymin=-Inf,ymax=Inf)+coord_fixed(ratio=1)+geom_point(shape=21,fill="red",colour="black",size=2,alpha=0.7,na.rm=FALSE)+scale_y_log10(breaks=BinLimits[2:length(BinLimits)],limits=c(BinLimits[2],BinLimits[NumBins +1]))+scale_x_log10(breaks= BinLimits[2:length(BinLimits)],limits= c(BinLimits[2] ,BinLimits[NumBins +1]))+theme(axis.text.y=element_text(size = 10,vjust=0.5))+theme(axis.text.x=element_text(size=10,vjust=0.5,angle=90))+theme(axis.title.x = element_text(face="bold",size=12))+xlab("five prime intergenic region")+theme(axis.title.y = element_text(face="bold",size=12)) + ylab("threeprime intergenic region")

Below I have put the screen shot for the Phytophthora sojae genome from a sample dataset

minor edits can be done to make it good !!

This code and concept has been acquired from http://biorxiv.org/content/early/2015/07/01/021774

Sunday, March 26, 2017

analysing the alleles from haplotypes on pacbio data

I had been working in pacbio data and when am trying to identify the alleles from haplotypes from diploid assembly, in the very early step itself i got many errors, because i had been following the illumina dataset method like for pacbio data, but the developed tools behaves strange with the data and I got stuck for 3-4 days i googled the maximum and tried various approaches, Finally i posted in the forums and interacted with the GATK developers, they suggested me a simple solution for solving my errors, so those who are working in long reads and want to identify the haplotypes here is my commandline and verified one
[ Any aligners can be used even BLASR initially i was thinking there was a problem with my aligner, but really not] and no need to mark duplicates in case of long reads only for illumina reads its been recommended by the developer, i had reached till the step of HaplotypeCaller so far no error its running smooth, If i change commands or face any problems, will be updated, once the output is ready maybe i can paste some of my output

bwa index 2017_V6_Pr102_assembly.fa
bwa mem -x pacbio 2017_V6_Pr102_assembly.fa /data/results1/STLab/Takao_data/Raw_data/ND886/all_ND886.fastq > aln.sam
samtools view -b -S aln.sam -o aln.bam
samtools sort aln.bam > aln_sorted.bam
samtools index aln_sorted.bam
samtools mpileup -uf 2017_V2_ND886_assembly.fa aln_sorted.bam | /share/apps/bcftools-1.2/bcftools call -cv - > out.vcf [use bcftools1.2 otherwise its not producing the genotype information]

Use any of your favourite haplotype phaser (whatshap/ hapcut) along with the above produced bam and vcf file

Now u get the phased alleles from haplotypes u can compare them and these can be used to downstream analysis

Friday, March 24, 2017

Fancy genomics “Iam taking you all to the world of two-Speed genomes concept"

My Phd problem includes the various approaches for solving genome assembly problems. When I was working on oomycetes project, I was attracted by the effector proteins, Evolution, pathogenicity, synteny, transposon, Repeat regions, suddenly the fancy thing which came in the mind after reading an interesting paper from biorxiv that is verticullum genome, a group from Netherlands have sequenced and studied the 2-speed genome concepts among the strains. http://genome.cshlp.org/content/early/2016/07/12/gr.204974.116.full.pdf+html I was impressed by the work, then I showed the work to my PI even she was impressed by the speed genomes. I work in a collaborative program where exactly my collaborator also was fascinated by the speed genome work.

Let me explain what is 2 speed genomes?

It was already known that fungi and the plant pathogen genomes comprises of Effector proteins. Which plays an important role in causing pathogenicity to the host. These Effector genes are not randomly distributed across the genomes, tend to be associated with the compartments enriched with repeat sequences and transposons. This led to the ‘two-speed genome’ model in which filamentous pathogen genomes have a bipartite architecture with gene sparse, repeat rich compartments for adaptive evolution. The unusual genome architecture and occurrence of effector genes in specific genome compartments is a feature that has evolved repeatedly in independent phylogenetic lineages of filamentous pathogens. Genome analyses of P. infestans and three of its sister species revealed uneven evolutionary rates across genomes with genes in repeat-rich regions showing higher rates of structural polymorphisms and positive selection. Two-speed genome architecture with the effector genes populating the more rapidly evolving sections of the genomes. Lineages that acquired two-speed genomes have increased survivability — they are less probabe to go extinct compared to lineages with less adaptable genomes, which are more probabe to be purged out of the biota as their hosts develop full resistance or become extinct. In this ‘jump or die’ model, pathogen lineages that have an increased likelihood to produce virulent genotypes on resistant hosts and non-hosts benefit from a macroevolutionary advantage and end up dominating the biota. Several filamentous plant pathogens have evolved by shifting or jumping from one host plant to another.

The information has been shared from this paper a great detailed review by Sophien and Raffaele et al its available here http://www.sciencedirect.com/science/article/pii/S0959437X15000945 .

For who don’t have access to science direct the same paper is available at biorxiv repository please find the link http://biorxiv.org/content/early/2015/07/01/021774

Wednesday, October 26, 2016

Structural variation in the genomes

Structural variation:Structural variation is a change or variation which leads to change in the structure of organisms's chromosome. structural variants can be of Insertions, duplication, Inversion and translocation. According to the human genome or people work in genome say that if there is a variant more than of 50 base pairs changes in the human genome of 1%. Its believed that some of the genetic diseases are caused due to the structural variations.whats the difference between the SNP's and structural variation?SNP's are single nucleotide base mutations which have been validated to be present in more than 1% of the population when a single base differes between the 2 genomes. These are any mutations which cause a change in the organism's chromosome structure, such as Insertions, deletions, copy number variations, duplications, inversions and translocation. SNPs and INDELs are about low-level genomic variation. The structural variants which affect the genome at larger scales. Events like gene duplications, tandem repeats, transposon insertions, inversions, and other chromosomal rearrangements. The long read sequencing technology paves the way to understand the structural variants using the split read alignment.[Information from literature Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly Yingrui Li, et al] structural variations from short sequencing reads are hampered by one or more of the following limitations: (i) the methods may favor a particular length range of structural variations; (ii) they may favor discovery of particular types of structural variations; (iii) they may be unable to resolve the exact structural variation genotypes and/or breakpoints at single nucleotide resolution; and (iv) because of difficulties mapping reads to the genome, they may not be able to accurately identify complex rearrangements. Paired-end mapping, for example, can only predict insertion breakpoints within a few base pairs of the exact breakpoint position, and it can only detect insertions when the entire sequence is contained within the DNA fragment whose ends are being sequenced; thus, the maximum size of an insertion that can be detected by paired-end mapping is limited by the largest insert size present in a library. Split-read methods, on the other hand, can precisely define a breakpoint and genotype of an insertion, but only when it is shorter than the read length. Thus, studies carried out so far have been of limited completeness, accuracy and/or resolution.

BWA-MEM or BLASR

http://lh3.github.io/2014/12/10/bwa-mem-for-long-error-prone-reads/ this is a very nice blog discusses about the alignment methods useful of the pacbio long reads.

https://www.biostars.org/p/63306/ forum discusses about the split read alignments.

Tips for structural variant analysis:

1. The maximum number of Reads should be mapped in the breakpoints of the chromosome and the coverage should be high.

2. How many Individual reads are supporting the translocation versus supporting assembly for identifying the translocations.

[ I spoke with some of the developers asking about the structural variants of draft pacbio assembly plant pathogen human said completely I can use the tools for predicting , am trying to do for one of the plant pathogen genome]

one of the paper in 2014 talks about all approaches

https://bib.oxfordjournals.org/content/early/2014/12/12/bib.bbu047.full#sec-9

Tuesday, September 27, 2016

Posters from ECCB2016

I found some interesting poster and thought it will help my friends who are working on the same area , and same type of work going in my lab those are here