Computational Genomics Lab at IICB: 2015

Monday, December 14, 2015

Variant Calling - The bowtie - picard - samtools - gatk pipeline....

Nextgen sequencing has caused a sudden surge in data deluge, but the informatics pipelines and algorithms are unable to keep up with the pace. While most of the exome sequencing data finally focuses on SNP calling and there are various ways of doing this, I decided to discuss one pipeline that has been accepted all over as one of the most sophisticated methods. It is the bowtie - picard - gatk pipeline.

When you are dealing with colorspace data the choice of mappers get limited. Howver, my favorite mapper is still bowtie for several reasons. Lifescope has its own inhouse mapper; which claims to have a all round better approach in mapping colorspace data, but the lack of transparency on what happens within puts me off using this tool. Once bowtie maps the reads by default parameter, the next thing to do is to convert the sam file into bam file, sort it and index it. All these can be done using samtools. However, if the file size is large, you could do the sorting job using sort operations from unix commandline.

1. sort sam file
export TMPDIR=DIR_WITH_LOTS_OF_SPACE
LC_ALL="C" sort -k 3,3 -k 4,4n input_sam > output_sam # This step will take a long time

sort options for samtools works but only on bam files and on many instances downstream analysis softwares complain about co-ordinates not being sorted...

or Use picard:

java -jar /share/apps/picard-tools-1.56/SortSam.jar I=bowtie.sam O=bowtie.bam SO=coordinate # This took one hour in a HPC with 48 GB RAM on each node for a file size of 30 GB

2. Make an index file of bam file
samtools index bowtie.bam bowtie.bai

3. MarkDuplicates using picard
java -jar /share/apps/picard-tools-1.56/MarkDuplicates.jar I=bowtie.bam M=metrics.bam O=duplicateMarked.bam

4. sort this bam file and make index using samtools
samtools sort duplicateMarked.bam duplicateMarked.sorted
samtools index duplicateMarked.sorted.bam duplicateMarked.sorted.bai

5. Then run IndelRealignerTargetCreator using GATK
java -jar /share/apps/GenomeAnalysisTK-2.4-9-g532efad/resources/GenomeAnalysisTK.jar -T RealignerTargetCreator -I duplicateMarked.sorted.bam -R /share/reference/human/samtools/hg19.fa -o dM.bam.list
# The output file dM.bam.list returns 0 output. Check it later.

6. Now run this picard tool to get RG updated since indelRealigner complains.

java -jar /share/apps/picard-tools-1.56/AddOrReplaceReadGroups.jar I=duplicateMarked.bam O=readGroupReplaced.bam RGLB="LINK_TO_FASTA" RGPL=SOLID RGPU=run barcode RGSM=9111 SORT_ORDER=coordinate CREATE_INDEX=TRUE VALIDATION_STRINGENCY=LENIENT'

Posters I could take pictures of Beyond Genome 2014

Here are few of the posters in Beyond Genome meeting that I could take pictures of. There were many more, but access to take their pictures was less...

Our Poster

Monday, December 7, 2015

SNP calling using GATK for de novo genome

I have a got a chance to work in leishmania genome, where i have a genome assembly and i dont have any deposited dbSNP or any other reference file to do variant calling, i have been working and stuck in many steps and posted in GATK forums they replied to some of my queries at one point stopped to reply since People were having a fat and busy holiday on thanks giving , and figured out how to do the variant calling, i think this blog will be much useful for the naive person like me, lets see the workflow and please refer GATK documentation for the explanation.
#first build the index for the reference genome
/share/apps/bowtie2-2.1.0/bowtie2-build after_removing_2k.fasta leishmania.index.bt2
#after index map the reference to the reads
/share/apps/bowtie2-2.1.0/bowtie2 -x leishmania.index.bt2 -1 /data/results/STLab/NahidAli/141218_SND393_A_L005_HTI-5_trim_R1_filtered.fastq -2 /data/results/STLab/NahidAli/141218_SND393_A_L005_HTI-5_trim_R2_filtered.fastq -S bowtie_aligned.sam
#convert the sam file to bam
samtools view -S bowtie_aligned.sam -b -o bowtie_aligned.bam
#sort the bam file
samtools sort bowtie_aligned.bam bowtie_aligned_sorted
#create a pileup file
samtools mpileup -uf after_removing_2k.fasta bowtie_aligned_sorted.bam|/share/apps/samtools-0.1.18/bcftools/bcftools view -bvcg - > leishmania.raw.bcf
#convert bcf to vcf
/share/apps/samtools-0.1.18/bcftools/bcftools view leishmania.raw.bcf > leishmania.raw.vcf
*********************************************************************************

The bove commands are just initial way of mapping the reads to the reference and the real GATK pipeline starts below since i don't have the any known sites i have done without base cailbration

#create the dictionary
#java -jar /share/apps/picard-tools-1.56/CreateSequenceDictionary.jar R=after_removing_2k.fasta O=after_removing_2k.dict
#add or mark group ids
#java -jar /share/apps/picard-tools-1.56/AddOrReplaceReadGroups.jar I=bowtie_aligned.sam O=group_added_read.bam SO=coordinate RGID=1 RGLB=library1 RGPL=illumina RGPU=1 RGSM=leishmania VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE
#mark duplicates
#java -jar /share/apps/picard-tools-1.56/MarkDuplicates.jar I=group_added_read.bam O=mapped_reads_dup.bam METRICS_FILE=metricsFile CREATE_INDEX=true
#sort bam file
#java -jar /share/apps/picard-tools-1.56/BuildBamIndex.jar INPUT=mapped_reads_dup.bam
#create realign target creator
#share/apps/GenomeAnalysisTK.jar -T RealignerTargetCreator -R after_removing_2k.fasta -o target_interval.intervals -I mapped_reads_dup.bam
#indel realigner
#/share/apps/GenomeAnalysisTK.jar -T IndelRealigner -R after_removing_2k.fasta -I mapped_reads_dup.bam -targetIntervals target_interval.intervals -o Indel_realigned.bam
#haplotype caller
#java -jar /share/apps/GenomeAnalysisTK.jar -T HaplotypeCaller -R after_removing_2k.fasta -I Indel_realigned.bam -stand_call_conf 30 -stand_emit_conf 10 -o raw_variants.vcf
#choose the variants from the raw vcf file
#java -jar /share/apps/GenomeAnalysisTK.jar -T SelectVariants -R after_removing_2k.fasta -V raw_variants.vcf -selectType SNP -o raw_snps.vcf
#do the filtration
#java -jar /share/apps/GenomeAnalysisTK.jar -T VariantFiltration -R after_removing_2k.fasta -V raw_variants.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter" -o filtered_snps.vcf
#Extract the indels
#java -jar /share/apps/GenomeAnalysisTK.jar -T SelectVariants -R after_removing_2k.fasta -V raw_variants.vcf -selectType INDEL -o raw_indels.vcf
#do the filteration
#java -jar /share/apps/GenomeAnalysisTK.jar -T VariantFiltration -R after_removing_2k.fasta -V raw_variants.vcf --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName "my_indel_filter" -o filtered_indels.vcf

from the above predicted snps and indels extract the regions and further annotate and work on it happy variant calling !!!!!!!!!!

Monday, November 30, 2015

5C bed file data format

5C and 3C are the newer technologies in sequencing where the chromatin inetraction data can be obtained. If you looking for such data and happen to download from UCSC genome browser, it may be hard to look around for format describing the fields. We asked the authors and here is the explanation:

The site from which you may download data may be this: https://www.encodeproject.org/experiments/ENCSR000CYD/

BED file format descrition can be found from : https://genome.ucsc.edu/FAQ/FAQformat.html#format1

Here is a sample data for GM12878 cell line:

chr22 31998728 33247041 5C_301_ENm004_FOR_292.5C_301_ENm004_REV_

32 1000 . 31998728 33247041 0 2 12744,40

98, 0,1244215,

chr5 131346229 132145236 5C_299_ENm002_FOR_241.5C_299_ENm002_REV_

33 1000 . 131346229 132145236 0 2 2609,210

5, 0,796902,

col1: Chromosome name

col2: Chromosome start

col3: chromosome end

col4: Name of the interacting sites (primer names)

col5:

col7: chromosome start

col8: chromosome end

col11: block sizes in comma separated list

col12: block offset in comma separated list

Now I will explain what col11 and col12 means...

the beginning of interacting site is the cromosome start and the beginning of offset is 0.

So, the interacting site begins at 31998728 + 0 and the interacting block length is 12744.

The beginning position of interacting site 2 is: 31998728 + 1244215 = 33242943

The size of interacting block 2 is 4098. so, end of interacting site is 33242943 + 4098 = 33247041.

Here is a diagrammatic representation:

Monday, November 23, 2015

Algal Biotechnology Workshop at IIT Mumbai on 21st Nov 2015

It was an insightful workshop on algal biotechnology at VMCC hall, IIT Mumbai during 21st November 2015. The organizers managed to have the world leaders as speakers in this area. The workshop started with handing over the materials to the participants followed by welcome address by Dr. Wangikar from IIT Mumbai followed by an insightful talk by Dr. Santanu Dasgupta from Reliance Industries.

Summaries of some of the interesting talks are discussed here:

De. Duu-long Jee from Department of Chemical Engineering, National Taiwan University:
Lutein, one of the 600 naturally occurring carotenoids is abundantly found in marigold flowers as well as in Micro-algae. Dr. jee presented an overview of cost-effectiveness of Lutein production with microalgae vs marigold flowers. Although microalgae produces about 3-4 times more lutein compared to Marigold flowers, but the energy required to extract those from micro-alga makes it an expansive option. Marigold on the other hand needs less nutrient, less power but more space... So, there is a need for engineering micro-algae that can have enhanced Lutein production with lesser energy dependence for extraction.

John Beardall, Monash University:

Extremophiles will play a major role in algal biotechnology, since they have altered metabolism. It is a well known fact that CO2 is sequestered in algae to enhance growth. But growth and lipid accumulation are oxymoron. Don't happen at the same time. They have explored media as a way for determining what favors the optimized fatty acid production. Their observation indicates that some micro-algae grow really well in media with altered source of carbon (glycerol and xylose) and also produce optimal fatty acid. Myxotrophic growth is favored for higher fatty acid production.

Jo-Shu Chang, Department of chemical engineering, National Cheng Kung University, Tanian, Taiwan:

Talked about CO2 sequestration by micro-algae and production of economically important compounds. He discussed about major energy components from micro-algae as Butanol, ethanol, H2, Diols, lactic acids and succinic acids. The effluent gas composed of 23.1% CO2, SOx 85 ppm, NOx 75 ppm and at temperature of 230C can be used for growing microalgae. Burkholderia (a proteo bacter) can be used for lipase production.

Min S. Park, Advanced Biomass R & D Center, BioEnergy Engineering and Research Laboratories, Dept. of chemical and Biomolecular Engineering, Daejeon, Republic of Korea:

Nanocloropsis is the choicest microalgae used for studying bioenergy production. These organisms have lipid droplets in their chloroplast. They have done series of signalling work involving Nanochloropsis and came to conclusion that JNK type of MAPK was highly activated under osmotic stress. NaCl induces osmotic stress -> acts upon MAPKK -> acts on MAPK -> represses Transcription factor -> inducing lipid production. They also observed that lipid production is inhibited by treatment of MEK specific inhibitor. The microbial culture community comprising the treatment plants mostly contained scenedesmus, Golenkinia, Microspora, Micractinium etc.

Jong Moon Park, Department of Chemical Engineering, School of Environmental Science and Engineering, Division of Advanced Nuclear Engineering, POSTECH, Republic of Korea:

He presented 2 different aspects of Bio-enegy production: 1. Enhanced fatty acid production from microalgae and ethanol production of Cyanobacteria.
In Cyanobacteria, they have used several approaches for enhancing ethanol production directly by manipulating few enzymes. One is glucose-6-phosphate 1-dehydrogenase, encoded by zwf and the other is Pdc. His admission is that ethanol from these engineered bacteria is released out of the cell and hence is not dangerous for the organism itself.
His notable work is also on microalgae where they have used food waste water and municipal sludge as one of the combinations for optimal growth of microalgae. He has also suggested that the municipality wate or food waste water can be diluted 20 times for growing micro-algae in them.
Chlorella was used for bio-diesel production.
Article look up are: Dexter and Fu, 2009; Li, C. 2015 for ethanol from Cyanobacteria.

Apart from this there were many more interesting talks, that I am not delving upon here. So, in all, everyone is looking for a breakthrough in growing these organisms faster and producing fatty acids quickly....

Thursday, November 5, 2015

Installing R packages that use shared library in Linux

Many R packages use scripts (or libraries) written in other languages like C, FORTRAN etc from shared libraries. Normally the main scripts (and their dependencies like header files(.h files)) are kept in the src directory inside the package. During installation of the package from the source file(.gz) using R CMD INSTALL somepackage.tar.gz, the scripts are compiled and generates some shared objects in the local directory which dynamically links to the shared library (to libsomename.so.some_number file) which is generally /usr/local/lib. This linking happens through some configuration file (/etc/ld.so.conf ) and some environment variables (e.g. LD_LIBRARY_PATH). Often the conf file and the environment variable does not contain the path of the shared library (mostly happens when users use their own shared library instead of the default) and thus during installation it shows the error: "shared object not found.... no such file or directory".

one way to solve this problem is problem is to run the ldconfig (or /sbin/ldconfig) commands(preferably in verbose mode(-v) ). This program creates the required links and cache to the most recent shared libraries.

example:

I faced similar type of error "shared object not found.... no such file or directory" during installation of the package fftwtools (R CMD INSTALL fftwtools.tar.gz). The steps I followed to fix the problem are:

1. error obsereved : can not open .../fftwtools/src/fftwtools.so: ... libfftw3.so.3... no such file or directory.

2. located the file using: locate libfftw3.so.3 (to be sure that the file exists)

output:

/usr/local/lib/libfftw3.so.3

/usr/local/lib/libfftw3.so.3.3.2

3. run /sbin/ldconfig -v

output:

/sbin/ldconfig: Path `/lib64' given more than once

/sbin/ldconfig: Path `/usr/lib64' given more than once

........................

/opt/bio/EMBOSS/lib:

libajax.so.6 -> libajax.so.6.0.3

libnucleus.so.6 -> libnucleus.so.6.0.3

.................

/lib:

libdevmapper-event.so.1.02 -> libdevmapper-event.so.1.02

libiw.so.28 -> libiw.so.28

.......................

/lib64:

libdevmapper-event.so.1.02 -> libdevmapper-event.so.1.02

libiw.so.28 -> libiw.so.28

.........................

/usr/local/lib:

libnucleus.so.6 -> libnucleus.so.6.0.5

libeplplot.so.3 -> libeplplot.so.3.2.7

.........

libfftw3.so.3 -> libfftw3.so.3.3.2

libezlib.so.1 -> libezlib.so.1.1.0

.........................

4. Install the package: R CMD INSTALL fftwtools.tar.gz

Hope the steps works for you. I will be happy to answer any queries regarding this issue. Thanks a lot for reading the post.

Thursday, July 16, 2015

Using Scipio for generating training dataset for Augustus gene modeler

It is a bit of a hassel if you have a brand new species and would like to train augustus for your dataset. However, there is an incredibly easy way to do this using scipio. Scipio is a wrpper for BLAT program that takes advantage of having a protein file and a genome file of a reference organism. It is very easy and fast to generate genbank files from this 2 files using scipio and BLAT.

Since Scipio only has 3 perl scripts, one can install it inside the augustus installation directory.

Proceed the following way:
[Make sure BLAT is in your path. Also make sure YAML module is installed in your system]

./scipio.1.4.1.pl --blat_output=test.psl genome.fa proteins.aa > test.yaml
cat test.yaml | yaml2gff.1.4.pl > test.scipiogff
scipiogff2gff.pl --in=test.scipiogff --out=scipio.gff -> this script comes with augustus distribution
cat test.yaml | yaml2log.1.4.pl > scipio.log

# Convert gff into Genbank format for training purposes.
# Here 1000 means intergenic distance is minimum 1000
gff2gbSmallDNA.pl scipio.gff genome.fa 1000 genes.raw.gb -> This script also comes with augustus package

Generate train.err file first using the following command
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.raw.gb 2> train.err

# Modify these crude gb files into a more cleaner gb file
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb

# Running Training for gene prediction
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.gb 2> train.err

# Modify these crude gb files into a more cleaner gb file
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb

# Now run etraining again
etraining --species=myspecies --stopCodonExcludedFromCDS=true genes.gb 2> train.err

[NOTE: Here you have to remember few things: 1) first the AUGUSTUS_CONFIG_PATH should be set to the config directory inside the augustus installation path. 2) make a directory named 'myspecies' inside config/species directory and place a file 'myspecies_parameters.cfg' under config/species sub directory and one 'generic_weightmatrix.txt'. Both these files can be copied from ../generic/generic_parameters.cfg (rename this file) and ../generic/generic_weightmatrix.txt (leave it as such). Then run etraining. Several files will be created under config/species/myspecies directory. Now you are ready to roll]

# Now run Augustus for gene prediction

# It can be run with these simple commandline steps:

augustus --species=myspecies genome.fa > test.gff

Augustus 3.1 now produces very neat gff files that is both easy to visualize and easy to understand

# start gene g1

Scaffold_1 AUGUSTUS gene 2248 2811 0.62 - . g1

Scaffold_1 AUGUSTUS transcript 2248 2811 0.62 - . g1.t1

Scaffold_1 AUGUSTUS stop_codon 2248 2250 . - 0 transcript_id "g1.t1"; gene_id "g1";

Scaffold_1 AUGUSTUS CDS 2248 2811 0.62 - 0 transcript_id "g1.t1"; gene_id "g1";

Scaffold_1 AUGUSTUS start_codon 2809 2811 . - 0 transcript_id "g1.t1"; gene_id "g1";

# protein sequence = [MLLTTPNRIAIYSGLDTAMATFSFEVSLRQSSYYELFFAHSVCFLRKSGERDADFWQYCGGRADVFYLHQWCEHRGAD

# REFCSANIYSDGEDDSQKKGTSRAKKGNRKRRGSSQAEVLATLAESVSAITAANNTREAQEVAWKDQALLLHDKRISHRLGMLTEIFDRTNCYIFDLK

# KHMKMTMATTS]

# end gene g1

Wednesday, July 8, 2015

Hypocrisy of scientific journals

I dont want to sound like a bitter dejected negative person, however, this has been my un-biased view of publication policies of some of the well known prestigious journals.

When I consider publishing somewhere I quickly browse and get to the section where it lists the publication cost. While the cost of publication is usually very high, they do have sometimes concession for List A and List B countries which is economically un-developed. In both the lists you will not find India there, so that means we have to pay the full publication cost! But you open a news paper in any of the western countries or just see the economic rating given to India by wetern raters and it is always in the junk level. So, why this hypocrisy? On one hand you want to rate India way below many developing countries, and on the other hand you consider India to be considerably developed to pay the full publication fee...

My experiences with pre-publication inquiry to some of the journals that I published before (when I was in US) is also very alarming. The same paper will get published if sent by an US lab, but a lab from India will face rejection at the pre-publication inquiry stage itself. I am wondering if anyone else has noticed this. But we dont want this to deter us from publishing good articles. We will rise above the cloud and publish our work in higher journals.

Tuesday, July 7, 2015

Trend line and regression analysis. How to do it easily

Today I came across a thesis os a student who used this and it was a long forgotten topic for me. nevertheless, it was refreshing to read and revisit this very simple and statistically inclined topic that we encounter in day to day life. So what is a trend line?

We all knowingly or unknowingly try to predict future from our past experience. For example if a student is scoring good grades in previous years, we tend to predict what his/her grades are going to be in coming years. There could be many more such examples as predicting the stock market, predicting weather and so on. So, what we do internally is generate a trend and try to fit in the future based on that trend.

So, let take a simple example of scores of a student:

1 300
2 340
3 320
4 400
5 420
9 500

So, here these are some values/scores the student had in the months in the left column. Eye balling this data what we see is, there is a gradual improvement in scores. And we also observe that there are some missing points here (6,7,8). So, can we predict the score for the 10th month? Can we say what the score would have been on 6th, 7th and 8th month?

So, in this case, we may like to fit this into a trendline. The easiest way to do it is to use excel. So lets fill in the data in an excel sheet as shown below: [figure 1]

So, now the formula is Y = 25.5X + 278 at R2=0.928. Using this equation, we can predict the scores on 6th month e.g; Y = 25.5 X 6 + 278 = 309.5;
On 7th month it is = 25.5 X 7 + 278 = 454.4
On 8th month it is = 25.5 X 8 + 278 = 479.6
On 10th month, it is going to be 25.5 X 10 + 278 = 533

Now putting these values back into excel we get the following figure (Figure - 2)
So, what we see there? The equation and R2 has changed!! Slightly though.. So, what does it tell us? Probably the linear trendline is not correct for this data series.

Figure 1: Excel screen shot of how to select trendline in excel and display the formula.

Figure -2: Changed R value as well as trend line formula

Tuesday, June 9, 2015

Installing Bio Python in local Ubuntu server

I am trying to install biopython in my ubuntu server but hit the wall several times with this error message:
Cannot fetch index base URL https://pypi.python.org/simple/

I checked my pip version using the following command:

$pip --version and the output is 1.5.4.

I googled little bit about it and saw this nice post at http://stackoverflow.com/questions/17416938/pip-cannot-install-anything

I followed this exactly and installed pip version 1.2.1 and checked the path where it is installed. For instance my earlier pip version is not over written and it is installed in a separate place, then calling pip will still use it from the previous location.

So, I did the following:

/usr/local/bin/pip install biopython and it worked like a charm.

Monday, June 1, 2015

downloading a research articles even if your institute dont have access you can download ...

Guys,
no need to put request in facebook to download papers also here to download research articles
Just follow this procedure to download any Research Article
Step 1: Goto freestuffy.in
Step 2 Then goto Research Papers
Step 3 Then goto Scientific Journals
Step 3 Type here DOI no of your article that usually starts from 10.10.....
Step 4 Hit Search button to download same

If unable to download it
Just follow these steps
Step 1: Goto freestuffy.in
Step 2 Then goto Research Papers
Step 3 Then goto server 1
Step 4 Just type URL of paper and hit search and paper is ready to download
and Method1 doesnt work...go to server1 option and don't type URL ,doi, pmid etc there , instead paste the full title of paper...it works...

Friday, February 27, 2015

Quantifying Microbiome of self

Here is a very interesting paper that I came across today in Genome Biology Journal. [ http://genomebiology.com/2014/15/7/R89 ]

The authors took time series data of their microbiota at 10,000 longitudinal data points and concluded that the gut microbiota remains more or less same over a period of time unless there is a real change in life style or disease condition.

For example in the two individuals this study was undertaken, visibly the gut microbiome was upset when the subject visited from developed world to developing world. In case of subject 2 there was a clear change in the gut microbiome pattern when he had bouts of diarrhea [Image below].

It was also reported that the variation between individuals is much larger than variation within an individual [Figure below]

Thursday, January 1, 2015

Determining Outlier

Reference: http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

Boxplot Construction:

The box plot is a useful graphical display for describing the behavior of the data in the middle as well as at the ends of the distributions. The box plot uses the medianand the lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ.

Box plot with fences:

A box plot is constructed by drawing a box between the upper and lower quartiles with a solid line drawn across the box to locate the median. The following quantities (called fences) are needed for identifying extreme values in the tails of the distribution:

lower inner fence: Q1 - 1.5*IQ
upper inner fence: Q3 + 1.5*IQ
lower outer fence: Q1 - 3*IQ
upper outer fence: Q3 + 3*IQ

Outlier Detection:

A point beyond an inner fence on either side is considered a mild outlier. A point beyond an outer fence is considered an extreme outlier.

My script to detect outlier:

#!/usr/bin/perl -w

# This script is to detect outliers in a dataset

`sort -g $ARGV[0] > 'tmpSorted'`;

open FH , "tmpSorted" or die "Cant open file for reading $! \n";

my $count=1;
my @arr;

while(<FH>){
chomp;
$arr[$count]=$_;
$count++;
}
close(FH);

my ($median, $lowQ, $upQ, $low1B, $up1B, $low2B, $up2B);

my $length = scalar(@arr);

my $m = findMidpoint($length);

my @med = @$m;

if(scalar(@med) > 1){
$median = ($arr[$med[0]] + $arr[$med[1]])/2;
print "Inside outer if and $med[0] and $med[1] and median is $median\n";
$m = findMidpoint($med[0]);
my @tmp = @$m;
$m = findMidpoint($length - $med[1] + 1);
my @tmp1 = @$m;
print "printing...",join("|",@tmp),"and ", join("=",@tmp1) , "\n
";

if(scalar(@tmp) > 1){
$lowQ = ($arr[$tmp[0]] + $arr[$tmp[1]]) /2;
}
else{
$lowQ = $arr[$tmp[0]];
}
if(scalar(@tmp1) > 1){
$upQ = ($arr[ $med[1] + $tmp[0] - 1] + $arr[ $med[1] + $
tmp[1] - 1]) / 2;

}
else{
$upQ = $arr[$med[1] + $tmp1[0] - 1];
}

}

else{
$median = $arr[$med[0]];
print "Inside outer else and median is $median\n";
$m = findMidpoint($med[0]);
my @tmp = @$m;
$m = findMidpoint($length - $med[0] + 1);
my @tmp1 = @$m;
print "printing",@tmp, @tmp1 , "\n";

if(scalar(@tmp) > 1){
$lowQ = ($arr[$tmp[0]] + $arr[$tmp[1]]) /2;
}
else{
$lowQ = $arr[$tmp[0]];
}
if(scalar(@tmp1) > 1){
$upQ = ($arr[ $med[0] + $tmp[0] - 1] + $arr[ $med[0] + $
tmp[1] - 1]) / 2;

}
else{
$upQ = $arr[$med[0] + $tmp1[0] - 1];
}

}

my $innerRange = ($upQ - $lowQ) * 1.5;
my $outerRange = ($upQ - $lowQ) * 3;

$low1B = $lowQ - $innerRange;
$low2B = $lowQ - $outerRange;
$up1B = $upQ + $innerRange;
$up2B = $upQ + $outerRange;

print "Median,lowq,upq,innerRange, outerRange, up1b,up2b, low1b, low2b is $media
n, $lowQ, $upQ, $innerRange, $outerRange, $up1B, $up2B, $low1B, $low2B \n";

sub findMidpoint{

my $length = $_[0];
my @arr;

# Even number return 2 values
if(length($length) % 2 == 0){
$arr[0] = $length / 2;
$arr[1] = $arr[0] + 1;
}
else{
$arr[0] = ($length + 1)/2;
}

return \@arr;
}

Computational Genomics Lab at IICB

Followers