Monday, December 14, 2015

Variant Calling - The bowtie - picard - samtools - gatk pipeline....

Nextgen sequencing has caused a sudden surge in data deluge, but the informatics pipelines and algorithms are unable to keep up with the pace. While most of the exome sequencing data finally focuses on SNP calling and there are various ways of doing this, I decided to discuss one pipeline that has been accepted all over as one of the most sophisticated methods. It is the bowtie - picard - gatk pipeline.

When you are dealing with colorspace data the choice of mappers get limited. Howver, my favorite mapper is still bowtie for several reasons. Lifescope has its own inhouse mapper; which claims to have a all round better approach in mapping colorspace data, but the lack of transparency on what happens within puts me off using this tool. Once bowtie maps the reads by default parameter, the next thing to do is to convert the sam file into bam file, sort it and index it. All these can be done using samtools. However, if the file size is large, you could do the sorting job using sort operations from unix commandline.

1. sort sam file
LC_ALL="C" sort -k 3,3 -k 4,4n input_sam > output_sam # This step will take a long time

sort options for samtools works but only on bam files and on many instances downstream analysis softwares complain about co-ordinates not being sorted...

or Use picard:

java -jar /share/apps/picard-tools-1.56/SortSam.jar I=bowtie.sam O=bowtie.bam SO=coordinate # This took one hour in a HPC with 48 GB RAM on each node for a file size of 30 GB

2. Make an index file of bam file
samtools index bowtie.bam bowtie.bai

3. MarkDuplicates using picard
java -jar /share/apps/picard-tools-1.56/MarkDuplicates.jar I=bowtie.bam M=metrics.bam O=duplicateMarked.bam

4. sort this bam file and make index using samtools
samtools sort duplicateMarked.bam duplicateMarked.sorted
samtools index duplicateMarked.sorted.bam duplicateMarked.sorted.bai

5. Then run IndelRealignerTargetCreator using GATK
java -jar /share/apps/GenomeAnalysisTK-2.4-9-g532efad/resources/GenomeAnalysisTK.jar -T RealignerTargetCreator -I duplicateMarked.sorted.bam  -R /share/reference/human/samtools/hg19.fa -o dM.bam.list 
# The output file dM.bam.list returns 0 output. Check it later.

6. Now run this picard tool to get RG updated since indelRealigner complains.
java -jar /share/apps/picard-tools-1.56/AddOrReplaceReadGroups.jar I=duplicateMarked.bam O=readGroupReplaced.bam RGLB="LINK_TO_FASTA" RGPL=SOLID RGPU=run barcode RGSM=9111 SORT_ORDER=coordinate CREATE_INDEX=TRUE VALIDATION_STRINGENCY=LENIENT'

Posters I could take pictures of Beyond Genome 2014

Here are few of the posters in Beyond Genome meeting that I could take pictures of. There were many more, but access to take their pictures was less...
Our Poster


Monday, December 7, 2015

SNP calling using GATK for de novo genome

I have a got a chance to work in leishmania genome, where i have a genome assembly and i dont have any deposited dbSNP or any other reference file to do variant calling, i have been working and stuck in many steps and posted in GATK forums they replied to some of my queries at one point stopped to reply since People were having a fat and busy holiday on thanks giving , and figured out how to do the variant calling, i think this blog will be much useful for the naive person like me, lets see the workflow and please refer GATK documentation for the explanation.
#first build the index for the reference genome
/share/apps/bowtie2-2.1.0/bowtie2-build after_removing_2k.fasta  leishmania.index.bt2
#after index map the reference to the reads
/share/apps/bowtie2-2.1.0/bowtie2 -x leishmania.index.bt2 -1 /data/results/STLab/NahidAli/141218_SND393_A_L005_HTI-5_trim_R1_filtered.fastq -2 /data/results/STLab/NahidAli/141218_SND393_A_L005_HTI-5_trim_R2_filtered.fastq -S bowtie_aligned.sam
#convert the sam file to bam
samtools view -S bowtie_aligned.sam -b -o bowtie_aligned.bam
#sort the bam file
samtools sort bowtie_aligned.bam bowtie_aligned_sorted
#create a pileup file
samtools mpileup -uf after_removing_2k.fasta bowtie_aligned_sorted.bam|/share/apps/samtools-0.1.18/bcftools/bcftools view -bvcg - > leishmania.raw.bcf
#convert bcf to vcf
/share/apps/samtools-0.1.18/bcftools/bcftools view leishmania.raw.bcf > leishmania.raw.vcf
 The bove commands are just initial way of mapping the reads to the reference and the real GATK pipeline starts below since i don't have the any known sites  i have done without base cailbration 
#create the dictionary
#java -jar /share/apps/picard-tools-1.56/CreateSequenceDictionary.jar R=after_removing_2k.fasta O=after_removing_2k.dict
#add or mark group ids
#java -jar /share/apps/picard-tools-1.56/AddOrReplaceReadGroups.jar I=bowtie_aligned.sam O=group_added_read.bam SO=coordinate RGID=1 RGLB=library1 RGPL=illumina RGPU=1 RGSM=leishmania VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE
#mark duplicates
#java -jar /share/apps/picard-tools-1.56/MarkDuplicates.jar I=group_added_read.bam O=mapped_reads_dup.bam METRICS_FILE=metricsFile CREATE_INDEX=true
#sort bam file
#java -jar /share/apps/picard-tools-1.56/BuildBamIndex.jar INPUT=mapped_reads_dup.bam
#create realign target creator
#share/apps/GenomeAnalysisTK.jar -T RealignerTargetCreator -R after_removing_2k.fasta -o target_interval.intervals -I mapped_reads_dup.bam
#indel realigner
#/share/apps/GenomeAnalysisTK.jar -T IndelRealigner -R after_removing_2k.fasta -I mapped_reads_dup.bam -targetIntervals target_interval.intervals -o Indel_realigned.bam
#haplotype caller
#java -jar /share/apps/GenomeAnalysisTK.jar -T HaplotypeCaller -R after_removing_2k.fasta -I Indel_realigned.bam -stand_call_conf 30 -stand_emit_conf 10 -o raw_variants.vcf
#choose the variants from the raw vcf file
#java -jar /share/apps/GenomeAnalysisTK.jar -T SelectVariants -R after_removing_2k.fasta -V raw_variants.vcf -selectType SNP -o raw_snps.vcf
#do the filtration
#java -jar /share/apps/GenomeAnalysisTK.jar -T VariantFiltration -R after_removing_2k.fasta -V raw_variants.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter" -o filtered_snps.vcf
#Extract the indels
#java -jar /share/apps/GenomeAnalysisTK.jar -T SelectVariants -R after_removing_2k.fasta -V raw_variants.vcf -selectType INDEL -o raw_indels.vcf
#do the filteration
#java -jar /share/apps/GenomeAnalysisTK.jar -T VariantFiltration -R after_removing_2k.fasta -V raw_variants.vcf --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName "my_indel_filter" -o filtered_indels.vcf

from the above predicted snps and indels extract the regions and further annotate and work on it happy variant calling !!!!!!!!!!