Computational Genomics Lab at IICB: SNP calling using GATK for de novo genome

I have a got a chance to work in leishmania genome, where i have a genome assembly and i dont have any deposited dbSNP or any other reference file to do variant calling, i have been working and stuck in many steps and posted in GATK forums they replied to some of my queries at one point stopped to reply since People were having a fat and busy holiday on thanks giving , and figured out how to do the variant calling, i think this blog will be much useful for the naive person like me, lets see the workflow and please refer GATK documentation for the explanation.
#first build the index for the reference genome
/share/apps/bowtie2-2.1.0/bowtie2-build after_removing_2k.fasta leishmania.index.bt2
#after index map the reference to the reads
/share/apps/bowtie2-2.1.0/bowtie2 -x leishmania.index.bt2 -1 /data/results/STLab/NahidAli/141218_SND393_A_L005_HTI-5_trim_R1_filtered.fastq -2 /data/results/STLab/NahidAli/141218_SND393_A_L005_HTI-5_trim_R2_filtered.fastq -S bowtie_aligned.sam
#convert the sam file to bam
samtools view -S bowtie_aligned.sam -b -o bowtie_aligned.bam
#sort the bam file
samtools sort bowtie_aligned.bam bowtie_aligned_sorted
#create a pileup file
samtools mpileup -uf after_removing_2k.fasta bowtie_aligned_sorted.bam|/share/apps/samtools-0.1.18/bcftools/bcftools view -bvcg - > leishmania.raw.bcf
#convert bcf to vcf
/share/apps/samtools-0.1.18/bcftools/bcftools view leishmania.raw.bcf > leishmania.raw.vcf
*********************************************************************************

The bove commands are just initial way of mapping the reads to the reference and the real GATK pipeline starts below since i don't have the any known sites i have done without base cailbration

#create the dictionary
#java -jar /share/apps/picard-tools-1.56/CreateSequenceDictionary.jar R=after_removing_2k.fasta O=after_removing_2k.dict
#add or mark group ids
#java -jar /share/apps/picard-tools-1.56/AddOrReplaceReadGroups.jar I=bowtie_aligned.sam O=group_added_read.bam SO=coordinate RGID=1 RGLB=library1 RGPL=illumina RGPU=1 RGSM=leishmania VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE
#mark duplicates
#java -jar /share/apps/picard-tools-1.56/MarkDuplicates.jar I=group_added_read.bam O=mapped_reads_dup.bam METRICS_FILE=metricsFile CREATE_INDEX=true
#sort bam file
#java -jar /share/apps/picard-tools-1.56/BuildBamIndex.jar INPUT=mapped_reads_dup.bam
#create realign target creator
#share/apps/GenomeAnalysisTK.jar -T RealignerTargetCreator -R after_removing_2k.fasta -o target_interval.intervals -I mapped_reads_dup.bam
#indel realigner
#/share/apps/GenomeAnalysisTK.jar -T IndelRealigner -R after_removing_2k.fasta -I mapped_reads_dup.bam -targetIntervals target_interval.intervals -o Indel_realigned.bam
#haplotype caller
#java -jar /share/apps/GenomeAnalysisTK.jar -T HaplotypeCaller -R after_removing_2k.fasta -I Indel_realigned.bam -stand_call_conf 30 -stand_emit_conf 10 -o raw_variants.vcf
#choose the variants from the raw vcf file
#java -jar /share/apps/GenomeAnalysisTK.jar -T SelectVariants -R after_removing_2k.fasta -V raw_variants.vcf -selectType SNP -o raw_snps.vcf
#do the filtration
#java -jar /share/apps/GenomeAnalysisTK.jar -T VariantFiltration -R after_removing_2k.fasta -V raw_variants.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter" -o filtered_snps.vcf
#Extract the indels
#java -jar /share/apps/GenomeAnalysisTK.jar -T SelectVariants -R after_removing_2k.fasta -V raw_variants.vcf -selectType INDEL -o raw_indels.vcf
#do the filteration
#java -jar /share/apps/GenomeAnalysisTK.jar -T VariantFiltration -R after_removing_2k.fasta -V raw_variants.vcf --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName "my_indel_filter" -o filtered_indels.vcf

from the above predicted snps and indels extract the regions and further annotate and work on it happy variant calling !!!!!!!!!!

Computational Genomics Lab at IICB

Followers

Monday, December 7, 2015

SNP calling using GATK for de novo genome

No comments:

Post a Comment

About the group