Followers

Monday, December 7, 2015

SNP calling using GATK for de novo genome

I have a got a chance to work in leishmania genome, where i have a genome assembly and i dont have any deposited dbSNP or any other reference file to do variant calling, i have been working and stuck in many steps and posted in GATK forums they replied to some of my queries at one point stopped to reply since People were having a fat and busy holiday on thanks giving , and figured out how to do the variant calling, i think this blog will be much useful for the naive person like me, lets see the workflow and please refer GATK documentation for the explanation.
#first build the index for the reference genome
/share/apps/bowtie2-2.1.0/bowtie2-build after_removing_2k.fasta  leishmania.index.bt2
#after index map the reference to the reads
/share/apps/bowtie2-2.1.0/bowtie2 -x leishmania.index.bt2 -1 /data/results/STLab/NahidAli/141218_SND393_A_L005_HTI-5_trim_R1_filtered.fastq -2 /data/results/STLab/NahidAli/141218_SND393_A_L005_HTI-5_trim_R2_filtered.fastq -S bowtie_aligned.sam
#convert the sam file to bam
samtools view -S bowtie_aligned.sam -b -o bowtie_aligned.bam
#sort the bam file
samtools sort bowtie_aligned.bam bowtie_aligned_sorted
#create a pileup file
samtools mpileup -uf after_removing_2k.fasta bowtie_aligned_sorted.bam|/share/apps/samtools-0.1.18/bcftools/bcftools view -bvcg - > leishmania.raw.bcf
#convert bcf to vcf
/share/apps/samtools-0.1.18/bcftools/bcftools view leishmania.raw.bcf > leishmania.raw.vcf
*********************************************************************************
 The bove commands are just initial way of mapping the reads to the reference and the real GATK pipeline starts below since i don't have the any known sites  i have done without base cailbration 
#create the dictionary
#java -jar /share/apps/picard-tools-1.56/CreateSequenceDictionary.jar R=after_removing_2k.fasta O=after_removing_2k.dict
#add or mark group ids
#java -jar /share/apps/picard-tools-1.56/AddOrReplaceReadGroups.jar I=bowtie_aligned.sam O=group_added_read.bam SO=coordinate RGID=1 RGLB=library1 RGPL=illumina RGPU=1 RGSM=leishmania VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE
#mark duplicates
#java -jar /share/apps/picard-tools-1.56/MarkDuplicates.jar I=group_added_read.bam O=mapped_reads_dup.bam METRICS_FILE=metricsFile CREATE_INDEX=true
#sort bam file
#java -jar /share/apps/picard-tools-1.56/BuildBamIndex.jar INPUT=mapped_reads_dup.bam
#create realign target creator
#share/apps/GenomeAnalysisTK.jar -T RealignerTargetCreator -R after_removing_2k.fasta -o target_interval.intervals -I mapped_reads_dup.bam
#indel realigner
#/share/apps/GenomeAnalysisTK.jar -T IndelRealigner -R after_removing_2k.fasta -I mapped_reads_dup.bam -targetIntervals target_interval.intervals -o Indel_realigned.bam
#haplotype caller
#java -jar /share/apps/GenomeAnalysisTK.jar -T HaplotypeCaller -R after_removing_2k.fasta -I Indel_realigned.bam -stand_call_conf 30 -stand_emit_conf 10 -o raw_variants.vcf
#choose the variants from the raw vcf file
#java -jar /share/apps/GenomeAnalysisTK.jar -T SelectVariants -R after_removing_2k.fasta -V raw_variants.vcf -selectType SNP -o raw_snps.vcf
#do the filtration
#java -jar /share/apps/GenomeAnalysisTK.jar -T VariantFiltration -R after_removing_2k.fasta -V raw_variants.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter" -o filtered_snps.vcf
#Extract the indels
#java -jar /share/apps/GenomeAnalysisTK.jar -T SelectVariants -R after_removing_2k.fasta -V raw_variants.vcf -selectType INDEL -o raw_indels.vcf
#do the filteration
#java -jar /share/apps/GenomeAnalysisTK.jar -T VariantFiltration -R after_removing_2k.fasta -V raw_variants.vcf --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName "my_indel_filter" -o filtered_indels.vcf

from the above predicted snps and indels extract the regions and further annotate and work on it happy variant calling !!!!!!!!!!

No comments:

Post a Comment