Sunday, March 26, 2017

analysing the alleles from haplotypes on pacbio data

I had been working in pacbio data and when am trying to identify the alleles from haplotypes from diploid assembly, in the very early step itself i got many errors, because i had been following the illumina dataset method like for pacbio data, but the developed tools behaves strange with the data and I got stuck for 3-4 days i googled the maximum and tried various approaches, Finally i posted in the forums and interacted with the GATK developers, they suggested me a simple solution for solving my errors, so those who are working in long reads and want to identify the haplotypes here is my commandline and verified one
[ Any aligners can be used even BLASR initially i was thinking there was a problem with my aligner, but really not] and no need to mark duplicates in case of long reads only for illumina reads its been recommended by the developer, i had reached till the step of HaplotypeCaller so far no error its running smooth, If i change commands or face any problems, will be updated, once the output is ready maybe i can paste some of my output

bwa index 2017_V6_Pr102_assembly.fa
bwa mem -x pacbio 2017_V6_Pr102_assembly.fa /data/results1/STLab/Takao_data/Raw_data/ND886/all_ND886.fastq > aln.sam
samtools view -b -S aln.sam -o aln.bam
samtools sort aln.bam > aln_sorted.bam
samtools index aln_sorted.bam
samtools mpileup -uf 2017_V2_ND886_assembly.fa aln_sorted.bam | /share/apps/bcftools-1.2/bcftools call  -cv - > out.vcf [use bcftools1.2 otherwise its not producing the genotype information]

Use any of your favourite haplotype phaser (whatshap/ hapcut) along with the above produced bam and vcf file

Now u get the phased  alleles from haplotypes u can compare them and these can be used to downstream analysis

No comments:

Post a Comment