Sequence assembly is a very important step in microbial genomics programs. However, it is equally important how we design our libraries and platforms we choose in order to get the optimal outcome. Insert size length, their standard deviation, read length, read overlap size, mate pair insert size, read length etc. are critical parameters in determining quality of the final assembly.
Out of the many organism we chose to sequence, one of the organisms had visibly better data. Although the sequencing company suggested that for paired end libaries we get 75X data and mate pair we get 10X, but actually we got way more coverage. In the last part of this article I describe how to calculate the X value for genome sequencing coverage.
We evaluated several assemblers and none of them produced optimal result. However, we were very interested to run allpaths on our genomes since it predicted the genome size correctly and had an extremely effective qc step. However, allpaths always returned with errors for our dataset, suggesting faulty library and faulty data. Here I describe the steps involved in this process.
Prepare your data:
=============
/PrepareAllPathsInputs.pl DATA_DIR=<PATH WHERE DATA-DIR WILL BE CREATED> PLOIDY=1 or 2
IN_GROUPS_CSV=<PATH FOR IN_GROUP.CSV FILE>
IN_LIBS_CSV=<PATH FOR IN_LIBS FILE>
OVERWRITE=True
Out of the many organism we chose to sequence, one of the organisms had visibly better data. Although the sequencing company suggested that for paired end libaries we get 75X data and mate pair we get 10X, but actually we got way more coverage. In the last part of this article I describe how to calculate the X value for genome sequencing coverage.
We evaluated several assemblers and none of them produced optimal result. However, we were very interested to run allpaths on our genomes since it predicted the genome size correctly and had an extremely effective qc step. However, allpaths always returned with errors for our dataset, suggesting faulty library and faulty data. Here I describe the steps involved in this process.
Prepare your data:
=============
/PrepareAllPathsInputs.pl DATA_DIR=<PATH WHERE DATA-DIR WILL BE CREATED> PLOIDY=1 or 2
IN_GROUPS_CSV=<PATH FOR IN_GROUP.CSV FILE>
IN_LIBS_CSV=<PATH FOR IN_LIBS FILE>
OVERWRITE=True
In IN_GROUPS.CSV file, the following information need to be there:
-------------------------------------------------------------------------------------
file_name,library_name,group_name
<actual_path>/L.R1.fq,illuminawgs1,frags1
<actual_path>/L.R2.fq,illuminawgs2,frags2
<actual_path>/M.R1.fq,illumina_jump1,jumping1
<actual_path>/M.R2.fq,illumina_jump2,jumping2
In IN_LIBS.CSV file more crucial information about the library need to be there:
--------------------------------------------------------------------------------------------------
library_name, project_name, organism_name, type,paired, frag_size, frag_stddev, insert_size, insert_stddev, read_orientation, genomic_start, genomic_end
illuminawgs1, L, L, fragment, 1, 280,15 , , , inward, 0, 0
illuminawgs2, L, L, fragment, 1, 280,15 , , , inward, 0, 0
illumina_jump1, L, L, jumping, 1, , , 2780,50 , outward, 0, 0
illumina_jump2, L, L, jumping, 1, , , 2780,50 , outward, 0, 0
Here one has to remember that fragment size is size of the actual fragment. For example if the read length is 151 each on both the ends, the fragment size will be the actual distance between the ends of these reads. In other words, this can be calculated as Read_Length1 + Read_Length2 - 2(overlap length). Suppose overlapping region is 10 bases then fragment size will be 151 + 151 - 20 = 282. And then there is insert SD that also needs to carefully calculated and mentioned.
After the first step that is PrepareAllPathsInputs.pl runs successfully, it will create a bunch of files under the DATA_DIR mentioned above.
The best thing would be to download the test data supplied along wth allpaths and try doing the runs. If it runs fine then your installation is probably good.
I had tough time completing the assembly and at every stage, it had this quirky error complaining about
Tue Sep 23 16:24:25 2014 Filled 0.0436214% of 4153923 pairs.
No library parameter adjustment: too few pairs closed.
Fatal error (pid=17593) at Tue Sep 23 16:24:25 2014:
Less than 10% of fragment pairs were filled.
There may be a problem with the library.
Here is what I tried to curb this:
1. set the FF_MAX_STRETCH=5 FF_MIN_OVERLAP=10
and still got the same error. I tried making the FF_MAX_STRETCH to a very high value and FF_MIN_OVERLAP to 0, still got the same problem.
The best thing would be to download the test data supplied along wth allpaths and try doing the runs. If it runs fine then your installation is probably good.
I had tough time completing the assembly and at every stage, it had this quirky error complaining about
Tue Sep 23 16:24:25 2014 Filled 0.0436214% of 4153923 pairs.
No library parameter adjustment: too few pairs closed.
Fatal error (pid=17593) at Tue Sep 23 16:24:25 2014:
Less than 10% of fragment pairs were filled.
There may be a problem with the library.
Here is what I tried to curb this:
1. set the FF_MAX_STRETCH=5 FF_MIN_OVERLAP=10
and still got the same error. I tried making the FF_MAX_STRETCH to a very high value and FF_MIN_OVERLAP to 0, still got the same problem.
2. Then actually went ahead and ran a pairwise alignment between my paired samples, and found about 93% indeed had overlaps. I calculated the fragment length and fragment SD and supplied that to in_libs.csv, still had the same problem.
3. I took the jump library data and tried to find their actual insert size using a reference and calculated approximate insert size and SD and provided that info in in_libs.csv, still failed.
4. Used randomized read fragment from another assembly, it failed. Tried an established genome already published with defined insert size, sd, read length and overlap and made mock jump library and fragment library still got the same error.
5. Used reference guided assembly, still the same.
Finally, I decided to use the test data that comes from allpaths distribution at
ftp://ftp.broadinstitute.org/pub/crd/ALLPATHS/Release-LG/ and it ran fine to my relief. Now I started working on those in_libs.csv files than my own providing my files.
6. First I ran the mock data, but it failed. A quick look suggested the Genome_size parameter that comes along the test data is 200000. So, I quickly changed that to 2000000 (2 MB), and lo and behold, it ran to completion!! The number of scaffolds were also very good (33)!!
7. Then I finally provided my sequencing data and running them and it seems to have surpassed the fillfragment stage. Keeping my fingers crossed! Hope it ends in positive outcome!
Yay!! Assembly is DONE!!!!!!!!!!!!!!!!!! With great statistics.. :) :)