Out of the many organism we chose to sequence, one of the organisms had visibly better data. Although the sequencing company suggested that for paired end libaries we get 75X data and mate pair we get 10X, but actually we got way more coverage. In the last part of this article I describe how to calculate the X value for genome sequencing coverage.
We evaluated several assemblers and none of them produced optimal result. However, we were very interested to run allpaths on our genomes since it predicted the genome size correctly and had an extremely effective qc step. However, allpaths always returned with errors for our dataset, suggesting faulty library and faulty data. Here I describe the steps involved in this process.
Prepare your data:
/PrepareAllPathsInputs.pl DATA_DIR=<PATH WHERE DATA-DIR WILL BE CREATED> PLOIDY=1 or 2
IN_GROUPS_CSV=<PATH FOR IN_GROUP.CSV FILE>
IN_LIBS_CSV=<PATH FOR IN_LIBS FILE>
The best thing would be to download the test data supplied along wth allpaths and try doing the runs. If it runs fine then your installation is probably good.
I had tough time completing the assembly and at every stage, it had this quirky error complaining about
Tue Sep 23 16:24:25 2014 Filled 0.0436214% of 4153923 pairs.
No library parameter adjustment: too few pairs closed.
Fatal error (pid=17593) at Tue Sep 23 16:24:25 2014:
Less than 10% of fragment pairs were filled.
There may be a problem with the library.
Here is what I tried to curb this:
1. set the FF_MAX_STRETCH=5 FF_MIN_OVERLAP=10
and still got the same error. I tried making the FF_MAX_STRETCH to a very high value and FF_MIN_OVERLAP to 0, still got the same problem.