Friday, October 24, 2014

Upgrading ubuntu and the consequences

When you upgrade ubuntu, there may be many unpleasant side effects. For instance I got an email about our server not accessible for citation purpose. I checked the web document roots and changed some permissions (which seem to have changed since the upgradation), still the site went blank.

To check ubuntu version do the following:
lsb_release  -a
Mine was 14.04

So I went ahead with a restart of apache and the commands are slightly different from that of red hat linux.

sudo /etc/init.d/apache2 restart

Restarting web server apache2
apache2: Could not reliably determine the server's fully qualified domain name, using for ServerName
... waiting apache2:
Could not reliably determine the server's fully qualified domain name, using for ServerName

Then browsing several web sites I did the following:

Created a file servername.conf inside
sudo vim /etc/apache2/conf-available/servername.conf

Inside this file entered a line

ServerName   MyDomainName
sudo a2enconf servername (Name of the file created)

then did a
service apache2 reload
Then restarted apache using:

/etc/init.d/apache2 restart

The warning message dis-appeared but the web page was still not up.

then you may have to change the document root directory. In our case, it was /var/www earlier, but currently it is /var/www/html

If you are depending upon bioperl modules, you may also see most of your perl modules dis-appearing. Then you search for that particular module using command:

find / -name

You may see your INC path has changed. Now you would like to place your bioperl files in INC path.

Change permissions of some files and then it will start working!!

Thursday, October 16, 2014

Day 2 and 3 Beyond Genome 2014

Talk 2: Genetics of Gleoblastoma:
Different populations are there in glioblastoma and fits to cancer stem cell model
Chipseq and functional elements. In vitro model. Differentiated gleoblastoma cels . xenograft model. Introduced TFS in vitro to induce tumor.
Core TFs bind to active TPC regulatory elements
Single cell RNAseq for glioblastoma. There is receptor diversity inside the glioblastoma tumor. 430 cells are sequenced. Each cell  detected around 6000 genes. PGFRA and TGFR are negatively co-regulated.
Core TFs are highly co-related with stemness. Negative correlation with MES. Then classify cells into high stemness or low stemness. For cells there is a dominant transcrion signature. Cells can switch fromone subclass to another. Tumors are more heterogenous than was thought before
Master regulators of tumor initiation,progression…
SynGen algorithm for predicting synergisim of molecules. ARACNe algorithm was used for reconstructing genetic network.
Transcription profile: regulatory network and functional network
Cell types, perturbations and phenotypic assays.Cell types can be cancer cell lines, any other cell lines, perturbations can be drug related. Assays can be transcriptomics, proteomic. Lincs L1000 data (CMapIII). 22119 genetic reagents, 77 cellular context, 20413 chemical reagents
Genomics, epigenomic…
Myc cell cycle, apoptosis, cellular transformation, cell proliferation. Myc is over expressed in many cancers.
Drugging the cancer interactions
Multifaceted target assessment for druggability
Cncerdrug targets make distinct subnetworks inside a network.

A complete catalogue, identification of drivers. Data sets are fewer for epigenetic data modifications.
40epigenetic marks are there
Understand  the chromatin states between Normal-> Tumor -> Metastasis
Chipseq for 35 chromatin marks. Generated a lot of data. Chromatin state prediction with ChromHMM
Relative changes in chromatin marks.
Loss of acylation in tumorigenic cells.
Six billion reads have been generated. Epigenomic plasticity
Personalized medicine
BRAF is mutated in human melanoma
Everolimus -> Has 17000 somatic mutation for a person who responded well.Map2k1 15 bp deletion.
All the patients with solid tumors have what kind of mutation needs to be determined before assessing their treatment type.
341 genes are listed for assay
Ten trillion bacterial cells, ten times morebacteril cells human genes
100 times more number of genes
Circulating tumor DNA. Cell free DNA 90% are hematopoetic stem cells. Cell free DNA increases in cancer patients.. Plasma-seq . The coverage is .1X depth
Grail is text mining tool
Finding cancer driver genes
Blair cell 2013 co-morbidity studies. 15cancers are there in TCGA hassomaticmutations. Gain,mutated,loss.OMIM has germlinemutations. Genetic links network,pathways.
Cancer is co-morbid with another genetic disease that happens due to mutation. Albinism is associated with some common genes associated with melanoma.
1/3 of the medelian disase have co-morbidity with ancer.
Bacterial-human somatic lateral gene transfer for cancer.
Fourth chromosoma of drosophila has 20% genome from bacteria.
Day 3
Talk1: Anchored Assembly: You can try at
Bioinformatics challenge:
BTG Informatics challenge: Single cell Copy Number analysis

Baslan Nature protocol 2012
Visualizationofmulti dimensionalcancer data Genome Medicine 2013
Copy number prediction using Titan, Ha et al.  Genome Research 2014
Genomic media andclinical cancer medicine: Dana Faber Institute:
Guided visualization exploration of cancer genomics

Jian Ma

Thursday, October 9, 2014

Beyond the Genome meeting 8-10th October,2014, Highlights of first day

The day opened at 11 AM for registration followed by Sumptuous lunch and the sessions started at 12.30 PM at Joseph B Martin conference center Harvard Medical School. The talks were fabulous. While listening to the talks, I was also scribbling down the points. It may not be very co-herent, but makes some sense. Here are the excerpts:

FIRST TALK: MutsigCV: Tool for  correcting the mutation rate – Cancer genes and evolution
1 mut/MB, 5-20 drivers
Driver event: An event that increased the fitness of the cell when it occurred
Cancer genes: That harbor in them.
Gene length plays a role, the longer the gene, the greater the chance to have cancer
Lung cancer 10 mutation /MB -> 450 genes
Tumor types have different mutation rates, blood cancer has the lowest and the lung cancer has the highest
21 tumor types, 4729 tumors > 3 * 10^6 mutations
Mutations are clustered, then they are not by chance, they may have a role. When they are in non-conserved region, they may be by chance. They did a FDR < 0.01 and did a qq plot to analyze the data that is not in a linear axis. Has all new genes in different cancer types.
With more data we get more cancer genes
The genes that mutate at higher rate (>20%)don’t vary much with down sampling. But as the % of mutation per gene went down the sample size mattered, e.g; increase in sample size led to increased number of genes.
2000 tumor samples will be appropriate for discovery for each cancer type, so altogether we will need2000 * 50=10,0000 samples`

Origin and consequences of genomic structural variation
Illumina,followed by Solid and least by454. Depth is 8X,length is 90 bp.
Deletion: breakdancer,Delly,CNVnator,Pindel
Many novel deletions,
Inversions are very complex types in cancer genome in 1000 genomeproject
Using Minion (oxford nanopore ) to evaluate pacBio sequencing.
International cancer genome consortium (ICGC): sequence entire genome for normal and cancer patients.
Chromothripsis is a major cause of cancer occurrence. Conclusion, existing data is not enough need more data.
Some chromosomes do circularize as a result of chromothripsis.
Pan cancer Analysis of whole genomes (PCAWG). Deeply sequenced data from 2000 patients.
Discovery of driver alterations in intergenic regions.
Therapeutic aspects of cancer drivers
Vernurafenib blocks BRAF .
6792 tumor samples covering 28 cancer types.
1.       Identify the drivers,find drugs targeting these drivers and assign drugs to the patients for testing
2.       4068tumors from16 tumors for somatic mutation
3.       Yates and Campbell et al 2012
4.       Finding positive selection signals could be indicativeof driver mutation.MuSIC-SMG/MutSigCV tools are used for detectingpostive selection leading to driver mutation finding. OncoDriveFM (FunctionalImpact bias), based on mutations at synonymous codon, stop codon gain or frame shift getsthe highest scores.
5.       There are some hotspot mutations can be identified by OncodriveClust
6. has the mutationpipeline. You select the cancer type and then lookfor the driver genes.
7.       Pipelines are also available for download and run locally.
8.       Different methods can be usedfor deteing low ly visible drier mutations.
9.       460 cancer drivers identified
10.   Pooled analysis andperproject analysis do bring about some non-overlapping genes.
11.   200 or so new driver genes, including regulatory genes
12.   Act=Gain switch of function(activating) ,Lof- Loss of function (tumor suppressor genes)
13.   207 lof, 170 Act and 83 are unclassified
14.   73 are major drivers
15.   460 are mutational drivers 29 are cnv (copy number variation drivers)
16.   90% of the tumors have atleast one driver event
17.   91 targetted drivers used for testing. 65.3% are in clinical trials
Epigenome Alterations
1.       Third most common pediatric braintumor, 45% incurable. Chemotherapy neverworks for these cases.
2.       Two clear groups identified. Tumor B tumor survives,Tumor A are very aggressive.Sopatients can be reclassified.
3.       Deep exome sequencingforPFA. Recurrent tumors havenoSNV nocnv no mutation. Looked at the epigenome. Cpg methylations patterns are very different.PfA tumors aremuchmorethanpfB mutation.Have a high promoter methylationevents/ gene silencing.inPFB nopatterns converged in pathway.InPFA it converged into a pathway where genes like to stay undifferentiated.
4.       Recapitulating embryonic state
5.       It is non-heritable disease. This ia a denovo disease.
Tumor/normal exome sequencing in dogs forsomatic mutations.
1.       Variant calling was done using MuTect and identifying significantly mutated genes using MUSIC
2.       Human data from tumor portal Lawrence te al Nature 505 (7484), 2014.
Pan cancer gene fusion
Oligo dimerization and tyrosynekinase domain fusions
1.       Cosmic and mitelman database -> 212 fusion genes and >3 samples
2.       Discordant reads and anchor reads are used for fusion discovery. Leads tolarge number of false positives. Minimumoverlap junction optimizer (MOJO).
3.       Less number of false positives. Excluded all fusion genes from GTex dataand also ignored >1% TCGA normals
4.       Most of the gene fusions occur between genes < 1MB apart.
5.       90% of the tumors have atleast one fusion
6.       1578 are recurrent fusions in >2 samples.
7.       38 fusions are recurrent in 10 or more samples
Enriching NA NGS analysis forCNVs, SNVs and gene fusions (sponsored talk)
Human Genome Analysis:
Hubs are more sensitive to mutation. Do they break motif, do they network hub.
Delivering large-scale clinical testing of cancer predisposition genes – what does it take?
Nazneen Rahman
TruSight workflow:
96 sample pipeline
Clonal evolution in breast cancer revealed by single nucleus genome sequencing
TCGA and ICGC (inter and intra tumor heterogeneity)
Mono-clonal, poly-clonal, self seeding, mutator phenotype or cancer stem cells.
For this single cell genome sequencing was pioneered. Nuc-seq. whole exome/genome sequencing using G2/M single cells. Mimimum coverage depth is 60X andcoverage breadth in exomes about 90%.
Randomlypicked cells fromsuspension
Exome sequencing identifies highly recurrent MED12 somatic mutations in breast fibroadenoma:
Med12 driver mutation in young women. Mutations occur at a hotspot and are deletions inframe. Mutation at codon 44. These are benign are in the stromal cells..
These mutations are reported in prostrate cancer, adenocortical carcinoma,uterine leiomyosarcoma. Possible dysregulationof estrogen signaling.
They show higher expression to estrogen response. They rise in stromal regions.
Extensive variation between primary disease and metastatic disease.

Wednesday, September 24, 2014

Running Allpaths assembler on our microbial genome sequences

Sequence assembly is a very important step in microbial genomics programs. However, it is equally important how we design our libraries and platforms we choose in order to get the optimal outcome. Insert size length, their standard deviation, read length, read overlap size, mate pair insert size, read length etc. are critical parameters in determining quality of the final assembly.

Out of the many organism we chose to sequence, one of the organisms had visibly better data. Although the sequencing company suggested that for paired end libaries we get 75X data and mate pair we get 10X, but actually we got way more coverage. In the last part of this article I describe how to calculate the X value for genome sequencing coverage.

We evaluated several assemblers and none of them produced optimal result. However, we were very interested to run allpaths on our genomes since it predicted the genome size correctly and had an extremely effective qc step. However, allpaths always returned with errors for our dataset, suggesting faulty library and faulty data. Here I describe the steps involved in this process.

Prepare your data:

In IN_GROUPS.CSV file, the following information need to be there:

In IN_LIBS.CSV file more crucial information about the library need to be there:
library_name,   project_name,   organism_name,  type,paired,    frag_size,      frag_stddev,    insert_size,    insert_stddev,  read_orientation,       genomic_start,  genomic_end
illuminawgs1,   L,  L,      fragment,       1,      280,15  ,       ,       ,       inward, 0,      0
illuminawgs2,   L,  L,      fragment,       1,      280,15  ,       ,       ,       inward, 0,      0
illumina_jump1, L,  L,      jumping,        1,      ,       ,       2780,50 ,       outward,        0,      0
illumina_jump2, L,  L,      jumping,        1,      ,       ,       2780,50 ,       outward,        0,      0

Here one has to remember that fragment size is size of the actual fragment. For example if the read length is 151 each on both the ends, the fragment size will be the actual distance between the ends of these reads. In other words, this can be calculated as Read_Length1 + Read_Length2 - 2(overlap length). Suppose overlapping region is 10 bases then fragment size will be 151 + 151 - 20 = 282. And then there is insert SD that also needs to carefully calculated and mentioned.

After the first step that is   runs successfully, it will create a bunch of files under the DATA_DIR mentioned above.
The best thing would be to download the test data supplied along wth allpaths and try doing the runs. If it runs fine then your installation is probably good.

I had tough time completing the assembly and at every stage, it had this quirky error complaining about
Tue Sep 23 16:24:25 2014 Filled 0.0436214% of 4153923 pairs.
No library parameter adjustment:  too few pairs closed.

Fatal error (pid=17593) at Tue Sep 23 16:24:25 2014:
Less than 10% of fragment pairs were filled.
There may be a problem with the library.

Here is what I tried to curb this:

and still got the same error. I tried making the FF_MAX_STRETCH to a very high value and FF_MIN_OVERLAP to 0, still got the same problem.

2. Then actually went ahead and ran a pairwise alignment between my paired samples, and found about 93% indeed had overlaps. I calculated the fragment length and fragment SD and supplied that to in_libs.csv, still had the same problem.

3. I took the jump library data and tried to find their actual insert size using a reference and calculated approximate insert size and SD and provided that info in in_libs.csv, still failed.

4. Used randomized read fragment from another assembly, it failed. Tried an established genome already published with defined insert size, sd, read length and overlap and made mock jump library and fragment library still got the same error.

5. Used reference guided assembly, still the same.
Finally, I decided to use the test data that comes from allpaths distribution at and it ran fine to my relief. Now I started working on those in_libs.csv files than my own providing my files.

6. First I ran the mock data, but it failed. A quick look suggested the Genome_size parameter that comes along the test data is 200000. So, I quickly changed that to 2000000 (2 MB), and lo and behold, it ran to completion!! The number of scaffolds were also very good (33)!!

7. Then I finally provided my sequencing data and running them and it seems to have surpassed the fillfragment stage. Keeping my fingers crossed! Hope it ends in positive outcome! 

Yay!! Assembly is DONE!!!!!!!!!!!!!!!!!! With great statistics.. :) :)

Monday, March 10, 2014

Installing EMBOSS GUI for your local data

EMBOSS is a great sequence analysis software that comes pre-packaged with a number of very useful programs. However, for those who are awry of using commandline options, there is good news. GUI applications are also available for EMBOSS. EMBOSS-GUI is one of the most popular GUI application that is widely used for interfacing EMBOSS programs. This program is developed by Luke McCarthy. The recent version of this GUI is called called as EMBOSS-explorer, has some bugs and I could not get it working in my system. However, I have a copy from the earlier release that has a stable version. You can write me for a copy if you want. First install it and the procedure is fairly simple. You may need certain privileges to be able to install this GUI. After installation you will get a file in your cg-bin directory that will have paths for the following directories:

print "Content-type: text/html\n\n";
init('/dir1', '/dir2', '/dir3', '', '')

where dir1 is where you have installed the GUI package.
          dir2 is where you have the binaries of EMBOSS program installed
          dir3 is the document root
          The next two are the http address for web site and the cgi-bin address.
After this is all in place your GUI should work fine.

Here is a catch, for some reason, in my last installation, the path I directed for installation and the actual path from where it was reading the GUI package was different. So, before doing any change first check which GUI is in path by writing a small perl script:

print $_, "\n" for @INC;
The GUI present in any of these paths are the ones that you are going to modify anyways.

After I have installed it, I would always want to have my own data to be read without user cop pasting the sequences. In figure-1 in the section where "To access a sequence from a database, enter the USA path here: (dbname:entry)" is written one can actually supply the name of the organism and particular sequence in a fasta file to be worked on. This is a value addition if you are dealing with a lot of sequences.I did a bit of poking around and figured out that in file which is located under your dir1/ (as listed in file) has several functions that actually runs the show. For instance the list of programs that are loaded are nothing but a call to EMBOSS program wossname. For GUI, several programs should be disabled since this will be computationally very intense. There is a $exclude option in where these programs can be named. directory etc.
All the GUI options are derived from the acd files that are stored under acd directory under /dir1. These acd files are named as program_name.acd. Now coming back to setting your files to be read internally, in go to the line where it is:
 elsif ($item =~ /^[\w-]+:/),
 change it to the way you like so that you can provide input string to be recognized. Set the path to your data. I replaced this string with: 

 elsif ($item =~ /\//) {                                  
 push @command, ("-$     param", $item);

Put your genome/sequence files under MYPATH.
Figure-1: Screen shot of EMBOSS GUI.

If for some reason the installation script for GUI does not work and you have got the GUI package, then do the manual installation the following way:

1. In the html directory place the path of the correctly (Mostly it should be in the cgi-bin directory)
2. Inside the cgi-bin directory, check the path for init() parameters as written above. Correct them accordingly.
3. Check if you are accessing the right EMBOSS::GUI module. For that you have to check @INC.
4. Inside the right EMBOSS directory, you have to place one acd directory, that is missing with this package. This acd directory can be obtained from the EMBOSS executable package. The acd directory should have a number of .acd files. These files are read for changing the forms for each of the EMBOSS programs.
5. Look for exclude file which should be inside GUI/data and put it under EMBOSS/GUI package.

Now probably you have 

Thursday, February 6, 2014

making local yum repository

Have you ever run 'yum install' command and got messages like package not found? Or have you run 'yum repolist' and got a 0 result? In that case, you have to mount your yum repository the right way for installing packages the safe way.

What do you need?

You need the .iso image file for your operating system. Here is the step by step process how you should do this.

1. Place your .iso file somewhere on your system, say /home/mine/RHEL.iso
2. mkdir -p /mypath/myName
3. mount -o loop /home/mine/RHEL.iso  mypath/myName
4. do a df -h and you will see the mounted file system [Optional]
5. create a file under /etc/yum.repos.d/myName.repo and edit it the following way:
      name=Red Hat Enterprise Linux
6. yum clean all
7. yum repolist -> This may give the following output:
Loaded plugins: product-id, refresh-packagekit, security, subscription-manager
Updating certificate-based repositories.
repo id                       repo name                                      status
server                        Red Hat Enterprise Linux                       3,529
repolist: 3,529

Now you can do your requisite installations using yum.

Monday, January 6, 2014

using cluster

To shutdown all compute nodes
 Rocks run host “/sbin/shutdown –h now”
To switch off the master node
 Shutdown -h now

Switch on

Storage head
1              5 (switch on from here)
2              4 -à wait till blue light starts blinking
3              3
4              2
5              1-à orange light will blink in starting

Serve master node and compute node

Head node                          1              6 -à (start from here) blue light blinks
 Compute node -0-0       2             7
Compute node-0-1         3              8
Compute node -0-2        4              9
Compute node -0-3         5             10

Note: the storage power switch is at the back side of the server each storage have two power switch
When we switch on the compute nodes the blue light blue lights blinks wait for some time, note the monitor if it stuck with the black screen hit F1 and it will take you to the following screen

The pic represents the compute node is switch on

The above pic represents the master node and the compute nodes

The pic represents the storage, the storage is like master storage and remaining all are storage of our compute nodes, it should be having blue light blinking but it is having some faults with the power concern so it blinks the amber light
Once you log into the compute nodes may it ask for the password the following commands can be given
rocks sync host sharedkey compute-0-0 (nodes you wish to setup a password less connection)
 /sbin/service autofs restart
rocks run host ‘/sbin/service autofs restart’
rocks sync users

rocks sync config