Friday, December 14, 2012
How to estimate significance in genomewide studies
By Subhadeep Das
With the advent of large scale sequencing and microarray technology, genome wide studies have become a powerful tool to delineate the regulatory network of Genes. With its immense potentiality, genome wide studies are being widely used to decipher differentially expressed genes, genome wide regulatory region detection etc. While its huge power are producing valuable information, a major difficulty lies in detection of the actual significant results out of the huge datasets being produced by it. Moreover, analyses of these data involve statistical significance test on thousands of features being analyzed (e.g. genes, enhancer sites, transcription factor binding sites, SNPs etc) simultanously. In traditional cases of linkage study(similar to present genome wide studies), strict P value cutoffs were used in order to avoid less number of false positive results, but, as opposed to the linkage case, it is expected that many more than one or two of the tested features are statistically significant . So, setting strict P value cutoff may lead to many missed findings.
Here, two newer entities- False Discovery rate (FDR) and q value come in rescue. So, what do the terms mean? Let us take a look over the terms-
P value- The probability of a random (or null) feature showing a score same or greater than a truly alternative (true positive) feature shows.
FDR- It is the proportion of false positives to the total number of features, called significant in a study, based on a P value cutoff.
FDR= expressed as expectation of the proportion of F/S and written as- E [F(t)/S(t)]. where F= no. of false positives, S= total no. of significant features. t= P value threshold or cutoff.
Q value- The proportion of false positives, that will occur, if the P value associated with this feature (whose Q value is being calculated) is considered as significant.
Thus a q value of <=5% of a feature indicates that, if that feature is considered as significant, then 5% of features considered as significant may be false positive.
The next major challenge is the calculation of FDR and q value, since q value is calculated from FDR and the internal terms like F and S are unknown. So, these terms are estimated rather than calculated and in order to do so, the terms are broken in several new terms. The simplified form of FDR is
FDR= P*m*t/(number of features where p value cutoff <=t)
Where, P= proportion of features that are truly null. i.e. m0/m. where, m0= number of true null features, m= number of total features.
So, using this simple but useful terms, we can filter the true positive data conferring real contribution towards the result of thegenomewide study.