Tuesday, February 12, 2013
Coefficient of Determination- what does it mean?
In any regression model(be it a linear or non-linear), the metric coefficient of determination (R2 ) is used to show the accuracy of the model and in standard definition it is defined as the variance in dependent variable explained by the independent variable. Well, it sounds nice and mathematically well understandable. But, to get a clear picture about its utility and use, the definition is not enough.
Here in this discussion I will try to describe the metric (R2 ) with pictorial representation to get an insight of an apparently described term "Coefficient of determination".
In the above picture the black dots are some data points and the straight line passing through the scatter plot is the regression line of the data points. In the scatter plot shown above, let us consider, there are n number of points. (X1,Y1) and (X2,Y2) are any two arbitrary points and (X̅,Y̅) (in red) is the mean of the data points.
It is evident from the figure that the data points are positively correlated and therefore the regression line has a positive slope. Let's assume the coefficient of the correlation R is 0.67. So coefficient of determination ,R2 will be 0.4489. That means 44.89% of the variance in Y is explained by variance in X. How?
The variance of the Y data from its mean Y̅ is the summation of squared error of every Y from Y̅ and it is called the TOTAL VARIANCE
So, the TOTAL VARIANCE V = Σ(Y- Y̅).
Let's consider the corresponting value of every X on the regression line is Ŷ. so every point on the regression line would be (X̂,Ŷ). Now, if the regression line could describe the data totally, then every point in the scattered plot would lie on the regression line. But in reality most of the data pints lie outside of the regression line. So, here comes another error term, let's call the individual error as e.
In the above picture the green lines are the individual error (e) of every data points with respect to the Regression line. Let's call the summation of the square of each e as the "Not Described error" by the regression model. Lets call it NDE.
So, NDE= Σ e = Σ(Y-Ŷ) = Σ (Y - (m*X̂ + C))
where m= slope of the regression line and C= intercept of the regression line.
So, The error described by the model = (Total error- NDE) = V-NDE
So, the proportion of total error described by the model = (V-NDE)/(V) = 1-(NDE/V).
Now, this term is called the coefficient of determination, R2
So, R2 = 1 - (NDE/V) = 1 - (Σ (Y - (m*X̂ + C)) /(V)) = 1 - (Σ (Y- (m*X̂ + C)) / Σ(Y- Y̅))
It can be shown that squaring R actually leads to the above equation but in this discussion we will not going into that.
Now, it is clear why R2 is described as "the variation in Y explained by X".
Now, question arises if only 44.89% of variance is described by the regression model, then what are the other factors that describes the rest 55.11% variance?
We will explain this by an biologycal eaxample, since this blog is written for the students of biological science background.
Lets assume, the Xs in the example is the transcription factor binding signal and the Ys are expression level of several genes. The regression line here then can approximate the expression level of a gene for its corresponding transcription factor binding data. Now, the 44.89% of variance (R2 = 0.4489) here, i.e, 44.89% of variance of expression with respect to its mean value is actually occuring due the variance of the Binding data. Then where is the rest 55.11% of variance?? well, to answer this question we should go back to basic molecular biology. The expression of a gene depends not only the binding of transcription factors but also depends on several other factors like- 1) chromatin structure 2) CpG content of promoter 3) cell type etc.
So these factors together actually describes the rest 55.11% of variance in expression level.
So, That's it. Hope you understood a bit of the term R2
We'll be back with more such topics, soon. Till then, Bbyeeee :)