Let us start the discussion with 2 very simple examples:-

Example 1:-

Seq1- GAGGTAAAC

Seq2- TCCGTAAGT

Seq3- CAGGTTGGA

Seq4- ACAGTCAGT

Seq5- TAGGTCATT

Seq6- TAGGTACTG

Seq7- ATGGTAACT

Seq8- CAGGTATAC

Seq9- TGTGTGAGT

Seq10- AAGGTAAGT

Let us assume that each of the above 10mer DNA sequences are the experimentally derived binding sites of a Transcription Factor(TF), say X. So, which of the following questions will be correct?

a) what is the probability that G will come at position 4 of the binding site of X?

b) what is the likelihood of G at position 4?

Example 2:-

| 1 2 3 4 5 6 7 8 9

--|----------------------------

A | 3 6 1 0 0 6 7 2 1

C | 2 2 1 0 0 2 1 1 2

G | 1 1 7 10 0 1 1 5 1

T | 4 1 1 0 10 1 1 2 6

The above matrix is the position specific frequency matrix derived from the sequences of example 1. Now, which of the following questions is correct?

a) what is the probability that the sequence CAGGTTGGA is a binding site of the TF X?

b) what is the likelihood that the sequence CAGGTTGGA is a binding site of X?

See, in both the cases it seems like both answers are correct since both the terms sounds very similar unless we carefully review the definition of the terms.

The term likelihood is used when we describe a function of a parameter given a

**fixed outcome**.
In both of the examples, the outcome, that is the experimentally derived sequences are given. So the out come is fixed and based on this observations we are asking a question. So, in both of the cases the therm Likelihood will be used.

What? Not convinced?????? ok, let me give you an example of the situation where the term "probability" will be used.

a) What is the probability that the 10mer sequence CAGGTTGGA will come in our genome?

b) what is the likelihood that the 10mer sequence CAGGTTGGA will come in our genome?

See, this time no prior data, that is, no observation is given (our genome size is fixed). So, likelihood is not the word meant for used in this case. So, the 10mer has a probability, not likelihood to be there in our genome.

Clear??? A bit?? I hope so. If not, feel free to post your query using the comment box below. Time to go, will return soon with more such topics. Till then, bbyeee!!!! :)

@Dr. Sucheta Tripathy:- Is the pen ready ma'am?? :D

ReplyDeletePen is not for one post... It will be decided in the end of the month - the quality writer gets it.

ReplyDeleteThat's great ma'am. Let me remind you, today is 30th April, end of the month.

ReplyDeleteSo, i guess this means that likelihood is used when there is some experimental data and probablity when there is no full proof data?

ReplyDeleteI also have similar queries as Deeksha: How do you derive the probability from the given information? You have scored a PSSM, but it is not clear whether this PSSM is derived from sequences that actually have a binding site or these are the query dataset.

ReplyDeletelet me try with another example.

ReplyDeleteSuppose you are tossing a fair coin. Then you can ask the question "what is the probability of getting 5 heads in 10 trials?". In this case we are assuming that the coin is fair and we have no prior data from observation.

But, suppose in an observation of 100 trials, you got 80 heads. Then naturally question arises whether the coin is fair or not. But, since it is based on the observation, you can not ask what is the probability of the coin to be a fair coin. The question should be frame like this:-

"What is the likelihood that the coin is fair?"

Hope it is clear now, a bit, at least??? :)

@Sucheta Tripathy PI: The sequences used to build the PWM or PSSM are the experimentally derived Binding sites for a Transcription factor.

ReplyDelete