Followers

Monday, November 30, 2015

5C bed file data format

5C and 3C are the newer technologies in sequencing where the chromatin inetraction data can be obtained. If you looking for such data and happen to download from UCSC genome browser, it may be hard to look around for format describing the fields. We asked the authors and here is the explanation:

The site from which you may download data may be this: https://www.encodeproject.org/experiments/ENCSR000CYD/

BED  file format descrition can be found from : https://genome.ucsc.edu/FAQ/FAQformat.html#format1 

Here is a sample data for GM12878 cell line:



chr22   31998728        33247041        5C_301_ENm004_FOR_292.5C_301_ENm004_REV_
32      1000    .       31998728        33247041        0       2       12744,40
98,     0,1244215,
chr5    131346229       132145236       5C_299_ENm002_FOR_241.5C_299_ENm002_REV_
33      1000    .       131346229       132145236       0       2       2609,210
5,      0,796902,

col1: Chromosome name
col2: Chromosome start
col3: chromosome end
col4: Name of the interacting sites (primer names)
col5:
col7: chromosome start
col8: chromosome end
col11: block sizes in comma separated list
col12: block offset in comma separated list

Now I will explain what col11 and col12 means...

the beginning of interacting site is the cromosome start and the beginning of offset is 0.

So, the interacting site begins at 31998728 + 0 and the interacting block length is 12744.

The beginning position of interacting site 2 is: 31998728 + 1244215 = 33242943
 The size of interacting block 2 is 4098. so, end of interacting site is 33242943 + 4098 = 33247041.

Here is a diagrammatic representation:



1 comment: