Friday, November 29, 2013

Analysing the transcriptome data

Here i had an opportunity to analyse the transcript data,
I have been analyzing the transcriptomic Data, i use two different assembly programs Ray and Trinity assembler, we prefer the trinity assembler for the transcriptomics data. the trinity which does the job by 3 process
it assembles the unique transcripts and reports the alternatively spliced transcripts
it takes the inchworm contigs into clusters and constructs the De-brujin graphs for those clusters
it makes the full lenght of transcripts for alternatively spliced isoforms those transcripts corresponds the paralogous genes
 suppose if u have a sequence format in the fastq it converts into fasta by using fastool it takes much time to do it.. if we have large number of datas like millions of reads we can minimise thr run by using the option called --minkmercov in case of paired end it uses the bowtie short read aligner and the command line is
trinityrnaseq_r2013-02-25/ --seqType fq --JM 30G --min_kmer_cov 1  --left left_file1.fastq --right right_file2.fastq --CPU 10 --output trinity_out_dir > assembly_log once you are done with running the trinity you can check your process by top  and then job will be shown like inchworm, chryasalis and butterfly suppose the sequence is in the fastq format it uses the fasttool and convert to fasta nad it process,it uses the cat command also for joining the contigs at last, when you do top u can see that command also even it uses sort command also.
    once the assembly is done you can run the from the script directory and can get the N50, total transcripts and total trinity components.
    I have tried using various assembler like ray  but it doesn't give me the expected output, ray it works perfect with the genome assembly, for the transcriptomic data  trinity is doing well.
Analysing differential expression using edgeR trough Rsem:

I have transcriptome sample with 3 different conditions, ie one with early passage, other one at intermediate passage and another as a last passage

basic steps i will follow are:
1. make a assembly of all conditions together like said bellow
trinityrnaseq_r2013-02-25/ --seqType fq --JM 30G --min_kmer_cov 1  --left condA1.fastq condB1.fastq condC1.fastq --right condA2.fastq condB2.fastq condC2.fastq --CPU 10 --output trinity_out > log back the obtained transcripts with the respective reads basically you should be mapping the transcripts with condA , then condB, then conC separately by using bowtie or sam whatever mapping tool convenient with that.
3.Now estimate the gene-level abundance using Rsem suppose  i want to check the differentially expressed genes between condA and condB,  cond A and C, and condB and C , the matrix can be created by using Rsem or the otherway you wanna check in all three cond A,B,C that is also possible make the matrix of all 3 cond together and do the differential expression.
4. Then provide the way you wanna check the differential expression in edgeR a  Bioconductor package  with sample description.
Analysis/DifferentialExpression/ --matrix counts.matrix --method edgeR --samples_file samples_described.txt
5. then extract the differential expression genes by the following command with the log transferred value will get the expression data , this TMM_normalized.FPKM value file will be created in the RSem matrix calulation and provide that normalized value 
cd edgeR/
/Analysis/DifferentialExpression/ --matrix ../matrix.TMM_normalized.FPKM     -P 1e-3 -C 2
  The output will be in the tab limited format of ids, logFc values, FPKM, condA expression value, condB expression value Blah Blah  etc etc ..