Wednesday, September 29, 2010

Empirical Cumulative Distribution Function to Find Differentially Expressed Genes from Microarrays

The microarray data usually has a large number of genes that are not differentially expressed and only a small proportion of differentially expressed genes (DEGs). There are several methods that are being developed that estimate the true proportion of DEGs. It is a common practice to discard non-DEGs as uninformative based on these methods. The distribution of the putative non-DEGs is complex and may be assumed to follow a mixture of distributions (The distribution is mostly uniform). Considering this distribution under null hypothesis, the significance of the genes may be estimated reasonably accurately by using the empirical cumulative distribution function (ECDF) approach.The p-values thus obtained are an indication of significance of the genes.
The first step in finding the differentially expressed genes is to rank the genes based on some ranking algorithm. Ranking algorithm can be simple as a fold change or t-statistics or complex as Data adaptive test statistics (DATS). This is just to obtain reasonable number of putative non-DEGs. Let 'M' be the pooled expression values of all the putative non-DEGs and let N be their number. Using the ECDF approach, one can test the expression values of the putative DEGs under null hypothesis. Let X(i,j) be the jth expression value of ith putative DEG, then using ECDF, p(i,j)=(#M<=X(i,j))/N. The p-value for each putative DEG may be obtained by pooling t he probability values for that gene using the Logit Method (see Figure 1).  FDR may be calculated as shown in Figure 2. For more details, the reader is encouraged to read this paper.
The code for ECDF approach may be downloaded from the following link. The program MainARTEstPi.m shows an example of application of ECDF approach on artificial microarray data. There are many sub-programs required by this main program, which are also included in this package. Further, the program mainp2q.m converts the obtained p-values to q-values similar to Significance analysis of microarrays. You can follow the earlier post on R-Test to get the tips to calculate FDR. If it is challenging, write to me and i will include that code as well.

As usual, please write to me at for questions and suggestions.

No comments:

Post a Comment