Wednesday, September 29, 2010

Empirical Cumulative Distribution Function to Find Differentially Expressed Genes from Microarrays

The microarray data usually has a large number of genes that are not differentially expressed and only a small proportion of differentially expressed genes (DEGs). There are several methods that are being developed that estimate the true proportion of DEGs. It is a common practice to discard non-DEGs as uninformative based on these methods. The distribution of the putative non-DEGs is complex and may be assumed to follow a mixture of distributions (The distribution is mostly uniform). Considering this distribution under null hypothesis, the significance of the genes may be estimated reasonably accurately by using the empirical cumulative distribution function (ECDF) approach.The p-values thus obtained are an indication of significance of the genes.
The first step in finding the differentially expressed genes is to rank the genes based on some ranking algorithm. Ranking algorithm can be simple as a fold change or t-statistics or complex as Data adaptive test statistics (DATS). This is just to obtain reasonable number of putative non-DEGs. Let 'M' be the pooled expression values of all the putative non-DEGs and let N be their number. Using the ECDF approach, one can test the expression values of the putative DEGs under null hypothesis. Let X(i,j) be the jth expression value of ith putative DEG, then using ECDF, p(i,j)=(#M<=X(i,j))/N. The p-value for each putative DEG may be obtained by pooling t he probability values for that gene using the Logit Method (see Figure 1).  FDR may be calculated as shown in Figure 2. For more details, the reader is encouraged to read this paper.
The code for ECDF approach may be downloaded from the following link. The program MainARTEstPi.m shows an example of application of ECDF approach on artificial microarray data. There are many sub-programs required by this main program, which are also included in this package. Further, the program mainp2q.m converts the obtained p-values to q-values similar to Significance analysis of microarrays. You can follow the earlier post on R-Test to get the tips to calculate FDR. If it is challenging, write to me and i will include that code as well.

As usual, please write to me at for questions and suggestions.

Monday, September 27, 2010

What is R-test and how is it useful for a bioinformatician?

Bioinformatics society as a whole is striving to make the philosophy of bench to bedside a reality. Although many experts to analyze the high throughput genomic data exist, there are only a few of those experts who take up the algorithmic development as a challenge. Remaining of them have a broad knowledge of the available tools and different parameters of those tools to perform analyses. This article is of interest to the researchers who engage in algorithmic development.

 With each new technology  being developed, there is a surge of methods that support the analysis of the data that is derived from that technology. Similarly, for the microarray technology, many methods for classification and gene selection have been developed. Some of these methods such as Data-adaptive test statistics (DATS) are pretty robust in finding the ranking of genes based on differential expression. However, these methods fail to provide significance of ranking. The methods such as significance analysis of microarrays are very popular because they not only provide the ranking but also provide the p-values and a notion of FDR associated with selection of each gene. 

The availability of the p-value makes it possible to investigate the microarray data under the solid framework of statistical inference. DATS, which employs bootstrapping in selection of samples and learns the parameters based on consistency of ranking in different bootstrapped datasets is more robust. It is however not a popular choice because it does not provide p-values and FDR. In this article I will try to provide the community with a piece of code based on an article "significance of gene ranking for classification of microarray samples" that will address this concern. This article highlights a method called R-test that converts the gene rankings into p-values. Once the p-values are determined, the FDR can be further calculated.

 The significance of ranking may be formulated as a hypothesis testing problem, the null hypothesis being "a gene is not differentially expressed". Assuming the ranking algorithm does a pretty good job at finding ranks of the genes based on differential expression, it may be seen that the distribution of the genes with high ranks (non-differentially expressed (non-DEG)) is uniform. Although it is not practical to find an exact set of non-DEGs (you wouldn't be running this experiment if you already knew non-DEGs), it is possible to come up with a set of putative non-DEGs. This can be done by following these simple steps
1)      Take the original microarray dataset and create several bootstrapped microarray datasets from it
2)      Use a ranking algorithm such as DATS and rank the genes for each bootstrapped dataset
3)      Arrange the genes by mean rank value in ascending order and plot genes vs ranks
4)      Select the last few gene ranks that follow uniform distribution (null distribution)
5)      For each gene, find the number of ranks under null distribution that are less than or equal to mean rank value of that gene and divide it by total number of ranks under null hypothesis to obtain p-value.
 In case you still didn’t understand what I just said, you can read the paper by Zhang et. Al. If this is too much for you, just use this code. The parameters are defined in the code and it should be self explanatory. The code that implements the R-test is "convertScorestoPVals.m". I have also included a main function (main.m) that uses this code on an artificial dataset.  I have extensively used this method in several of my papers (paper1, paper2, paper3). 

Okay, you now found the p-values and how are you going to determine the FDR? Let us assume you started with X number of genes, X being in the order of several thousands (typically 10000 to 40000). Let us assume you want to find FDR at certain alpha (typically goes from 0.05 to 0.0001). Expected number of false positive genes selected at that alpha is X*alpha. Find observed number of genes by selecting genes with p-value<=alpha. FDR may thus be calculated by #( p-value<=alpha)/(X*alpha). select alpha that provides less FDR to obtain true set of DEGs.
If you need assistance or have suggestions, as always, write to me at

Saturday, September 25, 2010

3D Star Co-ordinate Projection for the Visualization of Microarray Data

Mining of high-dimensional datasets such as microarray data to gain insight into the data has been the topic of research for decades. The first step in understanding the data is to visually examine the data to be able to make a better judgement about the kinds of algorithms that may best suit ones needs down the pipeline. Since humans can see and understand the data in 3 dimensions or less and since high dimensional datasets often have more than 3 dimensions, visualization algorithms provide a mechanism  to view the high dimensional data in 3 dimensions or less. Although visualization algorithms such as principal component analysis (PCA), linear discriminant analysis(LDA) and locally linear embedding (LLE) are often used by the research community, it can be easily seen that when using orthogonal dimensional anchors, one often runs out of space. Co-ordinate based methods such as parallel co-ordinates provide a solution by employing non-orthogonal parallel co-ordinates as dimensional anchors. But, the use of parallel co-ordinates is limited by the dimensionality and for extremely high-dimensional datasets, the plots become messy. To address this issue, Kandogan introduced 2D-star-coordinate projection (2DSCP) method. This method instead of using parallel-coordinates now uses dimensional anchors as spokes radiating into 2 dimensional space. One advantage of using this method is that one can represent the high-dimensional data as a dot in 2-dimensional space instead of a line as in the case of 2DSCP. This algorithm has flavor of both projection based algorithms such as PCA and co-ordinate based methods such as parallel co-ordinates because it provides plots just like PCA but uses non-orthogonal dimensional anchors such as parallel-coordinates. The 2DSCP however does not convey the depth information and for one to be able to obtain that, the data needs to be projected to 3 dimensions. In this work, we have extended the 2DSCP method to 3 dimensions, called 3DSCP. Also, since the dimensional anchors for the coordinate based methods are non-orthogonal, the projection results are not unique as in the case of PCA. We have therefore developed an automation algorithm to find the configuration of best projection. The algorithm has been successfully applied to high-dimensional datasets from various fields including microarray data. Please refer to the following paper for more details. To understand the concept, the reader is encouraged to read the following notes. Finally, the Matlab code is available from this link. The archive at this link consists of two use cases, one showing application of 3DSCP on artificial dataset and the other showing the application of 3DSCP on colon cancer microarray data. The archive also consists of 3 videos showing the use of automated 3DSCP algorithm on 3 different datasets. The readers are encouraged to write to for suggestions and help.

Thursday, September 23, 2010

GRANITE- Gene RelAtional Network of InTeracting Elements

This is the tool that is generated by me as a part of center of biomedical informatics (CBMI) at Washington University School of Medicine. The advent of new profiling technologies has enabled the comprehensive study of complex multi-factorial diseases such as cancers. The end product of such experiments is often a set of genes that are of interest to the researchers. Gene Relational Networks (GRNs) offer a possible means of establishing functional relationship among these genes. Although many tools to generate GRNs exist, they depend only on certain types of databases to infer relations among genes, preventing the user from having comprehensive interaction information. To address this issue, we developed a novel application termed GRANITE. GRANITE employs a unique workflow based on prioritization of databases to construct GRNs. Unlike some of the current available tools that employ one or two types of databases, GRANITE is comprehensive in the types of databases that it uses. A web application employing a Cytoscape plugin allows researchers to construct GRNs by entering a set of genes of interest and by selecting one or more databases from which connections are determined. A caBIG®-compatible stateful analytical service is also built for GRANITE using Introduce toolkit. GRANITE is publicly available through a web-based user interface, which guides users through the construction and visualization of GRNs ( A user manual illustrating different features is also available at GRANITE is also programmatically accessible through a caBIG® compliant analytical service. The GRANITE analytical service is available at ( The client code for this analytical service is available at . The user manual that provides instructions to run the client code are available at .