Monday, September 27, 2010

What is R-test and how is it useful for a bioinformatician?

Bioinformatics society as a whole is striving to make the philosophy of bench to bedside a reality. Although many experts to analyze the high throughput genomic data exist, there are only a few of those experts who take up the algorithmic development as a challenge. Remaining of them have a broad knowledge of the available tools and different parameters of those tools to perform analyses. This article is of interest to the researchers who engage in algorithmic development.

 With each new technology  being developed, there is a surge of methods that support the analysis of the data that is derived from that technology. Similarly, for the microarray technology, many methods for classification and gene selection have been developed. Some of these methods such as Data-adaptive test statistics (DATS) are pretty robust in finding the ranking of genes based on differential expression. However, these methods fail to provide significance of ranking. The methods such as significance analysis of microarrays are very popular because they not only provide the ranking but also provide the p-values and a notion of FDR associated with selection of each gene. 

The availability of the p-value makes it possible to investigate the microarray data under the solid framework of statistical inference. DATS, which employs bootstrapping in selection of samples and learns the parameters based on consistency of ranking in different bootstrapped datasets is more robust. It is however not a popular choice because it does not provide p-values and FDR. In this article I will try to provide the community with a piece of code based on an article "significance of gene ranking for classification of microarray samples" that will address this concern. This article highlights a method called R-test that converts the gene rankings into p-values. Once the p-values are determined, the FDR can be further calculated.

 The significance of ranking may be formulated as a hypothesis testing problem, the null hypothesis being "a gene is not differentially expressed". Assuming the ranking algorithm does a pretty good job at finding ranks of the genes based on differential expression, it may be seen that the distribution of the genes with high ranks (non-differentially expressed (non-DEG)) is uniform. Although it is not practical to find an exact set of non-DEGs (you wouldn't be running this experiment if you already knew non-DEGs), it is possible to come up with a set of putative non-DEGs. This can be done by following these simple steps
1)      Take the original microarray dataset and create several bootstrapped microarray datasets from it
2)      Use a ranking algorithm such as DATS and rank the genes for each bootstrapped dataset
3)      Arrange the genes by mean rank value in ascending order and plot genes vs ranks
4)      Select the last few gene ranks that follow uniform distribution (null distribution)
5)      For each gene, find the number of ranks under null distribution that are less than or equal to mean rank value of that gene and divide it by total number of ranks under null hypothesis to obtain p-value.
 In case you still didn’t understand what I just said, you can read the paper by Zhang et. Al. If this is too much for you, just use this code. The parameters are defined in the code and it should be self explanatory. The code that implements the R-test is "convertScorestoPVals.m". I have also included a main function (main.m) that uses this code on an artificial dataset.  I have extensively used this method in several of my papers (paper1, paper2, paper3). 

Okay, you now found the p-values and how are you going to determine the FDR? Let us assume you started with X number of genes, X being in the order of several thousands (typically 10000 to 40000). Let us assume you want to find FDR at certain alpha (typically goes from 0.05 to 0.0001). Expected number of false positive genes selected at that alpha is X*alpha. Find observed number of genes by selecting genes with p-value<=alpha. FDR may thus be calculated by #( p-value<=alpha)/(X*alpha). select alpha that provides less FDR to obtain true set of DEGs.
If you need assistance or have suggestions, as always, write to me at

1 comment:

  1. hye...i'm currently doing my thesis on clustering algorithm for microarray dat...but i'm stuck in creating artificial dataset using R language.. may i get the coding?