Tuesday, April 12, 2011

The Bird I am studying doesn't have a sequenced genome. I am I Doomed?

Bioinformatics is a rapidly growing field of study. The  sequencing technologies are evolving at an unprecedented rate and has necessitated the bioinformaticians to devise innovative strategies to handle the data. Most commercial applications provide a pipeline to analyze ChIP-seq, ChIP-chip, RNA-seq, Exome-seq data but only for common species such as human, mouse and rat. For the groups that cannot afford commercial software, pipelines can be created using several tools that are freely available. Since these tools have been developed by different groups, they are developed keeping in mind one particular paradigm or are tailored towards one particular technology. If the tool has to be used by the user working with different technology or a different paradigm, some additional work has to be done. Since the bioinformatics group is coming to a consensus on standard formats for the data, this problem is expected to minimize over time.  However, executing a pipeline at present requires some software development skills, specially in creating adapters to handle the output from one tool and presenting it as input to the other. Different turning parameters for different tools produce different results and finding a set of tuning parameters that work the best might itself be a challenge. For the species with a reference genome, all that has to be done is aligning the reads to the reference genome, and finding the coverage. Most of these tools are therefore catered towards analyzing the data from the species whose genome has been sequenced. How about if i want to study a species such as a japanese quail whose genome has not been sequenced?

The latest interest in De novo assembly technology has grown in the recent years especially to handle the transcriptome sequence data with no reference genome.  Previously the assembly programs have been generated for genomic data with larger read length. But since the new sequencing platforms generate reads which are of the order of 50 bp, several programs have been generated that handle smaller read lengths mostly using de Brujin graphs. De novo assembly in simple words means to create a longer contiguous sequence out of short reads. The de novo assembly programs usually generate contigs with N50 of several kilobases. N50 is the average read length of the first 50% of the contigs that have been arranged by the read length in descending order. So far, we have managed to take a bunch of reads with shorter read length and generate larger contiguous sequences. In other words, we now have a template that can potentially serve as our reference to estimate the coverage at each region. 

The reads from the RNA-seq data have been assembled, which could potentially be used as a reference and you have individual reads from your experiment and all you have to do is use the RNA-seq pipeline right? The answer is no. Since for many species the sequenced genome has been readily available, these programs have been generated to work with genome as a reference not a transcriptome. What we have have is the reads assembled into contigs that we are going to use as a reference. We do not really want to use a complex algorithm that has been developed to handle alternative splicing. Instead, you can directly use the programs such as MAQ or BWA that have been designed for alignment of ChIP-seq data. Moreover, the programs developed for peak calling can be used on the aligned data to find the regions with greater or lower coverage with respect to normal sample coverage. you identified the regions that are important for your study but you still are faced with the problem of knowing what those sequences are. Given that the genome of the species you are studying has not been sequenced, the very bottleneck you started with, the only alternative at this time is to find homologous sequences in similar species. For a japanese quail for example, the closest ancestor may be chicken, for the chickpea plant for example, the closest ancestor might be legumes etc. This can be performed by performing a BLASTX of the sequences of interest with the closest species at hand to find homologous sequences, which provide insight into genes of interest, transcription factors regulating that region and  transcription factors coded by that region that regulate other genes etc. 

Wednesday, September 29, 2010

Empirical Cumulative Distribution Function to Find Differentially Expressed Genes from Microarrays

The microarray data usually has a large number of genes that are not differentially expressed and only a small proportion of differentially expressed genes (DEGs). There are several methods that are being developed that estimate the true proportion of DEGs. It is a common practice to discard non-DEGs as uninformative based on these methods. The distribution of the putative non-DEGs is complex and may be assumed to follow a mixture of distributions (The distribution is mostly uniform). Considering this distribution under null hypothesis, the significance of the genes may be estimated reasonably accurately by using the empirical cumulative distribution function (ECDF) approach.The p-values thus obtained are an indication of significance of the genes.
The first step in finding the differentially expressed genes is to rank the genes based on some ranking algorithm. Ranking algorithm can be simple as a fold change or t-statistics or complex as Data adaptive test statistics (DATS). This is just to obtain reasonable number of putative non-DEGs. Let 'M' be the pooled expression values of all the putative non-DEGs and let N be their number. Using the ECDF approach, one can test the expression values of the putative DEGs under null hypothesis. Let X(i,j) be the jth expression value of ith putative DEG, then using ECDF, p(i,j)=(#M<=X(i,j))/N. The p-value for each putative DEG may be obtained by pooling t he probability values for that gene using the Logit Method (see Figure 1).  FDR may be calculated as shown in Figure 2. For more details, the reader is encouraged to read this paper.
The code for ECDF approach may be downloaded from the following link. The program MainARTEstPi.m shows an example of application of ECDF approach on artificial microarray data. There are many sub-programs required by this main program, which are also included in this package. Further, the program mainp2q.m converts the obtained p-values to q-values similar to Significance analysis of microarrays. You can follow the earlier post on R-Test to get the tips to calculate FDR. If it is challenging, write to me and i will include that code as well.

As usual, please write to me at jahangeer.shaik@gmail.com for questions and suggestions.

Monday, September 27, 2010

What is R-test and how is it useful for a bioinformatician?

Bioinformatics society as a whole is striving to make the philosophy of bench to bedside a reality. Although many experts to analyze the high throughput genomic data exist, there are only a few of those experts who take up the algorithmic development as a challenge. Remaining of them have a broad knowledge of the available tools and different parameters of those tools to perform analyses. This article is of interest to the researchers who engage in algorithmic development.

 With each new technology  being developed, there is a surge of methods that support the analysis of the data that is derived from that technology. Similarly, for the microarray technology, many methods for classification and gene selection have been developed. Some of these methods such as Data-adaptive test statistics (DATS) are pretty robust in finding the ranking of genes based on differential expression. However, these methods fail to provide significance of ranking. The methods such as significance analysis of microarrays are very popular because they not only provide the ranking but also provide the p-values and a notion of FDR associated with selection of each gene. 

The availability of the p-value makes it possible to investigate the microarray data under the solid framework of statistical inference. DATS, which employs bootstrapping in selection of samples and learns the parameters based on consistency of ranking in different bootstrapped datasets is more robust. It is however not a popular choice because it does not provide p-values and FDR. In this article I will try to provide the community with a piece of code based on an article "significance of gene ranking for classification of microarray samples" that will address this concern. This article highlights a method called R-test that converts the gene rankings into p-values. Once the p-values are determined, the FDR can be further calculated.

 The significance of ranking may be formulated as a hypothesis testing problem, the null hypothesis being "a gene is not differentially expressed". Assuming the ranking algorithm does a pretty good job at finding ranks of the genes based on differential expression, it may be seen that the distribution of the genes with high ranks (non-differentially expressed (non-DEG)) is uniform. Although it is not practical to find an exact set of non-DEGs (you wouldn't be running this experiment if you already knew non-DEGs), it is possible to come up with a set of putative non-DEGs. This can be done by following these simple steps
1)      Take the original microarray dataset and create several bootstrapped microarray datasets from it
2)      Use a ranking algorithm such as DATS and rank the genes for each bootstrapped dataset
3)      Arrange the genes by mean rank value in ascending order and plot genes vs ranks
4)      Select the last few gene ranks that follow uniform distribution (null distribution)
5)      For each gene, find the number of ranks under null distribution that are less than or equal to mean rank value of that gene and divide it by total number of ranks under null hypothesis to obtain p-value.
 In case you still didn’t understand what I just said, you can read the paper by Zhang et. Al. If this is too much for you, just use this code. The parameters are defined in the code and it should be self explanatory. The code that implements the R-test is "convertScorestoPVals.m". I have also included a main function (main.m) that uses this code on an artificial dataset.  I have extensively used this method in several of my papers (paper1, paper2, paper3). 

Okay, you now found the p-values and how are you going to determine the FDR? Let us assume you started with X number of genes, X being in the order of several thousands (typically 10000 to 40000). Let us assume you want to find FDR at certain alpha (typically goes from 0.05 to 0.0001). Expected number of false positive genes selected at that alpha is X*alpha. Find observed number of genes by selecting genes with p-value<=alpha. FDR may thus be calculated by #( p-value<=alpha)/(X*alpha). select alpha that provides less FDR to obtain true set of DEGs.
If you need assistance or have suggestions, as always, write to me at jahangeer.shaik@gmail.com

Saturday, September 25, 2010

3D Star Co-ordinate Projection for the Visualization of Microarray Data

Mining of high-dimensional datasets such as microarray data to gain insight into the data has been the topic of research for decades. The first step in understanding the data is to visually examine the data to be able to make a better judgement about the kinds of algorithms that may best suit ones needs down the pipeline. Since humans can see and understand the data in 3 dimensions or less and since high dimensional datasets often have more than 3 dimensions, visualization algorithms provide a mechanism  to view the high dimensional data in 3 dimensions or less. Although visualization algorithms such as principal component analysis (PCA), linear discriminant analysis(LDA) and locally linear embedding (LLE) are often used by the research community, it can be easily seen that when using orthogonal dimensional anchors, one often runs out of space. Co-ordinate based methods such as parallel co-ordinates provide a solution by employing non-orthogonal parallel co-ordinates as dimensional anchors. But, the use of parallel co-ordinates is limited by the dimensionality and for extremely high-dimensional datasets, the plots become messy. To address this issue, Kandogan introduced 2D-star-coordinate projection (2DSCP) method. This method instead of using parallel-coordinates now uses dimensional anchors as spokes radiating into 2 dimensional space. One advantage of using this method is that one can represent the high-dimensional data as a dot in 2-dimensional space instead of a line as in the case of 2DSCP. This algorithm has flavor of both projection based algorithms such as PCA and co-ordinate based methods such as parallel co-ordinates because it provides plots just like PCA but uses non-orthogonal dimensional anchors such as parallel-coordinates. The 2DSCP however does not convey the depth information and for one to be able to obtain that, the data needs to be projected to 3 dimensions. In this work, we have extended the 2DSCP method to 3 dimensions, called 3DSCP. Also, since the dimensional anchors for the coordinate based methods are non-orthogonal, the projection results are not unique as in the case of PCA. We have therefore developed an automation algorithm to find the configuration of best projection. The algorithm has been successfully applied to high-dimensional datasets from various fields including microarray data. Please refer to the following paper for more details. To understand the concept, the reader is encouraged to read the following notes. Finally, the Matlab code is available from this link. The archive at this link consists of two use cases, one showing application of 3DSCP on artificial dataset and the other showing the application of 3DSCP on colon cancer microarray data. The archive also consists of 3 videos showing the use of automated 3DSCP algorithm on 3 different datasets. The readers are encouraged to write to jahangeer.shaik@gmail.com for suggestions and help.

Thursday, September 23, 2010

GRANITE- Gene RelAtional Network of InTeracting Elements


This is the tool that is generated by me as a part of center of biomedical informatics (CBMI) at Washington University School of Medicine. The advent of new profiling technologies has enabled the comprehensive study of complex multi-factorial diseases such as cancers. The end product of such experiments is often a set of genes that are of interest to the researchers. Gene Relational Networks (GRNs) offer a possible means of establishing functional relationship among these genes. Although many tools to generate GRNs exist, they depend only on certain types of databases to infer relations among genes, preventing the user from having comprehensive interaction information. To address this issue, we developed a novel application termed GRANITE. GRANITE employs a unique workflow based on prioritization of databases to construct GRNs. Unlike some of the current available tools that employ one or two types of databases, GRANITE is comprehensive in the types of databases that it uses. A web application employing a Cytoscape plugin allows researchers to construct GRNs by entering a set of genes of interest and by selecting one or more databases from which connections are determined. A caBIG®-compatible stateful analytical service is also built for GRANITE using Introduce toolkit. GRANITE is publicly available through a web-based user interface, which guides users through the construction and visualization of GRNs (http://bioinformatics.wustl.edu/webTools/HomeAction.do). A user manual illustrating different features is also available at http://cvswiki1.wustl.edu/nagarajanwiki/GRANITE?action=AttachFile&do=view&target=GRANITEuserManual1.pdf. GRANITE is also programmatically accessible through a caBIG® compliant analytical service. The GRANITE analytical service is available at (http://analysis1.wustl.edu/wsrf/services/cagrid/GRANITE). The client code for this analytical service is available at http://cvswiki1.wustl.edu/nagarajanwiki/GRANITE?action=AttachFile&do=get&target=GRANITEregNminClient.zip . The user manual that provides instructions to run the client code are available at http://cvswiki1.wustl.edu/nagarajanwiki/GRANITE?action=AttachFile&do=get&target=UserManual.docx .