Tuesday, April 12, 2011

The Bird I am studying doesn't have a sequenced genome. I am I Doomed?

Bioinformatics is a rapidly growing field of study. The  sequencing technologies are evolving at an unprecedented rate and has necessitated the bioinformaticians to devise innovative strategies to handle the data. Most commercial applications provide a pipeline to analyze ChIP-seq, ChIP-chip, RNA-seq, Exome-seq data but only for common species such as human, mouse and rat. For the groups that cannot afford commercial software, pipelines can be created using several tools that are freely available. Since these tools have been developed by different groups, they are developed keeping in mind one particular paradigm or are tailored towards one particular technology. If the tool has to be used by the user working with different technology or a different paradigm, some additional work has to be done. Since the bioinformatics group is coming to a consensus on standard formats for the data, this problem is expected to minimize over time.  However, executing a pipeline at present requires some software development skills, specially in creating adapters to handle the output from one tool and presenting it as input to the other. Different turning parameters for different tools produce different results and finding a set of tuning parameters that work the best might itself be a challenge. For the species with a reference genome, all that has to be done is aligning the reads to the reference genome, and finding the coverage. Most of these tools are therefore catered towards analyzing the data from the species whose genome has been sequenced. How about if i want to study a species such as a japanese quail whose genome has not been sequenced?

The latest interest in De novo assembly technology has grown in the recent years especially to handle the transcriptome sequence data with no reference genome.  Previously the assembly programs have been generated for genomic data with larger read length. But since the new sequencing platforms generate reads which are of the order of 50 bp, several programs have been generated that handle smaller read lengths mostly using de Brujin graphs. De novo assembly in simple words means to create a longer contiguous sequence out of short reads. The de novo assembly programs usually generate contigs with N50 of several kilobases. N50 is the average read length of the first 50% of the contigs that have been arranged by the read length in descending order. So far, we have managed to take a bunch of reads with shorter read length and generate larger contiguous sequences. In other words, we now have a template that can potentially serve as our reference to estimate the coverage at each region. 

The reads from the RNA-seq data have been assembled, which could potentially be used as a reference and you have individual reads from your experiment and all you have to do is use the RNA-seq pipeline right? The answer is no. Since for many species the sequenced genome has been readily available, these programs have been generated to work with genome as a reference not a transcriptome. What we have have is the reads assembled into contigs that we are going to use as a reference. We do not really want to use a complex algorithm that has been developed to handle alternative splicing. Instead, you can directly use the programs such as MAQ or BWA that have been designed for alignment of ChIP-seq data. Moreover, the programs developed for peak calling can be used on the aligned data to find the regions with greater or lower coverage with respect to normal sample coverage. you identified the regions that are important for your study but you still are faced with the problem of knowing what those sequences are. Given that the genome of the species you are studying has not been sequenced, the very bottleneck you started with, the only alternative at this time is to find homologous sequences in similar species. For a japanese quail for example, the closest ancestor may be chicken, for the chickpea plant for example, the closest ancestor might be legumes etc. This can be performed by performing a BLASTX of the sequences of interest with the closest species at hand to find homologous sequences, which provide insight into genes of interest, transcription factors regulating that region and  transcription factors coded by that region that regulate other genes etc.