DNA microarray technologies are useful for addressing a broad range of biological problems - including the measurement of mRNA expression levels in target cells. These studies typically produce large data sets that contain measurements on thousands of genes under hundreds of conditions. There is a critical need to summarize this data and to pick out the important details. The most common activities, therefore, are to group together microarray data and to reduce the number of features. Both of these activities can be done using only the raw microarray data (unsupervised methods) or using external information that provides labels for the microarray data (supervised methods). We briefly review supervised and unsupervised methods for grouping and reducing data in the context of a publicly available suite of tools called CLEAVER, and illustrate their application on a representative data set collected to study lymphoma.
Annotating the tremendous amount of sequence information being generated requires accurate automated methods for recognizing homology. Although sequence similarity is only one of many indicators of evolutionary homology, it is often the only one used. Here we find that supplementing sequence similarity with information from biomedical literature is successful in increasing the accuracy of homology search results. We modified the PSI-BLAST algorithm to use literature similarity in each iteration of its database search. The modified algorithm is evaluated and compared to standard PSI-BLAST in searching for homologous proteins. The performance of the modified algorithm achieved 32% recall with 95% precision, while the original one achieved 33% recall with 84% precision; the literature similarity requirement preserved the sensitive characteristic of the PSI-BLAST algorithm while improving the precision.
Measuring the expression of most or all of the genes in a biological system raises major analytic challenges. A wealth of recent reports uses microarray expression data to examine diverse biological phenomena - from basic processes in model organisms to complex aspects of human disease. After an initial flurry of methods for clustering the data on the basis of similarity, the field has recognized some longer-term challenges. Firstly, there are efforts to understand the sources of noise and variation in microarray experiments in order to increase the biological signal. Secondly, there are efforts to combine expression data with other sources of information to improve the range and quality of conclusions that can be drawn. Finally, techniques are now emerging to reconstruct networks of genetic interactions in order to create integrated and systematic models of biological systems.