Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.
The mycobacterial insertion sequence IS6110 has been exploited extensively as a clonal marker in molecular epidemiologic studies of tuberculosis. In addition, it has been hypothesized that this element is an important driving force behind genotypic variability that may have phenotypic consequences. We present here a novel, DNA microarray-based methodology, designated SiteMapping, that simultaneously maps the locations and orientations of multiple copies of IS6110 within the genome. To investigate the sensitivity, accuracy, and limitations of the technique, it was applied to eight Mycobacterium tuberculosis strains for which complete or partial IS6110 insertion site information had been determined previously. SiteMapping correctly located 64% (38 of 59) of the IS6110 copies predicted by restriction fragment length polymorphism analysis. The technique is highly specific; 97% of the predicted insertion sites were true insertions. Eight previously unknown insertions were identified and confirmed by PCR or sequencing. The performance could be improved by modifications in the experimental protocol and in the approach to data analysis. SiteMapping has general applicability and demonstrates an expansion in the applications of microarrays that complements conventional approaches in the study of genome architecture.
The analysis of large-scale genomic information (such as sequence data or expression patterns) frequently involves grouping genes on the basis of common experimental features. Often, as with gene expression clustering, there are too many groups to easily identify the functionally relevant ones. One valuable source of information about gene function is the published literature. We present a method, neighbor divergence, for assessing whether the genes within a group share a common biological function based on their associated scientific literature. The method uses statistical natural language processing techniques to interpret biological text. It requires only a corpus of documents relevant to the genes being studied (e.g., all genes in an organism) and an index connecting the documents to appropriate genes. Given a group of genes, neighbor divergence assigns a numerical score indicating how "functionally coherent" the gene group is from the perspective of the published literature. We evaluate our method by testing its ability to distinguish 19 known functional gene groups from 1900 randomly assembled groups. Neighbor divergence achieves 79% sensitivity at 100% specificity, comparing favorably to other tested methods. We also apply neighbor divergence to previously published gene expression clusters to assess its ability to recognize gene groups that had been manually identified as representative of a common function.