Mycobacterium tuberculosis (M. tb.) strains differ in the number and locations of a transposon-like insertion sequence known as IS6110. Accurate detection of this sequence can be used as a fingerprint for individual strains, but can be difficult because of noisy data. In this paper, we propose a non-parametric discriminant analysis method for predicting the locations of the IS6110 sequence from microarray data. Polymerase chain reaction extension products generated from primers specific for the insertion sequence are hybridized to a microarray containing targets corresponding to each open reading frame in M. tb. To test for insertion sites, we use microarray intensity values extracted from small windows of contiguous open reading frames. Rank-transformation of spot intensities and first-order differences in local windows provide enough information to reliably determine the presence of an insertion sequence. The nonparametric approach outperforms all other methods tested in this study.
A series of microarray experiments produces observations of differential expression for thousands of genes across multiple conditions. It is often not clear whether a set of experiments are measuring fundamentally different gene expression states or are measuring similar states created through different mechanisms. It is useful, therefore, to define a core set of independent features for the expression states that allow them to be compared directly. Principal components analysis (PCA) is a statistical technique for determining the key variables in a multidimensional data set that explain the differences in the observations, and can be used to simplify the analysis and visualization of multidimensional data sets. We show that application of PCA to expression data (where the experimental conditions are the variables, and the gene expression measurements are the observations) allows us to summarize the ways in which gene responses vary under different conditions. Examination of the components also provides insight into the underlying factors that are measured in the experiments. We applied PCA to the publicly released yeast sporulation data set (Chu et al. 1998). In that work, 7 different measurements of gene expression were made over time. PCA on the time-points suggests that much of the observed variability in the experiment can be summarized in just 2 components--i.e. 2 variables capture most of the information. These components appear to represent (1) overall induction level and (2) change in induction level over time. We also examined the clusters proposed in the original paper, and show how they are manifested in principal component space. Our results are available on the internet at http:¿www.smi.stanford.edu/project/helix/PCArray .