Recent advances in single-cell technologies and integration algorithms make it possible to construct large, comprehensive reference atlases from multiple datasets encompassing many donors, studies, disease states, and sequencing platforms. Much like mapping sequencing reads to a reference genome, it is essential to be able to map new query cells onto complex, multimillion-cell reference atlases to rapidly identify relevant cell states and phenotypes. We present Symphony, a novel algorithm for building compressed, integrated reference atlases of ≥106 cells and enabling efficient query mapping within seconds. Based on a linear mixture model framework, Symphony precisely localizes query cells within a low-dimensional reference embedding without the need to reintegrate the reference cells, facilitating the downstream transfer of many types of reference-defined annotations to the query cells. We demonstrate the power of Symphony by (1) mapping a query containing multiple levels of experimental design to predict pancreatic cell types in human and mouse, (2) localizing query cells along a smooth developmental trajectory of human fetal liver hematopoiesis, and (3) harnessing a multimodal CITE-seq reference atlas to infer query surface protein expression in memory T cells. Symphony will enable the sharing of comprehensive integrated reference atlases in a convenient, portable format that powers fast, reproducible querying and downstream analyses.
Polymorphisms in the human leukocyte antigen (HLA) genes within the major histocompatibility complex (MHC) locus strongly influence autoimmune disease risk1–5. Two non-exclusive hypotheses exist about the pathogenic role of HLAalleles; i) the central hypothesis, where HLA risk alleles influence thymic selection so that the probability of T cell receptors (TCRs) reactive to pathogenic antigens is increased6–8; and ii) the peripheral hypothesis, where HLA risk alleles increase the affinity for pathogenic antigens9–11. The peripheral hypothesis has been the main research focus in autoimmunity, while human data on the central hypothesis are lacking. Here, we investigated the influence of HLA alleles on TCR composition at the highly diverse complementarity determining region 3 (CDR3), where TCR recognizes antigens. We demonstrated unexpectedly powerful HLA-CDR3 associations. The strongest association was found at HLA-DRB1 amino acid position 13 (n = 628 subjects, explained variance = 9.4%; P = 4.1 x 10−138). This HLA position mediates genetic risk for multiple autoimmune diseases. In structural analysis of TCR-peptide-MHC complexes, we observed that HLA-DRB1 position 13 does not interact directly with CDR3, but is proximate to antigenic peptide residues that are also close to CDR3. We identified multiple CDR3 amino acid features enriched by HLA risk alleles; for example, the risk alleles of rheumatoid arthritis, type 1 diabetes, and celiac disease all increase the hydrophobicity of CDR3 position 109 (P < 2.1 x 10−5). In the setting of celiac disease, the CDR3 features favored by HLA risk alleles are more enriched among candidate pathogenic TCRs than control TCRs (P = 2.4 × 10−6 for gliadin specific TCRs). Together, these results provide novel genetic evidence supporting the central hypothesis.
As advances in single-cell technologies enable the unbiased assay of thousands of cells simultaneously, human disease studies are able to identify clinically associated cell states using case-control study designs. These studies require precious clinical samples and costly technologies; therefore, it is critical to employ study design principles that maximize power to detect cell state frequency shifts between conditions, such as disease versus healthy. Here, we present single-cell Power Simulation Tool (scPOST), a method that enables users to estimate power under different study designs. To approximate the specific experimental and clinical scenarios being investigated, scPOST takes prototype (public or pilot) single-cell data as input and generates large numbers of single-cell datasets in silico. We use scPOST to perform power analyses on three independent single-cell datasets that span diverse experimental conditions: a batch-corrected 21-sample rheumatoid arthritis dataset (5,265 cells) from synovial tissue, a 259-sample tuberculosis progression dataset (496,517 memory T cells) from peripheral blood mononuclear cells (PBMCs), and a 30-sample ulcerative colitis dataset (235,229 cells) from intestinal biopsies. Over thousands of simulations, we consistently observe that power to detect frequency shifts in cell states is maximized by larger numbers of independent clinical samples, reduced batch effects, and smaller variation in a cell state’s frequency across samples.