In Press
Reshef Y, Rumker L, Kang JB, Nathan A, Korsunsky I, Asgari S, Murray MB, Moody DB, Raychaudhuri S. Axes of inter-sample variability among transcriptional neighborhoods reveal disease-associated cell states in single-cell data [Internet]. Nature Biotech In Press; bioRxivAbstract
As single-cell datasets grow in sample size, there is a critical need to characterize cell states that vary across samples and associate with sample attributes like clinical phenotypes. Current statistical approaches typically map cells to cell-type clusters and examine sample differences through that lens alone. Here we present covarying neighborhood analysis (CNA), an unbiased method to identify cell populations of interest with greater flexibility and granularity. CNA characterizes dominant axes of variation across samples by identifying groups of very small regions in transcriptional space—termed neighborhoods—that covary in abundance across samples, suggesting shared function or regulation. CNA can then rigorously test for associations between any sample-level attribute and the abundances of these covarying neighborhood groups. We show in simulation that CNA enables more powerful and accurate identification of disease-associated cell states than a cluster-based approach. When applied to published datasets, CNA captures a Notch activation signature in rheumatoid arthritis, redefines monocyte populations expanded in sepsis, and identifies a previously undiscovered T-cell population associated with progression to active tuberculosis.Competing Interest StatementThe authors have declared no competing interest.
Luo Y, Kanai M, Choi W, Li X, Yamamoto K, Ogawa K, Gutierrez-Arcelus M, Gregersen PK, Stuart PE, Elder JT, Fellay J, Carrington M, Haas DW, Guo X, Palmer ND, Chen Y-DI, Rotter JI, Taylor KD, Rich SS, Correa A, Wilson JG, Kathiresan S, Cho MH, Metspalu A, Esko T, Okada Y, Han B, for Consortium NHLBIT-OPM (TOPM), McLaren PJ, Raychaudhuri S. A high-resolution HLA reference panel capturing global population diversity enables multi-ethnic fine-mapping in HIV host response [Internet]. Nature Genetics In Press; PreprintAbstract
Defining causal variation by fine-mapping can be more effective in multi-ethnic genetic studies, particularly in regions such as the MHC with highly population-specific structure. To enable such studies, we constructed a large (N=21,546) high resolution HLA reference panel spanning five global populations based on whole-genome sequencing data. Expectedly, we observed unique long-range HLA haplotypes within each population group. Despite this, we demonstrated consistently accurate imputation at G-group resolution (94.2%, 93.7%, 97.8% and 93.7% in Admixed African (AA), East Asian (EAS), European (EUR) and Latino (LAT)). We jointly analyzed genome-wide association studies (GWAS) of HIV-1 viral load from EUR, AA and LAT populations. Our analysis pinpointed the MHC association to three amino acid positions (97, 67 and 156) marking three consecutive pockets (C, B and D) within the HLA-B peptide binding groove, explaining 12.9% of trait variance, and obviating effects of previously reported associations from population-specific HIV studies.Competing Interest StatementM.H.C. has received consulting or speaking fees from Illumina and AstraZeneca, and grant support from GSK and Bayer.Funding StatementThe study was supported by the National Institutes of Health (NIH) TB Research Unit Network, Grant U19 AI111224-01. The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services. The Genotype and Phenotype (GaP) Registry at The Feinstein Institute for Medical Research provided fresh, de-identified human plasma; blood was collected from control subjects under an IRB-approved protocol (IRB# 09-081) and processed to isolate plasma. The GaP is a sub-protocol of the Tissue Donation Program (TDP) at Northwell Health and a national resource for genotype-phenotype studies. A.M. is supported by Gentransmed grant 2014-2020.4.01.15-0012.; D.W.H. is supported by NIH grants AI110527, AI077505, TR000445, AI069439, and AI110527. D.H.S. was supported by R01 HL92301, R01 HL67348, R01 NS058700, R01 AR48797, R01 DK071891, R01 AG058921, the General Clinical Research Center of the Wake Forest University School of Medicine (M01 RR07122, F32 HL085989), the American Diabetes Association, and a pilot grant from the Claude Pepper Older Americans Independence Center of Wake Forest University Health Sciences (P60 AG10484). J.T.E. and P.E.S. were supported by NIH/NIAMS R01 AR042742, R01 AR050511, and R01 AR063611. For some HIV cohort participants, DNA and data collection was supported by NIH/NIAID AIDS Clinical Trial Group (ACTG) grants UM1 AI068634, UM1 AI068636 and UM1 AI106701, and ACTG clinical research site grants A1069412, A1069423, A1069424, A1069503, AI025859, AI025868, AI027658, AI027661, AI027666, AI027675, AI032782, AI034853, AI038858, AI045008, AI046370, AI046376, AI050409, AI050410, AI050410, AI058740, AI060354, AI068636, AI069412, AI069415, AI069418, AI069419, AI069423, AI069424, AI069428, AI069432, AI069432, AI069434, AI069439, AI069447, AI069450, AI069452, AI069465, AI069467, AI069470, AI069471, AI069472, AI069474, AI069477, AI069481, AI069484, AI069494, AI069495, AI069496, AI069501, AI069501, AI069502, AI069503, AI069511, AI069513, AI069532, AI069534, AI069556, AI072626, AI073961, RR000046, RR000425, RR023561, RR024156, RR024160, RR024996, RR025008, RR025747, RR025777, RR025780, TR000004, TR000058, TR000124, TR000170, TR000439, TR000445, TR000457, TR001079, TR001082, TR001111, and TR024160. Molecular data for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the National Heart, Lung and Blood Institute (NHLBI). See the TOPMed Omics Support Table (Supplementary Table 16) for study specific omics support information. Core support including centralized genomic read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract HHSN268201800002I). Core support including phenotype harmonization, data management, sample-identity QC, and general program coordination were provided by the TOPMed Data Coordinating Center (R01HL-120393; U01HL-120393; contract HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed. The COPDGene project was supported by Award Number U01 HL089897 and Award Number U01 HL089856 from the National Heart, Lung, and Blood Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health. The COPDGene project is also supported by the COPD Foundation through contributions made to an Industry Advisory Board comprised of AstraZeneca, Boehringer Ingelheim, GlaxoSmithKline, Novartis, Pfizer, Siemens and Sunovion. A full listing of COPDGene investigators can be found at: The Jackson Heart Study (JHS) is supported and conducted in collaboration with Jackson State University (HHSN268201800013I), Tougaloo College (HHSN268201800014I), the Mississippi State Department of Health (HHSN268201800015I) and the University of Mississippi Medical Center (HHSN268201800010I, HHSN268201800011I and HHSN268201800012I) contracts from the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute on Minority Health and Health Disparities (NIMHD). The authors also wish to thank the staffs and participants of the JHS. MESA and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts 75N92020D00001, HHSN268201500003I, N01-HC-95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC-95163, 75N92020D00004, N01-HC-95164, 75N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420. MESA Family is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support is provided by grants and contracts R01HL071051, R01HL071205, R01HL071250, R01HL071251, R01HL071258, R01HL071259, by the National Center for Research Resources, Grant UL1RR033176. The provision of genotyping data was supported in part by the National Center for Advancing Translational Sciences, CTSI grant UL1TR001881, and the National Institute of Diabetes and Digestive and Kidney Disease Diabetes Research Center (DRC) grant DK063491 to the Southern California Diabetes Endocrinology Research Center. This project has been funded in whole or in part with federal funds from the Frederick National Laboratory for Cancer Research, under Contract No. HHSN261200800001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. This Research was supported in part by the Intramural Research Program of the NIH, Frederick National Lab, Center for Cancer Research.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The Genotype and Phenotype (GaP) Registry at The Feinstein Institute for Medical Research provided fresh, de-identified human plasma; blood was collected from control subjects under an IRB-approved protocol (IRB# 09-081) and processed to isolate plasma. The GaP is a sub-protocol of the Tissue Donation Program (TDP) at Northwell Health and a national resource for genotype-phenotype studies. Each study was previously approved by respective institutional review boards (IRBs), including for the generation of WGS data and association with phenotypes. All participants provided written consent.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe source code is available for download at
Kang J, Nathan A, Millard N, Rumker L, Moody DB, Korsunsky I, Raychaudhuri S. Efficient and precise single-cell reference atlas mapping with Symphony [Internet]. Nature Communications In Press; bioRxivAbstract
Recent advances in single-cell technologies and integration algorithms make it possible to construct large, comprehensive reference atlases from multiple datasets encompassing many donors, studies, disease states, and sequencing platforms. Much like mapping sequencing reads to a reference genome, it is essential to be able to map new query cells onto complex, multimillion-cell reference atlases to rapidly identify relevant cell states and phenotypes. We present Symphony, a novel algorithm for building compressed, integrated reference atlases of ≥106 cells and enabling efficient query mapping within seconds. Based on a linear mixture model framework, Symphony precisely localizes query cells within a low-dimensional reference embedding without the need to reintegrate the reference cells, facilitating the downstream transfer of many types of reference-defined annotations to the query cells. We demonstrate the power of Symphony by (1) mapping a query containing multiple levels of experimental design to predict pancreatic cell types in human and mouse, (2) localizing query cells along a smooth developmental trajectory of human fetal liver hematopoiesis, and (3) harnessing a multimodal CITE-seq reference atlas to infer query surface protein expression in memory T cells. Symphony will enable the sharing of comprehensive integrated reference atlases in a convenient, portable format that powers fast, reproducible querying and downstream analyses.
Ishigaki K, Lagattuta K, Luo Y, James E, Buckner J, Raychaudhuri S. HLA autoimmune risk alleles restrict the hypervariable region of T cell receptors [Internet]. Submitted; medRxivAbstract
Polymorphisms in the human leukocyte antigen (HLA) genes within the major histocompatibility complex (MHC) locus strongly influence autoimmune disease risk15. Two non-exclusive hypotheses exist about the pathogenic role of HLAalleles; i) the central hypothesis, where HLA risk alleles influence thymic selection so that the probability of T cell receptors (TCRs) reactive to pathogenic antigens is increased68; and ii) the peripheral hypothesis, where HLA risk alleles increase the affinity for pathogenic antigens911. The peripheral hypothesis has been the main research focus in autoimmunity, while human data on the central hypothesis are lacking. Here, we investigated the influence of HLA alleles on TCR composition at the highly diverse complementarity determining region 3 (CDR3), where TCR recognizes antigens. We demonstrated unexpectedly powerful HLA-CDR3 associations. The strongest association was found at HLA-DRB1 amino acid position 13 (n = 628 subjects, explained variance = 9.4%; P = 4.1 x 10−138). This HLA position mediates genetic risk for multiple autoimmune diseases. In structural analysis of TCR-peptide-MHC complexes, we observed that HLA-DRB1 position 13 does not interact directly with CDR3, but is proximate to antigenic peptide residues that are also close to CDR3. We identified multiple CDR3 amino acid features enriched by HLA risk alleles; for example, the risk alleles of rheumatoid arthritis, type 1 diabetes, and celiac disease all increase the hydrophobicity of CDR3 position 109 (P < 2.1 x 10−5). In the setting of celiac disease, the CDR3 features favored by HLA risk alleles are more enriched among candidate pathogenic TCRs than control TCRs (P = 2.4 × 10−6 for gliadin specific TCRs). Together, these results provide novel genetic evidence supporting the central hypothesis.
Zhang F, Mears J, Shakib L, Beynor J, Shanaj S, Korsunsky I, Nathan A, Consortium AMP, Donlin L, Raychaudhuri S. IFN-γ and TNF-α drive a CXCL10+ CCL2+ macrophage phenotype expanded in severe COVID-19 lungs and inflammatory diseases with tissue inflammation [Internet]. Genome Med 2021;13(1):64. Publisher's VersionAbstract

Background: Immunosuppressive and anti-cytokine treatment may have a protective effect for patients with COVID-19. Understanding the immune cell states shared between COVID-19 and other inflammatory diseases with established therapies may help nominate immunomodulatory therapies.

Methods: To identify cellular phenotypes that may be shared across tissues affected by disparate inflammatory diseases, we developed a meta-analysis and integration pipeline that models and removes the effects of technology, tissue of origin, and donor that confound cell-type identification. Using this approach, we integrated > 300,000 single-cell transcriptomic profiles from COVID-19-affected lungs and tissues from healthy subjects and patients with five inflammatory diseases: rheumatoid arthritis (RA), Crohn's disease (CD), ulcerative colitis (UC), systemic lupus erythematosus (SLE), and interstitial lung disease. We tested the association of shared immune states with severe/inflamed status compared to healthy control using mixed-effects modeling. To define environmental factors within these tissues that shape shared macrophage phenotypes, we stimulated human blood-derived macrophages with defined combinations of inflammatory factors, emphasizing in particular antiviral interferons IFN-beta (IFN-β) and IFN-gamma (IFN-γ), and pro-inflammatory cytokines such as TNF.

Results: We built an immune cell reference consisting of > 300,000 single-cell profiles from 125 healthy or disease-affected donors from COVID-19 and five inflammatory diseases. We observed a CXCL10+ CCL2+ inflammatory macrophage state that is shared and strikingly abundant in severe COVID-19 bronchoalveolar lavage samples, inflamed RA synovium, inflamed CD ileum, and UC colon. These cells exhibited a distinct arrangement of pro-inflammatory and interferon response genes, including elevated levels of CXCL10, CXCL9, CCL2, CCL3, GBP1, STAT1, and IL1B. Further, we found this macrophage phenotype is induced upon co-stimulation by IFN-γ and TNF-α.

Conclusions: Our integrative analysis identified immune cell states shared across inflamed tissues affected by inflammatory diseases and COVID-19. Our study supports a key role for IFN-γ together with TNF-α in driving an abundant inflammatory macrophage phenotype in severe COVID-19-affected lungs, as well as inflamed RA synovium, CD ileum, and UC colon, which may be targeted by existing immunomodulatory therapies.

Keywords: COVID-19; Inflammatory diseases; Macrophage heterogeneity; Macrophage stimulation; Single-cell multi-disease tissue integration; Single-cell transcriptomics.

Nathan A, Beynor JI, Baglaenko Y, Suliman S, Ishigaki K, Asgari S, Huang CC, Luo Y, Zhang Z, Lopez K, Lindestam Arlehamn CS, Ernst JD, Jimenez J, Calderon RI, Lecca L, Van Rhijin I, Moody DB, Murray MB, Raychaudhuri S. Multimodally profiling memory T cells from a tuberculosis cohort identifies cell state associations with demographics, environment and disease [Internet]. Nat Immunol 2021;22(6):781-793. Publisher's VersionAbstract
Multimodal T cell profiling can enable more precise characterization of elusive cell states underlying disease. Here, we integrated single-cell RNA and surface protein data from 500,089 memory T cells to define 31 cell states from 259 individuals in a Peruvian tuberculosis (TB) progression cohort. At immune steady state >4 years after infection and disease resolution, we found that, after accounting for significant effects of age, sex, season and genetic ancestry on T cell composition, a polyfunctional type 17 helper T (TH17) cell-like effector state was reduced in abundance and function in individuals who previously progressed from Mycobacterium tuberculosis (M.tb) infection to active TB disease. These cells are capable of responding to M.tb peptides. Deconvoluting this state-uniquely identifiable with multimodal analysis-from public data demonstrated that its depletion may precede and persist beyond active disease. Our study demonstrates the power of integrative multimodal single-cell profiling to define cell states relevant to disease and other traits.
Shoop-Worrall SJW, Hyrich KL, Wedderburn LR, Thomson W, Geifman N, Baildam E, Barnes M, Beresford MW, Carlsson E, Chieng A, Ciurtin C, Cleary G, Davidson J, Dekaj F, Dews S-A, Dick A, Diogo GR, Duerr T, Fairlie J, Foster H, Gritzfeld JF, Ioannou Y, Jebson B, Kartawinata M, Kent T, Kimonyo A, Lawson-Tovey S, Lin W-Y, Martin P, McErlane F, Merali F, Morris A, Neale H, Neisen J, Ng S, Ralph E, Ramanan AV, Raychaudhuri S, Robinson E, Smith S, Sumner E, Tarasek D, Wallace C, Wanstall Z, Yarwood A. Patient-reported wellbeing and clinical disease measures over time captured by multivariate trajectories of disease activity in individuals with juvenile idiopathic arthritis in the UK: a multicentre prospective longitudinal study [Internet]. The Lancet Rheumatology 2021;3(2):e111-e121. Publisher's VersionAbstract
Summary Background Juvenile idiopathic arthritis (JIA) is a heterogeneous disease, the signs and symptoms of which can be summarised with use of composite disease activity measures, including the clinical Juvenile Arthritis Disease Activity Score (cJADAS). However, clusters of children and young people might experience different global patterns in their signs and symptoms of disease, which might run in parallel or diverge over time. We aimed to identify such clusters in the 3 years after a diagnosis of JIA. The identification of these clusters would allow for a greater understanding of disease progression in JIA, including how physician-reported and patient-reported outcomes relate to each other over the JIA disease course. Methods In this multicentre prospective longitudinal study, we included children and young people recruited before Jan 1, 2015, to the Childhood Arthritis Prospective Study (CAPS), a UK multicentre inception cohort. Participants without a cJADAS score were excluded. To assess groups of children and young people with similar disease patterns in active joint count, physician's global assessment, and patient or parental global evaluation, we used latent profile analysis at initial presentation to paediatric rheumatology and multivariate group-based trajectory models for the following 3 years. Optimal models were selected on the basis of a combination of model fit, clinical plausibility, and model parsimony. Finding Between Jan 1, 2001, and Dec 31, 2014, 1423 children and young people with JIA were recruited to CAPS, 239 of whom were excluded, resulting in a final study population of 1184 children and young people. We identified five clusters at baseline and six trajectory groups using longitudinal follow-up data. Disease course was not well predicted from clusters at baseline; however, in both cross-sectional and longitudinal analyses, substantial proportions of children and young people had high patient or parent global scores despite low or improving joint counts and physician global scores. Participants in these groups were older, and a higher proportion of them had enthesitis-related JIA and lower socioeconomic status, compared with those in other groups. Interpretation Almost one in four children and young people with JIA in our study reported persistent, high patient or parent global scores despite having low or improving active joint counts and physician's global scores. Distinct patient subgroups defined by disease manifestation or trajectories of progression could help to better personalise health-care services and treatment plans for individuals with JIA. Funding Medical Research Council, Versus Arthritis, Great Ormond Street Hospital Children's Charity, Olivia's Vision, and National Institute for Health Research.
Khan A, Shang N, Petukhova L, Zhang J, Shen Y, Hebbring SJ, Moncrieffe H, Kottyan LC, Namjou-Khales B, Knevel R, Raychaudhuri S, Karlson EW, Harley JB, Stanaway IB, Crosslin D, Denny JC, Elkind MSV, Gharavi AG, Hripcsak G, Weng C, Kiryluk K. Medical Records-Based Genetic Studies of the Complement System [Internet]. Journal of the American Society of Nephrology 2021; Publisher's VersionAbstract
The complement pathway represents one of the critical arms of the innate immune system. We combined genome-wide and phenome-wide association studies using medical records data for C3 and C4 levels to discover common genetic variants controlling systemic complement activation. Three genome-wide significant loci had large effects on complement levels. These loci encode three critical complement genes: CFH, C3, and C4. We performed detailed functional annotations of the significant loci, including multiallelic copy number variant analysis of the C4 locus to define two structural genomic variants with large effects on C4 levels. Blood C4 levels were strongly correlated with the copy number of C4A and C4B genes. Lastly, using genome-wide genetic correlations and electronic health records–based phenome-wide association studies in 102,138 participants, we catalogued a spectrum of human diseases genetically related to systemic complement activation, including inflammatory, autoimmune, cardiometabolic, and kidney diseases.Background Genetic variants in complement genes have been associated with a wide range of human disease states, but well-powered genetic association studies of complement activation have not been performed in large multiethnic cohorts.Methods We performed medical records–based genome-wide and phenome-wide association studies for plasma C3 and C4 levels among participants of the Electronic Medical Records and Genomics (eMERGE) network.Results In a GWAS for C3 levels in 3949 individuals, we detected two genome-wide significant loci: chr.1q31.3 (CFH locus; rs3753396-A; β=0.20; 95% CI, 0.14 to 0.25; P=1.52x10-11) and chr.19p13.3 (C3 locus; rs11569470-G; β=0.19; 95% CI, 0.13 to 0.24; P=1.29x10-8). These two loci explained approximately 2% of variance in C3 levels. GWAS for C4 levels involved 3998 individuals and revealed a genome-wide significant locus at chr.6p21.32 (C4 locus; rs3135353-C; β=0.40; 95% CI, 0.34 to 0.45; P=4.58x10-35). This locus explained approximately 13% of variance in C4 levels. The multiallelic copy number variant analysis defined two structural genomic C4 variants with large effect on blood C4 levels: C4-BS (β=-0.36; 95% CI, -0.42 to -0.30; P=2.98x10-22) and C4-AL-BS (β=0.25; 95% CI, 0.21 to 0.29; P=8.11x10-23). Overall, C4 levels were strongly correlated with copy numbers of C4A and C4B genes. In comprehensive phenome-wide association studies involving 102,138 eMERGE participants, we cataloged a full spectrum of autoimmune, cardiometabolic, and kidney diseases genetically related to systemic complement activation.Conclusions We discovered genetic determinants of plasma C3 and C4 levels using eMERGE genomic data linked to electronic medical records. Genetic variants regulating C3 and C4 levels have large effects and multiple clinical correlations across the spectrum of complement-related diseases in humans.
Shi H, Gazal S, Kanai M, Koch EM, Schoech AP, Siewert KM, Kim SS, Luo Y, Amariuta T, Huang H, Okada Y, Raychaudhuri S, Sunyaev SR, Price AL. Population-specific causal disease effect sizes in functionally important regions impacted by selection [Internet]. Nat Commun 2021;12(1):1098. Publisher's VersionAbstract
Many diseases exhibit population-specific causal effect sizes with trans-ethnic genetic correlations significantly less than 1, limiting trans-ethnic polygenic risk prediction. We develop a new method, S-LDXR, for stratifying squared trans-ethnic genetic correlation across genomic annotations, and apply S-LDXR to genome-wide summary statistics for 31 diseases and complex traits in East Asians (average N = 90K) and Europeans (average N = 267K) with an average trans-ethnic genetic correlation of 0.85. We determine that squared trans-ethnic genetic correlation is 0.82× (s.e. 0.01) depleted in the top quintile of background selection statistic, implying more population-specific causal effect sizes. Accordingly, causal effect sizes are more population-specific in functionally important regions, including conserved and regulatory regions. In regions surrounding specifically expressed genes, causal effect sizes are most population-specific for skin and immune genes, and least population-specific for brain genes. Our results could potentially be explained by stronger gene-environment interaction at loci impacted by selection, particularly positive selection.
Cook S, Choi W, Lim H, Luo Y, Kim K, Jia X, Raychaudhuri S, Han B. Accurate imputation of human leukocyte antigens with CookHLA [Internet]. Nat Commun 2021;12(1):1264. Publisher's VersionAbstract
The recent development of imputation methods enabled the prediction of human leukocyte antigen (HLA) alleles from intergenic SNP data, allowing studies to fine-map HLA for immune phenotypes. Here we report an accurate HLA imputation method, CookHLA, which has superior imputation accuracy compared to previous methods. CookHLA differs from other approaches in that it locally embeds prediction markers into highly polymorphic exons to account for exonic variability, and in that it adaptively learns the genetic map within MHC from the data to facilitate imputation. Our benchmarking with real datasets shows that our method achieves high imputation accuracy in a wide range of scenarios, including situations where the reference panel is small or ethnically unmatched.
Choi W, Luo Y, Raychaudhuri S, Han B. HATK: HLA Analysis Toolkit [Internet]. Bioinformatics 2021;37(3):416-418. Publisher's Version
Degenhardt F, Mayr G, Wendorff M, Boucher G, Ellinghaus E, Ellinghaus D, ElAbd H, Rosati E, Hubenthal M, Juzenas S, Abedian S, Vahedi H, Thelma BK, Yang S-K, Ye BD, Cheon JH, Datta LW, Daryani NE, Ellul P, Esaki M, Fuyuno Y, McGovern DPB, Haritunians T, Hong M, Juyal G, Jung ES, Kubo M, Kugathasan S, Lenz TL, Leslie S, Malekzadeh R, Midha V, Motyer A, Ng SC, Okou DT, Raychaudhuri S, Schembri J, Schreiber S, Song K, Sood A, Takahashi A, Torres EA, Umeno J, Alizadeh BZ, Weersma RK, Wong SH, Yamazaki K, Karlsen TH, Rioux JD, Brant SR, Center MAAISR, Franke A, Consortium IIBDG. Transethnic analysis of the human leukocyte antigen region for ulcerative colitis reveals not only shared but also ethnicity-specific disease associations [Internet]. Hum Mol Genet 2021;30(5):356-369. Publisher's VersionAbstract
Inflammatory bowel disease (IBD) is a chronic inflammatory disease of the gut. Genetic association studies have identified the highly variable human leukocyte antigen (HLA) region as the strongest susceptibility locus for IBD and specifically DRB1*01:03 as a determining factor for ulcerative colitis (UC). However, for most of the association signal such as delineation could not be made because of tight structures of linkage disequilibrium within the HLA. The aim of this study was therefore to further characterize the HLA signal using a transethnic approach. We performed a comprehensive fine mapping of single HLA alleles in UC in a cohort of 9272 individuals with African American, East Asian, Puerto Rican, Indian and Iranian descent and 40 691 previously analyzed Caucasians, additionally analyzing whole HLA haplotypes. We computationally characterized the binding of associated HLA alleles to human self-peptides and analyzed the physicochemical properties of the HLA proteins and predicted self-peptidomes. Highlighting alleles of the HLA-DRB1*15 group and their correlated HLA-DQ-DR haplotypes, we not only identified consistent associations (regarding effects directions/magnitudes) across different ethnicities but also identified population-specific signals (regarding differences in allele frequencies). We observed that DRB1*01:03 is mostly present in individuals of Western European descent and hardly present in non-Caucasian individuals. We found peptides predicted to bind to risk HLA alleles to be rich in positively charged amino acids. We conclude that the HLA plays an important role for UC susceptibility across different ethnicities. This research further implicates specific features of peptides that are predicted to bind risk and protective HLA proteins.
Millard N, Korsunsky I, Weinand K, Fonseka CY, Nathan A, Kang JB, Raychaudhuri S. Maximizing statistical power to detect clinically associated cell states with scPOST [Internet]. Cell Rep Methods 2021;1(8):PMCID: PMC8740883. bioRxivAbstract
As advances in single-cell technologies enable the unbiased assay of thousands of cells simultaneously, human disease studies are able to identify clinically associated cell states using case-control study designs. These studies require precious clinical samples and costly technologies; therefore, it is critical to employ study design principles that maximize power to detect cell state frequency shifts between conditions, such as disease versus healthy. Here, we present single-cell Power Simulation Tool (scPOST), a method that enables users to estimate power under different study designs. To approximate the specific experimental and clinical scenarios being investigated, scPOST takes prototype (public or pilot) single-cell data as input and generates large numbers of single-cell datasets in silico. We use scPOST to perform power analyses on three independent single-cell datasets that span diverse experimental conditions: a batch-corrected 21-sample rheumatoid arthritis dataset (5,265 cells) from synovial tissue, a 259-sample tuberculosis progression dataset (496,517 memory T cells) from peripheral blood mononuclear cells (PBMCs), and a 30-sample ulcerative colitis dataset (235,229 cells) from intestinal biopsies. Over thousands of simulations, we consistently observe that power to detect frequency shifts in cell states is maximized by larger numbers of independent clinical samples, reduced batch effects, and smaller variation in a cell state’s frequency across samples.
Amariuta T, Ishigaki K, Sugishita H, Ohta T, Koido M, Dey KK, Matsuda K, Murakami Y, Price AL, Kawakami E, Terao C, Raychaudhuri S. Improving the trans-ancestry portability of polygenic risk scores by prioritizing variants in predicted cell-type-specific regulatory elements [Internet]. Nature Genetics 2020;52:1346-1354. Publisher's VersionAbstract
Poor trans-ancestry portability of polygenic risk scores is a consequence of Eurocentric genetic studies and limited knowledge of shared causal variants. Leveraging regulatory annotations may improve portability by prioritizing functional over tagging variants. We constructed a resource of 707 cell-type-specific IMPACT regulatory annotations by aggregating 5,345 epigenetic datasets to predict binding patterns of 142 transcription factors across 245 cell types. We then partitioned the common SNP heritability of 111 genome-wide association study summary statistics of European (average n ≈ 189,000) and East Asian (average n ≈ 157,000) origin. IMPACT annotations captured consistent SNP heritability between populations, suggesting prioritization of shared functional variants. Variant prioritization using IMPACT resulted in increased trans-ancestry portability of polygenic risk scores from Europeans to East Asians across all 21 phenotypes analyzed (49.9% mean relative increase in R2). Our study identifies a crucial role for functional annotations such as IMPACT to improve the trans-ancestry portability of genetic data.
Asgari S, Luo Y, Belbin GM, Bartell E, Calderon R, Slowikowski K, Contreras C, Yataco R, Galea JT, Jimenez J, Coit JM, Farroñay C, Nazarian RM, O’Connor TD, Dietz HC, Hirschhorn J, Guio H, Lecca L, Kenny EE, Freeman E, Murray MB, Raychaudhuri S. A positively selected, common, missense variant in FBN1 confers a 2.2 centimeter reduction of height in the Peruvian population [Internet]. Nature 2020;582(7811):234-239. NCBI LinkAbstract
Peruvians are among the shortest people in the world. To understand the genetic basis of short stature in Peru, we examined an ethnically diverse group of Peruvians and identified a novel, population-specific, missense variant in FBN1 (E1297G) that is significantly associated with lower height in the Peruvian population. Each copy of the minor allele (frequency = 4.7%) reduces height by 2.2 cm (4.4 cm in homozygous individuals). This is the largest effect size known for a common height-associated variant. This variant shows strong evidence of positive selection within the Peruvian population and is significantly more frequent in Native American populations from coastal regions of Peru compared to populations from the Andes or the Amazon, suggesting that short stature in Peruvians is the result of adaptation to the coastal environment.One Sentence Summary A mutation found in Peruvians has the largest known effect on height for a common variant. This variant is specific to Native American ancestry.
Gutierrez-Arcelus M#, Baglaenko Y#, Arora J, Hannes S, Luo Y, Amariuta T, Teslovich N, Rao DA, Ermann J, Jonsson AH, for Consortium NHLBIT-OPM (TOPM), Navarrete C, Rich SS, Taylor KD, Rotter JI, Gregersen PK, Esko T, Brenner MB, Raychaudhuri S. Allele-specific expression changes dynamically during T cell activation in HLA and other autoimmune loci [Internet]. Nature Genetics 2020;52:247-253. Publisher's VersionAbstract
Genetic studies have revealed that autoimmune susceptibility variants are over-represented in memory CD4+ T cell regulatory elements1-3. Understanding how genetic variation affects gene expression in different T cell physiological states is essential for deciphering genetic mechanisms of autoimmunity4,5. Here, we characterized the dynamics of genetic regulatory effects at eight time points during memory CD4+ T cell activation with high-depth RNA-seq in healthy individuals. We discovered widespread, dynamic allele-specific expression across the genome, where the balance of alleles changes over time. These genes were enriched fourfold within autoimmune loci. We found pervasive dynamic regulatory effects within six HLA genes. HLA-DQB1 alleles had one of three distinct transcriptional regulatory programs. Using CRISPR-Cas9 genomic editing we demonstrated that a promoter variant is causal for T cell-specific control of HLA-DQB1 expression. Our study shows that genetic variation in cis-regulatory elements affects gene expression in a manner dependent on lymphocyte activation status, contributing to the interindividual complexity of immune responses.
Wei K#, Korsunsky I#, Marshall JL, Gao A, Watts GFM, Major T, Croft AP, Watts J, Blazar P, Lange J, Thornhill T, Filer A, Raza K, Donlin LT, Accelerating Medicines Partnership-Rheumatoid arthritis/Systemic Lupus Erythematosus (AMP RA/SLE), Siebel CW, Buckley CD, Raychaudhuri S*, Brenner* MB. Notch signalling drives synovial fibroblast identity and arthritis pathology [Internet]. Nature 2020;582(7811):259-264. Publisher's Version
Cui J, Raychaudhuri S, Karlson EW, Speyer C, Malspeis S, Guan H, Sparks JA, Ni H, Liu X, Stevens E, Williams JN, Davenport EE, Knevel R, Costenbader KH. Interactions Between Genome-Wide Genetic Factors and Smoking Influencing Risk of Systemic Lupus Erythematosus [Internet]. Arthritis & Rheumatology 2020;72(11):1863-1871. Publisher's VersionAbstract
{Objective To identify interactions between genetic factors and current or recent smoking in relation to risk of developing systemic lupus erythematosus (SLE). Methods For the study, 673 patients with SLE (diagnosed according to the American College of Rheumatology 1997 updated classification criteria) were matched by age, sex, and race (first 3 genetic principal components) to 3,272 control subjects without a history of connective tissue disease. Smoking status was classified as current smoking/having recently quit smoking within 4 years before diagnosis (or matched index date for controls) versus distant past/never smoking. In total, 86 single-nucleotide polymorphisms and 10 classic HLA alleles previously associated with SLE were included in a weighted genetic risk score (wGRS), with scores dichotomized as either low or high based on the median value in control subjects (low wGRS being defined as less than or equal to the control median; high wGRS being defined as greater than the control median). Conditional logistic regression models were used to estimate both the risk of SLE and risk of anti–double-stranded DNA autoantibody–positive (dsDNA+) SLE. Additive interactions were assessed using the attributable proportion (AP) due to interaction, and multiplicative interactions were assessed using a chi-square test (with 1 degree of freedom) for the wGRS and for individual risk alleles. Separate repeated analyses were carried out among subjects of European ancestry only. Results The mean ± SD age of the SLE patients at the time of diagnosis was 36.4 ± 15.3 years. Among the 673 SLE patients included, 92.3% were female and 59.3% were dsDNA+. Ethnic distributions were as follows: 75.6% of European ancestry, 4.5% of Asian ancestry, 11.7% of African ancestry, and 8.2% classified as other ancestry. A high wGRS (odds ratio [OR] 2.0
Huizinga TWJ, Holers MV, Anolik J, Brenner MB, Buckley CD, Bykerk V, Connolly SE, Deane KD, Guo J, Hodge M, Hoffmann S, Nestle F, Pitzalis C, Raychaudhuri S, Yamamoto K, Li Z, Klareskog L. Disruptive innovation in rheumatology: new networks of global public–private partnerships are needed to take advantage of scientific progress [Internet]. Annals of the Rheumatic Diseases 2020;79(5):553-555. Publisher's Version
Amariuta T, Luo Y, Knevel R, Okada Y, Raychaudhuri S. Advances in genetics toward identifying pathogenic cell states of rheumatoid arthritis [Internet]. Immunological Reviews 2020;294(1):188-204. Publisher's VersionAbstract
Rheumatoid arthritis (RA) risk has a large genetic component (~60%) that is still not fully understood. This has hampered the design of effective treatments that could promise lifelong remission. RA is a polygenic disease with 106 known genome-wide significant associated loci and thousands of small effect causal variants. Our current understanding of RA risk has suggested cell-type-specific contexts for causal variants, implicating CD4 + effector memory T cells, as well as monocytes, B cells and stromal fibroblasts. While these cellular states and categories are still mechanistically broad, future studies may identify causal cell subpopulations. These efforts are propelled by advances in single cell profiling. Identification of causal cell subpopulations may accelerate therapeutic intervention to achieve lifelong remission.