OBJECTIVES: Generalizable, high-throughput phenotyping methods based on supervised machine learning (ML) algorithms could significantly accelerate the use of electronic health records data for clinical and translational research. However, they often require large numbers of annotated samples, which are costly and time-consuming to review. We investigated the use of active learning (AL) in ML-based phenotyping algorithms.
METHODS: We integrated an uncertainty sampling AL approach with support vector machines-based phenotyping algorithms and evaluated its performance using three annotated disease cohorts including rheumatoid arthritis (RA), colorectal cancer (CRC), and venous thromboembolism (VTE). We investigated performance using two types of feature sets: unrefined features, which contained at least all clinical concepts extracted from notes and billing codes; and a smaller set of refined features selected by domain experts. The performance of the AL was compared with a passive learning (PL) approach based on random sampling.
RESULTS: Our evaluation showed that AL outperformed PL on three phenotyping tasks. When unrefined features were used in the RA and CRC tasks, AL reduced the number of annotated samples required to achieve an area under the curve (AUC) score of 0.95 by 68% and 23%, respectively. AL also achieved a reduction of 68% for VTE with an optimal AUC of 0.70 using refined features. As expected, refined features improved the performance of phenotyping classifiers and required fewer annotated samples.
CONCLUSIONS: This study demonstrated that AL can be useful in ML-based phenotyping methods. Moreover, AL and feature engineering based on domain knowledge could be combined to develop efficient and generalizable phenotyping methods.
OBJECTIVE: To identify genetic determinants of granulomatosis with polyangiitis (Wegener's) (GPA).
METHODS: We carried out a genome-wide association study (GWAS) of 492 GPA cases and 1,506 healthy controls (white subjects of European descent), followed by replication analysis of the most strongly associated signals in an independent cohort of 528 GPA cases and 1,228 controls.
RESULTS: Genome-wide significant associations were identified in 32 single-nucleotide polymorphic (SNP) markers across the HLA region, the majority of which were located in the HLA-DPB1 and HLA-DPA1 genes encoding the class II major histocompatibility complex (MHC) DPβ chain 1 and DPα chain 1 proteins, respectively. Peak association signals in these 2 genes, emanating from SNPs rs9277554 (for DPβ chain 1) and rs9277341 (DPα chain 1) were strongly replicated in an independent cohort (in the combined analysis of the initial cohort and the replication cohort, P = 1.92 × 10(-50) and 2.18 × 10(-39) , respectively). Imputation of classic HLA alleles and conditional analyses revealed that the SNP association signal was fully accounted for by the classic HLA-DPB1*04 allele. An independent single SNP, rs26595, near SEMA6A (the gene for semaphorin 6A) on chromosome 5, was also associated with GPA, reaching genome-wide significance in a combined analysis of the GWAS and replication cohorts (P = 2.09 × 10(-8) ).
CONCLUSION: We identified the SEMA6A and HLA-DP loci as significant contributors to risk for GPA, with the HLA-DPB1*04 allele almost completely accounting for the MHC association. These two associations confirm the critical role of immunogenetic factors in the development of GPA.
OBJECTIVE: The significance of non-rheumatoid arthritis (RA) autoantibodies in patients with RA is unclear. The aim of this study was to assess associations of autoantibodies with autoimmune risk alleles and with clinical diagnoses from the electronic medical records (EMRs) among RA cases and non-RA controls.
METHODS: Data on 1,290 RA cases and 1,236 non-RA controls of European genetic ancestry were obtained from the EMRs of 2 large academic centers. The levels of anti-citrullinated protein antibodies (ACPAs), antinuclear antibodies (ANAs), anti-tissue transglutaminase antibodies (AGTAs), and anti-thyroid peroxidase (anti-TPO) antibodies were measured. All subjects were genotyped for autoimmune risk alleles, and the association between number of autoimmune risk alleles present and number of types of autoantibodies present was studied. A phenome-wide association study (PheWAS) was conducted to study potential associations between autoantibodies and clinical diagnoses among RA cases and non-RA controls.
RESULTS: The mean ages were 60.7 years in RA cases and 64.6 years in non-RA controls. The proportion of female subjects was 79% in each group. The prevalence of ACPAs and ANAs was higher in RA cases compared to controls (each P < 0.0001); there were no differences in the prevalence of anti-TPO antibodies and AGTAs. Carriage of higher numbers of autoimmune risk alleles was associated with increasing numbers of autoantibody types in RA cases (P = 2.1 × 10(-5)) and non-RA controls (P = 5.0 × 10(-3)). From the PheWAS, the presence of ANAs was significantly associated with a diagnosis of Sjögren's/sicca syndrome in RA cases.
CONCLUSION: The increased frequency of autoantibodies in RA cases and non-RA controls was associated with the number of autoimmune risk alleles carried by an individual. PheWAS of EMR data, with linkage to laboratory data obtained from blood samples, provide a novel method to test for the clinical significance of biomarkers in disease.
OBJECTIVE: We aimed to mine the data in the Electronic Medical Record to automatically discover patients' Rheumatoid Arthritis disease activity at discrete rheumatology clinic visits. We cast the problem as a document classification task where the feature space includes concepts from the clinical narrative and lab values as stored in the Electronic Medical Record.
MATERIALS AND METHODS: The Training Set consisted of 2792 clinical notes and associated lab values. Test Set 1 included 1749 clinical notes and associated lab values. Test Set 2 included 344 clinical notes for which there were no associated lab values. The Apache clinical Text Analysis and Knowledge Extraction System was used to analyze the text and transform it into informative features to be combined with relevant lab values.
RESULTS: Experiments over a range of machine learning algorithms and features were conducted. The best performing combination was linear kernel Support Vector Machines with Unified Medical Language System Concept Unique Identifier features with feature selection and lab values. The Area Under the Receiver Operating Characteristic Curve (AUC) is 0.831 (σ = 0.0317), statistically significant as compared to two baselines (AUC = 0.758, σ = 0.0291). Algorithms demonstrated superior performance on cases clinically defined as extreme categories of disease activity (Remission and High) compared to those defined as intermediate categories (Moderate and Low) and included laboratory data on inflammatory markers.
CONCLUSION: Automatic Rheumatoid Arthritis disease activity discovery from Electronic Medical Record data is a learnable task approximating human performance. As a result, this approach might have several research applications, such as the identification of patients for genome-wide pharmacogenetic studies that require large sample sizes with precise definitions of disease activity and response to therapies.
Lee HS, Ripke S, Neale BM, Faraone SV, Purcell SM, Perlis RH, Mowry BJ, Thapar A, Goddard ME, Witte JS, Absher D, Agartz I, Akil H, Amin F, Andreassen OA, Anjorin A, Anney R, Anttila V, Arking DE, Asherson P, Azevedo MH, Backlund L, Badner JA, Bailey AJ, Banaschewski T, Barchas JD, Barnes MR, Barrett TB, Bass N, Battaglia A, Bauer M, Bayés M, Bellivier F, Bergen SE, Berrettini W, Betancur C, Bettecken T, Biederman J, Binder EB, Black DW, Blackwood DHR, Bloss CS, Boehnke M, Boomsma DI, Breen G, Breuer R, Bruggeman R, Cormican P, Buccola NG, Buitelaar JK, Bunney WE, Buxbaum JD, Byerley WF, Byrne EM, Caesar S, Cahn W, Cantor RM, Casas M, Chakravarti A, Chambert K, Choudhury K, Cichon S, Cloninger RC, Collier DA, Cook EH, Coon H, Cormand B, Corvin A, Coryell WH, Craig DW, Craig IW, Crosbie J, Cuccaro ML, Curtis D, Czamara D, Datta S, Dawson G, Day R, De Geus EJ, Degenhardt F, Djurovic S, Donohoe GJ, Doyle AE, Duan J, Dudbridge F, Duketis E, Ebstein RP, Edenberg HJ, Elia J, Ennis S, Etain B, Fanous A, Farmer AE, Ferrier NI, Flickinger M, Fombonne E, Foroud T, Frank J, Franke B, Fraser C, Freedman R, Freimer NB, Freitag CM, Friedl M, Frisén L, Gallagher L, Gejman PV, Georgieva L, Gershon ES, Geschwind DH, Giegling I, Gill M, Gordon SD, Gordon-Smith K, Green EK, Greenwood TA, Grice DE, Gross M, Grozeva D, Guan W, Gurling H, De Haan L, Haines JL, Hakonarson H, Hallmayer J, Hamilton SP, Hamshere ML, Hansen TF, Hartmann AM, Hautzinger M, Heath AC, Henders AK, Herms S, Hickie IB, Hipolito M, Hoefels S, Holmans PA, Holsboer F, Hoogendijk WJ, Hottenga J-J, Hultman CM, Hus V, Ingason A, Ising M, Jamain S, Jones EG, Jones I, Jones L, Tzeng J-Y, Kähler AK, Kahn RS, Kandaswamy R, Keller MC, Kennedy JL, Kenny E, Kent L, Kim Y, Kirov GK, Klauck SM, Klei L, Knowles JA, Kohli MA, Koller DL, Konte B, Korszun A, Krabbendam L, Krasucki R, Kuntsi J, Kwan P, Landén M, Långström N, Lathrop M, Lawrence J, Lawson WB, Leboyer M, Ledbetter DH, Lee PH, Lencz T, Lesch K-P, Levinson DF, Lewis CM, Li J, Lichtenstein P, Lieberman JA, Lin D-Y, Linszen DH, Liu C, Lohoff FW, Loo SK, Lord C, Lowe JK, Lucae S, MacIntyre DJ, Madden PAF, Maestrini E, Magnusson PKE, Mahon PB, Maier W, Malhotra AK, Mane SM, Martin CL, Martin NG, Mattheisen M, Matthews K, Mattingsdal M, McCarroll SA, McGhee KA, McGough JJ, McGrath PJ, McGuffin P, McInnis MG, McIntosh A, McKinney R, McLean AW, McMahon FJ, McMahon WM, McQuillin A, Medeiros H, Medland SE, Meier S, Melle I, Meng F, Meyer J, Middeldorp CM, Middleton L, Milanova V, Miranda A, Monaco AP, Montgomery GW, Moran JL, Moreno-De-Luca D, Morken G, Morris DW, Morrow EM, Moskvina V, Muglia P, Mühleisen TW, Muir WJ, Müller-Myhsok B, Murtha M, Myers RM, Myin-Germeys I, Neale MC, Nelson SF, Nievergelt CM, Nikolov I, Nimgaonkar V, Nolen WA, Nöthen MM, Nurnberger JI, Nwulia EA, Nyholt DR, O'Dushlaine C, Oades RD, Olincy A, Oliveira G, Olsen L, Ophoff RA, Osby U, Owen MJ, Palotie A, Parr JR, Paterson AD, Pato CN, Pato MT, Penninx BW, Pergadia ML, Pericak-Vance MA, Pickard BS, Pimm J, Piven J, Posthuma D, Potash JB, Poustka F, Propping P, Puri V, Quested DJ, Quinn EM, Ramos-Quiroga JA, Rasmussen HB, Raychaudhuri S, Rehnström K, Reif A, Ribasés M, Rice JP, Rietschel M, Roeder K, Roeyers H, Rossin L, Rothenberger A, Rouleau G, Ruderfer D, Rujescu D, Sanders AR, Sanders SJ, Santangelo SL, Sergeant JA, Schachar R, Schalling M, Schatzberg AF, Scheftner WA, Schellenberg GD, Scherer SW, Schork NJ, Schulze TG, Schumacher J, Schwarz M, Scolnick E, Scott LJ, Shi J, Shilling PD, Shyn SI, Silverman JM, Slager SL, Smalley SL, Smit JH, Smith EN, Sonuga-Barke EJS, St Clair D, State M, Steffens M, Steinhausen H-C, Strauss JS, Strohmaier J, Stroup ST, Sutcliffe JS, Szatmari P, Szelinger S, Thirumalai S, Thompson RC, Todorov AA, Tozzi F, Treutlein J, Uhr M, van den Oord EJCG, Van Grootheest G, van Os J, Vicente AM, Vieland VJ, Vincent JB, Visscher PM, Walsh CA, Wassink TH, Watson SJ, Weissman MM, Werge T, Wienker TF, Wijsman EM, Willemsen G, Williams N, Willsey JA, Witt SH, Xu W, Young AH, Yu TW, Zammit S, Zandi PP, Zhang P, Zitman FG, Zöllner S, Devlin B, Kelsoe JR, Sklar P, Daly MJ, O'Donovan MC, Craddock N, Sullivan PF, Smoller JW, Kendler KS, Wray NR. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat Genet 2013;45(9):984-94.Abstract
Most psychiatric disorders are moderately to highly heritable. The degree to which genetic variation is unique to individual disorders or shared across disorders is unclear. To examine shared genetic etiology, we use genome-wide genotype data from the Psychiatric Genomics Consortium (PGC) for cases and controls in schizophrenia, bipolar disorder, major depressive disorder, autism spectrum disorders (ASD) and attention-deficit/hyperactivity disorder (ADHD). We apply univariate and bivariate methods for the estimation of genetic variation within and covariation between disorders. SNPs explained 17-29% of the variance in liability. The genetic correlation calculated using common SNPs was high between schizophrenia and bipolar disorder (0.68 ± 0.04 s.e.), moderate between schizophrenia and major depressive disorder (0.43 ± 0.06 s.e.), bipolar disorder and major depressive disorder (0.47 ± 0.06 s.e.), and ADHD and major depressive disorder (0.32 ± 0.07 s.e.), low between schizophrenia and ASD (0.16 ± 0.06 s.e.) and non-significant for other pairs of disorders as well as between psychiatric disorders and the negative control of Crohn's disease. This empirical evidence of shared genetic etiology for psychiatric disorders can inform nosology and encourages the investigation of common pathophysiologies for related disorders.
Investigators have made key advances in rheumatoid arthritis (RA) genetics in the past 10 years. Although genetic studies have had limited influence on clinical practice and drug discovery, they are currently generating testable hypotheses to explain disease pathogenesis. Firstly, we review here the major advances in identifying RA genetic susceptibility markers both within and outside of the MHC. Understanding how genetic variants translate into pathogenic mechanisms and ultimately into phenotypes remains a mystery for most of the polymorphisms that confer susceptibility to RA, but functional data are emerging. Interplay between environmental and genetic factors is poorly understood and in need of further investigation. Secondly, we review current knowledge of the role of epigenetics in RA susceptibility. Differences in the epigenome could represent one of the ways in which environmental exposures translate into phenotypic outcomes. The best understood epigenetic phenomena include post-translational histone modifications and DNA methylation events, both of which have critical roles in gene regulation. Epigenetic studies in RA represent a new area of research with the potential to answer unsolved questions.
DNA sequence variation within human leukocyte antigen (HLA) genes mediate susceptibility to a wide range of human diseases. The complex genetic structure of the major histocompatibility complex (MHC) makes it difficult, however, to collect genotyping data in large cohorts. Long-range linkage disequilibrium between HLA loci and SNP markers across the major histocompatibility complex (MHC) region offers an alternative approach through imputation to interrogate HLA variation in existing GWAS data sets. Here we describe a computational strategy, SNP2HLA, to impute classical alleles and amino acid polymorphisms at class I (HLA-A, -B, -C) and class II (-DPA1, -DPB1, -DQA1, -DQB1, and -DRB1) loci. To characterize performance of SNP2HLA, we constructed two European ancestry reference panels, one based on data collected in HapMap-CEPH pedigrees (90 individuals) and another based on data collected by the Type 1 Diabetes Genetics Consortium (T1DGC, 5,225 individuals). We imputed HLA alleles in an independent data set from the British 1958 Birth Cohort (N = 918) with gold standard four-digit HLA types and SNPs genotyped using the Affymetrix GeneChip 500 K and Illumina Immunochip microarrays. We demonstrate that the sample size of the reference panel, rather than SNP density of the genotyping platform, is critical to achieve high imputation accuracy. Using the larger T1DGC reference panel, the average accuracy at four-digit resolution is 94.7% using the low-density Affymetrix GeneChip 500 K, and 96.7% using the high-density Illumina Immunochip. For amino acid polymorphisms within HLA genes, we achieve 98.6% and 99.3% accuracy using the Affymetrix GeneChip 500 K and Illumina Immunochip, respectively. Finally, we demonstrate how imputation and association testing at amino acid resolution can facilitate fine-mapping of primary MHC association signals, giving a specific example from type 1 diabetes.
OBJECTIVE: Differences in lipid levels associated with cardiovascular (CV) risk between rheumatoid arthritis (RA) patients and the general population remain unclear. Determining these differences is important in understanding the role of lipids in CV risk in RA.
METHODS: We studied 2,005 RA subjects from 2 large academic medical centers. We extracted electronic medical record data on the first low-density lipoprotein (LDL) measurement, and total cholesterol and high-density lipoprotein (HDL) measurements within 1 year of the LDL measurement. Subjects with an electronic statin prescription prior to the first LDL measurement were excluded. We compared lipid levels in RA patients to recently published levels from the general US population using the t-test and stratifying by published parameters, i.e., 2007-2010, and women. We determined lipid trends using separate linear regression models for total cholesterol, LDL cholesterol, and HDL cholesterol, testing the association between year of measurement (1989-2010) and lipid level, adjusted by age and sex. Lipid trends in RA were qualitatively compared to the published general population trends.
RESULTS: Women with RA had a significantly lower total cholesterol (186 versus 200 mg/dl; P = 0.002) and LDL cholesterol (105 versus 118 mg/dl; P = 0.001) compared to the general population (2007-2010). HDL cholesterol was not significantly different in the 2 groups. In the RA cohort, total cholesterol and LDL cholesterol significantly decreased each year, while HDL cholesterol increased (all with P < 0.0001), consistent with overall trends observed in a previous study.
CONCLUSION: RA patients appear to have an overall lower total cholesterol and LDL cholesterol than the general population despite the general overall risk of CV disease in RA from observational studies.
To characterize the role of rare complete human knockouts in autism spectrum disorders (ASDs), we identify genes with homozygous or compound heterozygous loss-of-function (LoF) variants (defined as nonsense and essential splice sites) from exome sequencing of 933 cases and 869 controls. We identify a 2-fold increase in complete knockouts of autosomal genes with low rates of LoF variation (≤ 5% frequency) in cases and estimate a 3% contribution to ASD risk by these events, confirming this observation in an independent set of 563 probands and 4,605 controls. Outside the pseudoautosomal regions on the X chromosome, we similarly observe a significant 1.5-fold increase in rare hemizygous knockouts in males, contributing to another 2% of ASDs in males. Taken together, these results provide compelling evidence that rare autosomal and X chromosome complete gene knockouts are important inherited risk factors for ASD.
To define the role of rare variants in advanced age-related macular degeneration (AMD) risk, we sequenced the exons of 681 genes within all reported AMD loci and related pathways in 2,493 cases and controls. We first tested each gene for increased or decreased burden of rare variants in cases compared to controls. We found that 7.8% of AMD cases compared to 2.3% of controls are carriers of rare missense CFI variants (odds ratio (OR) = 3.6; P = 2 × 10(-8)). There was a predominance of dysfunctional variants in cases compared to controls. We then tested individual variants for association with disease. We observed significant association with rare missense alleles in genes other than CFI. Genotyping in 5,115 independent samples confirmed associations with AMD of an allele in C3 encoding p.Lys155Gln (replication P = 3.5 × 10(-5), OR = 2.8; joint P = 5.2 × 10(-9), OR = 3.8) and an allele in C9 encoding p.Pro167Ser (replication P = 2.4 × 10(-5), OR = 2.2; joint P = 6.5 × 10(-7), OR = 2.2). Finally, we show that the allele of C3 encoding Gln155 results in resistance to proteolytic inactivation by CFH and CFI. These results implicate loss of C3 protein regulation and excessive alternative complement activation in AMD pathogenesis, thus informing both the direction of effect and mechanistic underpinnings of this disorder.
Genome-wide association studies (GWASs) have identified hundreds of loci harboring genetic variation influencing inflammatory-disease susceptibility in humans. It has been hypothesized that present day inflammatory diseases may have arisen, in part, due to pleiotropic effects of host resistance to pathogens over the course of human history, with significant selective pressures acting to increase host resistance to pathogens. The extent to which genetic factors underlying inflammatory-disease susceptibility has been influenced by selective processes can now be quantified more comprehensively than previously possible. To understand the evolutionary forces that have shaped inflammatory-disease susceptibility and to elucidate functional pathways affected by selection, we performed a systems-based analysis to integrate (1) published GWASs for inflammatory diseases, (2) a genome-wide scan for signatures of positive selection in a population of European ancestry, (3) functional genomics data comprised of protein-protein interaction networks, and (4) a genome-wide expression quantitative trait locus (eQTL) mapping study in peripheral blood mononuclear cells (PBMCs). We demonstrate that loci for inflammatory-disease susceptibility are enriched for genomic signatures of recent positive natural selection, with selected loci forming a highly interconnected protein-protein interaction network. Further, we identify 21 loci for inflammatory-disease susceptibility that display signatures of recent positive selection, of which 13 also show evidence of cis-regulatory effects on genes within the associated locus. Thus, our integrated analyses highlight a set of susceptibility loci that might subserve a shared molecular function and has experienced selective pressure over the course of human history; today, these loci play a key role in influencing susceptibility to multiple different inflammatory diseases, in part through alterations of gene expression in immune cells.
Cui J, Stahl EA, Saevarsdottir S, Miceli C, Diogo D, Trynka G, Raj T, Mirkov MU, Canhao H, Ikari K, Terao C, Okada Y, Wedrén S, Askling J, Yamanaka H, Momohara S, Taniguchi A, Ohmura K, Matsuda F, Mimori T, Gupta N, Kuchroo M, Morgan AW, Isaacs JD, Wilson AG, Hyrich KL, Herenius M, Doorenspleet ME, Tak P-P, Crusius BJA, van der Horst-Bruinsma IE, Wolbink GJ, van Riel PLCM, van de Laar M, Guchelaar H-J, Shadick NA, Allaart CF, Huizinga TWJ, Toes REM, Kimberly RP, Bridges LS, Criswell LA, Moreland LW, Fonseca JE, de Vries N, Stranger BE, De Jager PL, Raychaudhuri S, Weinblatt ME, Gregersen PK, Mariette X, Barton A, Padyukov L, Coenen MJ, Karlson EW, Plenge RM. Genome-wide association study and gene expression analysis identifies CD84 as a predictor of response to etanercept therapy in rheumatoid arthritis. PLoS Genet 2013;9(3):e1003394.Abstract
Anti-tumor necrosis factor alpha (anti-TNF) biologic therapy is a widely used treatment for rheumatoid arthritis (RA). It is unknown why some RA patients fail to respond adequately to anti-TNF therapy, which limits the development of clinical biomarkers to predict response or new drugs to target refractory cases. To understand the biological basis of response to anti-TNF therapy, we conducted a genome-wide association study (GWAS) meta-analysis of more than 2 million common variants in 2,706 RA patients from 13 different collections. Patients were treated with one of three anti-TNF medications: etanercept (n = 733), infliximab (n = 894), or adalimumab (n = 1,071). We identified a SNP (rs6427528) at the 1q23 locus that was associated with change in disease activity score (ΔDAS) in the etanercept subset of patients (P = 8 × 10(-8)), but not in the infliximab or adalimumab subsets (P>0.05). The SNP is predicted to disrupt transcription factor binding site motifs in the 3' UTR of an immune-related gene, CD84, and the allele associated with better response to etanercept was associated with higher CD84 gene expression in peripheral blood mononuclear cells (P = 1 × 10(-11) in 228 non-RA patients and P = 0.004 in 132 RA patients). Consistent with the genetic findings, higher CD84 gene expression correlated with lower cross-sectional DAS (P = 0.02, n = 210) and showed a non-significant trend for better ΔDAS in a subset of RA patients with gene expression data (n = 31, etanercept-treated). A small, multi-ethnic replication showed a non-significant trend towards an association among etanercept-treated RA patients of Portuguese ancestry (n = 139, P = 0.4), but no association among patients of Japanese ancestry (n = 151, P = 0.8). Our study demonstrates that an allele associated with response to etanercept therapy is also associated with CD84 gene expression, and further that CD84 expression correlates with disease activity. These findings support a model in which CD84 genotypes and/or expression may serve as a useful biomarker for response to etanercept treatment in RA patients of European ancestry.
Although genetic and non-genetic studies in mouse and human implicate the CD40 pathway in rheumatoid arthritis (RA), there are no approved drugs that inhibit CD40 signaling for clinical care in RA or any other disease. Here, we sought to understand the biological consequences of a CD40 risk variant in RA discovered by a previous genome-wide association study (GWAS) and to perform a high-throughput drug screen for modulators of CD40 signaling based on human genetic findings. First, we fine-map the CD40 risk locus in 7,222 seropositive RA patients and 15,870 controls, together with deep sequencing of CD40 coding exons in 500 RA cases and 650 controls, to identify a single SNP that explains the entire signal of association (rs4810485, P = 1.4×10(-9)). Second, we demonstrate that subjects homozygous for the RA risk allele have ∼33% more CD40 on the surface of primary human CD19+ B lymphocytes than subjects homozygous for the non-risk allele (P = 10(-9)), a finding corroborated by expression quantitative trait loci (eQTL) analysis in peripheral blood mononuclear cells from 1,469 healthy control individuals. Third, we use retroviral shRNA infection to perturb the amount of CD40 on the surface of a human B lymphocyte cell line (BL2) and observe a direct correlation between amount of CD40 protein and phosphorylation of RelA (p65), a subunit of the NF-κB transcription factor. Finally, we develop a high-throughput NF-κB luciferase reporter assay in BL2 cells activated with trimerized CD40 ligand (tCD40L) and conduct an HTS of 1,982 chemical compounds and FDA-approved drugs. After a series of counter-screens and testing in primary human CD19+ B cells, we identify 2 novel chemical inhibitors not previously implicated in inflammation or CD40-mediated NF-κB signaling. Our study demonstrates proof-of-concept that human genetics can be used to guide the development of phenotype-based, high-throughput small-molecule screens to identify potential novel therapies in complex traits such as RA.
Recent work has shown that much of the missing heritability of complex traits can be resolved by estimates of heritability explained by all genotyped SNPs. However, it is currently unknown how much heritability is missing due to poor tagging or additional causal variants at known GWAS loci. Here, we use variance components to quantify the heritability explained by all SNPs at known GWAS loci in nine diseases from WTCCC1 and WTCCC2. After accounting for expectation, we observed all SNPs at known GWAS loci to explain 1.29 x more heritability than GWAS-associated SNPs on average (P=3.3 x 10⁻⁵). For some diseases, this increase was individually significant: 2.07 x for Multiple Sclerosis (MS) (P=6.5 x 10⁻⁹) and 1.48 x for Crohn's Disease (CD) (P = 1.3 x 10⁻³); all analyses of autoimmune diseases excluded the well-studied MHC region. Additionally, we found that GWAS loci from other related traits also explained significant heritability. The union of all autoimmune disease loci explained 7.15 x more MS heritability than known MS SNPs (P < 1.0 x 10⁻¹⁶ and 2.20 x more CD heritability than known CD SNPs (P = 6.1 x 10⁻⁹), with an analogous increase for all autoimmune diseases analyzed. We also observed significant increases in an analysis of > 20,000 Rheumatoid Arthritis (RA) samples typed on ImmunoChip, with 2.37 x more heritability from all SNPs at GWAS loci (P = 2.3 x 10⁻⁶) and 5.33 x more heritability from all autoimmune disease loci (P < 1 x 10⁻¹⁶ compared to known RA SNPs (including those identified in this cohort). Our methods adjust for LD between SNPs, which can bias standard estimates of heritability from SNPs even if all causal variants are typed. By comparing adjusted estimates, we hypothesize that the genome-wide distribution of causal variants is enriched for low-frequency alleles, but that causal variants at known GWAS loci are skewed towards common alleles. These findings have important ramifications for fine-mapping study design and our understanding of complex disease architecture.
The extent to which variants in the protein-coding sequence of genes contribute to risk of rheumatoid arthritis (RA) is unknown. In this study, we addressed this issue by deep exon sequencing and large-scale genotyping of 25 biological candidate genes located within RA risk loci discovered by genome-wide association studies (GWASs). First, we assessed the contribution of rare coding variants in the 25 genes to the risk of RA in a pooled sequencing study of 500 RA cases and 650 controls of European ancestry. We observed an accumulation of rare nonsynonymous variants exclusive to RA cases in IL2RA and IL2RB (burden test: p = 0.007 and p = 0.018, respectively). Next, we assessed the aggregate contribution of low-frequency and common coding variants to the risk of RA by dense genotyping of the 25 gene loci in 10,609 RA cases and 35,605 controls. We observed a strong enrichment of coding variants with a nominal signal of association with RA (p < 0.05) after adjusting for the best signal of association at the loci (p(enrichment) = 6.4 × 10(-4)). For one locus containing CD2, we found that a missense variant, rs699738 (c.798C>A [p.His266Gln]), and a noncoding variant, rs624988, reside on distinct haplotypes and independently contribute to the risk of RA (p = 4.6 × 10(-6)). Overall, our results indicate that variants (distributed across the allele-frequency spectrum) within the protein-coding portion of a subset of biological candidate genes identified by GWASs contribute to the risk of RA. Further, we have demonstrated that very large sample sizes will be required for comprehensively identifying the independent alleles contributing to the missing heritability of RA.
Defining and characterizing pathologies of the immune system requires precise and accurate quantification of abundances and functions of cellular subsets via cytometric studies. At this time, data analysis relies on manual gating, which is a major source of variability in large-scale studies. We devised an automated, user-guided method, X-Cyt, which specializes in rapidly and robustly identifying targeted populations of interest in large data sets. We first applied X-Cyt to quantify CD4(+) effector and central memory T cells in 236 samples, demonstrating high concordance with manual analysis (r = 0.91 and 0.95, respectively) and superior performance to other available methods. We then quantified the rare mucosal associated invariant T cell population in 35 samples, achieving manual concordance of 0.98. Finally we characterized the population dynamics of invariant natural killer T (iNKT) cells, a particularly rare peripheral lymphocyte, in 110 individuals by assaying 19 markers. We demonstrated that although iNKT cell numbers and marker expression are highly variable in the population, iNKT abundance correlates with sex and age, and the expression of phenotypic and functional markers correlates closely with CD4 expression.
If trait-associated variants alter regulatory regions, then they should fall within chromatin marks in relevant cell types. However, it is unclear which of the many marks are most useful in defining cell types associated with disease and fine mapping variants. We hypothesized that informative marks are phenotypically cell type specific; that is, SNPs associated with the same trait likely overlap marks in the same cell type. We examined 15 chromatin marks and found that those highlighting active gene regulation were phenotypically cell type specific. Trimethylation of histone H3 at lysine 4 (H3K4me3) was the most phenotypically cell type specific (P < 1 × 10(-6)), driven by colocalization of variants and marks rather than gene proximity (P < 0.001). H3K4me3 peaks overlapped with 37 SNPs for plasma low-density lipoprotein concentration in the liver (P < 7 × 10(-5)), 31 SNPs for rheumatoid arthritis within CD4(+) regulatory T cells (P = 1 × 10(-4)), 67 SNPs for type 2 diabetes in pancreatic islet cells (P = 0.003) and the liver (P = 0.003), and 14 SNPs for neuropsychiatric disease in neuronal tissues (P = 0.007). We show how cell type-specific H3K4me3 peaks can inform the fine mapping of associated SNPs to identify causal variation.
While studies to associate genomic variants to complex traits have gradually become increasingly productive, the molecular mechanisms that underlie these associations are rarely understood. Because only a small fraction of trait-associated variants can be linked to coding sequences, investigators have speculated that many of the underlying causal alleles influence non-coding gene regulatory sites. Recent studies have successfully identified examples of mechanisms for non-coding alleles at individual loci. Now, genome-wide chromatin assays have resulted in maps of dozens of genomic annotations of the non-coding genome across multiple different tissues, cell types and cell lines. This gives a tremendous opportunity to integrate these annotations with complex trait signals to globally interpret associated variants, and prioritize likely causal alleles. Here, we review the examples of mechanisms by which non-coding, common alleles result in phenotypes. We discuss the efforts to integrate common trait-associated variants with genomic annotations. Finally, we highlight some caveats of these approaches and outline future directions for improvement.