Elsevier

Advances in Genetics

Volume 72, 2010, Pages 101-116
Advances in Genetics

Chapter 5 - Detecting, Characterizing, and Interpreting Nonlinear Gene–Gene Interactions Using Multifactor Dimensionality Reduction

https://doi.org/10.1016/B978-0-12-380862-2.00005-9Get rights and content

Abstract

Human health is a complex process that is dependent on many genes, many environmental factors and chance events that are perhaps not measurable with current technology or are simply unknowable. Success in the design and execution of population-based association studies to identify those genetic and environmental factors that play an important role in human disease will depend on our ability to embrace, rather that ignore, complexity in the genotype to phenotype mapping relationship for any given human ecology. We review here three general computational challenges that must be addressed. First, data mining and machine learning methods are needed to model nonlinear interactions between multiple genetic and environmental factors. Second, filter and wrapper methods are needed to identify attribute interactions in large and complex solution landscapes. Third, visualization methods are needed to help interpret computational models and results. We provide here an overview of the multifactor dimensionality reduction (MDR) method that was developed for addressing each of these challenges.

Introduction

Human genetics has a long and rich history of research to understand the role of interindividual variation in the human genome and variation in biological traits. We have progress rapidly from unmeasured genetic studies in families to the identification of common variation in the DNA sequence that can be used in population-based association studies. This is an exciting time because we now have access to technology that allows us to efficiently measure many DNA sequence variations from across the human genome. We will within the next 5 years likely have access to cutting-edge technology that will deliver the entire genomic sequence for all subjects in our genetic and epidemiologic studies. Now that we have access to the basic hereditary information it is time to shift our focus toward the analysis of this data. The focus of this chapter is on the important role of computer science, and, more specifically, machine learning for mining patterns of genetic variations that are associated with susceptibility to common human diseases. This approach assumes that the relationship between genotype and phenotype is very complex. Specifically, we will focus on computational methods for identifying gene–gene interactions or epistasis that accounts for part of the complexity of genetic architecture.

Human genetics has been largely successful in identifying the causative mutations in single genes that determine with virtual certainly rare diseases such as sickle-cell anemia. However, the same success has not been had for common human diseases such as sporadic breast cancer, essential hypertension or bipolar depression. This is because diseases that are common in the population have a much more complex etiology that requires different research strategies than were used to identify genes underlying rare diseases that follow a simpler Mendelian inheritance pattern. Complexity can arise from phenomena such as locus heterogeneity (i.e., different DNA sequence variations leading to the same phenotype), phenocopy (i.e., environmentally determined phenotypes), and the dependence of genotypic effects on environmental factors (i.e., gene–environment interactions or plastic reaction norms) and genotypes at other loci (i.e., gene–gene interactions or epistasis). It is this latter source of complexity, epistasis, that is of interest here. Epistasis has been recognized for many years as deviations from the simple inheritance patterns observed by Mendel (Bateson, 1909) or deviations from additivity in a linear statistical model (Fisher, 1918) and is likely due, in part, to canalization or mechanisms of stabilizing selection that evolve robust (i.e., redundant) gene networks (Waddington, 1942).

Epistasis has been defined in multiple different ways (e.g., Phillips, 1998, Phillips, 2008). We have reviewed two types of epistasis, biological and statistical (Moore & Williams, 2005, Moore & Williams, 2009, Tyler et al., 2009). Biological epistasis when the physical interactions between biomolecules (e.g., DNA, RNA, proteins, enzymes, etc.) are influenced by genetic variation at multiple different loci. This type of epistasis occurs at the cellular level in an individual and is what Bateson (1909) had in mind when he coined the term. Statistical epistasis, on the other hand, occurs at the population level and is realized when there is interindividual variation in DNA sequences. The statistical phenomenon of epistasis is what Fisher (1918) had in mind. The relationship between biological and statistical epistasis is often confusing but will be important to understand if we are to make biological inferences from statistical results (Moore & Williams, 2005, Moore & Williams, 2009, Phillips, 1998, Phillips, 2008, Tyler et al., 2009). Moore (2003) has argued that epistasis is likely to be a ubiquitous phenomenon in complex human diseases. The focus of the present study is the detection and characterization of statistical epistasis in human populations using machine learning and data mining methods.

The fields of genetics and epidemiology are undergoing an information explosion and an understanding implosion. That is, our ability to generate data is far outpacing our ability to interpret it. This is especially true today where it is technically and economically feasible to measure a million or more single nucleotide polymorphisms (SNPs) from across the human genome. An important goal in human genetics is to determine which of the millions of SNPs are useful for predicting who is at risk for common diseases. This “genome-wide” approach is expected to revolutionize the genetic analysis of common human diseases and, for better or worse, is quickly replacing the traditional “candidate-gene” approach that focuses on several genes selected by their known or suspected function.

Moore and Ritchie (2004) have outlined three significant challenges that must be overcome if we are to successfully identify genetic predictors of health and disease using a genome-wide approach. First, powerful data mining and machine learning methods will need to be developed to statistically model the relationship between combinations of DNA sequence variations and disease susceptibility. Traditional methods such as logistic regression have limited power for modeling high-order nonlinear interactions (Moore and Williams, 2002). A second challenge is the selection of genetic features or attributes that should be included for analysis. If interactions between genes explain most of the heritability of common diseases, then combinations of DNA sequence variations will need to be evaluated from a list of thousands of candidates. Filter (SNP selection) and wrapper (SNP searching) methods will play an important role because there are more combinations than can be exhaustively evaluated. A third challenge is the interpretation of gene–gene interaction models. Although a statistical model can be used to identify DNA sequence variations that confer risk for disease, this approach cannot be translated into specific prevention and treatment strategies without interpreting the results in the context of human biology. Making etiological inferences from computational models may be the most important and the most difficult challenge of all (Moore and Williams, 2005).

To illustrate the concept of statistical interaction, consider the following simple example of epistasis in the form of a penetrance function. Penetrance is simply the probability (P) of disease (D) given a particular combination of genotypes (G) that was inherited (i.e., P[D|G]). Let us assume for two SNPs labeled A and B that genotypes AA, aa, BB, and bb have population frequencies of 0.25, while genotypes Aa and Bb have frequencies of 0.5. Let us also assume that individuals have a very high risk of disease if they inherit Aa or Bb but not both (i.e., the exclusive OR or XOR logic function). What makes this model interesting is that disease risk is entirely dependent on the particular combination of genotypes inherited at more than one locus. The penetrance for each individual genotype in this model is all the same and is computed by summing the products of the genotype frequencies and penetrance values. Heritability can be calculated as outlined by Culverhouse et al. (2002). Thus, in this model there is no difference in disease risk for each single-locus genotype as specified by penetrance values. This model is labeled M170 by Li and Reich (2000) in their categorization of genetic models involving two SNPs and is an example of a pattern that is not separable by a simple linear function. This model is a special case where all of the heritability is due to epistasis or nonlinear gene–gene interaction.

Combining this type of statistical interaction with the challenge of variable selection yields what computer scientists have called a needle-in-a-haystack problem. That is, there may be a particular combination of SNPs or SNPs and environmental factors that together with the right nonlinear function are a significant predictor of disease susceptibility. However, individually they may not look any different than thousands of other SNPs that are not involved in the disease process and are thus noisy. Under these models, the learning algorithm is truly looking for a genetic needle in a genomic haystack. It is now commonly assumed that at least 1,000,000 carefully selected SNPs may be necessary to capture all of the relevant variation across the Caucasian human genome. Assuming this is true, we would need to scan approximately 500 billion pairwise combinations of SNPs to find a genetic needle. The number of higher order combinations is astronomical. What is the optimal computational approach to this problem?

There are two general approaches to select attributes for predictive models. The filter approach preprocesses the data by algorithmically, statistically, or biologically assessing the quality or relevance of each variable and then using that information to select a subset for analysis. The wrapper approach iteratively selects subsets of attributes for classification using either a deterministic or stochastic algorithm. The key difference between the two approaches is that the learning algorithm plays no role in selecting which attributes to consider in the filter approach. The advantage of the filter is speed while the wrapper approach has the potential to do a better job classifying subjects as sick or healthy. We first discuss a specific machine learning algorithm called multifactor dimensionality reduction (MDR) that has been applied to classifying healthy and disease subjects using their DNA sequence information and then discuss filter and wrapper approaches for the specific problem of detecting epistasis or gene–gene interactions on a genome-wide scale.

Section snippets

Machine Learning Analysis of Gene–Gene Interactions Using MDR

As discussed above, one of the early definitions of epistasis was deviation from additivity in a linear model (Fisher, 1918). The linear model plays a very important role in modern genetics and epidemiology because it has solid theoretical foundation, is easy to implement using a wide range of different software packages, and is easy to interpret. Despite these good reasons to use linear models, they do have limitations for detecting nonlinear patterns of interaction (e.g., Moore and Williams,

Filter Approaches to MDR

As discussed above, it is computationally infeasible to combinatorially explore all interactions among the DNA sequence variations in a genome-wide association study. One approach is to filter out a subset of variations that can then be efficiently analyzed using a method such as MDR. We review below a powerful filter method based on the ReliefF algorithm and then discuss prospects for using biological knowledge to filter genetic variations.

Wrapper Approaches to MDR

Stochastic search or wrapper methods may be more powerful than filter approaches because no attributes are discarded in the process. As a result, every attribute retains some probability of being selected for evaluation by the classifier. There are many different stochastic wrapper algorithms that can be applied to this problem (Michalewicz and Fogel, 2004). However, when interactions are present in the absence of marginal effects, there is no reason to expect that any wrapper method would

Statistical Interpretation of MDR Models

The MDR method described above is a powerful attribute construction approach for detecting epistasis or nonlinear gene–gene interactions in epidemiologic studies of common human diseases. The models that MDR produces are by nature multidimensional and thus difficult to interpret. For example, an interaction model with four SNPs, each with three genotypes, summarizes 81 different genotype (i.e., level) combinations (i.e., 34). How do each of these level combinations relate back to biological

Summary

As human genetics and epidemiology move into the genomics age with access to all the information in the genome, we will become increasingly dependent on computer science for managing and making sense of these mountains of data. The specific challenge reviewed here is the detection, characterization, and interpretation of epistasis or gene–gene interactions that are predictive of susceptibility to common human diseases. Epistasis is an important source of complexity in the genotype to phenotype

Acknowledgments

This work was supported by National Institutes of Health (USA) grants LM009012, LM010098, and AI59694.

References (64)

  • A.S. Andrew et al.

    Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking, and bladder cancer susceptibility

    Carcinogenesis

    (2006)
  • A.S. Andrew et al.

    DNA repair polymorphisms modify bladder cancer risk: A multi-factor analytic strategy

    Hum. Hered.

    (2008)
  • K. Askland et al.

    Pathways-based analyses of whole-genome association study data in bipolar disorder reveal genes mediating ion channel activity and synaptic neurotransmission

    Hum. Genet.

    (2009)
  • W. Bateson

    Mendel's Principles of Heredity

    (1909)
  • W.S. Bush et al.

    Parallel multifactor dimensionality reduction: A tool for the large-scale analysis of gene–gene interactions

    Bioinformatics

    (2006)
  • W.S. Bush et al.

    Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction

    BMC Bioinform.

    (2008)
  • W.S. Bush et al.

    Biofilter: A knowledge-integration system for the multi-locus analysis of genome-wide association studies

    Pac. Symp. Biocomput.

    (2009)
  • M.L. Calle et al.

    Improving strategies for detecting genetic patterns of disease susceptibility in association studies

    Stat. Med.

    (2008)
  • M.L. Calle et al.

    mbmdr: An R package for exploring gene–gene interactions associated with binary or quantitative traits

    Bioinformatics

    (2010)
  • T. Cattaert et al.

    FAM-MDR: A flexible family-based multifactor dimensionality reduction technique to detect epistasis using related individuals

    PLoS ONE

    (2010)
  • Y. Chung et al.

    Odds ratio based multifactor-dimensionality reduction method for detecting gene–gene interactions

    Bioinformatics

    (2007)
  • H.J. Cordell

    Genome-wide association studies: Detecting gene–gene interactions that underlie human diseases

    Nat. Rev. Genet.

    (2009)
  • L. De Lobel et al.

    A screening methodology based on Random Forests to improve the detection of gene–gene interactions

    Eur. J. Hum. Genet.

    (2010)
  • T.L. Edwards et al.

    A general framework for formal tests of interaction after exhaustive search methods with applications to MDR and MDR-PDT

    PLoS ONE

    (2010)
  • R.A. Fisher

    The correlations between relatives on the supposition of Mendelian inheritance

    Trans. R Soc. Edinb.

    (1918)
  • C.S. Greene et al.

    Spatially uniform ReliefF (SURF) for computationally-efficient filtering of gene–gene interactions

    BioData Min.

    (2008)
  • C.S. Greene et al.

    Optimal use of expert knowledge in ant colony optimization for the analysis of epistasis in human disease

    Lect. Notes Comput. Sci.

    (2009)
  • C.S. Greene et al.

    Enabling personal genomics with an explicit test of epistasis

    Pac. Symp. Biocomput.

    (2010)
  • C.S. Greene et al.

    The informative extremes: Using both nearest and farthest individuals can improve Relief algorithms in the domain of human genetics

    Lect. Notes Comput. Sci.

    (2010)
  • C.S. Greene et al.

    Multifactor dimensionality reduction for graphics processing units enables genome-wide testing of epistasis in sporadic ALS

    Bioinformatics

    (2010)
  • J. Gui et al.

    A robust multifactor dimensionality reduction method for detecting gene–gene interactions with application to the genetic analysis of bladder cancer susceptibility

    Ann. Hum. Genet.

    (2010)
  • L.W. Hahn et al.

    Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions

    Bioinformatics

    (2003)
  • Cited by (46)

    • Contribution of revision amputation vs replantation for certain digits to functional outcomes after traumatic digit amputations: A comparative study based on multicenter prospective cohort

      2021, International Journal of Surgery
      Citation Excerpt :

      Participants with revision amputation vs replantation of the ring finger distal to the PIP joint and small finger had DASH scores that were statistically indistinguishable. As mentioned, with generalized linear regression, detecting interactions requires more statistical power than do main effects [24,25]. For this reason, we alternatively used MDR with the purpose of finding out important finger-finger interactions on functional outcomes and guiding further analysis.

    • Genetic variants in the cholesterol biosynthesis pathway genes and risk of prostate cancer

      2021, Gene
      Citation Excerpt :

      Finally, correlation between the expressions of genes was explored using TCGA database via Pearson product moment correlations. Multifactor dimensionality reduction (MDR) was used to analyze SNP-environmental interactions in the risk of PCa (Moore, 2010; Zhang et al., 2016). We applied the best models with high training balance accuracy, testing balance accuracy and cross-validation consistency from the GMDR software (version 0.7) to evaluate the interaction between genes and environmental factors.

    • Proteinarium: Multi-sample protein-protein interaction analysis and visualization tool

      2020, Genomics
      Citation Excerpt :

      Genome-wide association studies (GWAS) have become a popular approach to the investigation of complex diseases [1,2] and have made possible discovery of insights not previously recognized [3–5].

    • Addiction

      2014, Rosenberg's Molecular and Genetic Basis of Neurological and Psychiatric Disease: Fifth Edition
    • Synergistic association of DNA repair relevant gene polymorphisms with the risk of coronary artery disease in northeastern Han Chinese

      2014, Thrombosis Research
      Citation Excerpt :

      Besides, we also observed that another polymorphism rs4846049 in the 3’-untranslated region of MTHFR gene also exhibited strong associations with CAD. Considering the ubiquity of genetic interactions in the pathogenesis of complex diseases, the identification and characterization of susceptible genes or polymorphisms require a thorough understanding of gene-to-gene interactions [26]. As expected, three of six examined polymorphisms in XRCC1 and MTHFR genes, which were significant in single-locus analyses, constituted the overall best MDR model in association with CAD.

    View all citing articles on Scopus
    View full text