Detection of pathogenic variants in the non-coding human genome

The identification of genetic variants associated with human diseases represents one of the core challenges in precision medicine, and requires the design and application of a new generation of machine learning-based prediction methods able to prioritize potential “deleterious” variants (i.e. causative or otherwise linked with disease risk) among the huge amount of neutral variants that represent natural genetic variation in individuals.

Most state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We developed HyperSMURF (hyper-ensemble of SMOTE under-sampled random forests), a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach to deal with highly imbalanced genomic data (Schubach et al, 2017, Schubach et al, 2017).

This machine learning method has been successfully applied as part of Genomiser, a state-of-the-art software tool that uses both genotypic and phenotypic information, to discover variants in both coding and non-coding regulatory regions associated with specific genetic Mendelian diseases (Smedley et al, 2016).

Using a different approach, we are also developing imbalance-aware deep neural networks based on balanced mini-batch learning to predict rare pathogenic genetic variants (Cappelletti et al, 2019).

Publications

L. Cappelletti, J. Gliozzo, A. Petrini and G. Valentini. Training Neural Networks with Balanced Mini-batch to Improve the Prediction of Pathogenic Genomic Variants in Mendelian Diseases. Sensors & Transducers, IFSA Publishing, SL 234(6), 2019.

M. Schubach, M. Re, P. Robinson and G. Valentini. Imbalance-aware machine learning for predicting rare and common disease-associated non-coding variants. Scientific reports, Nature Publishing Group 7(1), 2017.

M. Schubach, M. Re, P. Robinson and G. Valentini. Variant relevance prediction in extremely imbalanced training sets. F1000Research 6, 2017.

D. Smedley, M. Schubach, J. Jacobsen, S. Köhler, T. Zemojtel, M. Spielmann, M. Jäger, H. Hochheiser, N. Washington, J. McMurry and Others. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. The American Journal of Human Genetics, Elsevier 99(3), 2016.