Hierarchical Ensembles for the structured prediction of protein functions and abnormal human phenotypes

Relevant concepts in the field of molecular biology (for example: the functions of genes and proteins) and medicine (abnormal phenotypes associated with human pathologies) are organized according to hierarchical ontologies structured as trees (eg: FunCat for the functional classification of genes ) or as acyclic direct graphs (DAG) (eg: the Gene Ontology (GO) for the classification of genes and proteins and the HPO (Human Phenotype Ontology) for the classification of pathological human phenotypes.

In this context we developed hierarchical ensemble methods (Valentini, 2014) based on the true path rule (TPR) (Valentini 2009, 2011, Re and Valentini 2010) and cost-sensitive Bayesian methods for probabilistic reconciliation of the output of the base learners and for the structured prediction of the function of genes and proteins in tree-structured ontologies (Cesa-Bianchi and Valentini 2010, Cesa-Bianchi et al 2009, 2010).

We also showed that the combination of hierarchical ensemble methods, cost-sensitive learning strategies and the integration of different types of data significantly improve the performance in the prediction of gene functions at the level of the entire genome ( Cesa-Bianchi et al 2012).

In the context of the prediction of abnormal human phenotypes according to the HPO, we recently proposed new hierarchical ensemble methods for structured prediction based on DAG that have achieved state of the art results (Notaro et la. 2017, 2019, Robinson et al. 2015, Valentini et al. 2015). Hierarchical ensemble methods have been also applied to the recent CAFA3 challenge for the prediction of HPO terms (Zhou et al. 2019).

Recently, new methods based on isotonic regression algorithms combined with the TPR algorithm led to state-of-the-art results in the prediction of protein function (paper in preparation).

Publications

N. Zhou, Y. Jiang, [...], M. Frasca, M. Notaro, G. Grossi, A. Petrini, M. Re, G. Valentini, M. Mesiti and others. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biology 20(1), 2019.

M. Notaro, M. Schubach, M. Frasca, M. Mesiti, P. Robinson and G. Valentini. Ensembling descendant term classifiers to improve gene-abnormal phenotype predictions. International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, 2017.

M. Notaro, M. Schubach, P. Robinson and G. Valentini. Prediction of Human Phenotype Ontology terms by means of hierarchical ensemble methods. BMC Bioinformatics, BioMed Central 18(1), 2017.

P. Robinson, M. Frasca, S. Köhler, M. Notaro, M. Re and G. Valentini. A hierarchical ensemble method for dag-structured taxonomies. International Workshop on Multiple Classifier Systems, 2015.

G. Valentini, S. Köhler, M. Re, M. Notaro and P. Robinson. Prediction of human gene-phenotype associations by exploiting the hierarchical structure of the human phenotype ontology. International Conference on Bioinformatics and Biomedical Engineering, 2015.

G. Valentini. Hierarchical ensemble methods for protein function prediction. ISRN Bioinformatics, Hindawi Publishing Corporation 2014, 2014.

N. Cesa-Bianchi, M. Re and G. Valentini. Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Machine Learning, Springer 88(1-2), 2012.

G. Valentini. True path rule hierarchical ensembles for genome-wide gene function prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, IEEE 8(3), 2010.

N. Cesa-Bianchi and G. Valentini. Hierarchical cost-sensitive algorithms for genome-wide gene function prediction. Machine Learning in Systems Biology, 2009.

M. Re and G. Valentini. An experimental comparison of Hierarchical Bayes and True Path Rule ensembles for protein function prediction. International Workshop on Multiple Classifier Systems, 2010.

N. Cesa-Bianchi, M. Re and G. Valentini. Functional inference in FunCat through the combination of hierarchical ensembles with data fusion methods. ICML Workshop on learning from Multi-Label Data MLD'10, 2010.

G. Valentini. True path rule hierarchical ensembles. International Workshop on Multiple Classifier Systems, 2009.