Supervised Machine Learning for biomolecular data integration, biomarker discovery and support to diagnosis and prognosis

The bio-molecular classification of pathological phenotypes requires the development of methods well- suited to the characteristics of the "omics" data used, often characterized by high dimensionality.

In this context, we explored various supervised ensemble methods, such as methods based on error correcting codes (Valentini 2001, 2002), on the reduction of the dimensionality of data through randomized projections (Bertoni et al. 2005), or methods of bagging and its variants (Valentini et al. 2003, 2004). We also applied univariate feature selection and cost-sensitive SVM methods to the analysis of radiographic images (classification of pulmonary nodules) with results comparable with the best ones in literature (Campadelli et al. 2005).

Data integration plays a key role in several computational biology problems, since each data source can provide complementary biologically-relevant information, necessary to unravel the biological phenomenon of interest.

We investigated the impact of ensemble methods as "late" data fusion algorithms in gene function prediction problems: each learning machine is trained on a different source of data and their decision are combined according to a specific "consensus" algorithm (Re and Valentini, 2009, 2010). We showed that even simple ensemble methods such as majoirity voting or Decison Templates can achieve results comparable with state-of-the-art methods (Re and Valentini 2010). Moreover we also showed that ensembles are less prone to errors due to noisy data (Re and Valentini, 2010). We applied data integration and ensemble methods also in the context of protein subcellular localization problems (Rozza et al. 2010, 2011), and we studied also problems related to the biomolecular data base management using XML to integrate heterogeneous biological data (Mesiti et al. 2009)

Publications

A. Rozza, G. Lombardi, M. Re, E. Casiraghi, G. Valentini and P. Campadelli. A novel ensemble technique for protein subcellular location prediction. Ensembles in Machine Learning Applications, Springer, 2011.

A. Rozza, G. Lombardi, M. Re, E. Casiraghi and G. Valentini. DDAG K-TIPCAC: an ensemble method for protein subcellular localization. ECML SUEMA 2010 workshop: supervised and unsupervised ensemble methods and their applications, 2010.

M. Rè and G. Valentini. Noise tolerance of Multiple Classifier Systems in data integration-based gene function prediction. Journal of integrative bioinformatics, De Gruyter 7(3), 2010.

M. Re and G. Valentini. Simple ensemble methods are competitive with state-of-the-art data integration methods for gene function prediction.. MLSB, 2010.

M. Re and G. Valentini. Integration of heterogeneous data sources for gene function prediction using decision templates and ensembles of learning machines. Neurocomputing, Elsevier 73(7-9), 2010.

M. Re and G. Valentini. Ensemble based data fusion for gene function prediction. International Workshop on Multiple Classifier Systems, 2009.

A. Bertoni, R. Folgieri and G. Valentini. Bio-molecular cancer prediction with random subspace ensembles of support vector machines. Neurocomputing, Elsevier 63, 2005.

P. Campadelli, E. Casiraghi and G. Valentini. Support vector machines for candidate nodules classification. Neurocomputing, Elsevier 68, 2005.

A. Bertoni, R. Folgieri and G. Valentini. Feature selection combined with random subspace ensemble for gene expression based diagnosis of malignancies. Biological and Artificial Intelligence Environments, Springer, 2005.

G. Valentini, M. Muselli and F. Ruffino. Cancer recognition with bagged ensembles of support vector machines. Neurocomputing, Elsevier 56, 2004.

G. Valentini. An application of low bias bagged SVMs to the classification of heterogeneous malignant tissues. Italian Workshop on Neural Nets, 2003.

G. Valentini, M. Muselli and F. Ruffino. Bagged ensembles of support vector machines for gene expression data analysis. Proceedings of the International Joint Conference on Neural Networks, 2003. 3, 2003.

G. Valentini. Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artificial Intelligence in Medicine, Elsevier 26(3), 2002.

G. Valentini. Supervised gene expression data analysis using Support Vector Machines and Multi-Layer perceptrons. Proc. of KES’2002, the Sixth International Conference on Knowledge-Based Intelligent Information \& Engineering Systems, special session Machine Learning in Bioinformatics, 2002.

G. Valentini. Classification of human malignancies by machine learning methods using DNA microarray gene expression data. Fourth International Conference Neural Networks and Expert Systems in Medicine and HealthCare, 2001.