Unsupervised analysis for patient stratification and pattern discovery in complex biomolecular data

The validation of clusters discovered by clustering algorithms is a central problem in bioinformatics: indeed algorithms can find clusters in biomolecular data, but we need to assess whether the discovered cluster are statistically significant and biologically meaningful.

We developed stability-based algorithms and specific statistical tests for a) Analyzing the overall clustering reliability and for the model order selection in an unsupervised setting of the problem (Bertoni and Valentini, 2006, 2007, 2008; Valentini 2007); b) Analyzing the reliability of single clusters inside a clustering (Bertoni and Valentini 2006, 2005; Valentini 2006) The new methods have been applied to the analysis and validation of subclasses of pathologies characterized at bio-molecular level and to the discovery of multiple structures in complex bio-molecular data (e.g. hierarchical structures), using data generated through high-throughput biotechnologies (Bertoni and Valentini, 2006, 2007; Valentini and Ruffino, 2006). We tried also to develop stability-based methods to assess the reliability of hierarchical clusterings characterized by a high number of clusters and examples, targeted to the unsupervised search and validation of functional classes of genes (Avogadri et al. 2008, 2009).

The search for bio-molecular patterns in data characterized by high dimensionality and low cardinality (for example: DNA microarray or spectrometric data related to proteins), motivated the design and development of clustering ensemble methods tailored to this type of data. In particular we developed unsupervised methods based on randomized projections to analyze data characterized by high dimensionality (Bertoni and Valentini 2005). We also developed clustering ensemble methods based on randomized projections that use a fuzzy approach both for the basic clustering constituting the ensemble, and for combining the clustering obtained on multiple instances of the data. From the initial algorithm (Avogadri and Valentini, 2007) a more general algorithmic scheme has been developed from which different fuzzy ensemble clustering algorithms can be derived (Avogadri and Valentini, 2008) and this approach has been applied to the analysis of gene expression data for the search of pathological subclasses characterized at bio-molecular level (Avogadri and Valentini, 2009).

Publications

R. Avogadri and G. Valentini. Fuzzy ensemble clustering based on random projections for DNA microarray data analysis. Artificial Intelligence in Medicine, Elsevier 45(2-3), 2009.

A. Bertoni and G. Valentini. Discovering multi--level structures in bio-molecular data through the Bernstein inequality. BMC Bioinformatics, BioMed Central 9(2), 2008.

R. Avogadri and G. Valentini. Ensemble clustering with a fuzzy approach. Supervised and unsupervised ensemble methods and their applications, Springer, 2008.

A. Bertoni and G. Valentini. Model order selection for bio-molecular data clustering. BMC Bioinformatics, BioMed Central 8(2), 2007.

R. Avogadri and G. Valentini. Fuzzy ensemble clustering for DNA microarray data analysis. International Workshop on Fuzzy Logic and Applications, 2007.

A. Bertoni and G. Valentini. Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses. Artificial Intelligence in Medicine, Elsevier 37(2), 2006.

A. Bertoni and G. Valentini. Model order selection for clustered bio-molecular data. Probabilistic Modeling and Machine Learning in Structural and Systems Biology Workshop, 2006.

G. Valentini. Mosclust: a software library for discovering significant structures in bio-molecular data. Bioinformatics, Oxford University Press 23(3), 2006.

A. Bertoni and G. Valentini. Random projections for assessing gene expression cluster stability. Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. 1, 2005.

G. Valentini. Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data. Bioinformatics, Oxford University Press 22(3), 2005.

G. Valentini and F. Ruffino. Characterization of lung tumor subtypes through gene expression cluster validity assessment. RAIRO-Theoretical Informatics and Applications, EDP Sciences 40(2), 2006.

R. Avogadri, M. Brioschi, F. Ferrazzi, M. Re, A. Beghini and G. Valentini. A stability-based algorithm to validate hierarchical clusters of genes. International Journal of Knowledge Engineering and Soft Data Paradigms, Inderscience Publishers 1(4), 2009.

R. Avogadri, M. Brioschi, F. Ruffino, F. Ferrazzi, A. Beghini and G. Valentini. An algorithm to assess the reliability of hierarchical clusters in gene expression data. International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, 2008.