Pages personnelles

Research projects

Charles Bouveyron > Research projects

This page presents a few representative research projects, which I recently worked on.

Bayesian Variable Selection for Globally Sparse Probabilistic PCA: applications in genomics

In the past decade, sparse methods have imposed themselves as simple and powerful ways for selecting relevant features in high-dimensional statistical problems, such as principal component analysis (PCA). However, when several sparse principal components are computed, the interpretation of the selected variables is difficult since each axis has its own sparsity pattern and has to be interpreted separately. To overcome this drawback, we proposed a Bayesian procedure, called globally sparse probabilistic PCA (GSPPCA), that allows to obtain several sparse components with the same sparsity pattern. This allows the practitioner to identify the original variables which are relevant to describe the data. To this end, using Roweis' probabilistic interpretation of PCA and a Gaussian prior on the loading matrix, we provide the first exact computation of the marginal likelihood of a Bayesian PCA model. To avoid the drawbacks of discrete model selection, a simple relaxation approach allows to find a path of models using a variational expectation-maximization algorithm. The exact marginal likelihood is then maximized over this path.

This approach may be particularly useful in the context of genomic data where the identification of relevant genes are of great interest to understand diseases. We applied GSPPCA on a breast cancer data set (n=334 patients, p=5391 genes). The figure below allow to see that the marginal likelihood peak corresponds to highly interpretable genes: more than 5% of the biological pathways in the Reactome family have a significant overlap with the genes selected by GSPPCA.

- C. Bouveyron, P. Latouche and P.-A. Mattei, Bayesian Variable Selection for Globally Sparse Probabilistic PCA, Preprint HAL n°01310409, Université Paris Descartes, 2016.

Model-based high-dimensional clustering and classification: applications in ecology

Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, high-dimensional data are nowadays more and more frequent and, unfortunately, classical model-based clustering techniques show a disappointing behavior in high-dimensional spaces. This is mainly due to the fact that model-based clustering methods are dramatically over-parametrized in this case. However, high-dimensional spaces have specific characteristics which are useful for clustering and recent techniques exploit those characteristics. Subspace clustering methods assume that the data of each latent group live in a low-dimensional subspace. They thus combine the ideas of dimension reduction and parsimonious modeling. In particular, HDDC proposes a family of models allowing each group to have its own subspace intrinsic dimension.

As part of a collaboration with the Museum National d'Histoire Naturelle, we recently embedded HDDC in a workfow of audio data analysis to automatically recover species in tropical forests. The method was tested in two distinct tropical environments in French Guiana, a lowland high rainforest (HF) and a rock savanna (RS), and we compared automatic annotations with expert ones. For both environments, the similarity between the manual and automated partitions was high and consistent (ARI > 0.75) indicating that the clusters found are intelligible and can be used for further analysis.

- C. Bouveyron and C. Brunet, Model-based clustering of high-dimensional data: A review, Computational Statistics and Data Analysis, vol. 71, pp. 52-78, 2014.
- J. Ulloa, T. Aubin, D. Llusia, C. Bouveyron and J. Sueur, Measuring animal acoustic diversity in a tropical forest using unsupervised multiresolution analysis, preprint Université Paris Descartes, 2016.

Modeling and clustering of networks with texts: analysis of PubMed co-authorships on diabete

Due to the significant increase of communications between individuals via social media (Facebook, Twitter, Linkedin) or electronic formats (email, web, e-publication) in the past two decades, network analysis has become an unavoidable discipline. Unfortunately, random graph models have been proposed to extract information from networks based on person-to-person links only, without taking into account information on the contents, in particular texts. To overcome this limitation, we introduced the stochastic topic block model (STBM), a probabilistic model for networks with textual edges. STBM allows to discover meaningful clusters of vertices that are coherent from both the network interactions and the text contents. A classification variational expectation-maximization algorithm has been proposed to perform inference.

In a medical context, STBM has been used by doctors and biologists for scientific monitoring on PubMed. We present below a result obtained when analyzing all articles from PubMed on diabete published between 2008 and 2016. The data (author names, dates of publications, journal names, abstracts, ...) were extrated through the Pubmed API. The data set consists in 963 articles from 4658 authors. The final network has 778 authors with at least 2 articles and STBM identifies 9 groups of authors working on 7 research topics. The figure below allows to visualize the co-authorship network between the authors: node colors indicate the group memberships of authors whereas edge colors refer to the research topics of collaborations.

We are also currently exploring, with Institut Curie, the possibility of using STBM for analyzing patient files (surgery reports, physical examination reviews, letters, ...) in order to improve the diagnosis and the prognostic of patient diseases.

- C. Bouveyron and P. Latouche, Des réseaux, des textes et de la Statistique, La lettre de l'INSMI, CNRS, December edition, 2016.
- C. Bouveyron, P. Latouche and R. Zreik, The Stochastic Topic Block Model for the Clustering of Networks with Textual Edges, Statistics and Computing, in press, 2017.