Unsupervised learning techniques detect clinically relevant structure in human gut microbiota
Author(s): Himmi Lindgren,Leo M Lahti,Aki Havulinna,Teemu Niiranen,Rob Knight,Guillaume Meric
Affiliation(s): Department of Computing, University of Turku, Turku, Finland
Unsupervised learning techniques can detect clinically relevant structure in a population cohort data of human gut microbiota. While the gut microbiota composition is influenced by individual factors such as diet, medication, and development of the immune system during early childhood, it is proposed that individuals maintain a relatively stable microbiota ecosystem throughout adulthood. This stability allows to distinguish individuals into subgroups based on their gut microbiota characteristics, which define the key features of microbiota community types within the population. For this, we compared three unsupervised learning techniques, non-negative matrix factorization, and topic modelling techniques, Dirichlet Multinomial Mixtures and Latent Dirichlet Allocation, with a naive benchmark, using an all-cause mortality association strength as a quantitative metrics, to distinguish biologically relevant structure in a large Finnish population cohort, FINRISK, with almost 18 years follow-up. The techniques identify microbiota assemblages as either discrete enterotypes, which assign each sample into a single community type, or continuous enterosignatures, which identify patterns of co-occurrence of microbiota community types within each sample. The mia package in Bioconductor, which utilizes the TreeSummarizedExperiment infrastructure, contributed tools for microbiome analysis. We found five rather robust community types, characterized by Bacteroides, Alistipes, Agathobacter, Escherichia, and Prevotella bacterial genera. Latent Dirichlet Allocation detected the strongest early mortality signal using Cox regression, outperforming all other techniques. The replicability of Latent Dirichlet Allocation was assessed using cross validation. The predicted community types uncovered similar ecological landscape on the data with the community types obtained using the entire data, confirming the clinical relevance, robustness, and scalability of the technique.