Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data

Cantor, E; Guauque-Olarte, S; León, R; Chabert, S; Salas, R

Keywords: protein-protein interaction, feature selection, prior knowledge, rna-seq, High-dimensional, random forest, explainability, Gene selection

Abstract

The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes (n?30) comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis. © The Author(s) 2024.

Más información

Título según WOS: Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data
Título según SCOPUS: Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data
Título de la Revista: BioData Mining
Volumen: 17
Número: 1
Editorial: BIOMED CENTRAL LTD
Fecha de publicación: 2024
Idioma: English
DOI:

10.1186/s13040-024-00388-8

Notas: ISI, SCOPUS