Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data
Keywords: protein-protein interaction, feature selection, prior knowledge, rna-seq, High-dimensional, random forest, explainability, Gene selection
Abstract
The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes (n?30) comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis. © The Author(s) 2024.
Más información
| Título según WOS: | Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data |
| Título según SCOPUS: | Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data |
| Título de la Revista: | BioData Mining |
| Volumen: | 17 |
| Número: | 1 |
| Editorial: | BIOMED CENTRAL LTD |
| Fecha de publicación: | 2024 |
| Idioma: | English |
| DOI: |
10.1186/s13040-024-00388-8 |
| Notas: | ISI, SCOPUS |