Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data
Keywords: protein-protein interaction, feature selection, prior knowledge, rna-seq, High-dimensional, random forest, explainability, Gene selection
Abstract
The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes (n <= 30)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(n \le 30)$$\end{document} comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.
Más información
| Título según WOS: | Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data |
| Título de la Revista: | BIODATA MINING |
| Volumen: | 17 |
| Número: | 1 |
| Editorial: | BMC |
| Fecha de publicación: | 2024 |
| Idioma: | English |
| DOI: |
10.1186/s13040-024-00388-8 |
| Notas: | ISI |