RUDEUS: A Machine Learning Classification System to Study DNA-Binding Proteins

David Medina-Ortiz; Gabriel Cabas-Mora

Keywords: DNA-Binding Proteins, Single-Stranded and Double-Stranded DNA, Machine Learning, Protein Language Models.

Abstract

DNA-binding proteins play crucial roles in biological processes such as replication, transcription, pack-aging, and chromatin remodeling. Their study has gained importance across scientific fields, with computational biology complementing traditional methods. While machine learning has advanced bioinformatics, generalizable pipelines for identifying DNA-binding proteins and their specific interactions remain scarce. We present RUDEUS, a Python library with hierarchical classification models to identify DNA-binding proteins and distinguish between single- and double-stranded DNA interactions. RUDEUS integrates protein language models, supervised learning, and Bayesian optimization, achieving 95% precision in DNA-binding identification and 89% accuracy in distinguishing interaction types. The library also includes tools for annotating unknown sequences and validating DNA-protein interactions through molecular docking. RUDEUS delivers competitive performance and is easily integrated int o protein engineering workflows. It is available under the MIT License, with the source code and models available on the GitHub repository https://github.com/ProteinEngineering-PESB2/RUDEUS.

Más información

Fecha de publicación: 2024
Página de inicio: 302
Página final: 310