Exploring Machine Learning Algorithms and Protein Language Models Strategies to Develop Enzyme Classification Systems

Fernández, Diego; Olivera-Nappa, Alvaro; Uribe-Paredes, Roberto; Medina-Ortiz, David


Discovering functionalities for unknown enzymes has been one of the most common bioinformatics tasks. Functional annotation methods based on phylogenetic properties have been the gold standard in every genome annotation process. However, these methods only succeed if the minimum requirements for expressing similarity or homology are met. Alternatively, machine learning and deep learning methods have proven helpful in this problem, developing functional classification systems in various bioinformatics tasks. Nevertheless, there needs to be a clear strategy for elaborating predictive models and how amino acid sequences should be represented. In this work, we address the problem of functional classification of enzyme sequences (EC number) via machine learning methods, exploring various alternatives for training predictive models and numerical representation methods. The results show that the best performances are achieved by applying representations based on pre-trained models. However, there needs to be a clear strategy to train models. Therefore, when exploring several alternatives, it is observed that the methods based on CNN architectures proposed in this work present a more outstanding facility for learning and pattern extraction in complex systems, achieving performances above 97% and with error rates lower than 0.05 of binary cross entropy. Finally, we discuss the strategies explored and analyze future work to develop integrated methods for functional classification and the discovery of new enzymes to support current bioinformatics tools.

Más información

Título según SCOPUS: ID SCOPUS_ID:85164944045 Not found in local SCOPUS DB
Título de la Revista: Lecture Notes in Computer Science
Volumen: 13919 LNBI
Editorial: Springer
Fecha de publicación: 2023
Página de inicio: 307
Página final: 319