De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks

Khodji, Hiba; Collet, Pierre; Thompson, Julie D. D.; Jeannin-Girardon, Anne

Abstract

The widespread use of high throughput genome sequencing technologies has resulted in a significant increase in the number of available sequences, creating new challenges for genome annotation and prediction of protein-coding genes in terms of error detection and quality control. Multiple Sequence Alignments (MSAs) of the predicted protein sequences provide important contextual information that can be used to distinguish errors (caused by artifacts in the raw genome data, badly predicted gene sequences, or the alignment methods themselves) from true biological events. This can be achieved either by human expertise or by statistical analysis of the sequence data. Here, we propose a new approach that uses visual representations of MSAs as inputs for Convolutional Neural Networks (CNN) to classify MSAs into erroneous and non-erroneous categories. The MSAs are extracted from a unique in-house dataset, in which errors are carefully identified. Our model, called De-MISTED (Deep learning for MultIple Sequence alignmenTs Error Detection) identifies MSAs containing erroneous sequences with high accuracy (87%) and sensitivity (92%). Visual explanation techniques show that our model correctly identifies the position of multiple errors of different types (insertions, deletions and mismatches). Close examination of the data showed that our model can also identify errors that were not previously annotated in the data. The De-MISTED method thus contributes to a more robust exploitation of the genome data.

Más información

Título según WOS: ID WOS:000931766000005 Not found in local WOS DB
Título de la Revista: APPLIED INTELLIGENCE
Volumen: 53
Número: 15
Editorial: Springer
Fecha de publicación: 2023
Página de inicio: 18806
Página final: 18820
DOI:

10.1007/s10489-022-04390-7

Notas: ISI