De Novo Mutation Detection for Long-Read Sequencing Adapting DeNovoCNN to SMRT Data

Keywords

Loading...
Thumbnail Image

Issue Date

2023-05-30

Language

en

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

DOI

Abstract

Given the sheer size of the human genome, manually finding particular variants is unfeasible and, therefore, many tools are available to help automate these tasks. De novo mutations (DNMs) are one type of genetic variant, which have been shown to cause many of the most severe genetic disorders. The current best performing model for DNM detection is DeNovoCNN, which predicts whether a DNM is present based on an image encoding of genetic data. However, a new type of genome sequencing, called “long-read sequencing” (LRS), is quickly becoming the standard and many tools have to be adapted to its properties. This research substantiates the performance reduction of the original DeNovoCNN model on LRS data. Furthermore, a labelled dataset with LRS trios was created for training purposes, which was used to develop two DNM calling models. The first model was retrained from the original DeNovoCNN, keeping the same architecture and encoding. The second model used a completely new encoding, developed specifically for LRS data, and an altered model architecture. This specialized encoding is better suited to the properties of LRS data and, crucially, incorporates phasing data. The results show that creating a high performing DNM classifier is likely possible, but larger and more complete datasets are necessary to achieve this. In the future, when more data becomes available, following the methodology of this research will probably result in an accurate model allowing for easy and accurate DNM detection for any patient.

Description

Citation

Faculty

Faculteit der Sociale Wetenschappen