De Novo Mutation Detection for Long-Read Sequencing Adapting DeNovoCNN to SMRT Data
Keywords
Loading...
Authors
Issue Date
2023-05-30
Language
en
Document type
Journal Title
Journal ISSN
Volume Title
Publisher
Title
ISSN
Volume
Issue
Startpage
Endpage
DOI
Abstract
Given the sheer size of the human genome, manually finding particular variants is unfeasible
and, therefore, many tools are available to help automate these tasks. De novo mutations (DNMs)
are one type of genetic variant, which have been shown to cause many of the most severe genetic
disorders. The current best performing model for DNM detection is DeNovoCNN, which predicts
whether a DNM is present based on an image encoding of genetic data. However, a new type of
genome sequencing, called “long-read sequencing” (LRS), is quickly becoming the standard and
many tools have to be adapted to its properties. This research substantiates the performance
reduction of the original DeNovoCNN model on LRS data. Furthermore, a labelled dataset with
LRS trios was created for training purposes, which was used to develop two DNM calling models.
The first model was retrained from the original DeNovoCNN, keeping the same architecture and
encoding. The second model used a completely new encoding, developed specifically for LRS data,
and an altered model architecture. This specialized encoding is better suited to the properties
of LRS data and, crucially, incorporates phasing data. The results show that creating a high
performing DNM classifier is likely possible, but larger and more complete datasets are necessary
to achieve this. In the future, when more data becomes available, following the methodology of this
research will probably result in an accurate model allowing for easy and accurate DNM detection
for any patient.
Description
Citation
Supervisor
Faculty
Faculteit der Sociale Wetenschappen