Automatic Subtitle Generation for Dutch TV Content

Keywords

Loading...
Thumbnail Image

Issue Date

2022-05-03

Language

en

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

DOI

Abstract

Subtitles are a necessary medium of communication for those who are hearing impaired. To develop methods to more easily create these subtitles, this study investigates the relatively unexplored eld of automatically generating subtitles for Dutch TV content. We study and implement three modules: speech recognition, punctuation restoration, and subtitle segmentation, which together form a pipeline for the automatic generation of subtitles. We implement, optimize, and evaluate the state of the art for these individual modules to provide a clear overview of available techniques and their performance. To realize this, a representative, labelled speech dataset of extracted fragments from a Dutch TV show was created, alongside with multiple subtitle-based datasets and language models. The pipeline consisting of the best performing models for each module is implemented and evaluated by human annotators. Our contribution is a full- edged pipeline to automatically create subtitles for Dutch TV content based on open source models, as well as a framework to stimulate further research on the individual modules and subtitle generation in general. Keywords: Automatic Speech Recognition, Subtitle Generation, Punctuation Restoration, Subtitle Segmentation

Description

Citation

Faculty

Faculteit der Sociale Wetenschappen