dMel: Speech Tokenization made Simple

Bai, He; Likhomanenko, Tatiana; Zhang, Ruixiang; Gu, Zijin; Aldeneh, Zakaria; Jaitly, Navdeep

Computer Science > Computation and Language

arXiv:2407.15835 (cs)

[Submitted on 22 Jul 2024 (v1), last revised 2 Oct 2024 (this version, v2)]

Title:dMel: Speech Tokenization made Simple

Authors:He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly

View PDF HTML (experimental)

Abstract:Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated complicated speech tokenization methods to discretize continuous speech signals so that language modeling techniques can be applied to speech data. However, existing approaches either model semantic (content) tokens, potentially losing acoustic information, or model acoustic tokens, risking the loss of semantic (content) information. Having multiple token types also complicates the architecture and requires additional pretraining. Here we show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel), that performs better than other existing speech tokenization methods. Using an LM-style transformer architecture for speech-text modeling, we comprehensively evaluate different speech tokenization methods on speech recognition (ASR) and speech synthesis (TTS). Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.

Comments:	under review
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2407.15835 [cs.CL]
	(or arXiv:2407.15835v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.15835

Submission history

From: He Bai [view email]
[v1] Mon, 22 Jul 2024 17:51:53 UTC (2,165 KB)
[v2] Wed, 2 Oct 2024 20:38:27 UTC (2,710 KB)

Computer Science > Computation and Language

Title:dMel: Speech Tokenization made Simple

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:dMel: Speech Tokenization made Simple

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators