Music Source Separation in the Waveform Domain

Défossez, Alexandre; Usunier, Nicolas; Bottou, Léon; Bach, Francis

Computer Science > Sound

arXiv:1911.13254v1 (cs)

[Submitted on 27 Nov 2019 (this version), latest version 28 Apr 2021 (v2)]

Title:Music Source Separation in the Waveform Domain

Authors:Alexandre Défossez (FAIR, SIERRA, PSL), Nicolas Usunier (FAIR), Léon Bottou (FAIR), Francis Bach (DI-ENS, PSL, SIERRA)

View PDF

Abstract:Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments. Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we first show that an adaptation of Conv-Tasnet (Luo \& Mesgarani, 2019), a waveform-to-waveform model for source separation for speech, significantly beats the state-of-the-art on the MusDB dataset, the standard benchmark of multi-instrument source separation. Second, we observe that Conv-Tasnet follows a masking approach on the input signal, which has the potential drawback of removing parts of the relevant source without the capacity to reconstruct it. We propose Demucs, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder. Experiments on the MusDB dataset show that Demucs beats previously reported results in terms of signal to distortion ratio (SDR), but lower than Conv-Tasnet. Human evaluations show that Demucs has significantly higher quality (as assessed by mean opinion score) than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR. Additional experiments with a larger dataset suggest that the gap in SDR between Demucs and Conv-Tasnet shrinks, showing that our approach is promising.

Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
Cite as:	arXiv:1911.13254 [cs.SD]
	(or arXiv:1911.13254v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.1911.13254

Submission history

From: Alexandre Defossez [view email] [via CCSD proxy]
[v1] Wed, 27 Nov 2019 13:50:45 UTC (247 KB)
[v2] Wed, 28 Apr 2021 14:37:48 UTC (113 KB)

Computer Science > Sound

Title:Music Source Separation in the Waveform Domain

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Music Source Separation in the Waveform Domain

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators