Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

Submitted to ICASSP 2024

Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

Zhiwei Lin, Jun Chen, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu, Helen Meng

Abstract

Variational Autoencoders (VAEs) constitute a crucial component of neural symbolic music generation, among which some works have yielded outstanding results and attracted considerable attention. Nevertheless, previous VAEs still encounter issues with overly long feature sequences and generated results lack contextual coherence, thus the challenge of modeling long multi-track symbolic music still remains unaddressed. To this end, we propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music. The Multi-view MidiVAE utilizes the two-dimensional (2-D) representation, OctupleMIDI, to capture relationships among notes while reducing the feature sequences length. Moreover, we focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy to integrate both Track- and Bar-view MidiVAE features. Objective and subjective experimental results on the CocoChorales dataset demonstrate that, compared to the baseline, Multi-view MidiVAE exhibits significant improvements in terms of modeling long multi-track symbolic music.


Fig.1: The overall diagram of the proposed Multi-view MidiVAE. The model mainly contains Track- and Bar-view encoders, a multi-view information fusion (MIF), Track- and Bar-view decoders as well as an adaptive feature fusion (AFF).

Latent Space Sampling Demos

MusicVAE with REMI+

MusicVAE with REMI+  
case 1
case 2
sample1 sample2
case 3
case 4
proposed clean

MusicVAE with OctupleMIDI

MusicVAE with OctupleMIDI  
case 1
case 2
noisy baseline
case 3
case 4
proposed clean

Bar-view MidiVAE

Bar-view MidiVAE  
case 1
case 2
noisy baseline
case 3
case 4
proposed clean

Track-view MidiVAE

Track-view MidiVAE  
case 1
case 2
noisy baseline
case 3
case 4
proposed clean

Multi-view MidiVAE

Multi-view MidiVAE  
case 1
case 2
noisy baseline
case 3
case 4
proposed clean

Latent Space Interpolation Demos

case 1

case 1  
Start
End
noisy baseline
MusicVAE with REMI+
MusicVAE with OctupleMIDI
proposed clean
Bar-view MidiVAE
Track-view MidiVAE
proposed clean
Multi-view MidiVAE
 
proposed  

case 2

case 2  
Start
End
noisy baseline
MusicVAE with REMI+
MusicVAE with OctupleMIDI
proposed clean
Bar-view MidiVAE
Track-view MidiVAE
proposed clean
Multi-view MidiVAE
 
proposed