Samples
The following unseen samples illustrate that explicit energy input enables effective control over dynamics and volume in the synthesized singing voice.
sample 1
ground-truth + HiFi-GAN
baseline + HiFi-GAN
phoneme-level + HiFi-GAN
frame-level + HiFi-GAN
sample 2
ground-truth + HiFi-GAN
baseline + HiFi-GAN
phoneme-level + HiFi-GAN
frame-level + HiFi-GAN
sample 3
ground-truth + HiFi-GAN
baseline + HiFi-GAN
phoneme-level + HiFi-GAN
frame-level + HiFi-GAN
sample 4
ground-truth + HiFi-GAN
baseline + HiFi-GAN
phoneme-level + HiFi-GAN
frame-level + HiFi-GAN