Improving controllable text-to-video diffusion models



Journal Title

Journal ISSN

Volume Title



In this work, we explore different interpretations of ControlNet and how the conditional control it provides image diffusion models can be extended to video diffusion models. We discover that Control-a-Video extends ControlNet using strategies that diverge from ControlNet's training procedures. We explore if restructuring the training procedure to be more analogous to ControlNet will allow for a higher degree of controllability, and we introduce a way to train the model while maintaining the high convergence speeds found in Control-a-Video. We propose the following interpretations that are more analogous to ControlNet: (1) Decomposing the video diffusion model training from the Video ControlNet training in Control-a-Video; (2) connecting a frozen image diffusion model as the foundation for a Video ControlNet called VideoNet (3) training the entire VideoNet instead of just the temporal layers. We find that decomposing the training process produces higher quality generation, pairing an image diffusion model with a VideoNet speeds up training at the cost of sample quality, and training all spatio-temporal layers in a Video ControlNet causes the samples to degenerate.


LCSH Subject Headings