Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Zhang, David Junhao; Li, Dongxu; Le, Hung; Shou, Mike Zheng; Xiong, Caiming; Sahoo, Doyen

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.01827 (cs)

[Submitted on 3 Jan 2024]

Title:Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Authors:David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo

View PDF HTML (experimental)

Abstract:Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on this https URL.

Comments:	project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.01827 [cs.CV]
	(or arXiv:2401.01827v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.01827

Submission history

From: Junhao Zhang [view email]
[v1] Wed, 3 Jan 2024 16:43:47 UTC (23,665 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators