ComfyUI > Workflows > Nvidia Cosmos | Text & Image to Video Creation

Nvidia Cosmos | Text & Image to Video Creation

Experience Nvidia's newly released Cosmos models (7B and 14B) for state-of-the-art video generation in ComfyUI. This comprehensive workflow offers both text-to-video generation and image interpolation capabilities. For text-to-video, create fluid 121-frame videos using detailed text descriptions. For image-to-video, you can set a start_image and end_image to generate smooth transitions between them. Thanks to its ultra-efficient VAE, it can process 1280x704 videos on 12GB GPUs, making it 50x more memory-efficient than alternatives. Perfect for creating both realistic and stylized animations with guaranteed motion in every sequence.

Special thanks to Nvidia for releasing the Cosmos model family, and to the ComfyUI team for their excellent native implementation that makes this workflow possible.

ComfyUI Nvidia Cosmos Workflow

Nvidia Cosmos Text or Image-to-Video Workflow in ComfyUI | Video Generation

Want to run this workflow?

Fully operational workflows
No missing nodes or models
No manual setups required
Features stunning visuals

ComfyUI Nvidia Cosmos Examples

nvidia-cosmos-text-or-image-to-video-workflow-in-comfyUI-video-generation-1184-example_1.webp

nvidia-cosmos-text-or-image-to-video-workflow-in-comfyUI-video-generation-1184-example_2.webp

nvidia-cosmos-text-or-image-to-video-workflow-in-comfyUI-video-generation-1184-example_3.webp

nvidia-cosmos-text-or-image-to-video-workflow-in-comfyUI-video-generation-1184-example_4.webp

ComfyUI Nvidia Cosmos Description

ComfyUI Nvidia Cosmos Text & Image to Video Workflow

What is the Nvidia Cosmos Workflow

Turn your imagination into fluid videos using the newly released Nvidia Cosmos models in ComfyUI. This workflow demonstrates the strong AI capabilities of Nvidia Cosmos with its text-to-video and image-to-video generation features. Powered by Nvidia Cosmos's state-of-the-art 7B and 14B models, you can create high-quality videos from either textual descriptions or still images. The Nvidia Cosmos engine gives stellar results thanks to its ultra-efficient video processing capabilities.

Key Features of Nvidia Cosmos

Dual Generation Modes: Nvidia Cosmos offers both text-to-video and image-to-video generation
Guaranteed Motion: Always generates videos with movement when using 121 frames
Effective Negative Prompts: Non-distilled model ensures better control through negative prompts
Flexible Image Control: Generate from the last frame or create transitions between images
Ultra-Efficient VAE: Nvidia Cosmos employs a refined VAE system for smooth, high-quality video generation
High Resolution Support: Create videos at resolutions of 704x704 and above
Precise Frame Control: Optimized for 121-frame sequences
Smart Image Interpolation: Generate smooth transitions between reference images

How to Use the Nvidia Cosmos Workflow

Nvidia Cosmos workflow contains two main parts: text-to-video and image-to-video generation. By default, the image-to-video group is bypassed. To switch between the two modes:

For text-to-video: Keep the image-to-video group bypassed (default setting)
For image-to-video: Right-click the image-to-video group and select Set Group Nodes to Always

1. Text to Video Generation with Nvidia Cosmos

Setup and Requirements

Choose your preferred Nvidia Cosmos model size (7B recommended for starting)

Set resolution (Default 1280x704; minimum 704x704)
Frame settings:
- Length: 121 frames (The model performs optimally with a length of 121; deviating too much from this can result in subpar video quality.)
- Frame rate: 24.00 (default rate for optimal quality)

Sampling Parameters for Nvidia Cosmos

Sampler: res_multistep (Nvidia's recommended sampler for Cosmos)
Scheduler: karras (default for stability)
Steps: 20 (higher = better quality but slower; lower = faster but less detailed)
CFG: 6.5 (prompt guidance strength)
Denoise: 1.00 (1.00 = complete transformation; lower values keep more original content)

Prompting Tips for Nvidia Cosmos

Use detailed, multi-sentence prompts for better results
Include comprehensive negative prompts
Short prompts may generate coherent videos but might not strictly follow instructions