MMAudio: Advanced video-to-audio model for high-quality audio generation.

Flux UltraRealistic LoRA V2

Create stunningly lifelike image with Flux UltraRealistic LoRA V2

Hunyuan LoRA

Use downloaded Hunyuan LoRAs to control style and character consistency in video generation.

Flux Depth and Canny

Official Flux Tools - Flux Depth and Canny ControlNet Model

ComfyUI > Nodes > CosyVoice-ComfyUI > CosyVoiceNode

ComfyUI Node: CosyVoiceNode

Class Name

CosyVoiceNode

Category
AIFSH_CosyVoice

Author
AIFSH (Account age: 516days) Extension
CosyVoice-ComfyUI Latest Updated
2024-09-10 Github Stars
0.25K

Github Ask AIFSH Current Questions Past Questions

Table of Content

Description
CosyVoiceNode:
CosyVoiceNode Input Parameters:
CosyVoiceNode Output Parameters:
CosyVoiceNode Usage Tips:
CosyVoiceNode Common Errors and Solutions:
Related Nodes

How to Install CosyVoice-ComfyUI

Install this extension via the ComfyUI Manager by searching for CosyVoice-ComfyUI

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter CosyVoice-ComfyUI in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

CosyVoiceNode Description

Sophisticated text-to-speech tool with advanced machine learning for high-quality voice synthesis and cloning.

CosyVoiceNode:

CosyVoiceNode is a sophisticated tool designed to facilitate text-to-speech (TTS) synthesis and cross-lingual voice cloning. It leverages advanced machine learning models to generate high-quality speech from text inputs, making it an invaluable asset for AI artists looking to create realistic and expressive voiceovers. The node supports multiple inference modes, including zero-shot TTS, cross-lingual voice cloning, and instruction-based TTS, providing flexibility and versatility in various applications. By utilizing pre-trained models and a robust inference pipeline, CosyVoiceNode ensures that the generated speech is natural and coherent, enhancing the overall user experience.

CosyVoiceNode Input Parameters:

tts_text

This parameter represents the text input that you want to convert into speech. It is a required parameter and serves as the primary content for the TTS process. The quality and clarity of the generated speech heavily depend on the text provided. There are no specific minimum or maximum values, but it is recommended to use well-structured sentences for optimal results.

prompt_text

This optional parameter is used in zero-shot and cross-lingual inference modes to provide additional context or style cues for the generated speech. It helps the model understand the desired tone, style, or specific characteristics of the speech output. Providing a relevant prompt can significantly enhance the naturalness and expressiveness of the generated voice.

speech

This parameter is used to input a reference speech sample, particularly in cross-lingual voice cloning mode. The reference speech helps the model capture the unique characteristics and nuances of the speaker's voice, enabling it to generate speech that closely mimics the reference. The input should be a high-quality audio sample to achieve the best results.

seed

This optional parameter sets the random seed for the inference process, ensuring reproducibility of the results. By specifying a seed value, you can generate consistent outputs across different runs. The default value is typically set to a random seed, but you can provide any integer value to control the randomness.

sft_dropdown

This parameter is used in instruction-based TTS mode to select specific instructions or styles for the generated speech. It allows you to customize the speech output according to predefined styles or instructions, enhancing the versatility of the TTS system. The available options depend on the model's configuration and training data.

instruct_text

This optional parameter provides additional instructions or context for the instruction-based TTS mode. It helps the model understand the specific requirements or nuances of the desired speech output, enabling more precise and tailored speech generation. Providing clear and concise instructions can improve the quality and relevance of the generated speech.

CosyVoiceNode Output Parameters:

audio

The primary output of the CosyVoiceNode is the generated audio waveform. This output contains the synthesized speech based on the provided text and other input parameters. The audio is returned as a dictionary with keys waveform and sample_rate, where waveform is a tensor representing the audio signal and sample_rate is the sampling rate of the audio. The generated audio is typically in 16-bit PCM format, ensuring high-quality playback.

CosyVoiceNode Usage Tips:

To achieve the best results, provide well-structured and grammatically correct text as input for the tts_text parameter.
Use high-quality reference speech samples for the speech parameter to enhance the accuracy of cross-lingual voice cloning.
Experiment with different seed values to explore variations in the generated speech and find the most suitable output for your needs.
Utilize the prompt_text and instruct_text parameters to guide the model in generating speech with specific styles or characteristics.

CosyVoiceNode Common Errors and Solutions:

"Model directory not found"

Explanation: This error occurs when the specified model directory does not exist or is incorrectly specified.
Solution: Ensure that the model directory path is correct and that the directory contains the necessary model files.

"Invalid input text"

Explanation: This error occurs when the provided tts_text is empty or not in a valid format.
Solution: Check the tts_text parameter to ensure it contains valid and well-structured text.

"Reference speech sample not found"

Explanation: This error occurs when the speech parameter is missing or the provided audio sample is not accessible.
Solution: Verify that the speech parameter is correctly specified and that the audio file is accessible and in a supported format.

"Unsupported inference mode"

Explanation: This error occurs when an invalid inference mode is selected.
Solution: Ensure that the inference mode is correctly specified and supported by the model configuration.

CosyVoiceNode Related Nodes

Go back to the extension to check out more related nodes.

CosyVoice-ComfyUI

Table of Content

Description
CosyVoiceNode:
CosyVoiceNode Input Parameters:
CosyVoiceNode Output Parameters:
CosyVoiceNode Usage Tips:
CosyVoiceNode Common Errors and Solutions:
Related Nodes

Hunyuan3D-1 | ComfyUI 3D Pack

Create multi-view RGB images first, then transform them into 3D assets.

LivePortrait | Animate Portraits | Vid2Vid

Transfer facial expressions and movements from a driving video onto a source video

BAGEL AI | T2I + I2T + I2I

Multimodal understanding and generation with open-source AI.

Consistent Style Transfer with Unsampling

Controlling latent noise with Unsampling helps dramatically increase consistency in video style transfer.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.