Advanced audio-driven lip sync technology.

Audioreactive Dancers Evolved

Transform your subject with an audioreactive background made of intricate geometries.

FLUX LoRA Training

Guide you through the entire process of training FLUX LoRA models using your custom datasets.

FLUX Outpainting

Use SDXL and FLUX to expand and refine images seamlessly.

ComfyUI > Nodes > ComfyUI-FunAudioLLM > CosyVoice 3s极速克隆

ComfyUI Node: CosyVoice 3s极速克隆

Class Name

CosyVoiceZeroShotNode

Category
FunAudioLLM - CosyVoice

Author
SpenserCai (Account age: 3000days) Extension
ComfyUI-FunAudioLLM Latest Updated
2024-11-27 Github Stars
0.08K

Github Ask SpenserCai Current Questions Past Questions

Table of Content

Description
CosyVoiceZeroShotNode:
CosyVoiceZeroShotNode Input Parameters:
CosyVoiceZeroShotNode Output Parameters:
CosyVoiceZeroShotNode Usage Tips:
CosyVoiceZeroShotNode Common Errors and Solutions:
Related Nodes

How to Install ComfyUI-FunAudioLLM

Install this extension via the ComfyUI Manager by searching for ComfyUI-FunAudioLLM

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter ComfyUI-FunAudioLLM in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

CosyVoice 3s极速克隆 Description

Efficient voice cloning with minimal input using zero-shot learning for rapid text-to-speech synthesis.

CosyVoiceZeroShotNode:

The CosyVoiceZeroShotNode is a powerful tool designed to facilitate rapid voice cloning using a zero-shot learning approach. This node is part of the CosyVoice suite, which is tailored for text-to-speech (TTS) applications. Its primary function is to generate speech from text input by leveraging a minimal amount of prompt data, typically just a few seconds of audio. This capability is particularly beneficial for users who need to create voice models quickly without extensive training data. The node utilizes advanced machine learning techniques to synthesize speech that closely mimics the voice characteristics of the provided prompt, making it ideal for applications where voice personalization and quick turnaround are essential. By focusing on zero-shot inference, the node allows for efficient and effective voice cloning, enabling users to produce high-quality, natural-sounding speech outputs with minimal input.

CosyVoiceZeroShotNode Input Parameters:

tts_text

The tts_text parameter is the text input that you want to convert into speech. It serves as the primary content for the text-to-speech synthesis process. The quality and clarity of the generated speech are directly influenced by the text provided, as it determines the linguistic content of the output. There are no specific minimum or maximum values for this parameter, but it should be a coherent and grammatically correct sentence or phrase to ensure optimal results.

speed

The speed parameter controls the rate at which the synthesized speech is delivered. Adjusting this parameter allows you to speed up or slow down the speech output, which can be useful for matching the desired speaking pace or for specific applications that require a particular tempo. The default value is typically set to 1.0, representing normal speed, with values greater than 1.0 increasing the speed and values less than 1.0 decreasing it.

seed

The seed parameter is used to set the random seed for the synthesis process, ensuring reproducibility of the results. By specifying a seed value, you can generate consistent outputs across multiple runs with the same input parameters. This is particularly useful for debugging or when you need to produce identical results for comparison purposes. The seed value is typically an integer, and while there is no strict range, it should be chosen to suit the specific requirements of your application.

use_25hz

The use_25hz parameter is a boolean flag that determines whether to use a 25Hz sampling rate for the audio processing. This setting can impact the quality and fidelity of the generated speech, with potential trade-offs between audio quality and processing efficiency. The default value is usually set to False, meaning the standard sampling rate is used unless specified otherwise.

prompt_text

The prompt_text parameter is an optional input that provides additional context or guidance for the voice cloning process. It is used in conjunction with the prompt audio to enhance the accuracy and naturalness of the synthesized speech. This parameter is particularly important when no pre-existing speaker model is available, as it helps the system understand the desired voice characteristics. There are no specific constraints on the content of the prompt text, but it should be relevant to the intended voice style.

prompt_wav

The prompt_wav parameter is an optional input that consists of a waveform and sample rate, providing the audio sample used for voice cloning. This audio sample is crucial for the zero-shot learning process, as it serves as the reference for mimicking the voice characteristics in the generated speech. The quality and length of the prompt audio can significantly affect the accuracy and naturalness of the output. It is important to ensure that the audio is clear and representative of the desired voice.

speaker_model

The speaker_model parameter is an optional input that allows you to specify a pre-trained speaker model for the synthesis process. If provided, this model is used to guide the voice cloning, potentially improving the accuracy and consistency of the output. This parameter is particularly useful when you have an existing model that closely matches the desired voice characteristics. If no speaker model is available, the node will rely on the prompt text and audio for voice cloning.

CosyVoiceZeroShotNode Output Parameters:

output

The output parameter represents the synthesized speech generated by the node. This output is the result of the text-to-speech conversion process, incorporating the voice characteristics derived from the prompt audio or speaker model. The quality and naturalness of the output are influenced by the input parameters, and it is typically delivered as an audio waveform that can be played back or further processed as needed.

spk_model

The spk_model parameter is an output that provides the speaker model used or generated during the synthesis process. This model encapsulates the voice characteristics captured from the prompt audio or specified speaker model, and it can be used for future synthesis tasks to ensure consistency in voice output. The spk_model is particularly valuable for applications that require repeated use of the same voice style, as it allows for efficient reuse without the need for additional prompt data.

CosyVoiceZeroShotNode Usage Tips:

Ensure that the prompt_wav audio is clear and representative of the desired voice to achieve the best cloning results.
Experiment with the speed parameter to find the optimal speaking rate for your application, especially if the default speed does not meet your needs.
Use the seed parameter to reproduce results consistently, which is useful for testing and comparison purposes.

CosyVoiceZeroShotNode Common Errors and Solutions:

"prompt文本为空，您是否忘记输入prompt文本？"

Explanation: This error occurs when the prompt_text parameter is empty, which is required when no speaker model is provided.
Solution: Ensure that you provide a valid prompt_text input to guide the voice cloning process.

"Invalid audio format for prompt_wav"

Explanation: This error indicates that the audio provided in prompt_wav is not in the expected format or sample rate.
Solution: Verify that the prompt_wav audio is correctly formatted and resampled to the required specifications before inputting it into the node.

CosyVoice 3s极速克隆 Related Nodes

Go back to the extension to check out more related nodes.

ComfyUI-FunAudioLLM

Table of Content

Description
CosyVoiceZeroShotNode:
CosyVoiceZeroShotNode Input Parameters:
CosyVoiceZeroShotNode Output Parameters:
CosyVoiceZeroShotNode Usage Tips:
CosyVoiceZeroShotNode Common Errors and Solutions:
Related Nodes

Flux UltraRealistic LoRA V2

Create stunningly lifelike image with Flux UltraRealistic LoRA V2

Hunyuan3D-1 | ComfyUI 3D Pack

Create multi-view RGB images first, then transform them into 3D assets.

DreamO | Unified Multi-Task Image Customization Framework

Perform identity, style, try-on, and multi-condition image generation from 1–3 references

Fluxtapoz | RF Inversion and Stylization

Fluxtapoz Nodes for RF Inversion and Stylization - Unsampling and Sampling

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.