Visit ComfyUI Online for ready-to-use ComfyUI environment
Sophisticated text-to-speech tool with advanced machine learning for high-quality voice synthesis and cloning.
CosyVoiceNode is a sophisticated tool designed to facilitate text-to-speech (TTS) synthesis and cross-lingual voice cloning. It leverages advanced machine learning models to generate high-quality speech from text inputs, making it an invaluable asset for AI artists looking to create realistic and expressive voiceovers. The node supports multiple inference modes, including zero-shot TTS, cross-lingual voice cloning, and instruction-based TTS, providing flexibility and versatility in various applications. By utilizing pre-trained models and a robust inference pipeline, CosyVoiceNode ensures that the generated speech is natural and coherent, enhancing the overall user experience.
This parameter represents the text input that you want to convert into speech. It is a required parameter and serves as the primary content for the TTS process. The quality and clarity of the generated speech heavily depend on the text provided. There are no specific minimum or maximum values, but it is recommended to use well-structured sentences for optimal results.
This optional parameter is used in zero-shot and cross-lingual inference modes to provide additional context or style cues for the generated speech. It helps the model understand the desired tone, style, or specific characteristics of the speech output. Providing a relevant prompt can significantly enhance the naturalness and expressiveness of the generated voice.
This parameter is used to input a reference speech sample, particularly in cross-lingual voice cloning mode. The reference speech helps the model capture the unique characteristics and nuances of the speaker's voice, enabling it to generate speech that closely mimics the reference. The input should be a high-quality audio sample to achieve the best results.
This optional parameter sets the random seed for the inference process, ensuring reproducibility of the results. By specifying a seed value, you can generate consistent outputs across different runs. The default value is typically set to a random seed, but you can provide any integer value to control the randomness.
This parameter is used in instruction-based TTS mode to select specific instructions or styles for the generated speech. It allows you to customize the speech output according to predefined styles or instructions, enhancing the versatility of the TTS system. The available options depend on the model's configuration and training data.
This optional parameter provides additional instructions or context for the instruction-based TTS mode. It helps the model understand the specific requirements or nuances of the desired speech output, enabling more precise and tailored speech generation. Providing clear and concise instructions can improve the quality and relevance of the generated speech.
The primary output of the CosyVoiceNode is the generated audio waveform. This output contains the synthesized speech based on the provided text and other input parameters. The audio is returned as a dictionary with keys waveform
and sample_rate
, where waveform
is a tensor representing the audio signal and sample_rate
is the sampling rate of the audio. The generated audio is typically in 16-bit PCM format, ensuring high-quality playback.
tts_text
parameter.speech
parameter to enhance the accuracy of cross-lingual voice cloning.seed
values to explore variations in the generated speech and find the most suitable output for your needs.prompt_text
and instruct_text
parameters to guide the model in generating speech with specific styles or characteristics.tts_text
is empty or not in a valid format.tts_text
parameter to ensure it contains valid and well-structured text.speech
parameter is missing or the provided audio sample is not accessible.speech
parameter is correctly specified and that the audio file is accessible and in a supported format.© Copyright 2024 RunComfy. All Rights Reserved.