Visit ComfyUI Online for ready-to-use ComfyUI environment
Convert text to high-quality speech audio with AI using WhisperSpeech pipeline for natural-sounding output customization.
The IF_WhisperSpeech node is designed to convert text into high-quality speech audio using advanced AI techniques. This node leverages the capabilities of the WhisperSpeech pipeline to generate natural-sounding speech from the provided text input. It allows you to specify various parameters such as the speaker's voice, the speed of speech, and the overlap between audio chunks to fine-tune the output. The node is particularly useful for creating voiceovers, narrations, and other audio content where natural and clear speech is required. By using this node, you can automate the process of generating speech, saving time and effort while ensuring consistent audio quality.
This parameter accepts the text that you want to convert into speech. It supports multiline input, allowing you to provide long passages of text. The default value is a sample text about electromagnetism. The text you input here will be processed and converted into audio.
This parameter specifies the base name for the output audio file. The node will append a timestamp to this base name to create a unique file name for each generated audio. The default value is IF_whisper_speech
.
This parameter allows you to choose the voice of the speaker from a list of available audio files. The options include various pre-recorded voices stored in the whisperspeech/audio
directory. The default option is None
, which uses the default speaker voice.
This boolean parameter determines whether to use Torch's compile feature for optimizing the model's performance. The default value is False
. Enabling this option can improve the speed of audio generation but may require additional computational resources.
This optional parameter stands for "characters per second" and controls the speed of the generated speech. The default value is 14.0
, with a minimum of 10.0
and a maximum of 20.0
. Adjusting this value allows you to make the speech faster or slower.
This optional parameter specifies the overlap between audio chunks in milliseconds. The default value is 100.0
, with a minimum of 0.0
and a maximum of 200.0
. Increasing the overlap can help create smoother transitions between chunks, improving the naturalness of the speech.
This output parameter contains the generated audio data in a format that can be further processed or directly used in your projects. The audio is generated based on the input text and the specified parameters, ensuring high-quality and natural-sounding speech.
This output parameter provides the file path to the generated audio file, resampled to 16kHz. This file is saved in the output directory with a unique name based on the provided file_name
and a timestamp. The 16kHz resampling ensures compatibility with various audio processing tools and applications.
cps
values to find the optimal speech speed for your specific use case. A lower value will result in slower speech, while a higher value will make the speech faster.overlap
parameter to smooth out transitions between audio chunks, especially for longer texts. This can significantly enhance the naturalness of the generated speech.speaker
from the available audio files. This allows you to customize the voice to better match your project's needs.<speaker>
'whisperspeech/audio
directory.cps
value is outside the allowed range of 10.0 to 20.0.cps
value to be within the specified range. The default value is 14.0, which is a good starting point.torch_compile
option. You can also try running the node on a machine with more GPU memory.overlap
value to ensure that the length of the tokens stays within the allowed limit. The default value of 100.0 is usually sufficient.© Copyright 2024 RunComfy. All Rights Reserved.