Visit ComfyUI Online for ready-to-use ComfyUI environment
Generate lifelike speech from text input with various speaker styles and languages using Kokoro ONNX model.
The KokoroGenerator
node is designed to synthesize audio from text input, allowing you to create lifelike speech using a variety of speaker styles and languages. This node leverages the Kokoro ONNX model to transform written text into spoken words, providing a powerful tool for AI artists who wish to incorporate realistic voiceovers into their projects. By specifying parameters such as the speaker's voice, speech speed, and language, you can generate customized audio outputs that suit your creative needs. The KokoroGenerator
is particularly beneficial for those looking to add a human touch to their AI-generated content, offering a seamless way to produce high-quality audio with minimal effort.
The text
parameter is a string input that represents the content you wish to convert into speech. It supports multiline text, allowing you to input longer passages for synthesis. The default value is "I am a synthesized robot". This parameter is crucial as it directly influences the audio output, with the spoken words reflecting the text provided.
The speaker
parameter specifies the voice style to be used for the audio generation. It is of type KOKORO_SPEAKER
, which is a custom type representing different speaker profiles. This parameter allows you to choose from a variety of pre-defined voices, each with unique characteristics, to match the desired tone and style of your project.
The speed
parameter is a float that controls the rate of speech in the generated audio. It ranges from 0.1 to 4, with a default value of 1. Adjusting this parameter can make the speech faster or slower, allowing you to tailor the pacing to fit the context of your content. A higher value results in faster speech, while a lower value slows it down.
The lang
parameter is a string that determines the language of the synthesized speech. It does not support multiline input and defaults to "en-us". This parameter is essential for ensuring that the pronunciation and intonation of the generated audio align with the specified language, enhancing the authenticity of the speech.
The audio
output is a dictionary containing the generated waveform and its sample rate. The waveform is represented as a tensor, which is a multi-dimensional array used to store the audio data. The sample rate indicates the number of samples per second in the audio, which affects the quality and fidelity of the sound. This output is crucial for further processing or playback, as it provides the actual audio content generated by the node.
RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.