Visit ComfyUI Online for ready-to-use ComfyUI environment
Text-to-speech node using MARS5-TTS model for high-quality speech synthesis with voice cloning and customization options.
The MARS5TTS_Node is a powerful tool designed to convert text into speech using advanced deep learning models. This node leverages the MARS5-TTS model, which is pre-trained to generate high-quality, natural-sounding speech. The primary goal of this node is to provide a seamless and efficient way to synthesize speech from text, with the added capability of cloning voices from reference audio files. This makes it an invaluable asset for AI artists looking to create personalized and dynamic audio content. The node supports various customization options, allowing you to fine-tune the speech synthesis process to match specific needs, such as adjusting the temperature for more creative outputs or using deep cloning for more accurate voice replication.
This parameter represents the text that you want to convert into speech. The input should be a string containing the text content. The quality and naturalness of the generated speech will depend on the clarity and structure of the input text.
This parameter is the file path to a reference audio file containing the voice you want to clone. The reference voice helps the model to mimic the tone, pitch, and style of the provided audio. The file should be in a format supported by the librosa
library, such as WAV.
This boolean parameter determines whether to use deep cloning for voice replication. When set to True
, the model requires a reference transcript to accurately clone the voice. This option is useful for achieving high fidelity in voice replication. Default value is False
.
This parameter controls the repetition penalty window size. It helps in reducing repetitive patterns in the generated speech. A larger window size can lead to more varied and natural-sounding speech. The value should be an integer.
This parameter sets the number of top tokens to consider during the sampling process. A higher value allows for more diversity in the generated speech, while a lower value makes the output more deterministic. The value should be an integer.
This parameter adjusts the randomness of the speech generation process. A higher temperature results in more creative and varied outputs, while a lower temperature produces more stable and predictable speech. The value should be a float, typically between 0.7 and 1.5.
This parameter applies a penalty to frequent tokens, encouraging the model to use less common words and phrases. This can help in generating more diverse and interesting speech. The value should be a float.
This optional parameter is the transcript of the reference audio file. It is required if if_deep_clone
is set to True
. The transcript helps the model to better understand and replicate the reference voice.
This parameter is the file path to the generated speech audio file. The output is a WAV file containing the synthesized speech based on the input text and reference voice. The file is saved in the specified output directory with a unique timestamp to avoid overwriting.
temperature
parameter to find the right balance between creativity and stability in the generated speech.rep_penalty_window
parameter to reduce repetitive patterns and make the speech sound more natural.if_deep_clone
is set to True
, but no reference transcript is provided.temperature
parameter is set to a value outside the acceptable range.temperature
parameter to a float value between 0.7 and 1.5.© Copyright 2024 RunComfy. All Rights Reserved.