Visit ComfyUI Online for ready-to-use ComfyUI environment
Convert spoken language to written text with language detection and customizable transcription models for AI developers.
The Bjornulf_SpeechToText
node is designed to convert spoken language into written text, a process known as speech-to-text transcription. This node is particularly useful for AI artists and developers who need to transcribe audio content into text format for further processing or analysis. It supports various audio input methods, including direct audio data and file paths, making it versatile for different use cases. The node is capable of detecting the language of the spoken content, which enhances its utility in multilingual environments. By leveraging local transcription models, it ensures that the transcription process is efficient and can be tailored to different levels of accuracy and performance based on the model size selected. This node is essential for applications that require converting audio inputs into text, such as creating subtitles, transcribing interviews, or processing voice commands.
The model_size
parameter determines the size of the transcription model used for processing the audio input. It impacts the accuracy and speed of the transcription process. Available options are "tiny", "base", "small", "medium", and "large-v2", with "base" being the default. Smaller models are faster but may be less accurate, while larger models provide better accuracy at the cost of increased processing time.
The AUDIO
parameter is an optional input that allows you to provide audio data directly in tensor format. This is useful when the audio is already available in a digital format and needs to be processed without saving it to a file. The parameter expects a dictionary containing waveform
and sample_rate
keys, which represent the audio data and its sampling rate, respectively.
The audio_path
parameter is an optional string input that specifies the file path to an audio file that needs to be transcribed. This parameter is useful when the audio content is stored in a file and needs to be processed. It is important to ensure that the file path is valid and accessible to avoid errors during transcription.
The transcript
output provides the transcribed text from the audio input. It represents the spoken content in written form, allowing you to use it for further text-based processing or analysis. If the transcription process fails, this output will contain an error message indicating the failure.
The detected_language
output indicates the language code of the spoken content detected during the transcription process. This information is useful for understanding the language context of the audio input and can be used to tailor subsequent processing steps accordingly.
The language_name
output provides the full name of the detected language, making it easier to interpret the language code provided by the detected_language
output. This output enhances the readability and usability of the transcription results, especially in multilingual applications.
model_size
based on your accuracy and performance needs; larger models offer better accuracy but require more processing time.audio_path
is correct and accessible to avoid errors during the transcription process.AUDIO
parameter nor the audio_path
is provided or valid.AUDIO
parameter or by specifying a correct audio_path
.<error_message>
<error_message>
providing specific details.RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.