Transform your subject with an audioreactive background made of intricate geometries.

Flux UltraRealistic LoRA V2

Create stunningly lifelike image with Flux UltraRealistic LoRA V2

Flux Upscaler - Ultimate 32k | Image Upscaler

Flux Upscaler – Achieve 4k, 8k, 16k, and Ultimate 32k Resolution!

LatentSync| Lip Sync Model

Advanced audio-driven lip sync technology.

ComfyUI > Nodes > Bjornulf_custom_nodes > 🔊➜📝 STT - Speech to Text

ComfyUI Node: 🔊➜📝 STT - Speech to Text

Class Name

Bjornulf_SpeechToText

Category
Bjornulf

Author
justUmen (Account age: 3073days) Extension
Bjornulf_custom_nodes Latest Updated
2025-03-30 Github Stars
0.29K

Github Ask justUmen Current Questions Past Questions

Table of Content

Description
Bjornulf_SpeechToText:
Bjornulf_SpeechToText Input Parameters:
Bjornulf_SpeechToText Output Parameters:
Bjornulf_SpeechToText Usage Tips:
Bjornulf_SpeechToText Common Errors and Solutions:
Related Nodes

How to Install Bjornulf_custom_nodes

Install this extension via the ComfyUI Manager by searching for Bjornulf_custom_nodes

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter Bjornulf_custom_nodes in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

🔊➜📝 STT - Speech to Text Description

Convert spoken language to written text with language detection and customizable transcription models for AI developers.

🔊➜📝 STT - Speech to Text:

The Bjornulf_SpeechToText node is designed to convert spoken language into written text, a process known as speech-to-text transcription. This node is particularly useful for AI artists and developers who need to transcribe audio content into text format for further processing or analysis. It supports various audio input methods, including direct audio data and file paths, making it versatile for different use cases. The node is capable of detecting the language of the spoken content, which enhances its utility in multilingual environments. By leveraging local transcription models, it ensures that the transcription process is efficient and can be tailored to different levels of accuracy and performance based on the model size selected. This node is essential for applications that require converting audio inputs into text, such as creating subtitles, transcribing interviews, or processing voice commands.

🔊➜📝 STT - Speech to Text Input Parameters:

model_size

The model_size parameter determines the size of the transcription model used for processing the audio input. It impacts the accuracy and speed of the transcription process. Available options are "tiny", "base", "small", "medium", and "large-v2", with "base" being the default. Smaller models are faster but may be less accurate, while larger models provide better accuracy at the cost of increased processing time.

AUDIO

The AUDIO parameter is an optional input that allows you to provide audio data directly in tensor format. This is useful when the audio is already available in a digital format and needs to be processed without saving it to a file. The parameter expects a dictionary containing waveform and sample_rate keys, which represent the audio data and its sampling rate, respectively.

audio_path

The audio_path parameter is an optional string input that specifies the file path to an audio file that needs to be transcribed. This parameter is useful when the audio content is stored in a file and needs to be processed. It is important to ensure that the file path is valid and accessible to avoid errors during transcription.

🔊➜📝 STT - Speech to Text Output Parameters:

transcript

The transcript output provides the transcribed text from the audio input. It represents the spoken content in written form, allowing you to use it for further text-based processing or analysis. If the transcription process fails, this output will contain an error message indicating the failure.

detected_language

The detected_language output indicates the language code of the spoken content detected during the transcription process. This information is useful for understanding the language context of the audio input and can be used to tailor subsequent processing steps accordingly.

language_name

The language_name output provides the full name of the detected language, making it easier to interpret the language code provided by the detected_language output. This output enhances the readability and usability of the transcription results, especially in multilingual applications.

🔊➜📝 STT - Speech to Text Usage Tips:

Ensure that the audio input is clear and free from background noise to improve transcription accuracy.
Choose the appropriate model_size based on your accuracy and performance needs; larger models offer better accuracy but require more processing time.
Verify that the audio_path is correct and accessible to avoid errors during the transcription process.

🔊➜📝 STT - Speech to Text Common Errors and Solutions:

No valid audio input provided

Explanation: This error occurs when neither the AUDIO parameter nor the audio_path is provided or valid.
Solution: Ensure that you provide a valid audio input either through the AUDIO parameter or by specifying a correct audio_path.

Local transcription failed: `<error_message>`

Explanation: This error indicates that the local transcription process encountered an issue, with <error_message> providing specific details.
Solution: Check the error message for details and ensure that the audio input is in a supported format and that the model is correctly configured.

🔊➜📝 STT - Speech to Text Related Nodes

Go back to the extension to check out more related nodes.

Bjornulf_custom_nodes

Table of Content

Description
Bjornulf_SpeechToText:
Bjornulf_SpeechToText Input Parameters:
Bjornulf_SpeechToText Output Parameters:
Bjornulf_SpeechToText Usage Tips:
Bjornulf_SpeechToText Common Errors and Solutions:
Related Nodes

FLUX Img2Img | Merge Visuals and Prompts

Merge visuals and prompts for stunning, enhanced results.

VACE Wan2.1 | V2V

Transform videos with a reference style image using VACE Wan2.1.

Mochi Edit UnSampling | Video-to-Video

Mochi Edit: Modify Videos Using Text-Based Prompts and Unsampling.

MultiTalk | Photo to Talking Video

Millisecond lip sync + Wan2.1 = 15s ultra-detailed talking videos!

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.