ComfyUI
Playground
Pricing

RunComfy

Janus-Pro | T2I + I2T Model

Janus-Pro: Advanced Text-to-Image and Image-to-Text generation.

Wan 2.1 Video Restyle | Consistent Video Style Transform

Transform your video style by applying the restyled first frame using Wan 2.1 video restyle workflow.

Stable Fast 3D | ComfyUI 3D Pack

Create stunning 3D content with Stable Fast 3D and ComfyUI 3D Pack.

PuLID Flux II | Consistent Character Generation

Generate images with precise character control while preserving artistic style.

ComfyUI > Nodes > ComfyUI-WhisperX > WhisperX Node

ComfyUI Node: WhisperX Node

Class Name

WhisperX

Category
AIFSH_WhisperX

Author
AIFSH (Account age: 516days) Extension
ComfyUI-WhisperX Latest Updated
2025-04-01 Github Stars
0.04K

Github Ask AIFSH Current Questions Past Questions

Table of Content

Description
WhisperX Node:
WhisperX Node Input Parameters:
WhisperX Node Output Parameters:
WhisperX Node Usage Tips:
WhisperX Node Common Errors and Solutions:
Related Nodes

How to Install ComfyUI-WhisperX

Install this extension via the ComfyUI Manager by searching for ComfyUI-WhisperX

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter ComfyUI-WhisperX in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

WhisperX Node Description

Powerful node for audio transcription, alignment, and speaker labeling with GPU acceleration for fast processing.

WhisperX Node:

WhisperX is a powerful node designed to transcribe and process audio files using advanced speech recognition and alignment techniques. It leverages the capabilities of the Whisper model to transcribe audio into text, align the transcriptions with the audio, and optionally assign speaker labels for multi-speaker scenarios. This node is particularly beneficial for tasks that require accurate and efficient transcription of audio content, such as creating subtitles, generating transcripts for meetings, or processing audio data for further analysis. By utilizing GPU resources when available, WhisperX ensures fast and efficient processing, making it a valuable tool for AI artists and developers working with audio data.

WhisperX Node Input Parameters:

audio

The audio parameter specifies the path to the audio file that you want to transcribe. This file should be in a supported audio format such as WAV or MP3. The quality and clarity of the audio can significantly impact the accuracy of the transcription.

model_type

This parameter defines the type of Whisper model to be used for transcription. Different models may offer varying levels of accuracy and performance, so selecting the appropriate model type based on your needs is crucial.

batch_size

The batch_size parameter determines the number of audio segments processed in a single batch. A larger batch size can speed up processing but may require more memory. The default value is typically set to balance performance and resource usage.

if_mutiple_speaker

This boolean parameter indicates whether the audio contains multiple speakers. If set to true, the node will perform speaker diarization to assign speaker labels to different segments of the audio.

use_auth_token

The use_auth_token parameter is used for authentication when accessing certain models or services that require authorization. This is particularly relevant when using models hosted on platforms that enforce access control.

min_speakers

(Optional) Specifies the minimum number of speakers expected in the audio. This can help improve the accuracy of speaker diarization by providing a hint to the model.

max_speakers

(Optional) Specifies the maximum number of speakers expected in the audio. Similar to min_speakers, this helps the model better allocate speaker labels.

WhisperX Node Output Parameters:

transcribed_text

The transcribed_text parameter contains the text output of the transcription process. This is the primary result of the node, providing a textual representation of the spoken content in the audio file.

aligned_segments

The aligned_segments parameter provides detailed information about the alignment of the transcribed text with the audio. This includes timestamps and other metadata that can be used for precise synchronization of text and audio.

speaker_labels

If the if_mutiple_speaker parameter is set to true, the speaker_labels parameter will contain information about the different speakers identified in the audio. This includes which segments of the text were spoken by which speakers.

WhisperX Node Usage Tips:

Ensure your audio files are clear and free from excessive background noise to improve transcription accuracy.
Select the appropriate model_type based on your specific needs for accuracy and performance.
Use the batch_size parameter to optimize processing speed, especially for longer audio files.
If your audio contains multiple speakers, set the if_mutiple_speaker parameter to true to enable speaker diarization.
Provide min_speakers and max_speakers values if you have prior knowledge of the number of speakers to enhance diarization accuracy.

WhisperX Node Common Errors and Solutions:

"CUDA out of memory"

Explanation: This error occurs when the GPU does not have enough memory to process the audio file.
Solution: Reduce the batch_size or switch to CPU processing by setting the device parameter to "cpu".

"Audio file not found"

Explanation: The specified audio file path is incorrect or the file does not exist.
Solution: Verify the audio file path and ensure the file exists at the specified location.

"Model loading failed"

Explanation: The specified model_type could not be loaded, possibly due to an incorrect model name or missing files.
Solution: Check the model_type parameter and ensure the model files are correctly placed and accessible.

"Speaker diarization failed"

Explanation: The speaker diarization process encountered an issue, possibly due to insufficient audio quality or incorrect parameter settings.
Solution: Ensure the audio quality is sufficient and verify the min_speakers and max_speakers parameters if used.

WhisperX Node Related Nodes

Go back to the extension to check out more related nodes.

ComfyUI-WhisperX

Table of Content

Description
WhisperX Node:
WhisperX Node Input Parameters:
WhisperX Node Output Parameters:
WhisperX Node Usage Tips:
WhisperX Node Common Errors and Solutions:
Related Nodes

FLUX LoRA Training

Guide you through the entire process of training FLUX LoRA models using your custom datasets.

FLUX LoRA (RealismLoRA) | Photorealistic Images

Blend FLUX-1 model with FLUX-RealismLoRA for photorealistic AI images

Product Relighting | Magnific.AI Relight Alternative

Elevate your product photography effortlessly, a top alternative to Magnific.AI Relight.

ReActor | Fast Face Swap

With ComfyUI ReActor, you can easily swap the faces of one or more characters in images or videos.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.

Support

Resources

Legal

RunComfy