ComfyUI > Nodes > ComfyUI-WhisperX > WhisperX Node

ComfyUI Node: WhisperX Node

Class Name

WhisperX

Category
AIFSH_WhisperX
Author
AIFSH (Account age: 271days)
Extension
ComfyUI-WhisperX
Latest Updated
2024-06-14
Github Stars
0.03K

How to Install ComfyUI-WhisperX

Install this extension via the ComfyUI Manager by searching for ComfyUI-WhisperX
  • 1. Click the Manager button in the main menu
  • 2. Select Custom Nodes Manager button
  • 3. Enter ComfyUI-WhisperX in the search bar
After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

  • Free trial available
  • High-speed GPU machines
  • 200+ preloaded models/nodes
  • Freedom to upload custom models/nodes
  • 50+ ready-to-run workflows
  • 100% private workspace with up to 200GB storage
  • Dedicated Support

Run ComfyUI Online

WhisperX Node Description

Powerful node for audio transcription, alignment, and speaker labeling with GPU acceleration for fast processing.

WhisperX Node:

WhisperX is a powerful node designed to transcribe and process audio files using advanced speech recognition and alignment techniques. It leverages the capabilities of the Whisper model to transcribe audio into text, align the transcriptions with the audio, and optionally assign speaker labels for multi-speaker scenarios. This node is particularly beneficial for tasks that require accurate and efficient transcription of audio content, such as creating subtitles, generating transcripts for meetings, or processing audio data for further analysis. By utilizing GPU resources when available, WhisperX ensures fast and efficient processing, making it a valuable tool for AI artists and developers working with audio data.

WhisperX Node Input Parameters:

audio

The audio parameter specifies the path to the audio file that you want to transcribe. This file should be in a supported audio format such as WAV or MP3. The quality and clarity of the audio can significantly impact the accuracy of the transcription.

model_type

This parameter defines the type of Whisper model to be used for transcription. Different models may offer varying levels of accuracy and performance, so selecting the appropriate model type based on your needs is crucial.

batch_size

The batch_size parameter determines the number of audio segments processed in a single batch. A larger batch size can speed up processing but may require more memory. The default value is typically set to balance performance and resource usage.

if_mutiple_speaker

This boolean parameter indicates whether the audio contains multiple speakers. If set to true, the node will perform speaker diarization to assign speaker labels to different segments of the audio.

use_auth_token

The use_auth_token parameter is used for authentication when accessing certain models or services that require authorization. This is particularly relevant when using models hosted on platforms that enforce access control.

min_speakers

(Optional) Specifies the minimum number of speakers expected in the audio. This can help improve the accuracy of speaker diarization by providing a hint to the model.

max_speakers

(Optional) Specifies the maximum number of speakers expected in the audio. Similar to min_speakers, this helps the model better allocate speaker labels.

WhisperX Node Output Parameters:

transcribed_text

The transcribed_text parameter contains the text output of the transcription process. This is the primary result of the node, providing a textual representation of the spoken content in the audio file.

aligned_segments

The aligned_segments parameter provides detailed information about the alignment of the transcribed text with the audio. This includes timestamps and other metadata that can be used for precise synchronization of text and audio.

speaker_labels

If the if_mutiple_speaker parameter is set to true, the speaker_labels parameter will contain information about the different speakers identified in the audio. This includes which segments of the text were spoken by which speakers.

WhisperX Node Usage Tips:

  • Ensure your audio files are clear and free from excessive background noise to improve transcription accuracy.
  • Select the appropriate model_type based on your specific needs for accuracy and performance.
  • Use the batch_size parameter to optimize processing speed, especially for longer audio files.
  • If your audio contains multiple speakers, set the if_mutiple_speaker parameter to true to enable speaker diarization.
  • Provide min_speakers and max_speakers values if you have prior knowledge of the number of speakers to enhance diarization accuracy.

WhisperX Node Common Errors and Solutions:

"CUDA out of memory"

  • Explanation: This error occurs when the GPU does not have enough memory to process the audio file.
  • Solution: Reduce the batch_size or switch to CPU processing by setting the device parameter to "cpu".

"Audio file not found"

  • Explanation: The specified audio file path is incorrect or the file does not exist.
  • Solution: Verify the audio file path and ensure the file exists at the specified location.

"Model loading failed"

  • Explanation: The specified model_type could not be loaded, possibly due to an incorrect model name or missing files.
  • Solution: Check the model_type parameter and ensure the model files are correctly placed and accessible.

"Speaker diarization failed"

  • Explanation: The speaker diarization process encountered an issue, possibly due to insufficient audio quality or incorrect parameter settings.
  • Solution: Ensure the audio quality is sufficient and verify the min_speakers and max_speakers parameters if used.

WhisperX Node Related Nodes

Go back to the extension to check out more related nodes.
ComfyUI-WhisperX
RunComfy

© Copyright 2024 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals.