Visit ComfyUI Online for ready-to-use ComfyUI environment
Powerful node for audio transcription, alignment, and speaker labeling with GPU acceleration for fast processing.
WhisperX is a powerful node designed to transcribe and process audio files using advanced speech recognition and alignment techniques. It leverages the capabilities of the Whisper model to transcribe audio into text, align the transcriptions with the audio, and optionally assign speaker labels for multi-speaker scenarios. This node is particularly beneficial for tasks that require accurate and efficient transcription of audio content, such as creating subtitles, generating transcripts for meetings, or processing audio data for further analysis. By utilizing GPU resources when available, WhisperX ensures fast and efficient processing, making it a valuable tool for AI artists and developers working with audio data.
The audio parameter specifies the path to the audio file that you want to transcribe. This file should be in a supported audio format such as WAV or MP3. The quality and clarity of the audio can significantly impact the accuracy of the transcription.
This parameter defines the type of Whisper model to be used for transcription. Different models may offer varying levels of accuracy and performance, so selecting the appropriate model type based on your needs is crucial.
The batch_size parameter determines the number of audio segments processed in a single batch. A larger batch size can speed up processing but may require more memory. The default value is typically set to balance performance and resource usage.
This boolean parameter indicates whether the audio contains multiple speakers. If set to true, the node will perform speaker diarization to assign speaker labels to different segments of the audio.
The use_auth_token parameter is used for authentication when accessing certain models or services that require authorization. This is particularly relevant when using models hosted on platforms that enforce access control.
(Optional) Specifies the minimum number of speakers expected in the audio. This can help improve the accuracy of speaker diarization by providing a hint to the model.
(Optional) Specifies the maximum number of speakers expected in the audio. Similar to min_speakers, this helps the model better allocate speaker labels.
The transcribed_text parameter contains the text output of the transcription process. This is the primary result of the node, providing a textual representation of the spoken content in the audio file.
The aligned_segments parameter provides detailed information about the alignment of the transcribed text with the audio. This includes timestamps and other metadata that can be used for precise synchronization of text and audio.
If the if_mutiple_speaker parameter is set to true, the speaker_labels parameter will contain information about the different speakers identified in the audio. This includes which segments of the text were spoken by which speakers.
© Copyright 2024 RunComfy. All Rights Reserved.