Visit ComfyUI Online for ready-to-use ComfyUI environment
Transcribe audio to text with high accuracy and precise timing using Whisper model for AI artists.
The Apply Whisper node is designed to transcribe audio files into text using the Whisper model, a state-of-the-art speech recognition system. This node is particularly useful for AI artists who need to convert spoken words into written text for further processing, such as adding subtitles to videos or creating text-based content from audio recordings. By leveraging the Whisper model, the node ensures high accuracy in transcription, capturing not only the text but also the precise timing of each word and segment. This detailed alignment information can be invaluable for synchronizing subtitles with audio or for any application requiring precise timing data.
The audio
parameter expects an input of type VHS_AUDIO
. This parameter represents the audio data that you want to transcribe. The audio data should be provided in a format that the node can process, typically as a byte stream. The quality and clarity of the audio can significantly impact the accuracy of the transcription, so it is advisable to use clear and noise-free recordings.
The model
parameter allows you to select the specific Whisper model to use for transcription. The available options are base
, tiny
, small
, medium
, and large
. Each model varies in size and accuracy, with larger models generally providing more accurate transcriptions but requiring more computational resources. The choice of model can affect the speed and accuracy of the transcription process, so you should select the model that best fits your needs and available resources.
The text
output parameter provides the transcribed text from the input audio. This is the main output of the node and contains the entire spoken content converted into written form. The text is stripped of any leading or trailing whitespace to ensure clean and accurate results.
The segments_alignment
output parameter is a list of dictionaries, each representing a segment of the transcribed text. Each dictionary contains the value
(the transcribed text of the segment), start
(the start time of the segment in the audio), and end
(the end time of the segment). This detailed alignment information is useful for applications that require precise synchronization of text with audio, such as subtitle generation.
The words_alignment
output parameter is a list of dictionaries, each representing a word in the transcribed text. Each dictionary contains the value
(the transcribed word), start
(the start time of the word in the audio), and end
(the end time of the word). This fine-grained alignment data is essential for tasks that need exact word-level timing, such as creating karaoke-style lyrics or detailed subtitle tracks.
large
offer higher accuracy but require more computational power.segments_alignment
and words_alignment
outputs to create precisely timed subtitles or to analyze the timing of spoken words in your audio.© Copyright 2024 RunComfy. All Rights Reserved.