ComfyUI  >  Nodes  >  ComfyUI-Mana-Nodes >  🎤 Speech Recognition

ComfyUI Node: 🎤 Speech Recognition

Class Name

Speech Recognition

Category
💠 Mana Nodes
Author
ForeignGods (Account age: 1241 days)
Extension
ComfyUI-Mana-Nodes
Latest Updated
5/29/2024
Github Stars
0.2K

How to Install ComfyUI-Mana-Nodes

Install this extension via the ComfyUI Manager by searching for  ComfyUI-Mana-Nodes
  • 1. Click the Manager button in the main menu
  • 2. Select Custom Nodes Manager button
  • 3. Enter ComfyUI-Mana-Nodes in the search bar
After installation, click the  Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Cloud for ready-to-use ComfyUI environment

  • Free trial available
  • High-speed GPU machines
  • 200+ preloaded models/nodes
  • Freedom to upload custom models/nodes
  • 50+ ready-to-run workflows
  • 100% private workspace with up to 200GB storage
  • Dedicated Support

Run ComfyUI Online

🎤 Speech Recognition Description

Convert spoken language from audio to text with timestamps, spell-checking, and advanced models for efficient transcription.

🎤 Speech Recognition:

The Speech Recognition node is designed to convert spoken language from audio files into written text, making it an invaluable tool for AI artists who need to transcribe audio content efficiently. Utilizing advanced models like Wav2Vec2, this node processes audio data to generate accurate transcriptions, even including timestamps for each word. Additionally, it offers spell-checking capabilities to ensure the transcriptions are polished and error-free. This node is particularly beneficial for tasks such as creating subtitles, transcribing interviews, or converting spoken notes into text, thereby saving time and enhancing productivity.

🎤 Speech Recognition Input Parameters:

audio_file

This parameter specifies the path to the audio file that you want to transcribe. The audio file should be in a format supported by the librosa library, such as WAV. The quality and clarity of the audio file can significantly impact the accuracy of the transcription.

wav2vec2_model

This parameter indicates the specific Wav2Vec2 model to be used for transcription. Different models may offer varying levels of accuracy and performance, so selecting the appropriate model can influence the quality of the transcription.

spell_check_language

This parameter sets the language for spell-checking the transcription. It accepts language names like "English", "Spanish", "French", etc. The spell checker will correct the transcription based on the selected language, improving the overall accuracy and readability of the text.

framestamps_max_chars

This parameter defines the maximum number of characters allowed per frame in the transcription output. It helps in structuring the transcription into manageable segments, especially useful for creating subtitles or other time-coded text formats.

fps

This optional parameter sets the frames per second for the transcription output. The default value is 30 fps. Adjusting this value can help synchronize the transcription with video content more accurately.

uppercase

This optional parameter determines whether the transcription should be converted to uppercase. If set to True, the entire transcription will be in uppercase letters. This can be useful for specific formatting requirements.

🎤 Speech Recognition Output Parameters:

audio_file

This output parameter returns the path to the transcribed audio file. The file will contain the transcription in a structured format, including timestamps and any applied spell-check corrections.

🎤 Speech Recognition Usage Tips:

  • Ensure your audio file is clear and free from background noise to improve transcription accuracy.
  • Choose the appropriate Wav2Vec2 model based on your specific needs; some models may perform better with certain accents or languages.
  • Use the spell_check_language parameter to automatically correct common spelling errors in the transcription.
  • Adjust the framestamps_max_chars and fps parameters to better align the transcription with video content, if applicable.
  • Consider setting the uppercase parameter to True if you need the transcription in uppercase for specific formatting purposes.

🎤 Speech Recognition Common Errors and Solutions:

Error loading audio file

  • Explanation: This error occurs when the audio file cannot be loaded, possibly due to an unsupported format or a corrupted file.
  • Solution: Ensure the audio file is in a supported format (e.g., WAV) and is not corrupted. Try re-saving the file in a different format if necessary.

SpellChecker module is NOT accessible.

  • Explanation: This error indicates that the SpellChecker module is not installed or cannot be accessed.
  • Solution: Install the SpellChecker module using pip install pyspellchecker and ensure it is accessible in your environment.

Model not found

  • Explanation: This error occurs when the specified Wav2Vec2 model cannot be found or loaded.
  • Solution: Verify that the model name is correct and that it is available in the Hugging Face model repository. Ensure you have an active internet connection to download the model if necessary.

Audio file path is invalid

  • Explanation: This error indicates that the provided path to the audio file is incorrect or the file does not exist.
  • Solution: Double-check the file path for any typos or errors and ensure the file exists at the specified location.

🎤 Speech Recognition Related Nodes

Go back to the extension to check out more related nodes.
ComfyUI-Mana-Nodes
RunComfy

© Copyright 2024 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals.