Most powerful video generation solution yet! Cinema-grade detail, your personal film studio.

Hunyuan Image to Video | Breathtaking Motion Creator

Create magnificent movies out of still images through cinematic motion and customizable effects.

Flux TTP Upscale | 4K Face Restore

Repair distorted faces and upscale images to 4K resolution.

Flux Redux | Variation and Restyling

Official Flux Tools - Flux Redux for Image Variation and Restyling

ComfyUI > Nodes > img2txt-comfyui-nodes > Image to Text - Auto Caption

ComfyUI Node: Image to Text - Auto Caption

Class Name

img2txt BLIP_Llava Multimodel Tagger

Category
img2txt

Author
christian-byrne (Account age: 1633days) Extension
img2txt-comfyui-nodes Latest Updated
2025-03-14 Github Stars
0.09K

Github Ask christian-byrne Current Questions Past Questions

Table of Content

Description
Image to Text - Auto Caption:
Image to Text - Auto Caption Input Parameters:
Image to Text - Auto Caption Output Parameters:
Image to Text - Auto Caption Usage Tips:
Image to Text - Auto Caption Common Errors and Solutions:
Related Nodes

How to Install img2txt-comfyui-nodes

Install this extension via the ComfyUI Manager by searching for img2txt-comfyui-nodes

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter img2txt-comfyui-nodes in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

Image to Text - Auto Caption Description

Automatically generate descriptive image captions using advanced models, customizable parameters, and Chinese Q&A support.

Image to Text - Auto Caption:

The img2txt BLIP_Llava Multimodel Tagger is a powerful tool designed to automatically generate descriptive captions for images using some of the most advanced models available, including BLIP, Llava, MiniCPM, and MS-GIT. This node allows you to leverage the strengths of these models individually or in combination to produce rich, detailed descriptions of your images. It supports customization through various parameters, enabling you to tailor the captions to your specific needs, such as specifying the style, medium, or background of the image. Additionally, it offers automatic model download and management, making it easy to get started without needing extensive technical knowledge. The node also supports Chinese questions and answers via the MiniCPM model, broadening its applicability for diverse linguistic needs.

Image to Text - Auto Caption Input Parameters:

input_image

This parameter accepts a tensor representing the input image. The tensor should be in the format [Batch_n, H, W, 3-channel], where Batch_n is the batch size, H is the height, W is the width, and 3-channel represents the RGB color channels. This image will be processed to generate the captions.

use_blip_model

A boolean parameter that determines whether to use the BLIP model for caption generation. When set to true, the BLIP model will be used, which requires approximately 2GB of disk space. The default value is true.

use_llava_model

A boolean parameter that determines whether to use the Llava model for caption generation. When set to true, the Llava model will be used, which requires approximately 15GB of disk space. The default value is false.

use_all_models

A boolean parameter that, when set to true, enables the use of all available models (BLIP, Llava, MiniCPM, MS-GIT) and combines their outputs. This option requires a total disk space of over 20GB. The default value is false.

use_mini_pcm_model

A boolean parameter that determines whether to use the MiniCPM model for caption generation. When set to true, the MiniCPM model will be used, which requires approximately 6GB of disk space. The default value is false.

blip_caption_prefix

A string parameter that sets the prefix for captions generated by the BLIP model. This helps in conditioning the caption generation. The default value is "a photograph of".

prompt_questions

A string parameter that allows you to specify questions to ask about the image. These questions can be about the medium, art style, background, etc. Each question should be separated by a newline character.

temperature

A float parameter that controls the randomness of the caption generation. Lower values make the output more deterministic, while higher values increase randomness. This parameter is used by the models to fine-tune the generation process.

repetition_penalty

A float parameter that penalizes repeated phrases in the generated captions. This helps in producing more diverse and interesting descriptions. Adjusting this value can help in avoiding repetitive outputs.

min_words

An integer parameter that sets the minimum number of words in the generated caption. This ensures that the captions are sufficiently descriptive.

max_words

An integer parameter that sets the maximum number of words in the generated caption. This helps in keeping the captions concise and to the point.

search_beams

An integer parameter that determines the number of beams used in the search process for generating captions. More beams can lead to better results but may increase computation time.

exclude_terms

A string parameter that allows you to specify terms to be excluded from the generated captions. This can be useful for filtering out unwanted words or phrases.

output_text

A string parameter that holds the generated caption text. This is an optional parameter and can be used to store or display the output.

unique_id

An optional parameter that can be used to assign a unique identifier to the process. This can be useful for tracking and managing multiple caption generation tasks.

extra_pnginfo

An optional parameter that can be used to store additional information in the PNG metadata. This can be useful for embedding extra details about the image or the caption generation process.

Image to Text - Auto Caption Output Parameters:

output_text

This parameter contains the generated caption(s) for the input image. The output is a string or a tuple of strings, depending on the number of models used and the configuration settings. Each string provides a descriptive caption that can be used for various purposes, such as image annotation, content creation, or enhancing accessibility.

Image to Text - Auto Caption Usage Tips:

To get the most detailed captions, consider enabling the use_all_models parameter, which combines the strengths of all available models.
Use the blip_caption_prefix to condition the BLIP model's output, making it more relevant to your specific needs.
Adjust the temperature and repetition_penalty parameters to fine-tune the creativity and diversity of the generated captions.
If you have specific questions about the image, use the prompt_questions parameter to guide the Llava model in generating more targeted descriptions.

Image to Text - Auto Caption Common Errors and Solutions:

"Model not found"

Explanation: This error occurs when the specified model is not available for download or is incorrectly referenced.
Solution: Ensure that the model ID is correct and that you have an active internet connection for automatic model download.

"Insufficient disk space"

Explanation: This error occurs when there is not enough disk space to download and use the selected models.
Solution: Free up disk space or select fewer models to reduce the required disk space.

"Out of memory"

Explanation: This error occurs when the system runs out of memory while processing the image.
Solution: Reduce the image size or batch size, or use models with lower memory requirements.

"Invalid input image format"

Explanation: This error occurs when the input image tensor is not in the expected format.
Solution: Ensure that the input image tensor is in the format [Batch_n, H, W, 3-channel].

"Empty prompt questions"

Explanation: This error occurs when the prompt_questions parameter is empty or incorrectly formatted.
Solution: Ensure that each question is separated by a newline character and that the parameter is not empty.

Image to Text - Auto Caption Related Nodes

Go back to the extension to check out more related nodes.

img2txt-comfyui-nodes

Table of Content

Description
Image to Text - Auto Caption:
Image to Text - Auto Caption Input Parameters:
Image to Text - Auto Caption Output Parameters:
Image to Text - Auto Caption Usage Tips:
Image to Text - Auto Caption Common Errors and Solutions:
Related Nodes

AnimateDiff + ControlNet + AutoMask | Comic Style

Effortlessly restyle videos, converting realistic characters into anime while keeping the original backgrounds intact.

Wan 2.1 FLF2V | First-Last Frame Video

Generate smooth videos from a start and end frame using Wan 2.1 FLF2V.

Uni3C Video-Referenced Camera & Motion Transfer

Extract camera movements and human motions from reference videos for professional video generation

Trellis | Image to 3D

Trellis is an advanced Image-to-3D model for high-quality 3D assets generation.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.