ComfyUI > Nodes > img2txt-comfyui-nodes > Image to Text - Auto Caption

ComfyUI Node: Image to Text - Auto Caption

Class Name

img2txt BLIP_Llava Multimodel Tagger

Category
img2txt
Author
christian-byrne (Account age: 1364days)
Extension
img2txt-comfyui-nodes
Latest Updated
2024-06-23
Github Stars
0.03K

How to Install img2txt-comfyui-nodes

Install this extension via the ComfyUI Manager by searching for img2txt-comfyui-nodes
  • 1. Click the Manager button in the main menu
  • 2. Select Custom Nodes Manager button
  • 3. Enter img2txt-comfyui-nodes in the search bar
After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

  • Free trial available
  • High-speed GPU machines
  • 200+ preloaded models/nodes
  • Freedom to upload custom models/nodes
  • 50+ ready-to-run workflows
  • 100% private workspace with up to 200GB storage
  • Dedicated Support

Run ComfyUI Online

Image to Text - Auto Caption Description

Automatically generate descriptive image captions using advanced models, customizable parameters, and Chinese Q&A support.

Image to Text - Auto Caption:

The img2txt BLIP_Llava Multimodel Tagger is a powerful tool designed to automatically generate descriptive captions for images using some of the most advanced models available, including BLIP, Llava, MiniCPM, and MS-GIT. This node allows you to leverage the strengths of these models individually or in combination to produce rich, detailed descriptions of your images. It supports customization through various parameters, enabling you to tailor the captions to your specific needs, such as specifying the style, medium, or background of the image. Additionally, it offers automatic model download and management, making it easy to get started without needing extensive technical knowledge. The node also supports Chinese questions and answers via the MiniCPM model, broadening its applicability for diverse linguistic needs.

Image to Text - Auto Caption Input Parameters:

input_image

This parameter accepts a tensor representing the input image. The tensor should be in the format [Batch_n, H, W, 3-channel], where Batch_n is the batch size, H is the height, W is the width, and 3-channel represents the RGB color channels. This image will be processed to generate the captions.

use_blip_model

A boolean parameter that determines whether to use the BLIP model for caption generation. When set to true, the BLIP model will be used, which requires approximately 2GB of disk space. The default value is true.

use_llava_model

A boolean parameter that determines whether to use the Llava model for caption generation. When set to true, the Llava model will be used, which requires approximately 15GB of disk space. The default value is false.

use_all_models

A boolean parameter that, when set to true, enables the use of all available models (BLIP, Llava, MiniCPM, MS-GIT) and combines their outputs. This option requires a total disk space of over 20GB. The default value is false.

use_mini_pcm_model

A boolean parameter that determines whether to use the MiniCPM model for caption generation. When set to true, the MiniCPM model will be used, which requires approximately 6GB of disk space. The default value is false.

blip_caption_prefix

A string parameter that sets the prefix for captions generated by the BLIP model. This helps in conditioning the caption generation. The default value is "a photograph of".

prompt_questions

A string parameter that allows you to specify questions to ask about the image. These questions can be about the medium, art style, background, etc. Each question should be separated by a newline character.

temperature

A float parameter that controls the randomness of the caption generation. Lower values make the output more deterministic, while higher values increase randomness. This parameter is used by the models to fine-tune the generation process.

repetition_penalty

A float parameter that penalizes repeated phrases in the generated captions. This helps in producing more diverse and interesting descriptions. Adjusting this value can help in avoiding repetitive outputs.

min_words

An integer parameter that sets the minimum number of words in the generated caption. This ensures that the captions are sufficiently descriptive.

max_words

An integer parameter that sets the maximum number of words in the generated caption. This helps in keeping the captions concise and to the point.

search_beams

An integer parameter that determines the number of beams used in the search process for generating captions. More beams can lead to better results but may increase computation time.

exclude_terms

A string parameter that allows you to specify terms to be excluded from the generated captions. This can be useful for filtering out unwanted words or phrases.

output_text

A string parameter that holds the generated caption text. This is an optional parameter and can be used to store or display the output.

unique_id

An optional parameter that can be used to assign a unique identifier to the process. This can be useful for tracking and managing multiple caption generation tasks.

extra_pnginfo

An optional parameter that can be used to store additional information in the PNG metadata. This can be useful for embedding extra details about the image or the caption generation process.

Image to Text - Auto Caption Output Parameters:

output_text

This parameter contains the generated caption(s) for the input image. The output is a string or a tuple of strings, depending on the number of models used and the configuration settings. Each string provides a descriptive caption that can be used for various purposes, such as image annotation, content creation, or enhancing accessibility.

Image to Text - Auto Caption Usage Tips:

  • To get the most detailed captions, consider enabling the use_all_models parameter, which combines the strengths of all available models.
  • Use the blip_caption_prefix to condition the BLIP model's output, making it more relevant to your specific needs.
  • Adjust the temperature and repetition_penalty parameters to fine-tune the creativity and diversity of the generated captions.
  • If you have specific questions about the image, use the prompt_questions parameter to guide the Llava model in generating more targeted descriptions.

Image to Text - Auto Caption Common Errors and Solutions:

"Model not found"

  • Explanation: This error occurs when the specified model is not available for download or is incorrectly referenced.
  • Solution: Ensure that the model ID is correct and that you have an active internet connection for automatic model download.

"Insufficient disk space"

  • Explanation: This error occurs when there is not enough disk space to download and use the selected models.
  • Solution: Free up disk space or select fewer models to reduce the required disk space.

"Out of memory"

  • Explanation: This error occurs when the system runs out of memory while processing the image.
  • Solution: Reduce the image size or batch size, or use models with lower memory requirements.

"Invalid input image format"

  • Explanation: This error occurs when the input image tensor is not in the expected format.
  • Solution: Ensure that the input image tensor is in the format [Batch_n, H, W, 3-channel].

"Empty prompt questions"

  • Explanation: This error occurs when the prompt_questions parameter is empty or incorrectly formatted.
  • Solution: Ensure that each question is separated by a newline character and that the parameter is not empty.

Image to Text - Auto Caption Related Nodes

Go back to the extension to check out more related nodes.
img2txt-comfyui-nodes
RunComfy

© Copyright 2024 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals.