ComfyUI > Nodes > VLM_nodes

ComfyUI Extension: VLM_nodes

Repo Name

ComfyUI_VLM_nodes

Author
gokayfem (Account age: 1058 days)
Nodes
View all nodes(28)
Latest Updated
2024-07-31
Github Stars
0.33K

How to Install VLM_nodes

Install this extension via the ComfyUI Manager by searching for VLM_nodes
  • 1. Click the Manager button in the main menu
  • 2. Select Custom Nodes Manager button
  • 3. Enter VLM_nodes in the search bar
After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

  • Free trial available
  • High-speed GPU machines
  • 200+ preloaded models/nodes
  • Freedom to upload custom models/nodes
  • 50+ ready-to-run workflows
  • 100% private workspace with up to 200GB storage
  • Dedicated Support

Run ComfyUI Online

VLM_nodes Description

VLM_nodes offers custom nodes for Vision Language Models (VLM) and Large Language Models (LLM), enabling image captioning, automatic prompt generation, creative and consistent prompt suggestions, and keyword extraction.

VLM_nodes Introduction

ComfyUI_VLM_nodes is an extension designed to enhance the capabilities of AI artists by integrating Vision Language Models (VLMs) into the ComfyUI framework. This extension allows you to load and use various VLMs, enabling advanced functionalities such as structured output generation, image-to-music conversion, and automatic prompt generation. By leveraging models like LLaVa, ChatMusician, and InternLM-XComposer2-VL, ComfyUI_VLM_nodes provides a powerful toolset for creating and manipulating AI-generated content, making it easier for artists to achieve their creative goals.

How VLM_nodes Works

ComfyUI_VLM_nodes operates by integrating VLMs into the ComfyUI environment using the llama-cpp-python library. This integration allows the extension to load and utilize models in GGUF format, which are specifically designed for vision-language tasks. The extension works by downloading the necessary model files and clip projectors, placing them in the appropriate directories, and then using these models to process and generate content based on user inputs. The structured output node, for example, can extract entities, numbers, and classify prompts, while the image-to-music feature uses VLMs and LLMs to create music from images.

VLM_nodes Features

Structured Output

The Structured Output node simplifies the process of obtaining reliable answers from VLMs. It can extract entities, numbers, classify prompts, and generate specific prompts. You can customize the output by adding descriptions to fields and selecting the attributes you want to return.

structured

Image to Music

This feature uses VLMs, LLMs, and AudioLDM-2 to create music from images. The SaveAudioNode allows you to save the generated music in the output folder. The necessary files are automatically downloaded into the models/LLavacheckpoints/files_for_audioldm2 directory.

image to music

LLM to Music

Utilizes Chat Musician, an open-source LLM with intrinsic musical abilities, to generate music from text prompts. You can try prompts from the ChatMusician Demo Page. Recommended GGUF files are ChatMusician.Q5_K_M.gguf or ChatMusician.Q5_K_S.gguf.

LLM to music

InternLM-XComposer2-VL Node

This node integrates the InternLM-XComposer2-VL Model using AutoGPTQ. It automatically downloads the necessary files into the models/LLavacheckpoints/files_for_internlm directory. This model is known for its excellent visual perception capabilities.

InternLM-XComposer2

Automatic Prompt Generation and Suggestion Nodes

  • Get Keyword node: Extracts keywords from LLava outputs.
  • LLava PromptGenerator node: Creates prompts based on descriptions or keywords.
  • Suggester node: Generates multiple prompts based on the original prompt, with options for consistent or random results. Automatic Prompt Generation

VLM_nodes Models

Available Models

  • LlaVa 1.6 Mistral 7B: Model Link
  • Nous Hermes 2 Vision: Model Link
  • LlaVa 1.5 7B: Model Link
  • LlaVa 1.5 13B: Model Link
  • BakLLaVa: Model Link Each model has its unique capabilities and is suited for different tasks. For example, LlaVa models are excellent for visual question answering and image captioning, while ChatMusician is tailored for generating music from text prompts.

Troubleshooting VLM_nodes

Common Issues and Solutions

  1. Model Loading Errors: Ensure that all model files and clip projectors are correctly placed in the models/LLavacheckpoints directory.
  2. Python Version: Make sure you are using Python 3.9, as this is a requirement for the extension.
  3. File Not Found: Verify that the necessary files are downloaded and placed in the correct directories.

Frequently Asked Questions

  • Q: What should I do if the music generation fails?
  • A: Check if the necessary files for AudioLDM-2 are correctly downloaded into the models/LLavacheckpoints/files_for_audioldm2 directory.
  • Q: How can I improve the creativity of the generated prompts?
  • A: Adjust the temperature setting in the prompt generation nodes. Higher temperatures result in more creative outputs.

Learn More about VLM_nodes

For additional resources, tutorials, and community support, you can visit the following links:

  • Awesome VLM Architectures
  • Prompting Guide for LLM Settings (https://www.promptingguide.ai/introduction/settings) These resources provide in-depth information on Vision Language Models, their architectures, and how to effectively use them within the ComfyUI framework.

VLM_nodes Related Nodes

RunComfy

© Copyright 2024 RunComfy. All Rights Reserved.

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals.