Enhance Wan 2.1 video generation with LoRA models for improved style and customization.

LivePortrait | Animate Portraits | Img2Vid

Animate portraits with facial expressions and motion using a single image and reference video.

IPAdapter Plus (V2) | One-Image Style Transfer

Use IPAdapter Plus and ControlNet for precise style transfer with a single reference image.

DreamO | Unified Multi-Task Image Customization Framework

Perform identity, style, try-on, and multi-condition image generation from 1–3 references

ComfyUI > Nodes > VLM_nodes

ComfyUI Extension: VLM_nodes

Repo Name

ComfyUI_VLM_nodes

Author
gokayfem (Account age: 1342 days) Nodes
View all nodes(28) Latest Updated
2025-02-13 Github Stars
0.48K

Github Ask gokayfem Current Questions Past Questions

Table of Content

Description
How VLM_nodes Works
VLM_nodes Features
VLM_nodes Models
Troubleshooting VLM_nodes
Learn More about VLM_nodes
Related Nodes

How to Install VLM_nodes

Install this extension via the ComfyUI Manager by searching for VLM_nodes

1. Click the Manager button in the main menu
2. Select Custom Nodes Manager button
3. Enter VLM_nodes in the search bar

After installation, click the Restart button to restart ComfyUI. Then, manually refresh your browser to clear the cache and access the updated list of nodes.

Visit ComfyUI Online for ready-to-use ComfyUI environment

Free trial available
16GB VRAM to 80GB VRAM GPU machines
400+ preloaded models/nodes
Freedom to upload custom models/nodes
200+ ready-to-run workflows
100% private workspace with up to 200GB storage
Dedicated Support

Run ComfyUI Online

VLM_nodes Description

VLM_nodes offers custom nodes for Vision Language Models (VLM) and Large Language Models (LLM), enabling image captioning, automatic prompt generation, creative and consistent prompt suggestions, and keyword extraction.

VLM_nodes Introduction

ComfyUI_VLM_nodes is an extension designed to enhance the capabilities of AI artists by integrating Vision Language Models (VLMs) into the ComfyUI framework. This extension allows you to load and use various VLMs, enabling advanced functionalities such as structured output generation, image-to-music conversion, and automatic prompt generation. By leveraging models like LLaVa, ChatMusician, and InternLM-XComposer2-VL, ComfyUI_VLM_nodes provides a powerful toolset for creating and manipulating AI-generated content, making it easier for artists to achieve their creative goals.

How VLM_nodes Works

ComfyUI_VLM_nodes operates by integrating VLMs into the ComfyUI environment using the llama-cpp-python library. This integration allows the extension to load and utilize models in GGUF format, which are specifically designed for vision-language tasks. The extension works by downloading the necessary model files and clip projectors, placing them in the appropriate directories, and then using these models to process and generate content based on user inputs. The structured output node, for example, can extract entities, numbers, and classify prompts, while the image-to-music feature uses VLMs and LLMs to create music from images.

VLM_nodes Features

Structured Output

The Structured Output node simplifies the process of obtaining reliable answers from VLMs. It can extract entities, numbers, classify prompts, and generate specific prompts. You can customize the output by adding descriptions to fields and selecting the attributes you want to return.

structured

Image to Music

This feature uses VLMs, LLMs, and AudioLDM-2 to create music from images. The SaveAudioNode allows you to save the generated music in the output folder. The necessary files are automatically downloaded into the models/LLavacheckpoints/files_for_audioldm2 directory.

image to music

LLM to Music

Utilizes Chat Musician, an open-source LLM with intrinsic musical abilities, to generate music from text prompts. You can try prompts from the ChatMusician Demo Page. Recommended GGUF files are ChatMusician.Q5_K_M.gguf or ChatMusician.Q5_K_S.gguf.

LLM to music

InternLM-XComposer2-VL Node

This node integrates the InternLM-XComposer2-VL Model using AutoGPTQ. It automatically downloads the necessary files into the models/LLavacheckpoints/files_for_internlm directory. This model is known for its excellent visual perception capabilities.

InternLM-XComposer2

Automatic Prompt Generation and Suggestion Nodes

Get Keyword node: Extracts keywords from LLava outputs.
LLava PromptGenerator node: Creates prompts based on descriptions or keywords.
Suggester node: Generates multiple prompts based on the original prompt, with options for consistent or random results. Automatic Prompt Generation

VLM_nodes Models

Available Models

LlaVa 1.6 Mistral 7B: Model Link
Nous Hermes 2 Vision: Model Link
LlaVa 1.5 7B: Model Link
LlaVa 1.5 13B: Model Link
BakLLaVa: Model Link Each model has its unique capabilities and is suited for different tasks. For example, LlaVa models are excellent for visual question answering and image captioning, while ChatMusician is tailored for generating music from text prompts.

Troubleshooting VLM_nodes

Common Issues and Solutions

Model Loading Errors: Ensure that all model files and clip projectors are correctly placed in the models/LLavacheckpoints directory.
Python Version: Make sure you are using Python 3.9, as this is a requirement for the extension.
File Not Found: Verify that the necessary files are downloaded and placed in the correct directories.

Frequently Asked Questions

Q: What should I do if the music generation fails?
A: Check if the necessary files for AudioLDM-2 are correctly downloaded into the models/LLavacheckpoints/files_for_audioldm2 directory.
Q: How can I improve the creativity of the generated prompts?
A: Adjust the temperature setting in the prompt generation nodes. Higher temperatures result in more creative outputs.

Learn More about VLM_nodes

For additional resources, tutorials, and community support, you can visit the following links:

Awesome VLM Architectures
Prompting Guide for LLM Settings (https://www.promptingguide.ai/introduction/settings) These resources provide in-depth information on Vision Language Models, their architectures, and how to effectively use them within the ComfyUI framework.

VLM_nodes Related Nodes

AudioLDM-2 Node

ChatMusician

Creative Art PromptGenerator

Internlm Node

JsonToText

Get Keywords

Kosmos-2 Node

LLMLoader

LLM PromptGenerator

LLMSampler

LLava Loader Simple

LLava Optional Memory Free Advanced

LLava Optional Memory Free Simple

LLava PromptGenerator

LLava Sampler Advanced

LLava Sampler Simple

Llava Clip Loader

MC-LLaVA Node

MoonDream Node

Moondream-2 Node

PlayMusic Node

API PromptGenerator

Save Audio Node

SimpleText

Structured Output

Suggester

UformGen2 Qwen Node

ViewText

Table of Content

Description
How VLM_nodes Works
VLM_nodes Features
VLM_nodes Models
Troubleshooting VLM_nodes
Learn More about VLM_nodes
Related Nodes

MMAudio | Video-to-Audio

MMAudio: Advanced video-to-audio model for high-quality audio generation.

Flux Consistent Characters | Input Text

Create consistent characters and ensure they look uniform by inputting text.

ReActor | Fast Face Swap

Professional face swapping toolkit for ComfyUI that enables natural face replacement and enhancement.

Mochi Edit UnSampling | Video-to-Video

Mochi Edit: Modify Videos Using Text-Based Prompts and Unsampling.

Support

Resources

Legal

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Playground, enabling artists to harness the latest AI tools to create incredible art.