Visit ComfyUI Online for ready-to-use ComfyUI environment
Generate text using Pixtral model for image captioning with visual and textual inputs, enhancing projects with dynamic text.
The PixtralGenerateText
node is designed to generate text using a Pixtral model, which is particularly adept at processing visual inputs alongside textual prompts. This node is ideal for tasks such as image captioning, where you provide a series of images and a corresponding prompt that includes placeholders for these images. The node leverages advanced text generation techniques to produce coherent and contextually relevant text outputs based on the visual and textual inputs provided. By integrating image data directly into the text generation process, it offers a powerful tool for creating descriptive narratives or captions that are informed by visual content. This capability is especially beneficial for AI artists and content creators looking to enhance their projects with dynamically generated text that aligns with visual elements.
This optional parameter accepts a list of images that the Pixtral model will use to generate text. Each image corresponds to an [IMG]
token in the prompt, allowing the model to integrate visual information into the text generation process. The images should be provided in a format compatible with the model's processing capabilities.
This required parameter specifies the vision model to be used for text generation. It must be a model that is compatible with the Pixtral framework, capable of processing both visual and textual data to produce meaningful text outputs.
The prompt is a required string parameter that serves as the initial text input for the model. It should include [IMG]
tokens corresponding to the number of images provided, guiding the model on where to incorporate visual information. The default prompt is "Caption this image:\n[IMG]", and it supports multiline input for more complex prompts.
This integer parameter defines the maximum number of new tokens the model can generate. It ranges from 1 to 4096, with a default value of 256. Adjusting this value impacts the length of the generated text, allowing for concise or more detailed outputs.
A boolean parameter that determines whether sampling is used during text generation. When set to True
(default), the model samples from the distribution of possible next tokens, introducing variability and creativity into the output. Setting it to False
results in deterministic outputs.
This float parameter controls the randomness of the sampling process, with a default value of 0.3. It ranges from 0 to 1, where lower values make the model more conservative and higher values increase creativity and diversity in the generated text.
A float parameter that implements nucleus sampling, where only the top p
probability mass is considered for generating the next token. It ranges from 0.0 to 1.0, with a default of 0.9, balancing between diversity and coherence in the output.
This integer parameter limits the sampling pool to the top k
tokens, with a default value of 40. It helps in controlling the diversity of the generated text by restricting the number of potential next tokens.
A float parameter that penalizes the model for repeating the same token, with a default value of 1.1. This helps in reducing redundancy and ensuring more varied text outputs.
A string parameter that specifies one or more tokens that, when generated, will stop the text generation process. The default value is "</s>"
, which is commonly used to signify the end of a sequence.
An integer parameter used to initialize the random number generator for reproducibility. It ranges from 0 to 0xffffffff, with a default value of 0. Setting a specific seed ensures consistent outputs across runs.
A boolean parameter that determines whether the initial prompt should be included in the final output. By default, it is set to False
, meaning only the newly generated text is returned.
This boolean parameter, when set to True
, unloads the model from memory after text generation, freeing up resources. It is set to False
by default, allowing for subsequent text generation tasks without reloading the model.
The output is a string that contains the text generated by the Pixtral model. This text is crafted based on the provided images and prompt, incorporating visual context into the narrative. The output is designed to be coherent and contextually relevant, making it suitable for applications like image captioning or storytelling.
[IMG]
tokens in your prompt matches the number of images provided to achieve accurate and contextually relevant text generation.temperature
, top_p
, and top_k
parameters to find the right balance between creativity and coherence for your specific use case.[IMG]
tokens in the prompt does not match the number of images provided.[IMG]
tokens corresponding to the images you are using.unload_after_generate
is set to True
, reload the model for subsequent tasks.max_new_tokens
or use a smaller model to fit within the available memory constraints. Consider unloading the model after use to free up resources.© Copyright 2024 RunComfy. All Rights Reserved.