Visit ComfyUI Online for ready-to-use ComfyUI environment
Facilitates model quantization for AI performance optimization with configurable precision settings and optimizations.
The T5QuantizationConfig| T5 Quantization Config ๐ผ node is designed to facilitate the quantization of models, particularly for optimizing performance and efficiency in AI tasks. Quantization is a process that reduces the precision of the numbers used to represent a model's parameters, which can significantly decrease the model's size and increase its inference speed. This node allows you to configure various quantization settings, such as loading models in 8-bit or 4-bit precision, setting thresholds for low-level machine (LLM) operations, and enabling specific optimizations like FP32 CPU offloading. By leveraging these configurations, you can tailor the quantization process to meet the specific needs of your AI models, balancing between performance and resource utilization.
This parameter determines the mode of quantization to be applied to the model. The available options are "none", "load_in_8bit", and "load_in_4bit". Selecting "none" will disable quantization, while "load_in_8bit" and "load_in_4bit" will load the model in 8-bit and 4-bit precision, respectively. The default value is "none".
This parameter sets the threshold for low-level machine (LLM) operations when using 8-bit quantization. It is a floating-point value that determines the sensitivity of the quantization process. The default value is 6.0, and it can be adjusted to fine-tune the balance between model accuracy and performance.
This parameter allows you to specify modules that should be skipped during the 8-bit quantization process. It accepts a comma-separated string of module names. By default, this parameter is an empty string, meaning no modules are skipped.
This boolean parameter enables or disables the offloading of FP32 operations to the CPU during 8-bit quantization. Enabling this can help manage memory usage and improve performance on certain hardware configurations. The default value is False.
This boolean parameter indicates whether the model has FP16 weights, which can be useful for certain optimizations during the quantization process. The default value is False.
This parameter specifies the data type to be used for computations when loading the model in 4-bit precision. The default value is "float32", but it can be set to other data types supported by PyTorch, such as "float16".
This parameter defines the type of 4-bit quantization to be used. The default value is "fp4", which stands for floating-point 4-bit quantization.
This boolean parameter enables or disables the use of double quantization when loading the model in 4-bit precision. Double quantization can further reduce the model size but may impact accuracy. The default value is False.
This parameter specifies the storage format for the 4-bit quantized model. The default value is "uint8", which stands for unsigned 8-bit integer storage.
This output parameter provides the configured quantization settings as a QuantizationConfig
object. This object encapsulates all the specified quantization parameters and can be used to apply the quantization settings to a model. It is essential for optimizing the model's performance and resource utilization based on the configured settings.
llm_int8_threshold
and bnb_4bit_compute_dtype
parameters based on your specific requirements.llm_int8_enable_fp32_cpu_offload
to offload some operations to the CPU.llm_int8_skip_modules
parameter to exclude specific modules from quantization if they are critical for maintaining model accuracy.quantization_mode
is not recognized.quantization_mode
is set to one of the following: "none", "load_in_8bit", or "load_in_4bit".bnb_4bit_compute_dtype
is not supported by PyTorch.bnb_4bit_compute_dtype
is set to a valid PyTorch data type, such as "float32" or "float16".llm_int8_skip_modules
do not exist in the model.llm_int8_skip_modules
and ensure they match the actual module names in the model.ยฉ Copyright 2024 RunComfy. All Rights Reserved.