BigDL-LLM transformers-style API#
Hugging Face transformers AutoModel#
You can apply BigDL-LLM optimizations on any Hugging Face Transformers models by using the standard AutoModel APIs.
AutoModelForCausalLM#
- class bigdl.llm.transformers.AutoModelForCausalLM[source]#
Bases:
bigdl.llm.transformers.model._BaseAutoModelClass- classmethod from_pretrained(*args, **kwargs)#
Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.
Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:
- Parameters
load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be
False.load_in_low_bit – str value, options are
'sym_int4','asym_int4','sym_int5','asym_int5','sym_int8','nf3','nf4','fp4','fp8','fp8_e4m3','fp8_e5m2','gguf_iq2_xxs','gguf_iq2_xs','fp16'or'bf16','sym_int4'means symmetric int 4,'asym_int4'means asymmetric int 4,'nf4'means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be
True.modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be
None.speculative – boolean value, Whether to use speculative decoding. Default to be
False.cpu_embedding – Whether to replace the Embedding layer, may need to set it to
Truewhen running BigDL-LLM on GPU on Windows. Default to beFalse.lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to
Truewhen running BigDL-LLM on GPU on Windows. Default to beFalse.imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.
model_hub – str value, options are
'huggingface'and'modelscope', specify the model hub. Default to be'huggingface'.embedding_qtype – str value, options are
'q2_k'now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.
- Returns
a model instance
- classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
- classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#
Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.
- Parameters
pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.
optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.
- Returns
a model instance
AutoModel#
- class bigdl.llm.transformers.AutoModel[source]#
Bases:
bigdl.llm.transformers.model._BaseAutoModelClass- classmethod from_pretrained(*args, **kwargs)#
Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.
Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:
- Parameters
load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be
False.load_in_low_bit – str value, options are
'sym_int4','asym_int4','sym_int5','asym_int5','sym_int8','nf3','nf4','fp4','fp8','fp8_e4m3','fp8_e5m2','gguf_iq2_xxs','gguf_iq2_xs','fp16'or'bf16','sym_int4'means symmetric int 4,'asym_int4'means asymmetric int 4,'nf4'means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be
True.modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be
None.speculative – boolean value, Whether to use speculative decoding. Default to be
False.cpu_embedding – Whether to replace the Embedding layer, may need to set it to
Truewhen running BigDL-LLM on GPU on Windows. Default to beFalse.lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to
Truewhen running BigDL-LLM on GPU on Windows. Default to beFalse.imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.
model_hub – str value, options are
'huggingface'and'modelscope', specify the model hub. Default to be'huggingface'.embedding_qtype – str value, options are
'q2_k'now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.
- Returns
a model instance
- classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
- classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#
Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.
- Parameters
pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.
optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.
- Returns
a model instance
AutoModelForSpeechSeq2Seq#
- class bigdl.llm.transformers.AutoModelForSpeechSeq2Seq[source]#
Bases:
bigdl.llm.transformers.model._BaseAutoModelClass- classmethod from_pretrained(*args, **kwargs)#
Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.
Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:
- Parameters
load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be
False.load_in_low_bit – str value, options are
'sym_int4','asym_int4','sym_int5','asym_int5','sym_int8','nf3','nf4','fp4','fp8','fp8_e4m3','fp8_e5m2','gguf_iq2_xxs','gguf_iq2_xs','fp16'or'bf16','sym_int4'means symmetric int 4,'asym_int4'means asymmetric int 4,'nf4'means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be
True.modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be
None.speculative – boolean value, Whether to use speculative decoding. Default to be
False.cpu_embedding – Whether to replace the Embedding layer, may need to set it to
Truewhen running BigDL-LLM on GPU on Windows. Default to beFalse.lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to
Truewhen running BigDL-LLM on GPU on Windows. Default to beFalse.imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.
model_hub – str value, options are
'huggingface'and'modelscope', specify the model hub. Default to be'huggingface'.embedding_qtype – str value, options are
'q2_k'now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.
- Returns
a model instance
- classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
- classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#
Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.
- Parameters
pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.
optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.
- Returns
a model instance
AutoModelForSeq2SeqLM#
- class bigdl.llm.transformers.AutoModelForSeq2SeqLM[source]#
Bases:
bigdl.llm.transformers.model._BaseAutoModelClass- classmethod from_pretrained(*args, **kwargs)#
Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.
Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:
- Parameters
load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be
False.load_in_low_bit – str value, options are
'sym_int4','asym_int4','sym_int5','asym_int5','sym_int8','nf3','nf4','fp4','fp8','fp8_e4m3','fp8_e5m2','gguf_iq2_xxs','gguf_iq2_xs','fp16'or'bf16','sym_int4'means symmetric int 4,'asym_int4'means asymmetric int 4,'nf4'means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be
True.modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be
None.speculative – boolean value, Whether to use speculative decoding. Default to be
False.cpu_embedding – Whether to replace the Embedding layer, may need to set it to
Truewhen running BigDL-LLM on GPU on Windows. Default to beFalse.lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to
Truewhen running BigDL-LLM on GPU on Windows. Default to beFalse.imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.
model_hub – str value, options are
'huggingface'and'modelscope', specify the model hub. Default to be'huggingface'.embedding_qtype – str value, options are
'q2_k'now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.
- Returns
a model instance
- classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
- classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#
Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.
- Parameters
pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.
optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.
- Returns
a model instance
Native Model#
For llama/chatglm/bloom/gptneox/starcoder model families, you may also convert and run LLM using the native (cpp) implementation for maximum performance.
- class bigdl.llm.transformers.LlamaForCausalLM[source]#
Bases:
bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass- classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
- Parameters
pretrained_model_name_or_path – Path for model checkpoint. If running with
native int4, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted bybigdl.llm.llm_convert. If running withtransformers int4, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.
dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.
kwargs – keyword arguments which will be passed to the model instance.
- Returns
a model instance
- class bigdl.llm.transformers.ChatGLMForCausalLM[source]#
Bases:
bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass- classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
- Parameters
pretrained_model_name_or_path – Path for model checkpoint. If running with
native int4, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted bybigdl.llm.llm_convert. If running withtransformers int4, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.
dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.
kwargs – keyword arguments which will be passed to the model instance.
- Returns
a model instance
- class bigdl.llm.transformers.GptneoxForCausalLM[source]#
Bases:
bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass- classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
- Parameters
pretrained_model_name_or_path – Path for model checkpoint. If running with
native int4, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted bybigdl.llm.llm_convert. If running withtransformers int4, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.
dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.
kwargs – keyword arguments which will be passed to the model instance.
- Returns
a model instance
- class bigdl.llm.transformers.BloomForCausalLM[source]#
Bases:
bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass- classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
- Parameters
pretrained_model_name_or_path – Path for model checkpoint. If running with
native int4, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted bybigdl.llm.llm_convert. If running withtransformers int4, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.
dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.
kwargs – keyword arguments which will be passed to the model instance.
- Returns
a model instance
- class bigdl.llm.transformers.StarcoderForCausalLM[source]#
Bases:
bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass- classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
- Parameters
pretrained_model_name_or_path – Path for model checkpoint. If running with
native int4, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted bybigdl.llm.llm_convert. If running withtransformers int4, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.
dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.
kwargs – keyword arguments which will be passed to the model instance.
- Returns
a model instance