BigDL-LLM transformers
-style API#
Hugging Face transformers
AutoModel#
You can apply BigDL-LLM optimizations on any Hugging Face Transformers models by using the standard AutoModel APIs.
AutoModelForCausalLM#
- class bigdl.llm.transformers.AutoModelForCausalLM[source]#
Bases:
bigdl.llm.transformers.model._BaseAutoModelClass
- classmethod from_pretrained(*args, **kwargs)#
Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.
Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:
- Parameters
load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be
False
.load_in_low_bit – str value, options are
'sym_int4'
,'asym_int4'
,'sym_int5'
,'asym_int5'
,'sym_int8'
,'nf3'
,'nf4'
,'fp4'
,'fp8'
,'fp8_e4m3'
,'fp8_e5m2'
,'gguf_iq2_xxs'
,'gguf_iq2_xs'
,'fp16'
or'bf16'
,'sym_int4'
means symmetric int 4,'asym_int4'
means asymmetric int 4,'nf4'
means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be
True
.modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be
None
.speculative – boolean value, Whether to use speculative decoding. Default to be
False
.cpu_embedding – Whether to replace the Embedding layer, may need to set it to
True
when running BigDL-LLM on GPU on Windows. Default to beFalse
.lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to
True
when running BigDL-LLM on GPU on Windows. Default to beFalse
.imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.
model_hub – str value, options are
'huggingface'
and'modelscope'
, specify the model hub. Default to be'huggingface'
.embedding_qtype – str value, options are
'q2_k'
now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.
- Returns
a model instance
- classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
- classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#
Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.
- Parameters
pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.
optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.
- Returns
a model instance
AutoModel#
- class bigdl.llm.transformers.AutoModel[source]#
Bases:
bigdl.llm.transformers.model._BaseAutoModelClass
- classmethod from_pretrained(*args, **kwargs)#
Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.
Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:
- Parameters
load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be
False
.load_in_low_bit – str value, options are
'sym_int4'
,'asym_int4'
,'sym_int5'
,'asym_int5'
,'sym_int8'
,'nf3'
,'nf4'
,'fp4'
,'fp8'
,'fp8_e4m3'
,'fp8_e5m2'
,'gguf_iq2_xxs'
,'gguf_iq2_xs'
,'fp16'
or'bf16'
,'sym_int4'
means symmetric int 4,'asym_int4'
means asymmetric int 4,'nf4'
means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be
True
.modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be
None
.speculative – boolean value, Whether to use speculative decoding. Default to be
False
.cpu_embedding – Whether to replace the Embedding layer, may need to set it to
True
when running BigDL-LLM on GPU on Windows. Default to beFalse
.lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to
True
when running BigDL-LLM on GPU on Windows. Default to beFalse
.imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.
model_hub – str value, options are
'huggingface'
and'modelscope'
, specify the model hub. Default to be'huggingface'
.embedding_qtype – str value, options are
'q2_k'
now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.
- Returns
a model instance
- classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
- classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#
Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.
- Parameters
pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.
optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.
- Returns
a model instance
AutoModelForSpeechSeq2Seq#
- class bigdl.llm.transformers.AutoModelForSpeechSeq2Seq[source]#
Bases:
bigdl.llm.transformers.model._BaseAutoModelClass
- classmethod from_pretrained(*args, **kwargs)#
Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.
Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:
- Parameters
load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be
False
.load_in_low_bit – str value, options are
'sym_int4'
,'asym_int4'
,'sym_int5'
,'asym_int5'
,'sym_int8'
,'nf3'
,'nf4'
,'fp4'
,'fp8'
,'fp8_e4m3'
,'fp8_e5m2'
,'gguf_iq2_xxs'
,'gguf_iq2_xs'
,'fp16'
or'bf16'
,'sym_int4'
means symmetric int 4,'asym_int4'
means asymmetric int 4,'nf4'
means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be
True
.modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be
None
.speculative – boolean value, Whether to use speculative decoding. Default to be
False
.cpu_embedding – Whether to replace the Embedding layer, may need to set it to
True
when running BigDL-LLM on GPU on Windows. Default to beFalse
.lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to
True
when running BigDL-LLM on GPU on Windows. Default to beFalse
.imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.
model_hub – str value, options are
'huggingface'
and'modelscope'
, specify the model hub. Default to be'huggingface'
.embedding_qtype – str value, options are
'q2_k'
now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.
- Returns
a model instance
- classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
- classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#
Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.
- Parameters
pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.
optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.
- Returns
a model instance
AutoModelForSeq2SeqLM#
- class bigdl.llm.transformers.AutoModelForSeq2SeqLM[source]#
Bases:
bigdl.llm.transformers.model._BaseAutoModelClass
- classmethod from_pretrained(*args, **kwargs)#
Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.
Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:
- Parameters
load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be
False
.load_in_low_bit – str value, options are
'sym_int4'
,'asym_int4'
,'sym_int5'
,'asym_int5'
,'sym_int8'
,'nf3'
,'nf4'
,'fp4'
,'fp8'
,'fp8_e4m3'
,'fp8_e5m2'
,'gguf_iq2_xxs'
,'gguf_iq2_xs'
,'fp16'
or'bf16'
,'sym_int4'
means symmetric int 4,'asym_int4'
means asymmetric int 4,'nf4'
means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be
True
.modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be
None
.speculative – boolean value, Whether to use speculative decoding. Default to be
False
.cpu_embedding – Whether to replace the Embedding layer, may need to set it to
True
when running BigDL-LLM on GPU on Windows. Default to beFalse
.lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to
True
when running BigDL-LLM on GPU on Windows. Default to beFalse
.imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.
model_hub – str value, options are
'huggingface'
and'modelscope'
, specify the model hub. Default to be'huggingface'
.embedding_qtype – str value, options are
'q2_k'
now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.
- Returns
a model instance
- classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
- classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#
Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.
- Parameters
pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.
optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.
- Returns
a model instance
Native Model#
For llama
/chatglm
/bloom
/gptneox
/starcoder
model families, you may also convert and run LLM using the native (cpp) implementation for maximum performance.
- class bigdl.llm.transformers.LlamaForCausalLM[source]#
Bases:
bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass
- classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
- Parameters
pretrained_model_name_or_path – Path for model checkpoint. If running with
native int4
, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted bybigdl.llm.llm_convert
. If running withtransformers int4
, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.
dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.
kwargs – keyword arguments which will be passed to the model instance.
- Returns
a model instance
- class bigdl.llm.transformers.ChatGLMForCausalLM[source]#
Bases:
bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass
- classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
- Parameters
pretrained_model_name_or_path – Path for model checkpoint. If running with
native int4
, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted bybigdl.llm.llm_convert
. If running withtransformers int4
, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.
dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.
kwargs – keyword arguments which will be passed to the model instance.
- Returns
a model instance
- class bigdl.llm.transformers.GptneoxForCausalLM[source]#
Bases:
bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass
- classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
- Parameters
pretrained_model_name_or_path – Path for model checkpoint. If running with
native int4
, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted bybigdl.llm.llm_convert
. If running withtransformers int4
, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.
dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.
kwargs – keyword arguments which will be passed to the model instance.
- Returns
a model instance
- class bigdl.llm.transformers.BloomForCausalLM[source]#
Bases:
bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass
- classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
- Parameters
pretrained_model_name_or_path – Path for model checkpoint. If running with
native int4
, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted bybigdl.llm.llm_convert
. If running withtransformers int4
, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.
dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.
kwargs – keyword arguments which will be passed to the model instance.
- Returns
a model instance
- class bigdl.llm.transformers.StarcoderForCausalLM[source]#
Bases:
bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass
- classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
- Parameters
pretrained_model_name_or_path – Path for model checkpoint. If running with
native int4
, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted bybigdl.llm.llm_convert
. If running withtransformers int4
, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.
dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.
kwargs – keyword arguments which will be passed to the model instance.
- Returns
a model instance