BigDL-LLM transformers-style API#

Hugging Face transformers AutoModel#

You can apply BigDL-LLM optimizations on any Hugging Face Transformers models by using the standard AutoModel APIs.

AutoModelForCausalLM#

class bigdl.llm.transformers.AutoModelForCausalLM[source]#

Bases: bigdl.llm.transformers.model._BaseAutoModelClass

classmethod from_pretrained(*args, **kwargs)#

Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.

Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:

Parameters
  • load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be False.

  • load_in_low_bit – str value, options are 'sym_int4', 'asym_int4', 'sym_int5', 'asym_int5', 'sym_int8', 'nf3', 'nf4', 'fp4', 'fp8', 'fp8_e4m3', 'fp8_e5m2', 'iq2_xxs', 'iq2_xs', 'fp16' or 'bf16', 'sym_int4' means symmetric int 4, 'asym_int4' means asymmetric int 4, 'nf4' means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.

  • optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.

  • modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be None.

  • speculative – boolean value, Whether to use speculative decoding. Default to be False.

  • cpu_embedding – Whether to replace the Embedding layer, may need to set it to True when running BigDL-LLM on GPU on Windows. Default to be False.

  • lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to True when running BigDL-LLM on GPU on Windows. Default to be False.

  • imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.

  • model_hub – str value, options are 'huggingface' and 'modelscope', specify the model hub. Default to be 'huggingface'.

  • embedding_qtype – str value, options are 'q2_k' now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.

Returns

a model instance

classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#

Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.

Parameters
  • pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.

  • optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.

Returns

a model instance

AutoModel#

class bigdl.llm.transformers.AutoModel[source]#

Bases: bigdl.llm.transformers.model._BaseAutoModelClass

classmethod from_pretrained(*args, **kwargs)#

Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.

Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:

Parameters
  • load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be False.

  • load_in_low_bit – str value, options are 'sym_int4', 'asym_int4', 'sym_int5', 'asym_int5', 'sym_int8', 'nf3', 'nf4', 'fp4', 'fp8', 'fp8_e4m3', 'fp8_e5m2', 'iq2_xxs', 'iq2_xs', 'fp16' or 'bf16', 'sym_int4' means symmetric int 4, 'asym_int4' means asymmetric int 4, 'nf4' means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.

  • optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.

  • modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be None.

  • speculative – boolean value, Whether to use speculative decoding. Default to be False.

  • cpu_embedding – Whether to replace the Embedding layer, may need to set it to True when running BigDL-LLM on GPU on Windows. Default to be False.

  • lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to True when running BigDL-LLM on GPU on Windows. Default to be False.

  • imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.

  • model_hub – str value, options are 'huggingface' and 'modelscope', specify the model hub. Default to be 'huggingface'.

  • embedding_qtype – str value, options are 'q2_k' now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.

Returns

a model instance

classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#

Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.

Parameters
  • pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.

  • optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.

Returns

a model instance

AutoModelForSpeechSeq2Seq#

class bigdl.llm.transformers.AutoModelForSpeechSeq2Seq[source]#

Bases: bigdl.llm.transformers.model._BaseAutoModelClass

classmethod from_pretrained(*args, **kwargs)#

Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.

Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:

Parameters
  • load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be False.

  • load_in_low_bit – str value, options are 'sym_int4', 'asym_int4', 'sym_int5', 'asym_int5', 'sym_int8', 'nf3', 'nf4', 'fp4', 'fp8', 'fp8_e4m3', 'fp8_e5m2', 'iq2_xxs', 'iq2_xs', 'fp16' or 'bf16', 'sym_int4' means symmetric int 4, 'asym_int4' means asymmetric int 4, 'nf4' means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.

  • optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.

  • modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be None.

  • speculative – boolean value, Whether to use speculative decoding. Default to be False.

  • cpu_embedding – Whether to replace the Embedding layer, may need to set it to True when running BigDL-LLM on GPU on Windows. Default to be False.

  • lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to True when running BigDL-LLM on GPU on Windows. Default to be False.

  • imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.

  • model_hub – str value, options are 'huggingface' and 'modelscope', specify the model hub. Default to be 'huggingface'.

  • embedding_qtype – str value, options are 'q2_k' now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.

Returns

a model instance

classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#

Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.

Parameters
  • pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.

  • optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.

Returns

a model instance

AutoModelForSeq2SeqLM#

class bigdl.llm.transformers.AutoModelForSeq2SeqLM[source]#

Bases: bigdl.llm.transformers.model._BaseAutoModelClass

classmethod from_pretrained(*args, **kwargs)#

Load a model from a directory or the HF Hub. Use load_in_4bit or load_in_low_bit parameter the weight of model’s linears can be loaded to low-bit format, like int4, int5 and int8.

Three new arguments are added to extend Hugging Face’s from_pretrained method as follows:

Parameters
  • load_in_4bit – boolean value, True means loading linear’s weight to symmetric int 4 if the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4 if the model is GPTQ model. Default to be False.

  • load_in_low_bit – str value, options are 'sym_int4', 'asym_int4', 'sym_int5', 'asym_int5', 'sym_int8', 'nf3', 'nf4', 'fp4', 'fp8', 'fp8_e4m3', 'fp8_e5m2', 'iq2_xxs', 'iq2_xs', 'fp16' or 'bf16', 'sym_int4' means symmetric int 4, 'asym_int4' means asymmetric int 4, 'nf4' means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model.

  • optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.

  • modules_to_not_convert – list of str value, modules (nn.Module) that are skipped when conducting model optimizations. Default to be None.

  • speculative – boolean value, Whether to use speculative decoding. Default to be False.

  • cpu_embedding – Whether to replace the Embedding layer, may need to set it to True when running BigDL-LLM on GPU on Windows. Default to be False.

  • lightweight_bmm – Whether to replace the torch.bmm ops, may need to set it to True when running BigDL-LLM on GPU on Windows. Default to be False.

  • imatrix – str value, represent filename of importance matrix pretrained on specific datasets for use with the improved quantization methods recently added to llama.cpp.

  • model_hub – str value, options are 'huggingface' and 'modelscope', specify the model hub. Default to be 'huggingface'.

  • embedding_qtype – str value, options are 'q2_k' now. Default to be None. Relevant low bit optimizations will be applied to nn.Embedding layer.

Returns

a model instance

classmethod load_convert(q_k, optimize_model, *args, **kwargs)#
classmethod load_low_bit(pretrained_model_name_or_path, *model_args, **kwargs)#

Load a low bit optimized model (including INT4, INT5 and INT8) from a saved ckpt.

Parameters
  • pretrained_model_name_or_path – str value, Path to load the optimized model ckpt.

  • optimize_model – boolean value, Whether to further optimize the low_bit llm model. Default to be True.

Returns

a model instance

Native Model#

For llama/chatglm/bloom/gptneox/starcoder model families, you may also convert and run LLM using the native (cpp) implementation for maximum performance.

class bigdl.llm.transformers.LlamaForCausalLM[source]#

Bases: bigdl.llm.transformers.modelling_bigdl._BaseGGMLClass

classmethod from_pretrained(pretrained_model_name_or_path: str, native: bool = True, dtype: str = 'int4', *args, **kwargs)#
Parameters
  • pretrained_model_name_or_path – Path for model checkpoint. If running with native int4, the path should be converted BigDL-LLM optimized ggml binary checkpoint, which should be converted by bigdl.llm.llm_convert. If running with transformers int4, the path should be the huggingface repo id to be downloaded or the huggingface checkpoint folder.

  • native – Load model to either BigDL-LLM optimized Transformer or Native (ggml) int4.

  • dtype – Which quantized precision will be converted. Now only int4 and int8 are supported, and int8 only works for llama , gptneox and starcoder.

  • kwargs – keyword arguments which will be passed to the model instance.

Returns

a model instance