Hugging Face transformers
Format#
Load in Low Precision#
You may apply INT4 optimizations to any Hugging Face Transformers models as follows:
# load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
After loading the Hugging Face Transformers model, you may easily run the optimized model as follows:
# run the optimized model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)
Save & Load#
After the model is optimized using INT4 (or INT8/INT5), you may save and load the optimized model as follows:
model.save_low_bit(model_path)
new_model = AutoModelForCausalLM.load_low_bit(model_path)