Important
bigdl-llm
has now become ipex-llm
, and our future development will move to the IPEX-LLM project.
The BigDL Project#
BigDL-LLM#
bigdl-llm
is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud) using INT4/FP4/INT8/FP8 with very low latency [1] (for any PyTorch model).
Note
It is built on top of the excellent work of llama.cpp, gptq, bitsandbytes, qlora, etc.
Latest update 🔥#
[2024/03] 🔔🔔🔔
bigdl-llm
has now become ipex-llm; see the migration guide here.[2024/03] LangChain added support for
bigdl-llm
; see the details here.[2024/02]
bigdl-llm
now supports directly loading model from ModelScope (魔搭).[2024/02]
bigdl-llm
added inital INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.[2024/02] Users can now use
bigdl-llm
through Text-Generation-WebUI GUI.[2024/02]
bigdl-llm
now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.[2024/02]
bigdl-llm
now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).[2024/01] Using
bigdl-llm
QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).[2023/12]
bigdl-llm
now supports ReLoRA (see “ReLoRA: High-Rank Training Through Low-Rank Updates”).[2023/12]
bigdl-llm
now supports Mixtral-8x7B on both Intel GPU and CPU.[2023/12]
bigdl-llm
now supports QA-LoRA (see “QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models”).[2023/12]
bigdl-llm
now supports FP8 and FP4 inference on Intel GPU.[2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models in to
bigdl-llm
is available.[2023/11]
bigdl-llm
now supports vLLM continuous batching on both Intel GPU and CPU.[2023/10]
bigdl-llm
now supports QLoRA finetuning on both Intel GPU and CPU.[2023/10]
bigdl-llm
now supports FastChat serving on on both Intel CPU and GPU.[2023/09]
bigdl-llm
now supports Intel GPU (including Arc, Flex and MAX)[2023/09]
bigdl-llm
tutorial is released.Over 30 models have been verified on
bigdl-llm
, including LLaMA/LLaMA2, ChatGLM2/ChatGLM3, Mistral, Falcon, MPT, LLaVA, WizardCoder, Dolly, Whisper, Baichuan/Baichuan2, InternLM, Skywork, QWen/Qwen-VL, Aquila, MOSS and more; see the complete list here.
bigdl-llm
demos#
See the optimized performance of chatglm2-6b
and llama-2-13b-chat
models on 12th Gen Intel Core CPU and Intel Arc GPU below.
12th Gen Intel Core CPU | Intel Arc GPU | ||
chatglm2-6b |
llama-2-13b-chat |
chatglm2-6b |
llama-2-13b-chat |
bigdl-llm
quickstart#
CPU Quickstart#
You may install bigdl-llm
on Intel CPU as follows as follows:
Note
See the CPU installation guide for more details.
pip install --pre --upgrade bigdl-llm[all]
Note
bigdl-llm
has been tested on Python 3.9, 3.10 and 3.11
You can then apply INT4 optimizations to any Hugging Face Transformers models as follows.
#load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
#run the optimized model on Intel CPU
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)
GPU Quickstart#
You may install bigdl-llm
on Intel GPU as follows as follows:
Note
See the GPU installation guide for more details.
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
Note
bigdl-llm
has been tested on Python 3.9, 3.10 and 3.11
You can then apply INT4 optimizations to any Hugging Face Transformers models on Intel GPU as follows.
#load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
#run the optimized model on Intel GPU
model = model.to('xpu')
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...).to('xpu')
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu())
For more details, please refer to the bigdl-llm Document and API Doc.
Overview of the complete BigDL project#
BigDL seamlessly scales your data analytics & AI applications from laptop to cloud, with the following libraries:
LLM: Low-bit (INT3/INT4/INT5/INT8) large language model library for Intel CPU/GPU
Orca: Distributed Big Data & AI (TF & PyTorch) Pipeline on Spark and Ray
Nano: Transparent Acceleration of Tensorflow & PyTorch Programs on Intel CPU/GPU
DLlib: “Equivalent of Spark MLlib” for Deep Learning
Chronos: Scalable Time Series Analysis using AutoML
Friesian: End-to-End Recommendation Systems
PPML: Secure Big Data and AI (with SGX Hardware Security)
Choosing the right BigDL library#
[1]
Performance varies by use, configuration and other factors. bigdl-llm
may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.