diff --git a/demos/bigdl-llm/README.md b/demos/bigdl-llm/README.md new file mode 100644 index 00000000..84cd3fba --- /dev/null +++ b/demos/bigdl-llm/README.md @@ -0,0 +1,104 @@ +# LLM inference in TEE + +LLM ( Large Language Model) inference in TEE can protect the model, input prompt or output. The key challenges are: + +1. the performance of LLM inference in TEE (CPU) +2. can LLM inference run in TEE? + +With the significant LLM inference speed-up brought by [BigDL-LLM](https://github.com/intel-analytics/BigDL/tree/main/python/llm), and the Occlum LibOS, now high-performance and efficient LLM inference in TEE could be realized. + +A chatglm2 6B model inference demo by Occlum and BigDL-LLM in TEE is introduced below. + +## Start the Occlum development container +```bash +docker run --rm -it --network host \ + --device /dev/sgx_enclave --device /dev/sgx_provision \ + occlum/occlum:latest-ubuntu20.04 bash +``` + +## Download the model + +First of all, download the [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b). Put it in the directory such as **/work/models**. + +## Install required python packages + +Just run the script [install_python_with_conda.sh](./install_python_with_conda.sh). +It creates a conda python 3.9 env and install **bigdl-llm[all]**. + +## Build Occlum instance + +Besides the python packages, the demo also requires demo code and model to be existed in Occlum instance. + +The demo code below are copied from BigDL-LLM [chatglm2](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2). +``` +./chatglm2/generate.py +./chatglm2/streamchat.py +``` + +To simply the Occlum instance creation, a hostfs mount is used to mount previous downloaded chatglm2-6B model to Occlum. Just run the script [build_occlum_instance.sh](./build_occlum_instance.sh) to create the Occlum instance. + +* Note, this demo has model protection in use (TEE memory) but no protection on model in storage. To add protection on model in storage, some more operation such as model encryption and decryption may need imported. + +## Run the demo + +### Example 1, Predict Tokens using `generate()` API + +```bash +cd occlum_instance +HF_DATASETS_CACHE=/root/cache \ + occlum run /bin/python3 /chatglm2/generate.py \ + --repo-id-or-model-path /models/chatglm2-6b +``` + +### Example2, Stream Chat using `stream_chat()` API + +```bash +cd occlum_instance +HF_DATASETS_CACHE=/root/cache \ + occlum run /bin/python3 /chatglm2/streamchat.py \ + --repo-id-or-model-path /models/chatglm2-6b +``` + +For both examples, more arguments info could refer to BigDL-LLM [chatglm2](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2). + + +## Do inference with webui + +[FastChat](https://github.com/lm-sys/FastChat#serving-with-web-gui) is an open platform for training, serving, and evaluating large language model based chatbots. + +BigDL-LLM also support FastChat with using BigDL-LLM as a serving backend in the deployment. Details please refer to [BigDL-LLM serving](https://github.com/intel-analytics/BigDL/tree/main/python/llm/src/bigdl/llm/serving). + +For this demo, below commands show how to run an inference service in Occlum with webui interface. + +In order to load models using BigDL-LLM, the model name should include "bigdl". In our case, first create a soft link **chatglm2-6b-bigdl** to **chatglm2-6b**. + +### Serving with WebGUI + +To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers. + +#### Launch the Controller +```bash +./python-occlum/bin/python -m fastchat.serve.controller +``` + +This controller manages the distributed workers. + +#### Launch the model worker(s) in Occlum +```bash +cd occlum_instance +occlum start +HF_DATASETS_CACHE=/root/cache occlum exec /bin/python3 -m bigdl.llm.serving.model_worker --model-path /models/chatglm2-6b-bigdl --device cpu --host 0.0.0.0 +``` +Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller. + +#### Launch the Gradio web server in Occlum + +```bash +occlum exec /bin/python3 -m fastchat.serve.gradio_web_server --host 0.0.0.0 +``` + +This is the user interface that users will interact with. + +By following these steps, you will be able to serve your models using the web UI with `BigDL-LLM` as the backend. You can open your browser and chat with a model now. + + diff --git a/demos/bigdl-llm/build_occlum_instance.sh b/demos/bigdl-llm/build_occlum_instance.sh new file mode 100755 index 00000000..c01364a7 --- /dev/null +++ b/demos/bigdl-llm/build_occlum_instance.sh @@ -0,0 +1,38 @@ +#!/bin/bash +set -e + +BLUE='\033[1;34m' +NC='\033[0m' + +script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )" +python_dir="$script_dir/occlum_instance/image/opt/python-occlum" + + +function build_instance() +{ + rm -rf occlum_instance && occlum new occlum_instance + pushd occlum_instance + rm -rf image + copy_bom -f ../llm.yaml --root image --include-dir /opt/occlum/etc/template + + new_json="$(jq '.resource_limits.user_space_size = "60GB" | + .resource_limits.kernel_space_heap_size = "1GB" | + .resource_limits.max_num_of_threads = 500 | + .env.default += ["PYTHONHOME=/opt/python-occlum"] | + .env.default += ["PATH=/bin"] | + .env.default += ["HOME=/root"] | + .env.untrusted += ["HF_DATASETS_CACHE", "OMP_NUM_THREADS"]' Occlum.json)" && \ + echo "${new_json}" > Occlum.json + + # Make model as hostfs mount for test purpose + # The model should be protected in production by encryption + mkdir -p image/models + new_json="$(cat Occlum.json | jq '.mount+=[{"target": "/models", "type": "hostfs","source": "/work/models"}]')" && \ + echo "${new_json}" > Occlum.json + + occlum build + popd +} + +build_instance + diff --git a/demos/bigdl-llm/chatglm2/generate.py b/demos/bigdl-llm/chatglm2/generate.py new file mode 100644 index 00000000..3bae5c1a --- /dev/null +++ b/demos/bigdl-llm/chatglm2/generate.py @@ -0,0 +1,71 @@ +# +# Copyright 2016 The BigDL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import torch +import time +import argparse +import numpy as np + +from bigdl.llm.transformers import AutoModel, AutoModelForCausalLM +from transformers import AutoTokenizer + +# you could tune the prompt based on your own model, +# here the prompt tuning refers to https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L1007 +CHATGLM_V2_PROMPT_FORMAT = "问:{prompt}\n\n答:" + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model') + parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm2-6b", + help='The huggingface repo id for the ChatGLM2 model to be downloaded' + ', or the path to the huggingface checkpoint folder') + parser.add_argument('--prompt', type=str, default="AI是什么?", + help='Prompt to infer') + parser.add_argument('--n-predict', type=int, default=32, + help='Max tokens to predict') + + args = parser.parse_args() + model_path = args.repo_id_or_model_path + + # Load model in 4 bit, + # which convert the relevant layers in the model into INT4 format + model = AutoModel.from_pretrained(model_path, + load_in_4bit=True, + trust_remote_code=True) + + # model = AutoModelForCausalLM.load_low_bit(model_path, trust_remote_code=True) + + # Load tokenizer + tokenizer = AutoTokenizer.from_pretrained(model_path, + trust_remote_code=True) + + # Generate predicted tokens + with torch.inference_mode(): + prompt = CHATGLM_V2_PROMPT_FORMAT.format(prompt=args.prompt) + input_ids = tokenizer.encode(prompt, return_tensors="pt") + st = time.time() + # if your selected model is capable of utilizing previous key/value attentions + # to enhance decoding speed, but has `"use_cache": false` in its model config, + # it is important to set `use_cache=True` explicitly in the `generate` function + # to obtain optimal performance with BigDL-LLM INT4 optimizations + output = model.generate(input_ids, + max_new_tokens=args.n_predict) + end = time.time() + output_str = tokenizer.decode(output[0], skip_special_tokens=True) + print(f'Inference time: {end-st} s') + print('-'*20, 'Prompt', '-'*20) + print(prompt) + print('-'*20, 'Output', '-'*20) + print(output_str) diff --git a/demos/bigdl-llm/chatglm2/streamchat.py b/demos/bigdl-llm/chatglm2/streamchat.py new file mode 100644 index 00000000..3bbf5333 --- /dev/null +++ b/demos/bigdl-llm/chatglm2/streamchat.py @@ -0,0 +1,62 @@ +# +# Copyright 2016 The BigDL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import torch +import time +import argparse +import numpy as np + +from bigdl.llm.transformers import AutoModel +from transformers import AutoTokenizer + + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='Stream Chat for ChatGLM2 model') + parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm2-6b", + help='The huggingface repo id for the ChatGLM2 model to be downloaded' + ', or the path to the huggingface checkpoint folder') + parser.add_argument('--question', type=str, default="晚上睡不着应该怎么办", + help='Qustion you want to ask') + parser.add_argument('--disable-stream', action="store_true", + help='Disable stream chat') + + args = parser.parse_args() + model_path = args.repo_id_or_model_path + disable_stream = args.disable_stream + + # Load model in 4 bit, + # which convert the relevant layers in the model into INT4 format + model = AutoModel.from_pretrained(model_path, + load_in_4bit=True, + trust_remote_code=True) + + # Load tokenizer + tokenizer = AutoTokenizer.from_pretrained(model_path, + trust_remote_code=True) + + with torch.inference_mode(): + if disable_stream: + # Chat + response, history = model.chat(tokenizer, args.question, history=[]) + print('-'*20, 'Chat Output', '-'*20) + print(response) + else: + # Stream chat + response_ = "" + print('-'*20, 'Stream Chat Output', '-'*20) + for response, history in model.stream_chat(tokenizer, args.question, history=[]): + print(response.replace(response_, ""), end="") + response_ = response diff --git a/demos/bigdl-llm/fastchat-webui.png b/demos/bigdl-llm/fastchat-webui.png new file mode 100644 index 00000000..ff93808e Binary files /dev/null and b/demos/bigdl-llm/fastchat-webui.png differ diff --git a/demos/bigdl-llm/install_python_with_conda.sh b/demos/bigdl-llm/install_python_with_conda.sh new file mode 100755 index 00000000..feb87070 --- /dev/null +++ b/demos/bigdl-llm/install_python_with_conda.sh @@ -0,0 +1,13 @@ +#!/bin/bash +set -e +script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )" + +# Install python and dependencies to specified position +[ -f Miniconda3-latest-Linux-x86_64.sh ] || wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh +[ -d miniconda ] || bash ./Miniconda3-latest-Linux-x86_64.sh -b -p $script_dir/miniconda +$script_dir/miniconda/bin/conda create \ + --prefix $script_dir/python-occlum -y \ + python=3.9.11 + +# Install BigDL LLM +$script_dir/python-occlum/bin/pip install --pre --upgrade bigdl-llm[all] bigdl-llm[serving] diff --git a/demos/bigdl-llm/llm.yaml b/demos/bigdl-llm/llm.yaml new file mode 100644 index 00000000..dd0424f4 --- /dev/null +++ b/demos/bigdl-llm/llm.yaml @@ -0,0 +1,26 @@ +includes: + - base.yaml +targets: + - target: /bin + createlinks: + - src: /opt/python-occlum/bin/python3 + linkname: python3 + # copy: + # - files: + # - /opt/occlum/toolchains/busybox/glibc/busybox + # python packages + - target: /opt + copy: + - dirs: + - ../python-occlum + # python code + - target: / + copy: + - dirs: + - ../chatglm2 + - target: /opt/occlum/glibc/lib + copy: + - files: + - /opt/occlum/glibc/lib/libnss_files.so.2 + - /opt/occlum/glibc/lib/libnss_dns.so.2 + - /opt/occlum/glibc/lib/libresolv.so.2 diff --git a/demos/bigdl-llm/model_convert.py b/demos/bigdl-llm/model_convert.py new file mode 100644 index 00000000..1ef2c580 --- /dev/null +++ b/demos/bigdl-llm/model_convert.py @@ -0,0 +1,44 @@ +# +# Copyright 2016 The BigDL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import torch +import time +import argparse + + +# load Hugging Face Transformers model with INT4 optimizations +from bigdl.llm.transformers import AutoModelForCausalLM +from transformers import AutoTokenizer + + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model') + parser.add_argument('--model-path', type=str, default="THUDM/chatglm2-6b", + help='The original model path') + parser.add_argument('--save-path', type=str, default="./", + help='The converted model save path') + + args = parser.parse_args() + model_path = args.model_path + save_path = args.save_path + + model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True) + # Load tokenizer + tokenizer = AutoTokenizer.from_pretrained(model_path, + trust_remote_code=True) + # save model with INT4 optimizations + model.save_low_bit(save_path) + tokenizer.save_pretrained(save_path)