[demos] Add llm inference demo

This commit is contained in:
Zheng, Qi 2023-10-07 11:27:43 +08:00 committed by volcano
parent 7e0633116c
commit 1e472a67ed
8 changed files with 358 additions and 0 deletions

104
demos/bigdl-llm/README.md Normal file

@ -0,0 +1,104 @@
# LLM inference in TEE
LLM ( Large Language Model) inference in TEE can protect the model, input prompt or output. The key challenges are:
1. the performance of LLM inference in TEE (CPU)
2. can LLM inference run in TEE?
With the significant LLM inference speed-up brought by [BigDL-LLM](https://github.com/intel-analytics/BigDL/tree/main/python/llm), and the Occlum LibOS, now high-performance and efficient LLM inference in TEE could be realized.
A chatglm2 6B model inference demo by Occlum and BigDL-LLM in TEE is introduced below.
## Start the Occlum development container
```bash
docker run --rm -it --network host \
--device /dev/sgx_enclave --device /dev/sgx_provision \
occlum/occlum:latest-ubuntu20.04 bash
```
## Download the model
First of all, download the [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b). Put it in the directory such as **/work/models**.
## Install required python packages
Just run the script [install_python_with_conda.sh](./install_python_with_conda.sh).
It creates a conda python 3.9 env and install **bigdl-llm[all]**.
## Build Occlum instance
Besides the python packages, the demo also requires demo code and model to be existed in Occlum instance.
The demo code below are copied from BigDL-LLM [chatglm2](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2).
```
./chatglm2/generate.py
./chatglm2/streamchat.py
```
To simply the Occlum instance creation, a hostfs mount is used to mount previous downloaded chatglm2-6B model to Occlum. Just run the script [build_occlum_instance.sh](./build_occlum_instance.sh) to create the Occlum instance.
* Note, this demo has model protection in use (TEE memory) but no protection on model in storage. To add protection on model in storage, some more operation such as model encryption and decryption may need imported.
## Run the demo
### Example 1, Predict Tokens using `generate()` API
```bash
cd occlum_instance
HF_DATASETS_CACHE=/root/cache \
occlum run /bin/python3 /chatglm2/generate.py \
--repo-id-or-model-path /models/chatglm2-6b
```
### Example2, Stream Chat using `stream_chat()` API
```bash
cd occlum_instance
HF_DATASETS_CACHE=/root/cache \
occlum run /bin/python3 /chatglm2/streamchat.py \
--repo-id-or-model-path /models/chatglm2-6b
```
For both examples, more arguments info could refer to BigDL-LLM [chatglm2](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2).
## Do inference with webui
[FastChat](https://github.com/lm-sys/FastChat#serving-with-web-gui) is an open platform for training, serving, and evaluating large language model based chatbots.
BigDL-LLM also support FastChat with using BigDL-LLM as a serving backend in the deployment. Details please refer to [BigDL-LLM serving](https://github.com/intel-analytics/BigDL/tree/main/python/llm/src/bigdl/llm/serving).
For this demo, below commands show how to run an inference service in Occlum with webui interface.
In order to load models using BigDL-LLM, the model name should include "bigdl". In our case, first create a soft link **chatglm2-6b-bigdl** to **chatglm2-6b**.
### Serving with WebGUI
To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers.
#### Launch the Controller
```bash
./python-occlum/bin/python -m fastchat.serve.controller
```
This controller manages the distributed workers.
#### Launch the model worker(s) in Occlum
```bash
cd occlum_instance
occlum start
HF_DATASETS_CACHE=/root/cache occlum exec /bin/python3 -m bigdl.llm.serving.model_worker --model-path /models/chatglm2-6b-bigdl --device cpu --host 0.0.0.0
```
Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller.
#### Launch the Gradio web server in Occlum
```bash
occlum exec /bin/python3 -m fastchat.serve.gradio_web_server --host 0.0.0.0
```
This is the user interface that users will interact with.
By following these steps, you will be able to serve your models using the web UI with `BigDL-LLM` as the backend. You can open your browser and chat with a model now.
<img src="./fastchat-webui.png" width="70%">

@ -0,0 +1,38 @@
#!/bin/bash
set -e
BLUE='\033[1;34m'
NC='\033[0m'
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
python_dir="$script_dir/occlum_instance/image/opt/python-occlum"
function build_instance()
{
rm -rf occlum_instance && occlum new occlum_instance
pushd occlum_instance
rm -rf image
copy_bom -f ../llm.yaml --root image --include-dir /opt/occlum/etc/template
new_json="$(jq '.resource_limits.user_space_size = "60GB" |
.resource_limits.kernel_space_heap_size = "1GB" |
.resource_limits.max_num_of_threads = 500 |
.env.default += ["PYTHONHOME=/opt/python-occlum"] |
.env.default += ["PATH=/bin"] |
.env.default += ["HOME=/root"] |
.env.untrusted += ["HF_DATASETS_CACHE", "OMP_NUM_THREADS"]' Occlum.json)" && \
echo "${new_json}" > Occlum.json
# Make model as hostfs mount for test purpose
# The model should be protected in production by encryption
mkdir -p image/models
new_json="$(cat Occlum.json | jq '.mount+=[{"target": "/models", "type": "hostfs","source": "/work/models"}]')" && \
echo "${new_json}" > Occlum.json
occlum build
popd
}
build_instance

@ -0,0 +1,71 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import torch
import time
import argparse
import numpy as np
from bigdl.llm.transformers import AutoModel, AutoModelForCausalLM
from transformers import AutoTokenizer
# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L1007
CHATGLM_V2_PROMPT_FORMAT = "问:{prompt}\n\n答:"
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm2-6b",
help='The huggingface repo id for the ChatGLM2 model to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--prompt', type=str, default="AI是什么",
help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
model = AutoModel.from_pretrained(model_path,
load_in_4bit=True,
trust_remote_code=True)
# model = AutoModelForCausalLM.load_low_bit(model_path, trust_remote_code=True)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
# Generate predicted tokens
with torch.inference_mode():
prompt = CHATGLM_V2_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt")
st = time.time()
# if your selected model is capable of utilizing previous key/value attentions
# to enhance decoding speed, but has `"use_cache": false` in its model config,
# it is important to set `use_cache=True` explicitly in the `generate` function
# to obtain optimal performance with BigDL-LLM INT4 optimizations
output = model.generate(input_ids,
max_new_tokens=args.n_predict)
end = time.time()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)

@ -0,0 +1,62 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import torch
import time
import argparse
import numpy as np
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Stream Chat for ChatGLM2 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm2-6b",
help='The huggingface repo id for the ChatGLM2 model to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--question', type=str, default="晚上睡不着应该怎么办",
help='Qustion you want to ask')
parser.add_argument('--disable-stream', action="store_true",
help='Disable stream chat')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
disable_stream = args.disable_stream
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
model = AutoModel.from_pretrained(model_path,
load_in_4bit=True,
trust_remote_code=True)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
with torch.inference_mode():
if disable_stream:
# Chat
response, history = model.chat(tokenizer, args.question, history=[])
print('-'*20, 'Chat Output', '-'*20)
print(response)
else:
# Stream chat
response_ = ""
print('-'*20, 'Stream Chat Output', '-'*20)
for response, history in model.stream_chat(tokenizer, args.question, history=[]):
print(response.replace(response_, ""), end="")
response_ = response

Binary file not shown.

After

Width:  |  Height:  |  Size: 193 KiB

@ -0,0 +1,13 @@
#!/bin/bash
set -e
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
# Install python and dependencies to specified position
[ -f Miniconda3-latest-Linux-x86_64.sh ] || wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
[ -d miniconda ] || bash ./Miniconda3-latest-Linux-x86_64.sh -b -p $script_dir/miniconda
$script_dir/miniconda/bin/conda create \
--prefix $script_dir/python-occlum -y \
python=3.9.11
# Install BigDL LLM
$script_dir/python-occlum/bin/pip install --pre --upgrade bigdl-llm[all] bigdl-llm[serving]

26
demos/bigdl-llm/llm.yaml Normal file

@ -0,0 +1,26 @@
includes:
- base.yaml
targets:
- target: /bin
createlinks:
- src: /opt/python-occlum/bin/python3
linkname: python3
# copy:
# - files:
# - /opt/occlum/toolchains/busybox/glibc/busybox
# python packages
- target: /opt
copy:
- dirs:
- ../python-occlum
# python code
- target: /
copy:
- dirs:
- ../chatglm2
- target: /opt/occlum/glibc/lib
copy:
- files:
- /opt/occlum/glibc/lib/libnss_files.so.2
- /opt/occlum/glibc/lib/libnss_dns.so.2
- /opt/occlum/glibc/lib/libresolv.so.2

@ -0,0 +1,44 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import torch
import time
import argparse
# load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model')
parser.add_argument('--model-path', type=str, default="THUDM/chatglm2-6b",
help='The original model path')
parser.add_argument('--save-path', type=str, default="./",
help='The converted model save path')
args = parser.parse_args()
model_path = args.model_path
save_path = args.save_path
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
# save model with INT4 optimizations
model.save_low_bit(save_path)
tokenizer.save_pretrained(save_path)