[demos] Add llm inference demo

2023-10-07 11:27:43 +08:00 · 2023-10-07 11:27:43 +08:00 · 1e472a67ed
commit 1e472a67ed
parent 7e0633116c
8 changed files with 358 additions and 0 deletions
--- a/demos/bigdl-llm/README.md
+++ b/demos/bigdl-llm/README.md
@ -0,0 +1,104 @@
 # LLM inference in TEE
 LLM ( Large Language Model) inference in TEE can protect the model, input prompt or output. The key challenges are:
 1. the performance of LLM inference in TEE (CPU)
 2. can LLM inference run in TEE?
 With the significant LLM inference speed-up brought by [BigDL-LLM](https://github.com/intel-analytics/BigDL/tree/main/python/llm), and the Occlum LibOS, now high-performance and efficient LLM inference in TEE could be realized.
 A chatglm2 6B model inference demo by Occlum and BigDL-LLM in TEE is introduced below.
 ## Start the Occlum development container
 ```bash
 docker run --rm -it --network host \
    --device /dev/sgx_enclave --device /dev/sgx_provision \
    occlum/occlum:latest-ubuntu20.04 bash
 ```
 ## Download the model
 First of all, download the [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b). Put it in the directory such as **/work/models**.
 ## Install required python packages
 Just run the script [install_python_with_conda.sh](./install_python_with_conda.sh).
 It creates a conda python 3.9 env and install **bigdl-llm[all]**.
 ## Build Occlum instance
 Besides the python packages, the demo also requires demo code and model to be existed in Occlum instance. 
 The demo code below are copied from BigDL-LLM [chatglm2](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2).
 ```
 ./chatglm2/generate.py
 ./chatglm2/streamchat.py
 ```
 To simply the Occlum instance creation, a hostfs mount is used to mount previous downloaded chatglm2-6B model to Occlum. Just run the script [build_occlum_instance.sh](./build_occlum_instance.sh) to create the Occlum instance.
 * Note, this demo has model protection in use (TEE memory) but no protection on model in storage. To add protection on model in storage, some more operation such as model encryption and decryption may need imported.
 ## Run the demo
 ### Example 1, Predict Tokens using `generate()` API
 ```bash
 cd occlum_instance
 HF_DATASETS_CACHE=/root/cache \
    occlum run /bin/python3 /chatglm2/generate.py \
    --repo-id-or-model-path /models/chatglm2-6b
 ```
 ### Example2, Stream Chat using `stream_chat()` API
 ```bash
 cd occlum_instance
 HF_DATASETS_CACHE=/root/cache \
    occlum run /bin/python3 /chatglm2/streamchat.py \
    --repo-id-or-model-path /models/chatglm2-6b
 ```
 For both examples, more arguments info could refer to BigDL-LLM [chatglm2](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2).
 ## Do inference with webui
 [FastChat](https://github.com/lm-sys/FastChat#serving-with-web-gui) is an open platform for training, serving, and evaluating large language model based chatbots. 
 BigDL-LLM also support FastChat with using BigDL-LLM as a serving backend in the deployment. Details please refer to [BigDL-LLM serving](https://github.com/intel-analytics/BigDL/tree/main/python/llm/src/bigdl/llm/serving).
 For this demo, below commands show how to run an inference service in Occlum with webui interface.
 In order to load models using BigDL-LLM, the model name should include "bigdl". In our case, first create a soft link **chatglm2-6b-bigdl** to **chatglm2-6b**.
 ### Serving with WebGUI
 To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers.
 #### Launch the Controller
 ```bash
 ./python-occlum/bin/python -m fastchat.serve.controller
 ```
 This controller manages the distributed workers.
 #### Launch the model worker(s) in Occlum
 ```bash
 cd occlum_instance
 occlum start
 HF_DATASETS_CACHE=/root/cache  occlum exec /bin/python3 -m bigdl.llm.serving.model_worker --model-path /models/chatglm2-6b-bigdl --device cpu --host 0.0.0.0
 ```
 Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller.
 #### Launch the Gradio web server in Occlum
 ```bash
 occlum exec /bin/python3 -m fastchat.serve.gradio_web_server --host 0.0.0.0
 ```
 This is the user interface that users will interact with.
 By following these steps, you will be able to serve your models using the web UI with `BigDL-LLM` as the backend. You can open your browser and chat with a model now.
 <img src="./fastchat-webui.png" width="70%">
--- a/demos/bigdl-llm/build_occlum_instance.sh
+++ b/demos/bigdl-llm/build_occlum_instance.sh
@ -0,0 +1,38 @@
 #!/bin/bash
 set -e
 BLUE='\033[1;34m'
 NC='\033[0m'
 script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}"  )" >/dev/null 2>&1 && pwd )"
 python_dir="$script_dir/occlum_instance/image/opt/python-occlum"
 function build_instance()
 {
    rm -rf occlum_instance && occlum new occlum_instance
    pushd occlum_instance
    rm -rf image
    copy_bom -f ../llm.yaml --root image --include-dir /opt/occlum/etc/template
    new_json="$(jq '.resource_limits.user_space_size = "60GB" |
                    .resource_limits.kernel_space_heap_size = "1GB" |
                    .resource_limits.max_num_of_threads = 500 |
                    .env.default += ["PYTHONHOME=/opt/python-occlum"] |
                    .env.default += ["PATH=/bin"] |
                    .env.default += ["HOME=/root"] |
                    .env.untrusted += ["HF_DATASETS_CACHE", "OMP_NUM_THREADS"]' Occlum.json)" && \
    echo "${new_json}" > Occlum.json
    # Make model as hostfs mount for test purpose
    # The model should be protected in production by encryption
    mkdir -p image/models
    new_json="$(cat Occlum.json | jq '.mount+=[{"target": "/models", "type": "hostfs","source": "/work/models"}]')" && \
    echo "${new_json}" > Occlum.json
    occlum build
    popd
 }
 build_instance
--- a/demos/bigdl-llm/chatglm2/generate.py
+++ b/demos/bigdl-llm/chatglm2/generate.py
@ -0,0 +1,71 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import torch
 import time
 import argparse
 import numpy as np
 from bigdl.llm.transformers import AutoModel, AutoModelForCausalLM
 from transformers import AutoTokenizer
 # you could tune the prompt based on your own model,
 # here the prompt tuning refers to https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L1007
 CHATGLM_V2_PROMPT_FORMAT = "问：{prompt}\n\n答："
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm2-6b",
                        help='The huggingface repo id for the ChatGLM2 model to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="AI是什么？",
                        help='Prompt to infer')
    parser.add_argument('--n-predict', type=int, default=32,
                        help='Max tokens to predict')
    args = parser.parse_args()
    model_path = args.repo_id_or_model_path
    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    model = AutoModel.from_pretrained(model_path,
                                      load_in_4bit=True,
                                      trust_remote_code=True)
    # model = AutoModelForCausalLM.load_low_bit(model_path, trust_remote_code=True)
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                              trust_remote_code=True)
    # Generate predicted tokens
    with torch.inference_mode():
        prompt = CHATGLM_V2_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt")
        st = time.time()
        # if your selected model is capable of utilizing previous key/value attentions
        # to enhance decoding speed, but has `"use_cache": false` in its model config,
        # it is important to set `use_cache=True` explicitly in the `generate` function
        # to obtain optimal performance with BigDL-LLM INT4 optimizations
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        end = time.time()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)
--- a/demos/bigdl-llm/chatglm2/streamchat.py
+++ b/demos/bigdl-llm/chatglm2/streamchat.py
@ -0,0 +1,62 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import torch
 import time
 import argparse
 import numpy as np
 from bigdl.llm.transformers import AutoModel
 from transformers import AutoTokenizer
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Stream Chat for ChatGLM2 model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm2-6b",
                        help='The huggingface repo id for the ChatGLM2 model to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--question', type=str, default="晚上睡不着应该怎么办",
                        help='Qustion you want to ask')
    parser.add_argument('--disable-stream', action="store_true",
                        help='Disable stream chat')
    args = parser.parse_args()
    model_path = args.repo_id_or_model_path
    disable_stream = args.disable_stream
    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    model = AutoModel.from_pretrained(model_path,
                                      load_in_4bit=True,
                                      trust_remote_code=True)
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                              trust_remote_code=True)
    with torch.inference_mode():
        if disable_stream:
            # Chat
            response, history = model.chat(tokenizer, args.question, history=[])
            print('-'*20, 'Chat Output', '-'*20)
            print(response)
        else:
            # Stream chat
            response_ = ""
            print('-'*20, 'Stream Chat Output', '-'*20)
            for response, history in model.stream_chat(tokenizer, args.question, history=[]):
                print(response.replace(response_, ""), end="")
                response_ = response
--- a/demos/bigdl-llm/fastchat-webui.png
+++ b/demos/bigdl-llm/fastchat-webui.png
--- a/demos/bigdl-llm/install_python_with_conda.sh
+++ b/demos/bigdl-llm/install_python_with_conda.sh
@ -0,0 +1,13 @@
 #!/bin/bash
 set -e
 script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}"  )" >/dev/null 2>&1 && pwd )"
 # Install python and dependencies to specified position
 [ -f Miniconda3-latest-Linux-x86_64.sh ] || wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
 [ -d miniconda ] || bash ./Miniconda3-latest-Linux-x86_64.sh -b -p $script_dir/miniconda
 $script_dir/miniconda/bin/conda create \
    --prefix $script_dir/python-occlum -y \
    python=3.9.11
 # Install BigDL LLM
 $script_dir/python-occlum/bin/pip install --pre --upgrade bigdl-llm[all] bigdl-llm[serving]
--- a/demos/bigdl-llm/llm.yaml
+++ b/demos/bigdl-llm/llm.yaml
@ -0,0 +1,26 @@
 includes:
  - base.yaml
 targets:
  - target: /bin
    createlinks:
      - src: /opt/python-occlum/bin/python3
        linkname: python3
    # copy:
    #   - files:
    #       - /opt/occlum/toolchains/busybox/glibc/busybox
  # python packages
  - target: /opt
    copy:
      - dirs:
          - ../python-occlum
  # python code
  - target: /
    copy:
      - dirs:
          - ../chatglm2
  - target: /opt/occlum/glibc/lib
    copy:
      - files:
          - /opt/occlum/glibc/lib/libnss_files.so.2
          - /opt/occlum/glibc/lib/libnss_dns.so.2
          - /opt/occlum/glibc/lib/libresolv.so.2
--- a/demos/bigdl-llm/model_convert.py
+++ b/demos/bigdl-llm/model_convert.py
@ -0,0 +1,44 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import torch
 import time
 import argparse
 # load Hugging Face Transformers model with INT4 optimizations
 from bigdl.llm.transformers import AutoModelForCausalLM
 from transformers import AutoTokenizer
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model')
    parser.add_argument('--model-path', type=str, default="THUDM/chatglm2-6b",
                        help='The original model path')
    parser.add_argument('--save-path', type=str, default="./",
                        help='The converted model save path')
    args = parser.parse_args()
    model_path = args.model_path
    save_path = args.save_path
    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True)
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                              trust_remote_code=True)
    # save model with INT4 optimizations
    model.save_low_bit(save_path)
    tokenizer.save_pretrained(save_path)