Qi Zheng 367fa9c4ce [demos] CUDA torch python packages are not required for CPU inference

2024-03-14 16:23:34 +08:00

4.6 KiB

Raw Blame History

LLM inference in TEE

LLM ( Large Language Model) inference in TEE can protect the model, input prompt or output. The key challenges are:

the performance of LLM inference in TEE (CPU)
can LLM inference run in TEE?

With the significant LLM inference speed-up brought by BigDL-LLM, and the Occlum LibOS, now high-performance and efficient LLM inference in TEE could be realized.

A chatglm2 6B model inference demo by Occlum and BigDL-LLM in TEE is introduced below.

Start the Occlum development container

docker run --rm -it --network host \
    --device /dev/sgx_enclave --device /dev/sgx_provision \
    occlum/occlum:latest-ubuntu20.04 bash

Download the model

First of all, download the THUDM/chatglm2-6b. Put it in the directory such as /work/models.

Install required python packages

Just run the script install_python_with_conda.sh. It creates a conda python 3.9 env and install bigdl-llm[all].

Build Occlum instance

Besides the python packages, the demo also requires demo code and model to be existed in Occlum instance.

The demo code below are copied from BigDL-LLM chatglm2.

./chatglm2/generate.py
./chatglm2/streamchat.py

To simply the Occlum instance creation, a hostfs mount is used to mount previous downloaded chatglm2-6B model to Occlum. Just run the script build_occlum_instance.sh to create the Occlum instance.

Note, this demo has model protection in use (TEE memory) but no protection on model in storage. To add protection on model in storage, some more operation such as model encryption and decryption may need imported.

Run the demo

Example 1, Predict Tokens using `generate()` API

cd occlum_instance
HF_DATASETS_CACHE=/root/cache \
    occlum run /bin/python3 /chatglm2/generate.py \
    --repo-id-or-model-path /models/chatglm2-6b

Example2, Stream Chat using `stream_chat()` API

cd occlum_instance
HF_DATASETS_CACHE=/root/cache \
    occlum run /bin/python3 /chatglm2/streamchat.py \
    --repo-id-or-model-path /models/chatglm2-6b

For both examples, more arguments info could refer to BigDL-LLM chatglm2.

Do inference with webui

FastChat is an open platform for training, serving, and evaluating large language model based chatbots.

BigDL-LLM also support FastChat with using BigDL-LLM as a serving backend in the deployment. Details please refer to BigDL-LLM serving.

For this demo, below commands show how to run an inference service in Occlum with webui interface.

In order to load models using BigDL-LLM, the model name should include "bigdl". For example, model vicuna-7b should be renamed to bigdl-7b. A special case is ChatGLM models. For these models, you do not need to do any changes after downloading the model and the BigDL-LLM backend will be used automatically. Details please refer to Models.

Serving with WebGUI

To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers.

Launch the Controller in non-TEE env

./python-occlum/bin/python -m fastchat.serve.controller --host 0.0.0.0

This controller manages the distributed workers.

Launch the model worker(s) in Occlum

cd occlum_instance
occlum start
HF_DATASETS_CACHE=/root/cache  occlum exec /bin/python3 -m bigdl.llm.serving.model_worker --model-path /models/chatglm2-6b --device cpu --host 0.0.0.0

Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller.

Launch the Gradio web server in Occlum

occlum exec /bin/python3 -m fastchat.serve.gradio_web_server --host 0.0.0.0

This is the user interface that users will interact with.

By following these steps, you will be able to serve your models using the web UI with BigDL-LLM as the backend. You can open your browser and chat with a model now.

4.6 KiB Raw Blame History