[demos] Add benchmark example for llm demo

2024-03-13 14:19:10 +08:00 · 2024-03-13 14:19:10 +08:00 · 48b9c077ed
commit 48b9c077ed
parent 06fd64e74c
5 changed files with 4792 additions and 0 deletions
--- a/demos/bigdl-llm/README.md
+++ b/demos/bigdl-llm/README.md
@ -61,6 +61,33 @@ HF_DATASETS_CACHE=/root/cache \

 For both examples, more arguments info could refer to BigDL-LLM [chatglm2](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2).

+## LLM Inference Benchmark
+
+Based on the [benchmark](https://github.com/intel-analytics/BigDL/tree/main/python/llm/dev/benchmark) demo from BigDL, a simple [benchmark](./benchmarks/) is provided to measure the performance of LLM inference both in host and in TEE.
+
+Output will be like:
+```
+=========First token cost xx.xxxxs=========
+=========Last token cost average xx.xxxxs (xx tokens in all)=========
+```
+
+The following **model_path** could be the path of chatglm2-6b or Qwen-7B-Chat.
+**OMP_NUM_THREADS** is used to set the number of threads for OpenMP.
+
+### Benchmark in Host
+```bash
+OMP_NUM_THREADS=16 ./python-occlum/bin/python \
+    ./benchmarks/bench.py  --repo-id-or-model-path <model_path>
+```
+
+### Benchmark in TEE
+```bash
+cd occlum_instance
+OMP_NUM_THREADS=16 occlum run /bin/python3 \
+    /benchmarks/bench.py --repo-id-or-model-path <model_path>
+```
+
+By our benchmark result in Intel Ice Lake server, LLM inference performance within a TEE is approximately 30% less compared to on a host environment.

 ## Do inference with webui

--- a/demos/bigdl-llm/benchmarks/bench.py
+++ b/demos/bigdl-llm/benchmarks/bench.py
@ -0,0 +1,22 @@
+import argparse
+import torch
+from bigdl.llm.transformers import AutoModel, AutoModelForCausalLM
+from transformers import AutoTokenizer
+from benchmark_util import BenchmarkWrapper
+
+parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model')
+parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm2-6b",
+                    help='The huggingface repo id for the ChatGLM2 model to be downloaded'
+                            ', or the path to the huggingface checkpoint folder')
+
+args = parser.parse_args()
+model_path = args.repo_id_or_model_path
+model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True)
+model = BenchmarkWrapper(model, do_print=True)
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+prompt = "今天睡不着怎么办"
+ 
+with torch.inference_mode():
+    input_ids = tokenizer.encode(prompt, return_tensors="pt")
+    output = model.generate(input_ids, do_sample=False, max_new_tokens=512)
+    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
--- a/demos/bigdl-llm/benchmarks/benchmark_util.py
+++ b/demos/bigdl-llm/benchmarks/benchmark_util.py
--- a/demos/bigdl-llm/install_python_with_conda.sh
+++ b/demos/bigdl-llm/install_python_with_conda.sh
@ -12,3 +12,4 @@ $script_dir/miniconda/bin/conda create \
 # Install BigDL LLM
 $script_dir/python-occlum/bin/pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cpu
 $script_dir/python-occlum/bin/pip install --pre --upgrade bigdl-llm[all] bigdl-llm[serving]
+$script_dir/python-occlum/bin/pip install transformers_stream_generator einops
--- a/demos/bigdl-llm/llm.yaml
+++ b/demos/bigdl-llm/llm.yaml
@ -18,6 +18,7 @@ targets:
    copy:
      - dirs:
          - ../chatglm2
+          - ../benchmarks
  - target: /opt/occlum/glibc/lib
    copy:
      - files: