occlum/demos/pytorch/distributed/README.md
2023-04-18 13:12:37 +08:00

5.0 KiB

Distributed PyTorch Demo

This project demonstrates how Occlum enables unmodified distributed PyTorch programs running in SGX enclaves, on the basis of unmodified Python.

Environment variables for Distributed PyTorch model

There are a few environment variables that are related to distributed PyTorch training, which are:

  1. MASTER_ADDR
  2. MASTER_PORT
  3. WORLD_SIZE
  4. RANK
  5. OMP_NUM_THREADS

MASTER_ADDR and MASTER_PORT specifies a rendezvous point where all the training processes will connect to.

WORLD_SIZE specifies how many training processes will participate in the training.

RANK is the unique identifier for each of the training process.

The MASTER_ADDR, MASTER_PORT and WORLD_SIZE should be identical for all the participants while the RANK should be unique.

OMP_NUM_THREADS generally can be set to the number of physical CPU core numbers. But in Occlum, the more OMP_NUM_THREADS is, the more TCS and memory are required.

Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set num_workers=1 env.

There is a environment variable called GLOO_DEVICE_TRANSPORT that can be used to specify the transport.

The default value is set to TCP. If TLS is required to satisfy the security requirement, then, please also set the following environment variables:

  1. GLOO_DEVICE_TRANSPORT=TCP_TLS
  2. GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY
  3. GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT
  4. GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE

These environments are set as below in our demo.

  "env": {
    "default": [
      "GLOO_DEVICE_TRANSPORT=TCP_TLS",
      "GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key",
      "GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt",
      "GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem",

The CA files above are generated by openssl. Details please refer to the function generate_ca_files in the script build_pytorch_occlum_instance.sh.

How to Run

This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container.

Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example.

In the following example, we will try to run a distributed PyTorch training using fashion-MNIST dataset with 2 processes (Occlum instance).

Thus, we set WORLD_SIZE to 2.

Generally, MASTER_ADDR can be set to the IP address of the process with RANK 0. In our case, two processes are running in the same container, thus MASTER_ADDR can be simply set to localhost.

Step 1 (on the host): Start an Occlum container

docker pull occlum/occlum:0.29.3-ubuntu20.04
docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash

Step 2 (in the Occlum container): Download miniconda and install python to prefix position.

cd /root/demos/pytorch/distributed
bash ./install_python_with_conda.sh

Step 3 (in the Occlum container): Build the Distributed PyTorch Occlum instances

cd /root/demos/pytorch/distributed
bash ./build_pytorch_occlum_instance.sh

Step 4 (in the Occlum container): Run node one PyTorch instance

cd /root/demos/pytorch/distributed/occlum_instance
WORLD_SIZE=2 RANK=0 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model

If successful, it will wait for the node two to join.

Using distributed PyTorch with gloo backend

Step 5 (in the Occlum container): Run node two PyTorch instance

cd /root/demos/pytorch/distributed/occlum_instance
WORLD_SIZE=2 RANK=1 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model

If everything goes well, node one and two has similar logs as below.

After downloading data
2022-12-05T09:40:05Z INFO     Train Epoch: 1 [0/469 (0%)]       loss=2.3037
2022-12-05T09:40:05Z INFO     Reducer buckets have been rebuilt in this iteration.
2022-12-05T09:40:06Z INFO     Train Epoch: 1 [10/469 (2%)]      loss=2.3117
2022-12-05T09:40:06Z INFO     Train Epoch: 1 [20/469 (4%)]      loss=2.2826
2022-12-05T09:40:06Z INFO     Train Epoch: 1 [30/469 (6%)]      loss=2.2904
2022-12-05T09:40:07Z INFO     Train Epoch: 1 [40/469 (9%)]      loss=2.2860
2022-12-05T09:40:07Z INFO     Train Epoch: 1 [50/469 (11%)]     loss=2.2784
2022-12-05T09:40:08Z INFO     Train Epoch: 1 [60/469 (13%)]     loss=2.2779
2022-12-05T09:40:08Z INFO     Train Epoch: 1 [70/469 (15%)]     loss=2.2689
2022-12-05T09:40:08Z INFO     Train Epoch: 1 [80/469 (17%)]     loss=2.2513
2022-12-05T09:40:09Z INFO     Train Epoch: 1 [90/469 (19%)]     loss=2.2536
...