# Distributed PyTorch Demo This project demonstrates how Occlum enables _unmodified_ distributed [PyTorch](https://pytorch.org/) programs running in SGX enclaves, on the basis of _unmodified_ [Python](https://www.python.org). ## Environment variables for Distributed PyTorch model There are a few environment variables that are related to distributed PyTorch training, which are: 1. MASTER_ADDR 2. MASTER_PORT 3. WORLD_SIZE 4. RANK 5. OMP_NUM_THREADS `MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to. `WORLD_SIZE` specifies how many training processes will participate in the training. `RANK` is the unique identifier for each of the training process. The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique. `OMP_NUM_THREADS` generally can be set to the number of physical CPU core numbers. But in Occlum, the more `OMP_NUM_THREADS` is, the more TCS and memory are required. **Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.** ### TLS related environment variables There is a environment variable called `GLOO_DEVICE_TRANSPORT` that can be used to specify the transport. The default value is set to TCP. If TLS is required to satisfy the security requirement, then, please also set the following environment variables: 1. GLOO_DEVICE_TRANSPORT=TCP_TLS 2. GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY 3. GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT 4. GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE These environments are set as below in our demo. ```json "env": { "default": [ "GLOO_DEVICE_TRANSPORT=TCP_TLS", "GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key", "GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt", "GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem", ``` The CA files above are generated by openssl. Details please refer to the function **generate_ca_files** in the script [`build_pytorch_occlum_instance.sh`](./build_pytorch_occlum_instance.sh). ## How to Run This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container. Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example. In the following example, we will try to run a distributed PyTorch training using `fasion-MNIST` dataset with 2 processes (Occlum instance). Thus, we set `WORLD_SIZE` to 2. Generally, `MASTER_ADDR` can be set to the IP address of the process with RANK 0. In our case, two processes are running in the same container, thus `MASTER_ADDR` can be simply set to `localhost`. Step 1 (on the host): Start an Occlum container ```bash docker pull occlum/occlum:0.29.3-ubuntu20.04 docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash ``` Step 2 (in the Occlum container): Download miniconda and install python to prefix position. ```bash cd /root/demos/pytorch/distributed bash ./install_python_with_conda.sh ``` Step 3 (in the Occlum container): Build the Distributed PyTorch Occlum instances ```bash cd /root/demos/pytorch/distributed bash ./build_pytorch_occlum_instance.sh ``` Step 4 (in the Occlum container): Run node one PyTorch instance ```bash cd /root/demos/pytorch/distributed/occlum_instance WORLD_SIZE=2 RANK=0 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model ``` If successful, it will wait for the node two to join. ```log Using distributed PyTorch with gloo backend ``` Step 5 (in the Occlum container): Run node two PyTorch instance ```bash cd /root/demos/pytorch/distributed/occlum_instance WORLD_SIZE=2 RANK=1 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model ``` If everything goes well, node one and two has similar logs as below. ```log After downloading data 2022-12-05T09:40:05Z INFO Train Epoch: 1 [0/469 (0%)] loss=2.3037 2022-12-05T09:40:05Z INFO Reducer buckets have been rebuilt in this iteration. 2022-12-05T09:40:06Z INFO Train Epoch: 1 [10/469 (2%)] loss=2.3117 2022-12-05T09:40:06Z INFO Train Epoch: 1 [20/469 (4%)] loss=2.2826 2022-12-05T09:40:06Z INFO Train Epoch: 1 [30/469 (6%)] loss=2.2904 2022-12-05T09:40:07Z INFO Train Epoch: 1 [40/469 (9%)] loss=2.2860 2022-12-05T09:40:07Z INFO Train Epoch: 1 [50/469 (11%)] loss=2.2784 2022-12-05T09:40:08Z INFO Train Epoch: 1 [60/469 (13%)] loss=2.2779 2022-12-05T09:40:08Z INFO Train Epoch: 1 [70/469 (15%)] loss=2.2689 2022-12-05T09:40:08Z INFO Train Epoch: 1 [80/469 (17%)] loss=2.2513 2022-12-05T09:40:09Z INFO Train Epoch: 1 [90/469 (19%)] loss=2.2536 ... ```