111 lines
5.0 KiB
Markdown
111 lines
5.0 KiB
Markdown
# Distributed PyTorch Demo
|
|
|
|
This project demonstrates how Occlum enables _unmodified_ distributed [PyTorch](https://pytorch.org/) programs running in SGX enclaves, on the basis of _unmodified_ [Python](https://www.python.org).
|
|
|
|
## Environment variables for Distributed PyTorch model
|
|
There are a few environment variables that are related to distributed PyTorch training, which are:
|
|
|
|
1. MASTER_ADDR
|
|
2. MASTER_PORT
|
|
3. WORLD_SIZE
|
|
4. RANK
|
|
5. OMP_NUM_THREADS
|
|
|
|
`MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to.
|
|
|
|
`WORLD_SIZE` specifies how many training processes will participate in the training.
|
|
|
|
`RANK` is the unique identifier for each of the training process.
|
|
|
|
The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique.
|
|
|
|
`OMP_NUM_THREADS` generally can be set to the number of physical CPU core numbers. But in Occlum, the more `OMP_NUM_THREADS` is, the more TCS and memory are required.
|
|
|
|
**Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.**
|
|
|
|
### TLS related environment variables
|
|
There is a environment variable called `GLOO_DEVICE_TRANSPORT` that can be used to specify the transport.
|
|
|
|
The default value is set to TCP. If TLS is required to satisfy the security requirement, then, please also set the following environment variables:
|
|
|
|
1. GLOO_DEVICE_TRANSPORT=TCP_TLS
|
|
2. GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY
|
|
3. GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT
|
|
4. GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE
|
|
|
|
These environments are set as below in our demo.
|
|
```json
|
|
"env": {
|
|
"default": [
|
|
"GLOO_DEVICE_TRANSPORT=TCP_TLS",
|
|
"GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key",
|
|
"GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt",
|
|
"GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem",
|
|
```
|
|
|
|
The CA files above are generated by openssl. Details please refer to the function **generate_ca_files** in the script [`build_pytorch_occlum_instance.sh`](./build_pytorch_occlum_instance.sh).
|
|
|
|
## How to Run
|
|
|
|
This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container.
|
|
|
|
Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example.
|
|
|
|
In the following example, we will try to run a distributed PyTorch training using `fasion-MNIST` dataset with 2 processes (Occlum instance).
|
|
|
|
Thus, we set `WORLD_SIZE` to 2.
|
|
|
|
Generally, `MASTER_ADDR` can be set to the IP address of the process with RANK 0. In our case, two processes are running in the same container, thus `MASTER_ADDR` can be simply set to `localhost`.
|
|
|
|
Step 1 (on the host): Start an Occlum container
|
|
```bash
|
|
docker pull occlum/occlum:0.29.3-ubuntu20.04
|
|
docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash
|
|
```
|
|
|
|
Step 2 (in the Occlum container): Download miniconda and install python to prefix position.
|
|
```bash
|
|
cd /root/demos/pytorch/distributed
|
|
bash ./install_python_with_conda.sh
|
|
```
|
|
|
|
Step 3 (in the Occlum container): Build the Distributed PyTorch Occlum instances
|
|
```bash
|
|
cd /root/demos/pytorch/distributed
|
|
bash ./build_pytorch_occlum_instance.sh
|
|
```
|
|
|
|
Step 4 (in the Occlum container): Run node one PyTorch instance
|
|
```bash
|
|
cd /root/demos/pytorch/distributed/occlum_instance
|
|
WORLD_SIZE=2 RANK=0 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
|
|
```
|
|
|
|
If successful, it will wait for the node two to join.
|
|
```log
|
|
Using distributed PyTorch with gloo backend
|
|
```
|
|
|
|
Step 5 (in the Occlum container): Run node two PyTorch instance
|
|
```bash
|
|
cd /root/demos/pytorch/distributed/occlum_instance
|
|
WORLD_SIZE=2 RANK=1 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
|
|
```
|
|
|
|
If everything goes well, node one and two has similar logs as below.
|
|
```log
|
|
After downloading data
|
|
2022-12-05T09:40:05Z INFO Train Epoch: 1 [0/469 (0%)] loss=2.3037
|
|
2022-12-05T09:40:05Z INFO Reducer buckets have been rebuilt in this iteration.
|
|
2022-12-05T09:40:06Z INFO Train Epoch: 1 [10/469 (2%)] loss=2.3117
|
|
2022-12-05T09:40:06Z INFO Train Epoch: 1 [20/469 (4%)] loss=2.2826
|
|
2022-12-05T09:40:06Z INFO Train Epoch: 1 [30/469 (6%)] loss=2.2904
|
|
2022-12-05T09:40:07Z INFO Train Epoch: 1 [40/469 (9%)] loss=2.2860
|
|
2022-12-05T09:40:07Z INFO Train Epoch: 1 [50/469 (11%)] loss=2.2784
|
|
2022-12-05T09:40:08Z INFO Train Epoch: 1 [60/469 (13%)] loss=2.2779
|
|
2022-12-05T09:40:08Z INFO Train Epoch: 1 [70/469 (15%)] loss=2.2689
|
|
2022-12-05T09:40:08Z INFO Train Epoch: 1 [80/469 (17%)] loss=2.2513
|
|
2022-12-05T09:40:09Z INFO Train Epoch: 1 [90/469 (19%)] loss=2.2536
|
|
...
|
|
```
|