diff --git a/demos/pytorch/distributed/README.md b/demos/pytorch/distributed/README.md index 2ca35909..7d4abb96 100644 --- a/demos/pytorch/distributed/README.md +++ b/demos/pytorch/distributed/README.md @@ -9,6 +9,7 @@ There are a few environment variables that are related to distributed PyTorch tr 2. MASTER_PORT 3. WORLD_SIZE 4. RANK +5. OMP_NUM_THREADS `MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to. @@ -18,6 +19,8 @@ There are a few environment variables that are related to distributed PyTorch tr The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique. +`OMP_NUM_THREADS` generally can be set to the number of physical CPU core numbers. But in Occlum, the more `OMP_NUM_THREADS` is, the more TCS and memory are required. + **Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.** ### TLS related environment variables @@ -75,7 +78,7 @@ bash ./build_pytorch_occlum_instance.sh Step 4 (in the Occlum container): Run node one PyTorch instance ```bash cd /root/demos/pytorch/distributed/occlum_instance -WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model +WORLD_SIZE=2 RANK=0 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model ``` If successful, it will wait for the node two to join. @@ -86,7 +89,7 @@ Using distributed PyTorch with gloo backend Step 5 (in the Occlum container): Run node two PyTorch instance ```bash cd /root/demos/pytorch/distributed/occlum_instance -WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model +WORLD_SIZE=2 RANK=1 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model ``` If everything goes well, node one and two has similar logs as below. diff --git a/demos/pytorch/distributed/build_pytorch_occlum_instance.sh b/demos/pytorch/distributed/build_pytorch_occlum_instance.sh index caed2684..b78427b2 100755 --- a/demos/pytorch/distributed/build_pytorch_occlum_instance.sh +++ b/demos/pytorch/distributed/build_pytorch_occlum_instance.sh @@ -36,7 +36,7 @@ function build_instance() new_json="$(jq '.resource_limits.user_space_size = "4000MB" | .resource_limits.kernel_space_heap_size = "256MB" | .resource_limits.max_num_of_threads = 64 | - .env.untrusted += [ "MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "RANK", "TORCH_CPP_LOG_LEVEL" ] | + .env.untrusted += [ "MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "RANK", "OMP_NUM_THREADS", "HOME" ] | .env.default += ["GLOO_DEVICE_TRANSPORT=TCP_TLS"] | .env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key"] | .env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt"] |