[demos] Provide optional OMP_NUM_THREADS setting to distributed pytorch

This commit is contained in:
Zheng, Qi 2022-12-16 16:37:27 +08:00 committed by volcano
parent 588b458268
commit ab14a3e479
2 changed files with 6 additions and 3 deletions

@ -9,6 +9,7 @@ There are a few environment variables that are related to distributed PyTorch tr
2. MASTER_PORT 2. MASTER_PORT
3. WORLD_SIZE 3. WORLD_SIZE
4. RANK 4. RANK
5. OMP_NUM_THREADS
`MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to. `MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to.
@ -18,6 +19,8 @@ There are a few environment variables that are related to distributed PyTorch tr
The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique. The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique.
`OMP_NUM_THREADS` generally can be set to the number of physical CPU core numbers. But in Occlum, the more `OMP_NUM_THREADS` is, the more TCS and memory are required.
**Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.** **Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.**
### TLS related environment variables ### TLS related environment variables
@ -75,7 +78,7 @@ bash ./build_pytorch_occlum_instance.sh
Step 4 (in the Occlum container): Run node one PyTorch instance Step 4 (in the Occlum container): Run node one PyTorch instance
```bash ```bash
cd /root/demos/pytorch/distributed/occlum_instance cd /root/demos/pytorch/distributed/occlum_instance
WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model WORLD_SIZE=2 RANK=0 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
``` ```
If successful, it will wait for the node two to join. If successful, it will wait for the node two to join.
@ -86,7 +89,7 @@ Using distributed PyTorch with gloo backend
Step 5 (in the Occlum container): Run node two PyTorch instance Step 5 (in the Occlum container): Run node two PyTorch instance
```bash ```bash
cd /root/demos/pytorch/distributed/occlum_instance cd /root/demos/pytorch/distributed/occlum_instance
WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model WORLD_SIZE=2 RANK=1 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
``` ```
If everything goes well, node one and two has similar logs as below. If everything goes well, node one and two has similar logs as below.

@ -36,7 +36,7 @@ function build_instance()
new_json="$(jq '.resource_limits.user_space_size = "4000MB" | new_json="$(jq '.resource_limits.user_space_size = "4000MB" |
.resource_limits.kernel_space_heap_size = "256MB" | .resource_limits.kernel_space_heap_size = "256MB" |
.resource_limits.max_num_of_threads = 64 | .resource_limits.max_num_of_threads = 64 |
.env.untrusted += [ "MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "RANK", "TORCH_CPP_LOG_LEVEL" ] | .env.untrusted += [ "MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "RANK", "OMP_NUM_THREADS", "HOME" ] |
.env.default += ["GLOO_DEVICE_TRANSPORT=TCP_TLS"] | .env.default += ["GLOO_DEVICE_TRANSPORT=TCP_TLS"] |
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key"] | .env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key"] |
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt"] | .env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt"] |