[demos] Provide optional OMP_NUM_THREADS setting to distributed pytorch
This commit is contained in:
parent
588b458268
commit
ab14a3e479
@ -9,6 +9,7 @@ There are a few environment variables that are related to distributed PyTorch tr
|
|||||||
2. MASTER_PORT
|
2. MASTER_PORT
|
||||||
3. WORLD_SIZE
|
3. WORLD_SIZE
|
||||||
4. RANK
|
4. RANK
|
||||||
|
5. OMP_NUM_THREADS
|
||||||
|
|
||||||
`MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to.
|
`MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to.
|
||||||
|
|
||||||
@ -18,6 +19,8 @@ There are a few environment variables that are related to distributed PyTorch tr
|
|||||||
|
|
||||||
The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique.
|
The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique.
|
||||||
|
|
||||||
|
`OMP_NUM_THREADS` generally can be set to the number of physical CPU core numbers. But in Occlum, the more `OMP_NUM_THREADS` is, the more TCS and memory are required.
|
||||||
|
|
||||||
**Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.**
|
**Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.**
|
||||||
|
|
||||||
### TLS related environment variables
|
### TLS related environment variables
|
||||||
@ -75,7 +78,7 @@ bash ./build_pytorch_occlum_instance.sh
|
|||||||
Step 4 (in the Occlum container): Run node one PyTorch instance
|
Step 4 (in the Occlum container): Run node one PyTorch instance
|
||||||
```bash
|
```bash
|
||||||
cd /root/demos/pytorch/distributed/occlum_instance
|
cd /root/demos/pytorch/distributed/occlum_instance
|
||||||
WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
|
WORLD_SIZE=2 RANK=0 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
|
||||||
```
|
```
|
||||||
|
|
||||||
If successful, it will wait for the node two to join.
|
If successful, it will wait for the node two to join.
|
||||||
@ -86,7 +89,7 @@ Using distributed PyTorch with gloo backend
|
|||||||
Step 5 (in the Occlum container): Run node two PyTorch instance
|
Step 5 (in the Occlum container): Run node two PyTorch instance
|
||||||
```bash
|
```bash
|
||||||
cd /root/demos/pytorch/distributed/occlum_instance
|
cd /root/demos/pytorch/distributed/occlum_instance
|
||||||
WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
|
WORLD_SIZE=2 RANK=1 OMP_NUM_THREADS=16 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
|
||||||
```
|
```
|
||||||
|
|
||||||
If everything goes well, node one and two has similar logs as below.
|
If everything goes well, node one and two has similar logs as below.
|
||||||
|
@ -36,7 +36,7 @@ function build_instance()
|
|||||||
new_json="$(jq '.resource_limits.user_space_size = "4000MB" |
|
new_json="$(jq '.resource_limits.user_space_size = "4000MB" |
|
||||||
.resource_limits.kernel_space_heap_size = "256MB" |
|
.resource_limits.kernel_space_heap_size = "256MB" |
|
||||||
.resource_limits.max_num_of_threads = 64 |
|
.resource_limits.max_num_of_threads = 64 |
|
||||||
.env.untrusted += [ "MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "RANK", "TORCH_CPP_LOG_LEVEL" ] |
|
.env.untrusted += [ "MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "RANK", "OMP_NUM_THREADS", "HOME" ] |
|
||||||
.env.default += ["GLOO_DEVICE_TRANSPORT=TCP_TLS"] |
|
.env.default += ["GLOO_DEVICE_TRANSPORT=TCP_TLS"] |
|
||||||
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key"] |
|
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key"] |
|
||||||
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt"] |
|
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt"] |
|
||||||
|
Loading…
Reference in New Issue
Block a user