[demos] Add distributed pytorch demo
This commit is contained in:
parent
a5cdcc8045
commit
47bd1fd7af
27
.github/workflows/demo_test.yml
vendored
27
.github/workflows/demo_test.yml
vendored
@ -276,11 +276,34 @@ jobs:
|
|||||||
build-envs: 'OCCLUM_RELEASE_BUILD=1'
|
build-envs: 'OCCLUM_RELEASE_BUILD=1'
|
||||||
|
|
||||||
- name: Build python and pytorch
|
- name: Build python and pytorch
|
||||||
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch; ./install_python_with_conda.sh"
|
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/standalone; ./install_python_with_conda.sh"
|
||||||
|
|
||||||
- name: Run pytorch test
|
- name: Run pytorch test
|
||||||
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch; SGX_MODE=SIM ./run_pytorch_on_occlum.sh"
|
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/standalone; SGX_MODE=SIM ./run_pytorch_on_occlum.sh"
|
||||||
|
|
||||||
|
Distributed_Pytorch_test:
|
||||||
|
runs-on: ubuntu-20.04
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v1
|
||||||
|
with:
|
||||||
|
submodules: true
|
||||||
|
|
||||||
|
- uses: ./.github/workflows/composite_action/sim
|
||||||
|
with:
|
||||||
|
container-name: ${{ github.job }}
|
||||||
|
build-envs: 'OCCLUM_RELEASE_BUILD=1'
|
||||||
|
|
||||||
|
- name: Build python and pytorch
|
||||||
|
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed; ./install_python_with_conda.sh"
|
||||||
|
|
||||||
|
- name: Build pytorch Occlum instance
|
||||||
|
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed; SGX_MODE=SIM ./build_pytorch_occlum_instance.sh"
|
||||||
|
|
||||||
|
- name: Start pytorch Occlum instance node one
|
||||||
|
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed/occlum_instance; WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model &"
|
||||||
|
|
||||||
|
- name: Start pytorch Occlum instance node two
|
||||||
|
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed/occlum_instance_2; WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model"
|
||||||
|
|
||||||
Tensorflow_test:
|
Tensorflow_test:
|
||||||
runs-on: ubuntu-20.04
|
runs-on: ubuntu-20.04
|
||||||
|
@ -22,7 +22,7 @@ This set of demos shows how real-world apps can be easily run inside SGX enclave
|
|||||||
* [grpc](grpc/): A client and server communicating through [gRPC](https://grpc.io), containing [glibc-supported demo](grpc/grpc_glibc) and [musl-supported demo](grpc/grpc_musl).
|
* [grpc](grpc/): A client and server communicating through [gRPC](https://grpc.io), containing [glibc-supported demo](grpc/grpc_glibc) and [musl-supported demo](grpc/grpc_musl).
|
||||||
* [https_server](https_server/): A HTTPS file server based on [Mongoose Embedded Web Server Library](https://github.com/cesanta/mongoose).
|
* [https_server](https_server/): A HTTPS file server based on [Mongoose Embedded Web Server Library](https://github.com/cesanta/mongoose).
|
||||||
* [openvino](openvino/) A benchmark of [OpenVINO Inference Engine](https://docs.openvinotoolkit.org/2019_R3/_docs_IE_DG_inference_engine_intro.html).
|
* [openvino](openvino/) A benchmark of [OpenVINO Inference Engine](https://docs.openvinotoolkit.org/2019_R3/_docs_IE_DG_inference_engine_intro.html).
|
||||||
* [pytorch](pytorch/): A demo of [PyTorch](https://pytorch.org/).
|
* [pytorch](pytorch/): Demos of standalone and distributed [PyTorch](https://pytorch.org/).
|
||||||
* [redis](redis/): A demo of [Redis](https://redis.io).
|
* [redis](redis/): A demo of [Redis](https://redis.io).
|
||||||
* [sofaboot](sofaboot/): A demo of [SOFABoot](https://github.com/sofastack/sofa-boot), an open source Java development framework based on Spring Boot.
|
* [sofaboot](sofaboot/): A demo of [SOFABoot](https://github.com/sofastack/sofa-boot), an open source Java development framework based on Spring Boot.
|
||||||
* [sqlite](sqlite/) A demo of [SQLite](https://www.sqlite.org) SQL database engine.
|
* [sqlite](sqlite/) A demo of [SQLite](https://www.sqlite.org) SQL database engine.
|
||||||
|
107
demos/pytorch/distributed/README.md
Normal file
107
demos/pytorch/distributed/README.md
Normal file
@ -0,0 +1,107 @@
|
|||||||
|
# Distributed PyTorch Demo
|
||||||
|
|
||||||
|
This project demonstrates how Occlum enables _unmodified_ distributed [PyTorch](https://pytorch.org/) programs running in SGX enclaves, on the basis of _unmodified_ [Python](https://www.python.org).
|
||||||
|
|
||||||
|
## Environment variables for Distributed PyTorch model
|
||||||
|
There are a few environment variables that are related to distributed PyTorch training, which are:
|
||||||
|
|
||||||
|
1. MASTER_ADDR
|
||||||
|
2. MASTER_PORT
|
||||||
|
3. WORLD_SIZE
|
||||||
|
4. RANK
|
||||||
|
|
||||||
|
`MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to.
|
||||||
|
|
||||||
|
`WORLD_SIZE` specifies how many training processes will participate in the training.
|
||||||
|
|
||||||
|
`RANK` is the unique identifier for each of the training process.
|
||||||
|
|
||||||
|
The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique.
|
||||||
|
|
||||||
|
**Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.**
|
||||||
|
|
||||||
|
### TLS related environment variables
|
||||||
|
There is a environment variable called `GLOO_DEVICE_TRANSPORT` that can be used to specify the transport.
|
||||||
|
|
||||||
|
The default value is set to TCP. If TLS is required to satisfy the security requirement, then, please also set the following environment variables:
|
||||||
|
|
||||||
|
1. GLOO_DEVICE_TRANSPORT=TCP_TLS
|
||||||
|
2. GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY
|
||||||
|
3. GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT
|
||||||
|
4. GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE
|
||||||
|
|
||||||
|
These environments are set as below in our demo.
|
||||||
|
```json
|
||||||
|
"env": {
|
||||||
|
"default": [
|
||||||
|
"GLOO_DEVICE_TRANSPORT=TCP_TLS",
|
||||||
|
"GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key",
|
||||||
|
"GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt",
|
||||||
|
"GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem",
|
||||||
|
```
|
||||||
|
|
||||||
|
The CA files above are generated by openssl. Details please refer to the function **generate_ca_files** in the script [`build_pytorch_occlum_instance.sh`](./build_pytorch_occlum_instance.sh).
|
||||||
|
|
||||||
|
## How to Run
|
||||||
|
|
||||||
|
This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container.
|
||||||
|
|
||||||
|
Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example.
|
||||||
|
|
||||||
|
In the following example, we will try to run a distributed PyTorch training using `fasion-MNIST` dataset with 2 processes (Occlum instance).
|
||||||
|
|
||||||
|
Thus, we set `WORLD_SIZE` to 2.
|
||||||
|
|
||||||
|
Generally, `MASTER_ADDR` can be set to the IP address of the process with RANK 0. In our case, two processes are running in the same container, thus `MASTER_ADDR` can be simply set to `localhost`.
|
||||||
|
|
||||||
|
Step 1 (on the host): Start an Occlum container
|
||||||
|
```bash
|
||||||
|
docker pull occlum/occlum:0.29.3-ubuntu20.04
|
||||||
|
docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash
|
||||||
|
```
|
||||||
|
|
||||||
|
Step 2 (in the Occlum container): Download miniconda and install python to prefix position.
|
||||||
|
```bash
|
||||||
|
cd /root/demos/pytorch/distributed
|
||||||
|
bash ./install_python_with_conda.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Step 3 (in the Occlum container): Build the Distributed PyTorch Occlum instances
|
||||||
|
```bash
|
||||||
|
cd /root/demos/pytorch/distributed
|
||||||
|
bash ./build_pytorch_occlum_instance.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Step 4 (in the Occlum container): Run node one PyTorch instance
|
||||||
|
```bash
|
||||||
|
cd /root/demos/pytorch/distributed/occlum_instance
|
||||||
|
WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
|
||||||
|
```
|
||||||
|
|
||||||
|
If successful, it will wait for the node two to join.
|
||||||
|
```log
|
||||||
|
Using distributed PyTorch with gloo backend
|
||||||
|
```
|
||||||
|
|
||||||
|
Step 5 (in the Occlum container): Run node two PyTorch instance
|
||||||
|
```bash
|
||||||
|
cd /root/demos/pytorch/distributed/occlum_instance
|
||||||
|
WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
|
||||||
|
```
|
||||||
|
|
||||||
|
If everything goes well, node one and two has similar logs as below.
|
||||||
|
```log
|
||||||
|
After downloading data
|
||||||
|
2022-12-05T09:40:05Z INFO Train Epoch: 1 [0/469 (0%)] loss=2.3037
|
||||||
|
2022-12-05T09:40:05Z INFO Reducer buckets have been rebuilt in this iteration.
|
||||||
|
2022-12-05T09:40:06Z INFO Train Epoch: 1 [10/469 (2%)] loss=2.3117
|
||||||
|
2022-12-05T09:40:06Z INFO Train Epoch: 1 [20/469 (4%)] loss=2.2826
|
||||||
|
2022-12-05T09:40:06Z INFO Train Epoch: 1 [30/469 (6%)] loss=2.2904
|
||||||
|
2022-12-05T09:40:07Z INFO Train Epoch: 1 [40/469 (9%)] loss=2.2860
|
||||||
|
2022-12-05T09:40:07Z INFO Train Epoch: 1 [50/469 (11%)] loss=2.2784
|
||||||
|
2022-12-05T09:40:08Z INFO Train Epoch: 1 [60/469 (13%)] loss=2.2779
|
||||||
|
2022-12-05T09:40:08Z INFO Train Epoch: 1 [70/469 (15%)] loss=2.2689
|
||||||
|
2022-12-05T09:40:08Z INFO Train Epoch: 1 [80/469 (17%)] loss=2.2513
|
||||||
|
2022-12-05T09:40:09Z INFO Train Epoch: 1 [90/469 (19%)] loss=2.2536
|
||||||
|
...
|
||||||
|
```
|
55
demos/pytorch/distributed/build_pytorch_occlum_instance.sh
Executable file
55
demos/pytorch/distributed/build_pytorch_occlum_instance.sh
Executable file
@ -0,0 +1,55 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
set -e
|
||||||
|
|
||||||
|
BLUE='\033[1;34m'
|
||||||
|
NC='\033[0m'
|
||||||
|
|
||||||
|
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
|
||||||
|
python_dir="$script_dir/occlum_instance/image/opt/python-occlum"
|
||||||
|
|
||||||
|
|
||||||
|
function generate_ca_files()
|
||||||
|
{
|
||||||
|
cn_name=${1:-"localhost"}
|
||||||
|
# Generate CA files
|
||||||
|
openssl req -x509 -nodes -days 1825 -newkey rsa:2048 -keyout myCA.key -out myCA.pem -subj "/CN=${cn_name}"
|
||||||
|
# Prepare test private key
|
||||||
|
openssl genrsa -out test.key 2048
|
||||||
|
# Use private key to generate a Certificate Sign Request
|
||||||
|
openssl req -new -key test.key -out test.csr -subj "/C=CN/ST=Shanghai/L=Shanghai/O=Ant/CN=${cn_name}"
|
||||||
|
# Use CA private key and CA file to sign test CSR
|
||||||
|
openssl x509 -req -in test.csr -CA myCA.pem -CAkey myCA.key -CAcreateserial -out test.crt -days 825 -sha256
|
||||||
|
}
|
||||||
|
|
||||||
|
function build_instance()
|
||||||
|
{
|
||||||
|
rm -rf occlum_instance* && occlum new occlum_instance
|
||||||
|
pushd occlum_instance
|
||||||
|
rm -rf image
|
||||||
|
copy_bom -f ../pytorch.yaml --root image --include-dir /opt/occlum/etc/template
|
||||||
|
|
||||||
|
if [ ! -d $python_dir ];then
|
||||||
|
echo "Error: cannot stat '$python_dir' directory"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
new_json="$(jq '.resource_limits.user_space_size = "4000MB" |
|
||||||
|
.resource_limits.kernel_space_heap_size = "256MB" |
|
||||||
|
.resource_limits.max_num_of_threads = 64 |
|
||||||
|
.env.untrusted += [ "MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "RANK", "TORCH_CPP_LOG_LEVEL" ] |
|
||||||
|
.env.default += ["GLOO_DEVICE_TRANSPORT=TCP_TLS"] |
|
||||||
|
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key"] |
|
||||||
|
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt"] |
|
||||||
|
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem"] |
|
||||||
|
.env.default += ["PYTHONHOME=/opt/python-occlum"] |
|
||||||
|
.env.default += [ "MASTER_ADDR=127.0.0.1", "MASTER_PORT=29500" ] ' Occlum.json)" && \
|
||||||
|
echo "${new_json}" > Occlum.json
|
||||||
|
occlum build
|
||||||
|
popd
|
||||||
|
}
|
||||||
|
|
||||||
|
generate_ca_files
|
||||||
|
build_instance
|
||||||
|
|
||||||
|
# Test instance for 2 nodes distributed pytorch training
|
||||||
|
cp -r occlum_instance occlum_instance_2
|
10
demos/pytorch/distributed/install_python_with_conda.sh
Executable file
10
demos/pytorch/distributed/install_python_with_conda.sh
Executable file
@ -0,0 +1,10 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
set -e
|
||||||
|
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
|
||||||
|
|
||||||
|
# Install python and dependencies to specified position
|
||||||
|
[ -f Miniconda3-latest-Linux-x86_64.sh ] || wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
|
||||||
|
[ -d miniconda ] || bash ./Miniconda3-latest-Linux-x86_64.sh -b -p $script_dir/miniconda
|
||||||
|
$script_dir/miniconda/bin/conda create --prefix $script_dir/python-occlum -y \
|
||||||
|
python=3.8.10 numpy=1.21.5 scipy=1.7.3 scikit-learn=1.0 pandas=1.3 \
|
||||||
|
Cython pytorch torchvision -c pytorch
|
210
demos/pytorch/distributed/mnist.py
Normal file
210
demos/pytorch/distributed/mnist.py
Normal file
@ -0,0 +1,210 @@
|
|||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
|
||||||
|
from torchvision import datasets, transforms
|
||||||
|
from torch.utils.data.distributed import DistributedSampler
|
||||||
|
import torch
|
||||||
|
import torch.distributed as dist
|
||||||
|
import torch.nn as nn
|
||||||
|
import torch.nn.functional as F
|
||||||
|
import torch.optim as optim
|
||||||
|
|
||||||
|
WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))
|
||||||
|
|
||||||
|
RANK = int(os.environ.get("RANK", 0))
|
||||||
|
|
||||||
|
class Net(nn.Module):
|
||||||
|
def __init__(self):
|
||||||
|
super(Net, self).__init__()
|
||||||
|
self.conv1 = nn.Conv2d(1, 20, 5, 1)
|
||||||
|
self.conv2 = nn.Conv2d(20, 50, 5, 1)
|
||||||
|
self.fc1 = nn.Linear(4*4*50, 500)
|
||||||
|
self.fc2 = nn.Linear(500, 10)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
x = F.relu(self.conv1(x))
|
||||||
|
x = F.max_pool2d(x, 2, 2)
|
||||||
|
x = F.relu(self.conv2(x))
|
||||||
|
x = F.max_pool2d(x, 2, 2)
|
||||||
|
x = x.view(-1, 4*4*50)
|
||||||
|
x = F.relu(self.fc1(x))
|
||||||
|
x = self.fc2(x)
|
||||||
|
return F.log_softmax(x, dim=1)
|
||||||
|
|
||||||
|
|
||||||
|
def train(args, model, device, train_loader, optimizer, epoch):
|
||||||
|
model.train()
|
||||||
|
for batch_idx, (data, target) in enumerate(train_loader):
|
||||||
|
data, target = data.to(device), target.to(device)
|
||||||
|
optimizer.zero_grad()
|
||||||
|
output = model(data)
|
||||||
|
loss = F.nll_loss(output, target)
|
||||||
|
loss.backward()
|
||||||
|
optimizer.step()
|
||||||
|
if batch_idx % args.log_interval == 0:
|
||||||
|
msg = "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
|
||||||
|
epoch, batch_idx, len(train_loader),
|
||||||
|
100. * batch_idx / len(train_loader), loss.item())
|
||||||
|
logging.info(msg)
|
||||||
|
niter = epoch * len(train_loader) + batch_idx
|
||||||
|
|
||||||
|
|
||||||
|
def test(args, model, device, test_loader, epoch):
|
||||||
|
model.eval()
|
||||||
|
test_loss = 0
|
||||||
|
correct = 0
|
||||||
|
with torch.no_grad():
|
||||||
|
for data, target in test_loader:
|
||||||
|
data, target = data.to(device), target.to(device)
|
||||||
|
output = model(data)
|
||||||
|
# sum up batch loss
|
||||||
|
test_loss += F.nll_loss(output, target, reduction="sum").item()
|
||||||
|
# get the index of the max log-probability
|
||||||
|
pred = output.max(1, keepdim=True)[1]
|
||||||
|
correct += pred.eq(target.view_as(pred)).sum().item()
|
||||||
|
|
||||||
|
test_loss /= len(test_loader.dataset)
|
||||||
|
logging.info("{{metricName: accuracy, metricValue: {:.4f}}};{{metricName: loss, metricValue: {:.4f}}}\n".format(
|
||||||
|
float(correct) / (len(test_loader.dataset) / WORLD_SIZE), test_loss))
|
||||||
|
|
||||||
|
|
||||||
|
def should_distribute():
|
||||||
|
return dist.is_available() and WORLD_SIZE > 1
|
||||||
|
|
||||||
|
|
||||||
|
def is_distributed():
|
||||||
|
return dist.is_available() and dist.is_initialized()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# Training settings
|
||||||
|
parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
|
||||||
|
parser.add_argument("--batch-size", type=int, default=64, metavar="N",
|
||||||
|
help="input batch size for training (default: 64)")
|
||||||
|
parser.add_argument("--test-batch-size", type=int, default=1000, metavar="N",
|
||||||
|
help="input batch size for testing (default: 1000)")
|
||||||
|
parser.add_argument("--epochs", type=int, default=10, metavar="N",
|
||||||
|
help="number of epochs to train (default: 10)")
|
||||||
|
parser.add_argument("--lr", type=float, default=0.01, metavar="LR",
|
||||||
|
help="learning rate (default: 0.01)")
|
||||||
|
parser.add_argument("--momentum", type=float, default=0.5, metavar="M",
|
||||||
|
help="SGD momentum (default: 0.5)")
|
||||||
|
parser.add_argument("--no-cuda", action="store_true", default=False,
|
||||||
|
help="disables CUDA training")
|
||||||
|
parser.add_argument("--seed", type=int, default=1, metavar="S",
|
||||||
|
help="random seed (default: 1)")
|
||||||
|
parser.add_argument("--log-interval", type=int, default=10, metavar="N",
|
||||||
|
help="how many batches to wait before logging training status")
|
||||||
|
parser.add_argument("--log-path", type=str, default="",
|
||||||
|
help="Path to save logs. Print to StdOut if log-path is not set")
|
||||||
|
parser.add_argument("--save-model", action="store_true", default=False,
|
||||||
|
help="For Saving the current Model")
|
||||||
|
|
||||||
|
if dist.is_available():
|
||||||
|
parser.add_argument("--backend", type=str, help="Distributed backend",
|
||||||
|
choices=[dist.Backend.GLOO,
|
||||||
|
dist.Backend.NCCL, dist.Backend.MPI],
|
||||||
|
default=dist.Backend.GLOO)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Use this format (%Y-%m-%dT%H:%M:%SZ) to record timestamp of the metrics.
|
||||||
|
# If log_path is empty print log to StdOut, otherwise print log to the file.
|
||||||
|
if args.log_path == "":
|
||||||
|
logging.basicConfig(
|
||||||
|
format="%(asctime)s %(levelname)-8s %(message)s",
|
||||||
|
datefmt="%Y-%m-%dT%H:%M:%SZ",
|
||||||
|
level=logging.DEBUG)
|
||||||
|
else:
|
||||||
|
logging.basicConfig(
|
||||||
|
format="%(asctime)s %(levelname)-8s %(message)s",
|
||||||
|
datefmt="%Y-%m-%dT%H:%M:%SZ",
|
||||||
|
level=logging.DEBUG,
|
||||||
|
filename=args.log_path)
|
||||||
|
|
||||||
|
use_cuda = not args.no_cuda and torch.cuda.is_available()
|
||||||
|
if use_cuda:
|
||||||
|
print("Using CUDA")
|
||||||
|
|
||||||
|
torch.manual_seed(args.seed)
|
||||||
|
|
||||||
|
device = torch.device("cuda" if use_cuda else "cpu")
|
||||||
|
|
||||||
|
if should_distribute():
|
||||||
|
print("Using distributed PyTorch with {} backend".format(
|
||||||
|
args.backend), flush=True)
|
||||||
|
dist.init_process_group(backend=args.backend)
|
||||||
|
|
||||||
|
kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}
|
||||||
|
|
||||||
|
print("Before downloading data", flush=True)
|
||||||
|
train_data = datasets.FashionMNIST("./data",
|
||||||
|
train=True,
|
||||||
|
download=True,
|
||||||
|
transform=transforms.Compose([
|
||||||
|
transforms.ToTensor()
|
||||||
|
]))
|
||||||
|
|
||||||
|
|
||||||
|
test_data = datasets.FashionMNIST("./data",
|
||||||
|
train=True,
|
||||||
|
download=True,
|
||||||
|
transform=transforms.Compose([
|
||||||
|
transforms.ToTensor()
|
||||||
|
]))
|
||||||
|
if is_distributed():
|
||||||
|
train_sampler = DistributedSampler(train_data, num_replicas=WORLD_SIZE, rank=RANK, shuffle=True, drop_last=False, seed=args.seed)
|
||||||
|
test_sampler = DistributedSampler(test_data, num_replicas=WORLD_SIZE, rank=RANK, shuffle=True, drop_last=False, seed=args.seed)
|
||||||
|
train_loader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size,sampler=train_sampler, **kwargs)
|
||||||
|
test_loader = torch.utils.data.DataLoader(test_data, batch_size=args.test_batch_size, shuffle=False, **kwargs)
|
||||||
|
else:
|
||||||
|
train_loader = torch.utils.data.DataLoader(
|
||||||
|
train_data,
|
||||||
|
batch_size=args.batch_size, shuffle=True, **kwargs)
|
||||||
|
test_loader = torch.utils.data.DataLoader(test_data,
|
||||||
|
batch_size=args.test_batch_size, shuffle=False, **kwargs)
|
||||||
|
|
||||||
|
print("After downloading data", flush=True)
|
||||||
|
|
||||||
|
test_loader = torch.utils.data.DataLoader(
|
||||||
|
datasets.FashionMNIST("./data",
|
||||||
|
train=False,
|
||||||
|
transform=transforms.Compose([
|
||||||
|
transforms.ToTensor()
|
||||||
|
])),
|
||||||
|
batch_size=args.test_batch_size, shuffle=False, **kwargs)
|
||||||
|
|
||||||
|
model = Net().to(device)
|
||||||
|
|
||||||
|
if is_distributed():
|
||||||
|
Distributor = nn.parallel.DistributedDataParallel
|
||||||
|
model = Distributor(model)
|
||||||
|
|
||||||
|
optimizer = optim.SGD(model.parameters(), lr=args.lr,
|
||||||
|
momentum=args.momentum)
|
||||||
|
|
||||||
|
|
||||||
|
start = time.perf_counter()
|
||||||
|
cpu_start = time.process_time()
|
||||||
|
|
||||||
|
for epoch in range(1, args.epochs + 1):
|
||||||
|
train(args, model, device, train_loader, optimizer, epoch)
|
||||||
|
test(args, model, device, test_loader, epoch)
|
||||||
|
|
||||||
|
cpu_end = time.process_time()
|
||||||
|
end = time.perf_counter()
|
||||||
|
print("CPU Elapsed time:", cpu_end - cpu_start)
|
||||||
|
print("Elapsed time:", end - start)
|
||||||
|
|
||||||
|
if (args.save_model):
|
||||||
|
torch.save(model.state_dict(), "mnist_cnn.pt")
|
||||||
|
|
||||||
|
if is_distributed():
|
||||||
|
dist.destroy_process_group()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
39
demos/pytorch/distributed/pytorch.yaml
Normal file
39
demos/pytorch/distributed/pytorch.yaml
Normal file
@ -0,0 +1,39 @@
|
|||||||
|
includes:
|
||||||
|
- base.yaml
|
||||||
|
targets:
|
||||||
|
- target: /bin
|
||||||
|
createlinks:
|
||||||
|
- src: /opt/python-occlum/bin/python3
|
||||||
|
linkname: python3
|
||||||
|
copy:
|
||||||
|
- files:
|
||||||
|
- /opt/occlum/toolchains/busybox/glibc/busybox
|
||||||
|
# python packages
|
||||||
|
- target: /opt
|
||||||
|
copy:
|
||||||
|
- dirs:
|
||||||
|
- ../python-occlum
|
||||||
|
# python code
|
||||||
|
- target: /
|
||||||
|
copy:
|
||||||
|
- files:
|
||||||
|
- ../mnist.py
|
||||||
|
- target: /opt/occlum/glibc/lib
|
||||||
|
copy:
|
||||||
|
- files:
|
||||||
|
- /lib/x86_64-linux-gnu/libnss_dns.so.2
|
||||||
|
- /lib/x86_64-linux-gnu/libnss_files.so.2
|
||||||
|
# etc files
|
||||||
|
- target: /etc
|
||||||
|
copy:
|
||||||
|
- dirs:
|
||||||
|
- /etc/ssl
|
||||||
|
- files:
|
||||||
|
- /etc/nsswitch.conf
|
||||||
|
# CA files
|
||||||
|
- target: /ppml/certs/
|
||||||
|
copy:
|
||||||
|
- files:
|
||||||
|
- ../myCA.pem
|
||||||
|
- ../test.key
|
||||||
|
- ../test.crt
|
3
demos/pytorch/standalone/.gitignore
vendored
Normal file
3
demos/pytorch/standalone/.gitignore
vendored
Normal file
@ -0,0 +1,3 @@
|
|||||||
|
occlum_instance/
|
||||||
|
miniconda/
|
||||||
|
Miniconda3*
|
@ -10,22 +10,22 @@ Use the nn package to define our model as a sequence of layers. nn.Sequential is
|
|||||||
|
|
||||||
This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container.
|
This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container.
|
||||||
|
|
||||||
Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.23.0-ubuntu18.04 as example.
|
Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example.
|
||||||
|
|
||||||
Step 1 (on the host): Start an Occlum container
|
Step 1 (on the host): Start an Occlum container
|
||||||
```
|
```
|
||||||
docker pull occlum/occlum:0.23.0-ubuntu18.04
|
docker pull occlum/occlum:0.29.3-ubuntu20.04
|
||||||
docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.23.0-ubuntu18.04 bash
|
docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash
|
||||||
```
|
```
|
||||||
|
|
||||||
Step 2 (in the Occlum container): Download miniconda and install python to prefix position.
|
Step 2 (in the Occlum container): Download miniconda and install python to prefix position.
|
||||||
```
|
```
|
||||||
cd /root/demos/pytorch
|
cd /root/demos/pytorch/standalone
|
||||||
bash ./install_python_with_conda.sh
|
bash ./install_python_with_conda.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
Step 3 (in the Occlum container): Run the sample code on Occlum
|
Step 3 (in the Occlum container): Run the sample code on Occlum
|
||||||
```
|
```
|
||||||
cd /root/demos/pytorch
|
cd /root/demos/standalone/pytorch
|
||||||
bash ./run_pytorch_on_occlum.sh
|
bash ./run_pytorch_on_occlum.sh
|
||||||
```
|
```
|
Loading…
Reference in New Issue
Block a user