[demos] Add distributed pytorch demo
This commit is contained in:
parent
a5cdcc8045
commit
47bd1fd7af
27
.github/workflows/demo_test.yml
vendored
27
.github/workflows/demo_test.yml
vendored
@ -276,11 +276,34 @@ jobs:
|
||||
build-envs: 'OCCLUM_RELEASE_BUILD=1'
|
||||
|
||||
- name: Build python and pytorch
|
||||
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch; ./install_python_with_conda.sh"
|
||||
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/standalone; ./install_python_with_conda.sh"
|
||||
|
||||
- name: Run pytorch test
|
||||
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch; SGX_MODE=SIM ./run_pytorch_on_occlum.sh"
|
||||
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/standalone; SGX_MODE=SIM ./run_pytorch_on_occlum.sh"
|
||||
|
||||
Distributed_Pytorch_test:
|
||||
runs-on: ubuntu-20.04
|
||||
steps:
|
||||
- uses: actions/checkout@v1
|
||||
with:
|
||||
submodules: true
|
||||
|
||||
- uses: ./.github/workflows/composite_action/sim
|
||||
with:
|
||||
container-name: ${{ github.job }}
|
||||
build-envs: 'OCCLUM_RELEASE_BUILD=1'
|
||||
|
||||
- name: Build python and pytorch
|
||||
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed; ./install_python_with_conda.sh"
|
||||
|
||||
- name: Build pytorch Occlum instance
|
||||
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed; SGX_MODE=SIM ./build_pytorch_occlum_instance.sh"
|
||||
|
||||
- name: Start pytorch Occlum instance node one
|
||||
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed/occlum_instance; WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model &"
|
||||
|
||||
- name: Start pytorch Occlum instance node two
|
||||
run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed/occlum_instance_2; WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model"
|
||||
|
||||
Tensorflow_test:
|
||||
runs-on: ubuntu-20.04
|
||||
|
@ -22,7 +22,7 @@ This set of demos shows how real-world apps can be easily run inside SGX enclave
|
||||
* [grpc](grpc/): A client and server communicating through [gRPC](https://grpc.io), containing [glibc-supported demo](grpc/grpc_glibc) and [musl-supported demo](grpc/grpc_musl).
|
||||
* [https_server](https_server/): A HTTPS file server based on [Mongoose Embedded Web Server Library](https://github.com/cesanta/mongoose).
|
||||
* [openvino](openvino/) A benchmark of [OpenVINO Inference Engine](https://docs.openvinotoolkit.org/2019_R3/_docs_IE_DG_inference_engine_intro.html).
|
||||
* [pytorch](pytorch/): A demo of [PyTorch](https://pytorch.org/).
|
||||
* [pytorch](pytorch/): Demos of standalone and distributed [PyTorch](https://pytorch.org/).
|
||||
* [redis](redis/): A demo of [Redis](https://redis.io).
|
||||
* [sofaboot](sofaboot/): A demo of [SOFABoot](https://github.com/sofastack/sofa-boot), an open source Java development framework based on Spring Boot.
|
||||
* [sqlite](sqlite/) A demo of [SQLite](https://www.sqlite.org) SQL database engine.
|
||||
|
107
demos/pytorch/distributed/README.md
Normal file
107
demos/pytorch/distributed/README.md
Normal file
@ -0,0 +1,107 @@
|
||||
# Distributed PyTorch Demo
|
||||
|
||||
This project demonstrates how Occlum enables _unmodified_ distributed [PyTorch](https://pytorch.org/) programs running in SGX enclaves, on the basis of _unmodified_ [Python](https://www.python.org).
|
||||
|
||||
## Environment variables for Distributed PyTorch model
|
||||
There are a few environment variables that are related to distributed PyTorch training, which are:
|
||||
|
||||
1. MASTER_ADDR
|
||||
2. MASTER_PORT
|
||||
3. WORLD_SIZE
|
||||
4. RANK
|
||||
|
||||
`MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to.
|
||||
|
||||
`WORLD_SIZE` specifies how many training processes will participate in the training.
|
||||
|
||||
`RANK` is the unique identifier for each of the training process.
|
||||
|
||||
The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique.
|
||||
|
||||
**Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.**
|
||||
|
||||
### TLS related environment variables
|
||||
There is a environment variable called `GLOO_DEVICE_TRANSPORT` that can be used to specify the transport.
|
||||
|
||||
The default value is set to TCP. If TLS is required to satisfy the security requirement, then, please also set the following environment variables:
|
||||
|
||||
1. GLOO_DEVICE_TRANSPORT=TCP_TLS
|
||||
2. GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY
|
||||
3. GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT
|
||||
4. GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE
|
||||
|
||||
These environments are set as below in our demo.
|
||||
```json
|
||||
"env": {
|
||||
"default": [
|
||||
"GLOO_DEVICE_TRANSPORT=TCP_TLS",
|
||||
"GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key",
|
||||
"GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt",
|
||||
"GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem",
|
||||
```
|
||||
|
||||
The CA files above are generated by openssl. Details please refer to the function **generate_ca_files** in the script [`build_pytorch_occlum_instance.sh`](./build_pytorch_occlum_instance.sh).
|
||||
|
||||
## How to Run
|
||||
|
||||
This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container.
|
||||
|
||||
Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example.
|
||||
|
||||
In the following example, we will try to run a distributed PyTorch training using `fasion-MNIST` dataset with 2 processes (Occlum instance).
|
||||
|
||||
Thus, we set `WORLD_SIZE` to 2.
|
||||
|
||||
Generally, `MASTER_ADDR` can be set to the IP address of the process with RANK 0. In our case, two processes are running in the same container, thus `MASTER_ADDR` can be simply set to `localhost`.
|
||||
|
||||
Step 1 (on the host): Start an Occlum container
|
||||
```bash
|
||||
docker pull occlum/occlum:0.29.3-ubuntu20.04
|
||||
docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash
|
||||
```
|
||||
|
||||
Step 2 (in the Occlum container): Download miniconda and install python to prefix position.
|
||||
```bash
|
||||
cd /root/demos/pytorch/distributed
|
||||
bash ./install_python_with_conda.sh
|
||||
```
|
||||
|
||||
Step 3 (in the Occlum container): Build the Distributed PyTorch Occlum instances
|
||||
```bash
|
||||
cd /root/demos/pytorch/distributed
|
||||
bash ./build_pytorch_occlum_instance.sh
|
||||
```
|
||||
|
||||
Step 4 (in the Occlum container): Run node one PyTorch instance
|
||||
```bash
|
||||
cd /root/demos/pytorch/distributed/occlum_instance
|
||||
WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
|
||||
```
|
||||
|
||||
If successful, it will wait for the node two to join.
|
||||
```log
|
||||
Using distributed PyTorch with gloo backend
|
||||
```
|
||||
|
||||
Step 5 (in the Occlum container): Run node two PyTorch instance
|
||||
```bash
|
||||
cd /root/demos/pytorch/distributed/occlum_instance
|
||||
WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model
|
||||
```
|
||||
|
||||
If everything goes well, node one and two has similar logs as below.
|
||||
```log
|
||||
After downloading data
|
||||
2022-12-05T09:40:05Z INFO Train Epoch: 1 [0/469 (0%)] loss=2.3037
|
||||
2022-12-05T09:40:05Z INFO Reducer buckets have been rebuilt in this iteration.
|
||||
2022-12-05T09:40:06Z INFO Train Epoch: 1 [10/469 (2%)] loss=2.3117
|
||||
2022-12-05T09:40:06Z INFO Train Epoch: 1 [20/469 (4%)] loss=2.2826
|
||||
2022-12-05T09:40:06Z INFO Train Epoch: 1 [30/469 (6%)] loss=2.2904
|
||||
2022-12-05T09:40:07Z INFO Train Epoch: 1 [40/469 (9%)] loss=2.2860
|
||||
2022-12-05T09:40:07Z INFO Train Epoch: 1 [50/469 (11%)] loss=2.2784
|
||||
2022-12-05T09:40:08Z INFO Train Epoch: 1 [60/469 (13%)] loss=2.2779
|
||||
2022-12-05T09:40:08Z INFO Train Epoch: 1 [70/469 (15%)] loss=2.2689
|
||||
2022-12-05T09:40:08Z INFO Train Epoch: 1 [80/469 (17%)] loss=2.2513
|
||||
2022-12-05T09:40:09Z INFO Train Epoch: 1 [90/469 (19%)] loss=2.2536
|
||||
...
|
||||
```
|
55
demos/pytorch/distributed/build_pytorch_occlum_instance.sh
Executable file
55
demos/pytorch/distributed/build_pytorch_occlum_instance.sh
Executable file
@ -0,0 +1,55 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
BLUE='\033[1;34m'
|
||||
NC='\033[0m'
|
||||
|
||||
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
|
||||
python_dir="$script_dir/occlum_instance/image/opt/python-occlum"
|
||||
|
||||
|
||||
function generate_ca_files()
|
||||
{
|
||||
cn_name=${1:-"localhost"}
|
||||
# Generate CA files
|
||||
openssl req -x509 -nodes -days 1825 -newkey rsa:2048 -keyout myCA.key -out myCA.pem -subj "/CN=${cn_name}"
|
||||
# Prepare test private key
|
||||
openssl genrsa -out test.key 2048
|
||||
# Use private key to generate a Certificate Sign Request
|
||||
openssl req -new -key test.key -out test.csr -subj "/C=CN/ST=Shanghai/L=Shanghai/O=Ant/CN=${cn_name}"
|
||||
# Use CA private key and CA file to sign test CSR
|
||||
openssl x509 -req -in test.csr -CA myCA.pem -CAkey myCA.key -CAcreateserial -out test.crt -days 825 -sha256
|
||||
}
|
||||
|
||||
function build_instance()
|
||||
{
|
||||
rm -rf occlum_instance* && occlum new occlum_instance
|
||||
pushd occlum_instance
|
||||
rm -rf image
|
||||
copy_bom -f ../pytorch.yaml --root image --include-dir /opt/occlum/etc/template
|
||||
|
||||
if [ ! -d $python_dir ];then
|
||||
echo "Error: cannot stat '$python_dir' directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
new_json="$(jq '.resource_limits.user_space_size = "4000MB" |
|
||||
.resource_limits.kernel_space_heap_size = "256MB" |
|
||||
.resource_limits.max_num_of_threads = 64 |
|
||||
.env.untrusted += [ "MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "RANK", "TORCH_CPP_LOG_LEVEL" ] |
|
||||
.env.default += ["GLOO_DEVICE_TRANSPORT=TCP_TLS"] |
|
||||
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key"] |
|
||||
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt"] |
|
||||
.env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem"] |
|
||||
.env.default += ["PYTHONHOME=/opt/python-occlum"] |
|
||||
.env.default += [ "MASTER_ADDR=127.0.0.1", "MASTER_PORT=29500" ] ' Occlum.json)" && \
|
||||
echo "${new_json}" > Occlum.json
|
||||
occlum build
|
||||
popd
|
||||
}
|
||||
|
||||
generate_ca_files
|
||||
build_instance
|
||||
|
||||
# Test instance for 2 nodes distributed pytorch training
|
||||
cp -r occlum_instance occlum_instance_2
|
10
demos/pytorch/distributed/install_python_with_conda.sh
Executable file
10
demos/pytorch/distributed/install_python_with_conda.sh
Executable file
@ -0,0 +1,10 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
|
||||
|
||||
# Install python and dependencies to specified position
|
||||
[ -f Miniconda3-latest-Linux-x86_64.sh ] || wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
|
||||
[ -d miniconda ] || bash ./Miniconda3-latest-Linux-x86_64.sh -b -p $script_dir/miniconda
|
||||
$script_dir/miniconda/bin/conda create --prefix $script_dir/python-occlum -y \
|
||||
python=3.8.10 numpy=1.21.5 scipy=1.7.3 scikit-learn=1.0 pandas=1.3 \
|
||||
Cython pytorch torchvision -c pytorch
|
210
demos/pytorch/distributed/mnist.py
Normal file
210
demos/pytorch/distributed/mnist.py
Normal file
@ -0,0 +1,210 @@
|
||||
from __future__ import print_function
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
|
||||
from torchvision import datasets, transforms
|
||||
from torch.utils.data.distributed import DistributedSampler
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch.optim as optim
|
||||
|
||||
WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))
|
||||
|
||||
RANK = int(os.environ.get("RANK", 0))
|
||||
|
||||
class Net(nn.Module):
|
||||
def __init__(self):
|
||||
super(Net, self).__init__()
|
||||
self.conv1 = nn.Conv2d(1, 20, 5, 1)
|
||||
self.conv2 = nn.Conv2d(20, 50, 5, 1)
|
||||
self.fc1 = nn.Linear(4*4*50, 500)
|
||||
self.fc2 = nn.Linear(500, 10)
|
||||
|
||||
def forward(self, x):
|
||||
x = F.relu(self.conv1(x))
|
||||
x = F.max_pool2d(x, 2, 2)
|
||||
x = F.relu(self.conv2(x))
|
||||
x = F.max_pool2d(x, 2, 2)
|
||||
x = x.view(-1, 4*4*50)
|
||||
x = F.relu(self.fc1(x))
|
||||
x = self.fc2(x)
|
||||
return F.log_softmax(x, dim=1)
|
||||
|
||||
|
||||
def train(args, model, device, train_loader, optimizer, epoch):
|
||||
model.train()
|
||||
for batch_idx, (data, target) in enumerate(train_loader):
|
||||
data, target = data.to(device), target.to(device)
|
||||
optimizer.zero_grad()
|
||||
output = model(data)
|
||||
loss = F.nll_loss(output, target)
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
if batch_idx % args.log_interval == 0:
|
||||
msg = "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
|
||||
epoch, batch_idx, len(train_loader),
|
||||
100. * batch_idx / len(train_loader), loss.item())
|
||||
logging.info(msg)
|
||||
niter = epoch * len(train_loader) + batch_idx
|
||||
|
||||
|
||||
def test(args, model, device, test_loader, epoch):
|
||||
model.eval()
|
||||
test_loss = 0
|
||||
correct = 0
|
||||
with torch.no_grad():
|
||||
for data, target in test_loader:
|
||||
data, target = data.to(device), target.to(device)
|
||||
output = model(data)
|
||||
# sum up batch loss
|
||||
test_loss += F.nll_loss(output, target, reduction="sum").item()
|
||||
# get the index of the max log-probability
|
||||
pred = output.max(1, keepdim=True)[1]
|
||||
correct += pred.eq(target.view_as(pred)).sum().item()
|
||||
|
||||
test_loss /= len(test_loader.dataset)
|
||||
logging.info("{{metricName: accuracy, metricValue: {:.4f}}};{{metricName: loss, metricValue: {:.4f}}}\n".format(
|
||||
float(correct) / (len(test_loader.dataset) / WORLD_SIZE), test_loss))
|
||||
|
||||
|
||||
def should_distribute():
|
||||
return dist.is_available() and WORLD_SIZE > 1
|
||||
|
||||
|
||||
def is_distributed():
|
||||
return dist.is_available() and dist.is_initialized()
|
||||
|
||||
|
||||
def main():
|
||||
# Training settings
|
||||
parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
|
||||
parser.add_argument("--batch-size", type=int, default=64, metavar="N",
|
||||
help="input batch size for training (default: 64)")
|
||||
parser.add_argument("--test-batch-size", type=int, default=1000, metavar="N",
|
||||
help="input batch size for testing (default: 1000)")
|
||||
parser.add_argument("--epochs", type=int, default=10, metavar="N",
|
||||
help="number of epochs to train (default: 10)")
|
||||
parser.add_argument("--lr", type=float, default=0.01, metavar="LR",
|
||||
help="learning rate (default: 0.01)")
|
||||
parser.add_argument("--momentum", type=float, default=0.5, metavar="M",
|
||||
help="SGD momentum (default: 0.5)")
|
||||
parser.add_argument("--no-cuda", action="store_true", default=False,
|
||||
help="disables CUDA training")
|
||||
parser.add_argument("--seed", type=int, default=1, metavar="S",
|
||||
help="random seed (default: 1)")
|
||||
parser.add_argument("--log-interval", type=int, default=10, metavar="N",
|
||||
help="how many batches to wait before logging training status")
|
||||
parser.add_argument("--log-path", type=str, default="",
|
||||
help="Path to save logs. Print to StdOut if log-path is not set")
|
||||
parser.add_argument("--save-model", action="store_true", default=False,
|
||||
help="For Saving the current Model")
|
||||
|
||||
if dist.is_available():
|
||||
parser.add_argument("--backend", type=str, help="Distributed backend",
|
||||
choices=[dist.Backend.GLOO,
|
||||
dist.Backend.NCCL, dist.Backend.MPI],
|
||||
default=dist.Backend.GLOO)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Use this format (%Y-%m-%dT%H:%M:%SZ) to record timestamp of the metrics.
|
||||
# If log_path is empty print log to StdOut, otherwise print log to the file.
|
||||
if args.log_path == "":
|
||||
logging.basicConfig(
|
||||
format="%(asctime)s %(levelname)-8s %(message)s",
|
||||
datefmt="%Y-%m-%dT%H:%M:%SZ",
|
||||
level=logging.DEBUG)
|
||||
else:
|
||||
logging.basicConfig(
|
||||
format="%(asctime)s %(levelname)-8s %(message)s",
|
||||
datefmt="%Y-%m-%dT%H:%M:%SZ",
|
||||
level=logging.DEBUG,
|
||||
filename=args.log_path)
|
||||
|
||||
use_cuda = not args.no_cuda and torch.cuda.is_available()
|
||||
if use_cuda:
|
||||
print("Using CUDA")
|
||||
|
||||
torch.manual_seed(args.seed)
|
||||
|
||||
device = torch.device("cuda" if use_cuda else "cpu")
|
||||
|
||||
if should_distribute():
|
||||
print("Using distributed PyTorch with {} backend".format(
|
||||
args.backend), flush=True)
|
||||
dist.init_process_group(backend=args.backend)
|
||||
|
||||
kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}
|
||||
|
||||
print("Before downloading data", flush=True)
|
||||
train_data = datasets.FashionMNIST("./data",
|
||||
train=True,
|
||||
download=True,
|
||||
transform=transforms.Compose([
|
||||
transforms.ToTensor()
|
||||
]))
|
||||
|
||||
|
||||
test_data = datasets.FashionMNIST("./data",
|
||||
train=True,
|
||||
download=True,
|
||||
transform=transforms.Compose([
|
||||
transforms.ToTensor()
|
||||
]))
|
||||
if is_distributed():
|
||||
train_sampler = DistributedSampler(train_data, num_replicas=WORLD_SIZE, rank=RANK, shuffle=True, drop_last=False, seed=args.seed)
|
||||
test_sampler = DistributedSampler(test_data, num_replicas=WORLD_SIZE, rank=RANK, shuffle=True, drop_last=False, seed=args.seed)
|
||||
train_loader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size,sampler=train_sampler, **kwargs)
|
||||
test_loader = torch.utils.data.DataLoader(test_data, batch_size=args.test_batch_size, shuffle=False, **kwargs)
|
||||
else:
|
||||
train_loader = torch.utils.data.DataLoader(
|
||||
train_data,
|
||||
batch_size=args.batch_size, shuffle=True, **kwargs)
|
||||
test_loader = torch.utils.data.DataLoader(test_data,
|
||||
batch_size=args.test_batch_size, shuffle=False, **kwargs)
|
||||
|
||||
print("After downloading data", flush=True)
|
||||
|
||||
test_loader = torch.utils.data.DataLoader(
|
||||
datasets.FashionMNIST("./data",
|
||||
train=False,
|
||||
transform=transforms.Compose([
|
||||
transforms.ToTensor()
|
||||
])),
|
||||
batch_size=args.test_batch_size, shuffle=False, **kwargs)
|
||||
|
||||
model = Net().to(device)
|
||||
|
||||
if is_distributed():
|
||||
Distributor = nn.parallel.DistributedDataParallel
|
||||
model = Distributor(model)
|
||||
|
||||
optimizer = optim.SGD(model.parameters(), lr=args.lr,
|
||||
momentum=args.momentum)
|
||||
|
||||
|
||||
start = time.perf_counter()
|
||||
cpu_start = time.process_time()
|
||||
|
||||
for epoch in range(1, args.epochs + 1):
|
||||
train(args, model, device, train_loader, optimizer, epoch)
|
||||
test(args, model, device, test_loader, epoch)
|
||||
|
||||
cpu_end = time.process_time()
|
||||
end = time.perf_counter()
|
||||
print("CPU Elapsed time:", cpu_end - cpu_start)
|
||||
print("Elapsed time:", end - start)
|
||||
|
||||
if (args.save_model):
|
||||
torch.save(model.state_dict(), "mnist_cnn.pt")
|
||||
|
||||
if is_distributed():
|
||||
dist.destroy_process_group()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
39
demos/pytorch/distributed/pytorch.yaml
Normal file
39
demos/pytorch/distributed/pytorch.yaml
Normal file
@ -0,0 +1,39 @@
|
||||
includes:
|
||||
- base.yaml
|
||||
targets:
|
||||
- target: /bin
|
||||
createlinks:
|
||||
- src: /opt/python-occlum/bin/python3
|
||||
linkname: python3
|
||||
copy:
|
||||
- files:
|
||||
- /opt/occlum/toolchains/busybox/glibc/busybox
|
||||
# python packages
|
||||
- target: /opt
|
||||
copy:
|
||||
- dirs:
|
||||
- ../python-occlum
|
||||
# python code
|
||||
- target: /
|
||||
copy:
|
||||
- files:
|
||||
- ../mnist.py
|
||||
- target: /opt/occlum/glibc/lib
|
||||
copy:
|
||||
- files:
|
||||
- /lib/x86_64-linux-gnu/libnss_dns.so.2
|
||||
- /lib/x86_64-linux-gnu/libnss_files.so.2
|
||||
# etc files
|
||||
- target: /etc
|
||||
copy:
|
||||
- dirs:
|
||||
- /etc/ssl
|
||||
- files:
|
||||
- /etc/nsswitch.conf
|
||||
# CA files
|
||||
- target: /ppml/certs/
|
||||
copy:
|
||||
- files:
|
||||
- ../myCA.pem
|
||||
- ../test.key
|
||||
- ../test.crt
|
3
demos/pytorch/standalone/.gitignore
vendored
Normal file
3
demos/pytorch/standalone/.gitignore
vendored
Normal file
@ -0,0 +1,3 @@
|
||||
occlum_instance/
|
||||
miniconda/
|
||||
Miniconda3*
|
@ -10,22 +10,22 @@ Use the nn package to define our model as a sequence of layers. nn.Sequential is
|
||||
|
||||
This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container.
|
||||
|
||||
Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.23.0-ubuntu18.04 as example.
|
||||
Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example.
|
||||
|
||||
Step 1 (on the host): Start an Occlum container
|
||||
```
|
||||
docker pull occlum/occlum:0.23.0-ubuntu18.04
|
||||
docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.23.0-ubuntu18.04 bash
|
||||
docker pull occlum/occlum:0.29.3-ubuntu20.04
|
||||
docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash
|
||||
```
|
||||
|
||||
Step 2 (in the Occlum container): Download miniconda and install python to prefix position.
|
||||
```
|
||||
cd /root/demos/pytorch
|
||||
cd /root/demos/pytorch/standalone
|
||||
bash ./install_python_with_conda.sh
|
||||
```
|
||||
|
||||
Step 3 (in the Occlum container): Run the sample code on Occlum
|
||||
```
|
||||
cd /root/demos/pytorch
|
||||
cd /root/demos/standalone/pytorch
|
||||
bash ./run_pytorch_on_occlum.sh
|
||||
```
|
Loading…
Reference in New Issue
Block a user