[demos] Add distributed pytorch demo
This commit is contained in:
		
							parent
							
								
									a5cdcc8045
								
							
						
					
					
						commit
						47bd1fd7af
					
				
							
								
								
									
										27
									
								
								.github/workflows/demo_test.yml
									
									
									
									
										vendored
									
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										27
									
								
								.github/workflows/demo_test.yml
									
									
									
									
										vendored
									
									
								
							| @ -276,11 +276,34 @@ jobs: | |||||||
|         build-envs: 'OCCLUM_RELEASE_BUILD=1' |         build-envs: 'OCCLUM_RELEASE_BUILD=1' | ||||||
| 
 | 
 | ||||||
|     - name: Build python and pytorch |     - name: Build python and pytorch | ||||||
|       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch; ./install_python_with_conda.sh" |       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/standalone; ./install_python_with_conda.sh" | ||||||
| 
 | 
 | ||||||
|     - name: Run pytorch test |     - name: Run pytorch test | ||||||
|       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch; SGX_MODE=SIM ./run_pytorch_on_occlum.sh" |       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/standalone; SGX_MODE=SIM ./run_pytorch_on_occlum.sh" | ||||||
| 
 | 
 | ||||||
|  |   Distributed_Pytorch_test: | ||||||
|  |     runs-on: ubuntu-20.04 | ||||||
|  |     steps: | ||||||
|  |     - uses: actions/checkout@v1 | ||||||
|  |       with: | ||||||
|  |         submodules: true | ||||||
|  | 
 | ||||||
|  |     - uses: ./.github/workflows/composite_action/sim | ||||||
|  |       with: | ||||||
|  |         container-name: ${{ github.job }} | ||||||
|  |         build-envs: 'OCCLUM_RELEASE_BUILD=1' | ||||||
|  | 
 | ||||||
|  |     - name: Build python and pytorch | ||||||
|  |       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed; ./install_python_with_conda.sh" | ||||||
|  | 
 | ||||||
|  |     - name: Build pytorch Occlum instance | ||||||
|  |       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed; SGX_MODE=SIM ./build_pytorch_occlum_instance.sh" | ||||||
|  | 
 | ||||||
|  |     - name: Start pytorch Occlum instance node one | ||||||
|  |       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed/occlum_instance; WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model &" | ||||||
|  | 
 | ||||||
|  |     - name: Start pytorch Occlum instance node two | ||||||
|  |       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed/occlum_instance_2; WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model" | ||||||
| 
 | 
 | ||||||
|   Tensorflow_test: |   Tensorflow_test: | ||||||
|     runs-on: ubuntu-20.04 |     runs-on: ubuntu-20.04 | ||||||
|  | |||||||
| @ -22,7 +22,7 @@ This set of demos shows how real-world apps can be easily run inside SGX enclave | |||||||
| * [grpc](grpc/): A client and server communicating through [gRPC](https://grpc.io), containing [glibc-supported demo](grpc/grpc_glibc) and [musl-supported demo](grpc/grpc_musl). | * [grpc](grpc/): A client and server communicating through [gRPC](https://grpc.io), containing [glibc-supported demo](grpc/grpc_glibc) and [musl-supported demo](grpc/grpc_musl). | ||||||
| * [https_server](https_server/): A HTTPS file server based on [Mongoose Embedded Web Server Library](https://github.com/cesanta/mongoose). | * [https_server](https_server/): A HTTPS file server based on [Mongoose Embedded Web Server Library](https://github.com/cesanta/mongoose). | ||||||
| * [openvino](openvino/) A benchmark of [OpenVINO Inference Engine](https://docs.openvinotoolkit.org/2019_R3/_docs_IE_DG_inference_engine_intro.html). | * [openvino](openvino/) A benchmark of [OpenVINO Inference Engine](https://docs.openvinotoolkit.org/2019_R3/_docs_IE_DG_inference_engine_intro.html). | ||||||
| * [pytorch](pytorch/): A demo of [PyTorch](https://pytorch.org/). | * [pytorch](pytorch/): Demos of standalone and distributed [PyTorch](https://pytorch.org/). | ||||||
| * [redis](redis/): A demo of [Redis](https://redis.io). | * [redis](redis/): A demo of [Redis](https://redis.io). | ||||||
| * [sofaboot](sofaboot/): A demo of [SOFABoot](https://github.com/sofastack/sofa-boot), an open source Java development framework based on Spring Boot. | * [sofaboot](sofaboot/): A demo of [SOFABoot](https://github.com/sofastack/sofa-boot), an open source Java development framework based on Spring Boot. | ||||||
| * [sqlite](sqlite/) A demo of [SQLite](https://www.sqlite.org) SQL database engine. | * [sqlite](sqlite/) A demo of [SQLite](https://www.sqlite.org) SQL database engine. | ||||||
|  | |||||||
							
								
								
									
										107
									
								
								demos/pytorch/distributed/README.md
									
									
									
									
									
										Normal file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										107
									
								
								demos/pytorch/distributed/README.md
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,107 @@ | |||||||
|  | # Distributed PyTorch Demo | ||||||
|  | 
 | ||||||
|  | This project demonstrates how Occlum enables _unmodified_ distributed [PyTorch](https://pytorch.org/) programs running in SGX enclaves, on the basis of _unmodified_ [Python](https://www.python.org). | ||||||
|  | 
 | ||||||
|  | ## Environment variables for Distributed PyTorch model | ||||||
|  | There are a few environment variables that are related to distributed PyTorch training, which are: | ||||||
|  | 
 | ||||||
|  | 1. MASTER_ADDR | ||||||
|  | 2. MASTER_PORT | ||||||
|  | 3. WORLD_SIZE | ||||||
|  | 4. RANK | ||||||
|  | 
 | ||||||
|  | `MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to. | ||||||
|  | 
 | ||||||
|  | `WORLD_SIZE` specifies how many training processes will participate in the training. | ||||||
|  | 
 | ||||||
|  | `RANK` is the unique identifier for each of the training process. | ||||||
|  | 
 | ||||||
|  | The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique. | ||||||
|  | 
 | ||||||
|  | **Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.** | ||||||
|  | 
 | ||||||
|  | ### TLS related environment variables | ||||||
|  | There is a environment variable called `GLOO_DEVICE_TRANSPORT` that can be used to specify the transport. | ||||||
|  | 
 | ||||||
|  | The default value is set to TCP.  If TLS is required to satisfy the security requirement, then, please also set the following environment variables: | ||||||
|  | 
 | ||||||
|  | 1. GLOO_DEVICE_TRANSPORT=TCP_TLS | ||||||
|  | 2. GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY | ||||||
|  | 3. GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT | ||||||
|  | 4. GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE | ||||||
|  | 
 | ||||||
|  | These environments are set as below in our demo. | ||||||
|  | ```json | ||||||
|  |   "env": { | ||||||
|  |     "default": [ | ||||||
|  |       "GLOO_DEVICE_TRANSPORT=TCP_TLS", | ||||||
|  |       "GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key", | ||||||
|  |       "GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt", | ||||||
|  |       "GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem", | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | The CA files above are generated by openssl. Details please refer to the function **generate_ca_files** in the script [`build_pytorch_occlum_instance.sh`](./build_pytorch_occlum_instance.sh). | ||||||
|  | 
 | ||||||
|  | ## How to Run | ||||||
|  | 
 | ||||||
|  | This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container. | ||||||
|  | 
 | ||||||
|  | Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example. | ||||||
|  | 
 | ||||||
|  | In the following example, we will try to run a distributed PyTorch training using `fasion-MNIST` dataset with 2 processes (Occlum instance). | ||||||
|  | 
 | ||||||
|  | Thus, we set `WORLD_SIZE` to 2. | ||||||
|  | 
 | ||||||
|  | Generally, `MASTER_ADDR` can be set to the IP address of the process with RANK 0. In our case, two processes are running in the same container, thus `MASTER_ADDR` can be simply set to `localhost`. | ||||||
|  | 
 | ||||||
|  | Step 1 (on the host): Start an Occlum container | ||||||
|  | ```bash | ||||||
|  | docker pull occlum/occlum:0.29.3-ubuntu20.04 | ||||||
|  | docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | Step 2 (in the Occlum container): Download miniconda and install python to prefix position. | ||||||
|  | ```bash | ||||||
|  | cd /root/demos/pytorch/distributed | ||||||
|  | bash ./install_python_with_conda.sh | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | Step 3 (in the Occlum container): Build the Distributed PyTorch Occlum instances | ||||||
|  | ```bash | ||||||
|  | cd /root/demos/pytorch/distributed | ||||||
|  | bash ./build_pytorch_occlum_instance.sh | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | Step 4 (in the Occlum container): Run node one PyTorch instance | ||||||
|  | ```bash | ||||||
|  | cd /root/demos/pytorch/distributed/occlum_instance | ||||||
|  | WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | If successful, it will wait for the node two to join. | ||||||
|  | ```log | ||||||
|  | Using distributed PyTorch with gloo backend | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | Step 5 (in the Occlum container): Run node two PyTorch instance | ||||||
|  | ```bash | ||||||
|  | cd /root/demos/pytorch/distributed/occlum_instance | ||||||
|  | WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | If everything goes well, node one and two has similar logs as below. | ||||||
|  | ```log | ||||||
|  | After downloading data | ||||||
|  | 2022-12-05T09:40:05Z INFO     Train Epoch: 1 [0/469 (0%)]       loss=2.3037 | ||||||
|  | 2022-12-05T09:40:05Z INFO     Reducer buckets have been rebuilt in this iteration. | ||||||
|  | 2022-12-05T09:40:06Z INFO     Train Epoch: 1 [10/469 (2%)]      loss=2.3117 | ||||||
|  | 2022-12-05T09:40:06Z INFO     Train Epoch: 1 [20/469 (4%)]      loss=2.2826 | ||||||
|  | 2022-12-05T09:40:06Z INFO     Train Epoch: 1 [30/469 (6%)]      loss=2.2904 | ||||||
|  | 2022-12-05T09:40:07Z INFO     Train Epoch: 1 [40/469 (9%)]      loss=2.2860 | ||||||
|  | 2022-12-05T09:40:07Z INFO     Train Epoch: 1 [50/469 (11%)]     loss=2.2784 | ||||||
|  | 2022-12-05T09:40:08Z INFO     Train Epoch: 1 [60/469 (13%)]     loss=2.2779 | ||||||
|  | 2022-12-05T09:40:08Z INFO     Train Epoch: 1 [70/469 (15%)]     loss=2.2689 | ||||||
|  | 2022-12-05T09:40:08Z INFO     Train Epoch: 1 [80/469 (17%)]     loss=2.2513 | ||||||
|  | 2022-12-05T09:40:09Z INFO     Train Epoch: 1 [90/469 (19%)]     loss=2.2536 | ||||||
|  | ... | ||||||
|  | ``` | ||||||
							
								
								
									
										55
									
								
								demos/pytorch/distributed/build_pytorch_occlum_instance.sh
									
									
									
									
									
										Executable file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										55
									
								
								demos/pytorch/distributed/build_pytorch_occlum_instance.sh
									
									
									
									
									
										Executable file
									
								
							| @ -0,0 +1,55 @@ | |||||||
|  | #!/bin/bash | ||||||
|  | set -e | ||||||
|  | 
 | ||||||
|  | BLUE='\033[1;34m' | ||||||
|  | NC='\033[0m' | ||||||
|  | 
 | ||||||
|  | script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}"  )" >/dev/null 2>&1 && pwd )" | ||||||
|  | python_dir="$script_dir/occlum_instance/image/opt/python-occlum" | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | function generate_ca_files() | ||||||
|  | { | ||||||
|  |     cn_name=${1:-"localhost"} | ||||||
|  |     # Generate CA files | ||||||
|  |     openssl req -x509 -nodes -days 1825 -newkey rsa:2048 -keyout myCA.key -out myCA.pem -subj "/CN=${cn_name}" | ||||||
|  |     # Prepare test private key | ||||||
|  |     openssl genrsa -out test.key 2048 | ||||||
|  |     # Use private key to generate a Certificate Sign Request | ||||||
|  |     openssl req -new -key test.key -out test.csr -subj "/C=CN/ST=Shanghai/L=Shanghai/O=Ant/CN=${cn_name}" | ||||||
|  |     # Use CA private key and CA file to sign test CSR | ||||||
|  |     openssl x509 -req -in test.csr -CA myCA.pem -CAkey myCA.key -CAcreateserial -out test.crt -days 825 -sha256 | ||||||
|  | } | ||||||
|  | 
 | ||||||
|  | function build_instance() | ||||||
|  | { | ||||||
|  |     rm -rf occlum_instance* && occlum new occlum_instance | ||||||
|  |     pushd occlum_instance | ||||||
|  |     rm -rf image | ||||||
|  |     copy_bom -f ../pytorch.yaml --root image --include-dir /opt/occlum/etc/template | ||||||
|  | 
 | ||||||
|  |     if [ ! -d $python_dir ];then | ||||||
|  |         echo "Error: cannot stat '$python_dir' directory" | ||||||
|  |         exit 1 | ||||||
|  |     fi | ||||||
|  | 
 | ||||||
|  |     new_json="$(jq '.resource_limits.user_space_size = "4000MB" | | ||||||
|  |                     .resource_limits.kernel_space_heap_size = "256MB" | | ||||||
|  |                     .resource_limits.max_num_of_threads = 64 | | ||||||
|  |                     .env.untrusted += [ "MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "RANK", "TORCH_CPP_LOG_LEVEL" ] | | ||||||
|  |                     .env.default += ["GLOO_DEVICE_TRANSPORT=TCP_TLS"] | | ||||||
|  |                     .env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key"] | | ||||||
|  |                     .env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt"] | | ||||||
|  |                     .env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem"] | | ||||||
|  |                     .env.default += ["PYTHONHOME=/opt/python-occlum"] | | ||||||
|  |                     .env.default += [ "MASTER_ADDR=127.0.0.1", "MASTER_PORT=29500" ] ' Occlum.json)" && \ | ||||||
|  |     echo "${new_json}" > Occlum.json | ||||||
|  |     occlum build | ||||||
|  |     popd | ||||||
|  | } | ||||||
|  | 
 | ||||||
|  | generate_ca_files | ||||||
|  | build_instance | ||||||
|  | 
 | ||||||
|  | # Test instance for 2 nodes distributed pytorch training | ||||||
|  | cp -r occlum_instance occlum_instance_2 | ||||||
							
								
								
									
										10
									
								
								demos/pytorch/distributed/install_python_with_conda.sh
									
									
									
									
									
										Executable file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										10
									
								
								demos/pytorch/distributed/install_python_with_conda.sh
									
									
									
									
									
										Executable file
									
								
							| @ -0,0 +1,10 @@ | |||||||
|  | #!/bin/bash | ||||||
|  | set -e | ||||||
|  | script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}"  )" >/dev/null 2>&1 && pwd )" | ||||||
|  | 
 | ||||||
|  | # Install python and dependencies to specified position | ||||||
|  | [ -f Miniconda3-latest-Linux-x86_64.sh ] || wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh | ||||||
|  | [ -d miniconda ] || bash ./Miniconda3-latest-Linux-x86_64.sh -b -p $script_dir/miniconda | ||||||
|  | $script_dir/miniconda/bin/conda create --prefix $script_dir/python-occlum -y \ | ||||||
|  |     python=3.8.10 numpy=1.21.5 scipy=1.7.3 scikit-learn=1.0 pandas=1.3 \ | ||||||
|  |     Cython pytorch torchvision -c pytorch | ||||||
							
								
								
									
										210
									
								
								demos/pytorch/distributed/mnist.py
									
									
									
									
									
										Normal file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										210
									
								
								demos/pytorch/distributed/mnist.py
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,210 @@ | |||||||
|  | from __future__ import print_function | ||||||
|  | 
 | ||||||
|  | import argparse | ||||||
|  | import logging | ||||||
|  | import os | ||||||
|  | import time | ||||||
|  | 
 | ||||||
|  | from torchvision import datasets, transforms | ||||||
|  | from torch.utils.data.distributed import DistributedSampler | ||||||
|  | import torch | ||||||
|  | import torch.distributed as dist | ||||||
|  | import torch.nn as nn | ||||||
|  | import torch.nn.functional as F | ||||||
|  | import torch.optim as optim | ||||||
|  | 
 | ||||||
|  | WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1)) | ||||||
|  | 
 | ||||||
|  | RANK = int(os.environ.get("RANK", 0)) | ||||||
|  | 
 | ||||||
|  | class Net(nn.Module): | ||||||
|  |     def __init__(self): | ||||||
|  |         super(Net, self).__init__() | ||||||
|  |         self.conv1 = nn.Conv2d(1, 20, 5, 1) | ||||||
|  |         self.conv2 = nn.Conv2d(20, 50, 5, 1) | ||||||
|  |         self.fc1 = nn.Linear(4*4*50, 500) | ||||||
|  |         self.fc2 = nn.Linear(500, 10) | ||||||
|  | 
 | ||||||
|  |     def forward(self, x): | ||||||
|  |         x = F.relu(self.conv1(x)) | ||||||
|  |         x = F.max_pool2d(x, 2, 2) | ||||||
|  |         x = F.relu(self.conv2(x)) | ||||||
|  |         x = F.max_pool2d(x, 2, 2) | ||||||
|  |         x = x.view(-1, 4*4*50) | ||||||
|  |         x = F.relu(self.fc1(x)) | ||||||
|  |         x = self.fc2(x) | ||||||
|  |         return F.log_softmax(x, dim=1) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def train(args, model, device, train_loader, optimizer, epoch): | ||||||
|  |     model.train() | ||||||
|  |     for batch_idx, (data, target) in enumerate(train_loader): | ||||||
|  |         data, target = data.to(device), target.to(device) | ||||||
|  |         optimizer.zero_grad() | ||||||
|  |         output = model(data) | ||||||
|  |         loss = F.nll_loss(output, target) | ||||||
|  |         loss.backward() | ||||||
|  |         optimizer.step() | ||||||
|  |         if batch_idx % args.log_interval == 0: | ||||||
|  |             msg = "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format( | ||||||
|  |                 epoch, batch_idx, len(train_loader), | ||||||
|  |                 100. * batch_idx / len(train_loader), loss.item()) | ||||||
|  |             logging.info(msg) | ||||||
|  |             niter = epoch * len(train_loader) + batch_idx | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def test(args, model, device, test_loader, epoch): | ||||||
|  |     model.eval() | ||||||
|  |     test_loss = 0 | ||||||
|  |     correct = 0 | ||||||
|  |     with torch.no_grad(): | ||||||
|  |         for data, target in test_loader: | ||||||
|  |             data, target = data.to(device), target.to(device) | ||||||
|  |             output = model(data) | ||||||
|  |             # sum up batch loss | ||||||
|  |             test_loss += F.nll_loss(output, target, reduction="sum").item() | ||||||
|  |             # get the index of the max log-probability | ||||||
|  |             pred = output.max(1, keepdim=True)[1] | ||||||
|  |             correct += pred.eq(target.view_as(pred)).sum().item() | ||||||
|  | 
 | ||||||
|  |     test_loss /= len(test_loader.dataset) | ||||||
|  |     logging.info("{{metricName: accuracy, metricValue: {:.4f}}};{{metricName: loss, metricValue: {:.4f}}}\n".format( | ||||||
|  |         float(correct) / (len(test_loader.dataset) / WORLD_SIZE), test_loss)) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def should_distribute(): | ||||||
|  |     return dist.is_available() and WORLD_SIZE > 1 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def is_distributed(): | ||||||
|  |     return dist.is_available() and dist.is_initialized() | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | def main(): | ||||||
|  |     # Training settings | ||||||
|  |     parser = argparse.ArgumentParser(description="PyTorch MNIST Example") | ||||||
|  |     parser.add_argument("--batch-size", type=int, default=64, metavar="N", | ||||||
|  |                         help="input batch size for training (default: 64)") | ||||||
|  |     parser.add_argument("--test-batch-size", type=int, default=1000, metavar="N", | ||||||
|  |                         help="input batch size for testing (default: 1000)") | ||||||
|  |     parser.add_argument("--epochs", type=int, default=10, metavar="N", | ||||||
|  |                         help="number of epochs to train (default: 10)") | ||||||
|  |     parser.add_argument("--lr", type=float, default=0.01, metavar="LR", | ||||||
|  |                         help="learning rate (default: 0.01)") | ||||||
|  |     parser.add_argument("--momentum", type=float, default=0.5, metavar="M", | ||||||
|  |                         help="SGD momentum (default: 0.5)") | ||||||
|  |     parser.add_argument("--no-cuda", action="store_true", default=False, | ||||||
|  |                         help="disables CUDA training") | ||||||
|  |     parser.add_argument("--seed", type=int, default=1, metavar="S", | ||||||
|  |                         help="random seed (default: 1)") | ||||||
|  |     parser.add_argument("--log-interval", type=int, default=10, metavar="N", | ||||||
|  |                         help="how many batches to wait before logging training status") | ||||||
|  |     parser.add_argument("--log-path", type=str, default="", | ||||||
|  |                         help="Path to save logs. Print to StdOut if log-path is not set") | ||||||
|  |     parser.add_argument("--save-model", action="store_true", default=False, | ||||||
|  |                         help="For Saving the current Model") | ||||||
|  | 
 | ||||||
|  |     if dist.is_available(): | ||||||
|  |         parser.add_argument("--backend", type=str, help="Distributed backend", | ||||||
|  |                             choices=[dist.Backend.GLOO, | ||||||
|  |                                      dist.Backend.NCCL, dist.Backend.MPI], | ||||||
|  |                             default=dist.Backend.GLOO) | ||||||
|  |     args = parser.parse_args() | ||||||
|  | 
 | ||||||
|  |     # Use this format (%Y-%m-%dT%H:%M:%SZ) to record timestamp of the metrics. | ||||||
|  |     # If log_path is empty print log to StdOut, otherwise print log to the file. | ||||||
|  |     if args.log_path == "": | ||||||
|  |         logging.basicConfig( | ||||||
|  |             format="%(asctime)s %(levelname)-8s %(message)s", | ||||||
|  |             datefmt="%Y-%m-%dT%H:%M:%SZ", | ||||||
|  |             level=logging.DEBUG) | ||||||
|  |     else: | ||||||
|  |         logging.basicConfig( | ||||||
|  |             format="%(asctime)s %(levelname)-8s %(message)s", | ||||||
|  |             datefmt="%Y-%m-%dT%H:%M:%SZ", | ||||||
|  |             level=logging.DEBUG, | ||||||
|  |             filename=args.log_path) | ||||||
|  | 
 | ||||||
|  |     use_cuda = not args.no_cuda and torch.cuda.is_available() | ||||||
|  |     if use_cuda: | ||||||
|  |         print("Using CUDA") | ||||||
|  | 
 | ||||||
|  |     torch.manual_seed(args.seed) | ||||||
|  | 
 | ||||||
|  |     device = torch.device("cuda" if use_cuda else "cpu") | ||||||
|  | 
 | ||||||
|  |     if should_distribute(): | ||||||
|  |         print("Using distributed PyTorch with {} backend".format( | ||||||
|  |             args.backend), flush=True) | ||||||
|  |         dist.init_process_group(backend=args.backend) | ||||||
|  | 
 | ||||||
|  |     kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {} | ||||||
|  | 
 | ||||||
|  |     print("Before downloading data", flush=True) | ||||||
|  |     train_data = datasets.FashionMNIST("./data", | ||||||
|  |                             train=True, | ||||||
|  |                             download=True, | ||||||
|  |                             transform=transforms.Compose([ | ||||||
|  |                             transforms.ToTensor() | ||||||
|  |                             ])) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  |     test_data = datasets.FashionMNIST("./data", | ||||||
|  |                             train=True, | ||||||
|  |                             download=True, | ||||||
|  |                             transform=transforms.Compose([ | ||||||
|  |                             transforms.ToTensor() | ||||||
|  |                             ])) | ||||||
|  |     if is_distributed(): | ||||||
|  |         train_sampler = DistributedSampler(train_data, num_replicas=WORLD_SIZE, rank=RANK, shuffle=True, drop_last=False, seed=args.seed) | ||||||
|  |         test_sampler = DistributedSampler(test_data, num_replicas=WORLD_SIZE, rank=RANK, shuffle=True, drop_last=False, seed=args.seed) | ||||||
|  |         train_loader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size,sampler=train_sampler, **kwargs) | ||||||
|  |         test_loader = torch.utils.data.DataLoader(test_data, batch_size=args.test_batch_size, shuffle=False, **kwargs) | ||||||
|  |     else: | ||||||
|  |         train_loader = torch.utils.data.DataLoader( | ||||||
|  |             train_data, | ||||||
|  |             batch_size=args.batch_size, shuffle=True, **kwargs) | ||||||
|  |         test_loader = torch.utils.data.DataLoader(test_data, | ||||||
|  |         batch_size=args.test_batch_size, shuffle=False, **kwargs) | ||||||
|  | 
 | ||||||
|  |     print("After downloading data", flush=True) | ||||||
|  | 
 | ||||||
|  |     test_loader = torch.utils.data.DataLoader( | ||||||
|  |         datasets.FashionMNIST("./data", | ||||||
|  |                               train=False, | ||||||
|  |                               transform=transforms.Compose([ | ||||||
|  |                                   transforms.ToTensor() | ||||||
|  |                               ])), | ||||||
|  |         batch_size=args.test_batch_size, shuffle=False, **kwargs) | ||||||
|  | 
 | ||||||
|  |     model = Net().to(device) | ||||||
|  | 
 | ||||||
|  |     if is_distributed(): | ||||||
|  |         Distributor = nn.parallel.DistributedDataParallel | ||||||
|  |         model = Distributor(model) | ||||||
|  | 
 | ||||||
|  |     optimizer = optim.SGD(model.parameters(), lr=args.lr, | ||||||
|  |                           momentum=args.momentum) | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  |     start = time.perf_counter() | ||||||
|  |     cpu_start = time.process_time() | ||||||
|  | 
 | ||||||
|  |     for epoch in range(1, args.epochs + 1): | ||||||
|  |         train(args, model, device, train_loader, optimizer, epoch) | ||||||
|  |         test(args, model, device, test_loader, epoch) | ||||||
|  | 
 | ||||||
|  |     cpu_end = time.process_time() | ||||||
|  |     end = time.perf_counter() | ||||||
|  |     print("CPU Elapsed time:", cpu_end - cpu_start) | ||||||
|  |     print("Elapsed time:", end - start) | ||||||
|  | 
 | ||||||
|  |     if (args.save_model): | ||||||
|  |         torch.save(model.state_dict(), "mnist_cnn.pt") | ||||||
|  | 
 | ||||||
|  |     if is_distributed(): | ||||||
|  |         dist.destroy_process_group() | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | if __name__ == "__main__": | ||||||
|  |     main() | ||||||
							
								
								
									
										39
									
								
								demos/pytorch/distributed/pytorch.yaml
									
									
									
									
									
										Normal file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										39
									
								
								demos/pytorch/distributed/pytorch.yaml
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,39 @@ | |||||||
|  | includes: | ||||||
|  |   - base.yaml | ||||||
|  | targets: | ||||||
|  |   - target: /bin | ||||||
|  |     createlinks: | ||||||
|  |       - src: /opt/python-occlum/bin/python3 | ||||||
|  |         linkname: python3 | ||||||
|  |     copy: | ||||||
|  |       - files: | ||||||
|  |           - /opt/occlum/toolchains/busybox/glibc/busybox | ||||||
|  |   # python packages | ||||||
|  |   - target: /opt | ||||||
|  |     copy:  | ||||||
|  |       - dirs: | ||||||
|  |           - ../python-occlum | ||||||
|  |   # python code | ||||||
|  |   - target: / | ||||||
|  |     copy: | ||||||
|  |       - files:  | ||||||
|  |           - ../mnist.py | ||||||
|  |   - target: /opt/occlum/glibc/lib | ||||||
|  |     copy: | ||||||
|  |       - files: | ||||||
|  |           - /lib/x86_64-linux-gnu/libnss_dns.so.2 | ||||||
|  |           - /lib/x86_64-linux-gnu/libnss_files.so.2 | ||||||
|  |   # etc files | ||||||
|  |   - target: /etc | ||||||
|  |     copy: | ||||||
|  |       - dirs: | ||||||
|  |           - /etc/ssl | ||||||
|  |       - files: | ||||||
|  |           - /etc/nsswitch.conf | ||||||
|  |   # CA files | ||||||
|  |   - target: /ppml/certs/ | ||||||
|  |     copy: | ||||||
|  |       - files: | ||||||
|  |           - ../myCA.pem | ||||||
|  |           - ../test.key | ||||||
|  |           - ../test.crt | ||||||
							
								
								
									
										3
									
								
								demos/pytorch/standalone/.gitignore
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										3
									
								
								demos/pytorch/standalone/.gitignore
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| @ -0,0 +1,3 @@ | |||||||
|  | occlum_instance/ | ||||||
|  | miniconda/ | ||||||
|  | Miniconda3* | ||||||
| @ -10,22 +10,22 @@ Use the nn package to define our model as a sequence of layers. nn.Sequential is | |||||||
| 
 | 
 | ||||||
| This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container. | This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container. | ||||||
| 
 | 
 | ||||||
| Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.23.0-ubuntu18.04 as example. | Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example. | ||||||
| 
 | 
 | ||||||
| Step 1 (on the host): Start an Occlum container | Step 1 (on the host): Start an Occlum container | ||||||
| ``` | ``` | ||||||
| docker pull occlum/occlum:0.23.0-ubuntu18.04 | docker pull occlum/occlum:0.29.3-ubuntu20.04 | ||||||
| docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.23.0-ubuntu18.04 bash | docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| Step 2 (in the Occlum container): Download miniconda and install python to prefix position. | Step 2 (in the Occlum container): Download miniconda and install python to prefix position. | ||||||
| ``` | ``` | ||||||
| cd /root/demos/pytorch | cd /root/demos/pytorch/standalone | ||||||
| bash ./install_python_with_conda.sh | bash ./install_python_with_conda.sh | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| Step 3 (in the Occlum container): Run the sample code on Occlum | Step 3 (in the Occlum container): Run the sample code on Occlum | ||||||
| ``` | ``` | ||||||
| cd /root/demos/pytorch | cd /root/demos/standalone/pytorch | ||||||
| bash ./run_pytorch_on_occlum.sh | bash ./run_pytorch_on_occlum.sh | ||||||
| ``` | ``` | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user