[demos] Add distributed pytorch demo
This commit is contained in:
		
							parent
							
								
									a5cdcc8045
								
							
						
					
					
						commit
						47bd1fd7af
					
				
							
								
								
									
										27
									
								
								.github/workflows/demo_test.yml
									
									
									
									
										vendored
									
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										27
									
								
								.github/workflows/demo_test.yml
									
									
									
									
										vendored
									
									
								
							| @ -276,11 +276,34 @@ jobs: | ||||
|         build-envs: 'OCCLUM_RELEASE_BUILD=1' | ||||
| 
 | ||||
|     - name: Build python and pytorch | ||||
|       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch; ./install_python_with_conda.sh" | ||||
|       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/standalone; ./install_python_with_conda.sh" | ||||
| 
 | ||||
|     - name: Run pytorch test | ||||
|       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch; SGX_MODE=SIM ./run_pytorch_on_occlum.sh" | ||||
|       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/standalone; SGX_MODE=SIM ./run_pytorch_on_occlum.sh" | ||||
| 
 | ||||
|   Distributed_Pytorch_test: | ||||
|     runs-on: ubuntu-20.04 | ||||
|     steps: | ||||
|     - uses: actions/checkout@v1 | ||||
|       with: | ||||
|         submodules: true | ||||
| 
 | ||||
|     - uses: ./.github/workflows/composite_action/sim | ||||
|       with: | ||||
|         container-name: ${{ github.job }} | ||||
|         build-envs: 'OCCLUM_RELEASE_BUILD=1' | ||||
| 
 | ||||
|     - name: Build python and pytorch | ||||
|       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed; ./install_python_with_conda.sh" | ||||
| 
 | ||||
|     - name: Build pytorch Occlum instance | ||||
|       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed; SGX_MODE=SIM ./build_pytorch_occlum_instance.sh" | ||||
| 
 | ||||
|     - name: Start pytorch Occlum instance node one | ||||
|       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed/occlum_instance; WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model &" | ||||
| 
 | ||||
|     - name: Start pytorch Occlum instance node two | ||||
|       run: docker exec ${{ github.job }} bash -c "cd /root/occlum/demos/pytorch/distributed/occlum_instance_2; WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model" | ||||
| 
 | ||||
|   Tensorflow_test: | ||||
|     runs-on: ubuntu-20.04 | ||||
|  | ||||
| @ -22,7 +22,7 @@ This set of demos shows how real-world apps can be easily run inside SGX enclave | ||||
| * [grpc](grpc/): A client and server communicating through [gRPC](https://grpc.io), containing [glibc-supported demo](grpc/grpc_glibc) and [musl-supported demo](grpc/grpc_musl). | ||||
| * [https_server](https_server/): A HTTPS file server based on [Mongoose Embedded Web Server Library](https://github.com/cesanta/mongoose). | ||||
| * [openvino](openvino/) A benchmark of [OpenVINO Inference Engine](https://docs.openvinotoolkit.org/2019_R3/_docs_IE_DG_inference_engine_intro.html). | ||||
| * [pytorch](pytorch/): A demo of [PyTorch](https://pytorch.org/). | ||||
| * [pytorch](pytorch/): Demos of standalone and distributed [PyTorch](https://pytorch.org/). | ||||
| * [redis](redis/): A demo of [Redis](https://redis.io). | ||||
| * [sofaboot](sofaboot/): A demo of [SOFABoot](https://github.com/sofastack/sofa-boot), an open source Java development framework based on Spring Boot. | ||||
| * [sqlite](sqlite/) A demo of [SQLite](https://www.sqlite.org) SQL database engine. | ||||
|  | ||||
							
								
								
									
										107
									
								
								demos/pytorch/distributed/README.md
									
									
									
									
									
										Normal file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										107
									
								
								demos/pytorch/distributed/README.md
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,107 @@ | ||||
| # Distributed PyTorch Demo | ||||
| 
 | ||||
| This project demonstrates how Occlum enables _unmodified_ distributed [PyTorch](https://pytorch.org/) programs running in SGX enclaves, on the basis of _unmodified_ [Python](https://www.python.org). | ||||
| 
 | ||||
| ## Environment variables for Distributed PyTorch model | ||||
| There are a few environment variables that are related to distributed PyTorch training, which are: | ||||
| 
 | ||||
| 1. MASTER_ADDR | ||||
| 2. MASTER_PORT | ||||
| 3. WORLD_SIZE | ||||
| 4. RANK | ||||
| 
 | ||||
| `MASTER_ADDR` and `MASTER_PORT` specifies a rendezvous point where all the training processes will connect to. | ||||
| 
 | ||||
| `WORLD_SIZE` specifies how many training processes will participate in the training. | ||||
| 
 | ||||
| `RANK` is the unique identifier for each of the training process. | ||||
| 
 | ||||
| The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` should be identical for all the participants while the `RANK` should be unique. | ||||
| 
 | ||||
| **Note that in most cases PyTorch only use multi-threads. If you find a process fork, please set `num_workers=1` env.** | ||||
| 
 | ||||
| ### TLS related environment variables | ||||
| There is a environment variable called `GLOO_DEVICE_TRANSPORT` that can be used to specify the transport. | ||||
| 
 | ||||
| The default value is set to TCP.  If TLS is required to satisfy the security requirement, then, please also set the following environment variables: | ||||
| 
 | ||||
| 1. GLOO_DEVICE_TRANSPORT=TCP_TLS | ||||
| 2. GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY | ||||
| 3. GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT | ||||
| 4. GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE | ||||
| 
 | ||||
| These environments are set as below in our demo. | ||||
| ```json | ||||
|   "env": { | ||||
|     "default": [ | ||||
|       "GLOO_DEVICE_TRANSPORT=TCP_TLS", | ||||
|       "GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key", | ||||
|       "GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt", | ||||
|       "GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem", | ||||
| ``` | ||||
| 
 | ||||
| The CA files above are generated by openssl. Details please refer to the function **generate_ca_files** in the script [`build_pytorch_occlum_instance.sh`](./build_pytorch_occlum_instance.sh). | ||||
| 
 | ||||
| ## How to Run | ||||
| 
 | ||||
| This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container. | ||||
| 
 | ||||
| Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example. | ||||
| 
 | ||||
| In the following example, we will try to run a distributed PyTorch training using `fasion-MNIST` dataset with 2 processes (Occlum instance). | ||||
| 
 | ||||
| Thus, we set `WORLD_SIZE` to 2. | ||||
| 
 | ||||
| Generally, `MASTER_ADDR` can be set to the IP address of the process with RANK 0. In our case, two processes are running in the same container, thus `MASTER_ADDR` can be simply set to `localhost`. | ||||
| 
 | ||||
| Step 1 (on the host): Start an Occlum container | ||||
| ```bash | ||||
| docker pull occlum/occlum:0.29.3-ubuntu20.04 | ||||
| docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash | ||||
| ``` | ||||
| 
 | ||||
| Step 2 (in the Occlum container): Download miniconda and install python to prefix position. | ||||
| ```bash | ||||
| cd /root/demos/pytorch/distributed | ||||
| bash ./install_python_with_conda.sh | ||||
| ``` | ||||
| 
 | ||||
| Step 3 (in the Occlum container): Build the Distributed PyTorch Occlum instances | ||||
| ```bash | ||||
| cd /root/demos/pytorch/distributed | ||||
| bash ./build_pytorch_occlum_instance.sh | ||||
| ``` | ||||
| 
 | ||||
| Step 4 (in the Occlum container): Run node one PyTorch instance | ||||
| ```bash | ||||
| cd /root/demos/pytorch/distributed/occlum_instance | ||||
| WORLD_SIZE=2 RANK=0 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model | ||||
| ``` | ||||
| 
 | ||||
| If successful, it will wait for the node two to join. | ||||
| ```log | ||||
| Using distributed PyTorch with gloo backend | ||||
| ``` | ||||
| 
 | ||||
| Step 5 (in the Occlum container): Run node two PyTorch instance | ||||
| ```bash | ||||
| cd /root/demos/pytorch/distributed/occlum_instance | ||||
| WORLD_SIZE=2 RANK=1 occlum run /bin/python3 mnist.py --epoch 3 --no-cuda --seed 42 --save-model | ||||
| ``` | ||||
| 
 | ||||
| If everything goes well, node one and two has similar logs as below. | ||||
| ```log | ||||
| After downloading data | ||||
| 2022-12-05T09:40:05Z INFO     Train Epoch: 1 [0/469 (0%)]       loss=2.3037 | ||||
| 2022-12-05T09:40:05Z INFO     Reducer buckets have been rebuilt in this iteration. | ||||
| 2022-12-05T09:40:06Z INFO     Train Epoch: 1 [10/469 (2%)]      loss=2.3117 | ||||
| 2022-12-05T09:40:06Z INFO     Train Epoch: 1 [20/469 (4%)]      loss=2.2826 | ||||
| 2022-12-05T09:40:06Z INFO     Train Epoch: 1 [30/469 (6%)]      loss=2.2904 | ||||
| 2022-12-05T09:40:07Z INFO     Train Epoch: 1 [40/469 (9%)]      loss=2.2860 | ||||
| 2022-12-05T09:40:07Z INFO     Train Epoch: 1 [50/469 (11%)]     loss=2.2784 | ||||
| 2022-12-05T09:40:08Z INFO     Train Epoch: 1 [60/469 (13%)]     loss=2.2779 | ||||
| 2022-12-05T09:40:08Z INFO     Train Epoch: 1 [70/469 (15%)]     loss=2.2689 | ||||
| 2022-12-05T09:40:08Z INFO     Train Epoch: 1 [80/469 (17%)]     loss=2.2513 | ||||
| 2022-12-05T09:40:09Z INFO     Train Epoch: 1 [90/469 (19%)]     loss=2.2536 | ||||
| ... | ||||
| ``` | ||||
							
								
								
									
										55
									
								
								demos/pytorch/distributed/build_pytorch_occlum_instance.sh
									
									
									
									
									
										Executable file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										55
									
								
								demos/pytorch/distributed/build_pytorch_occlum_instance.sh
									
									
									
									
									
										Executable file
									
								
							| @ -0,0 +1,55 @@ | ||||
| #!/bin/bash | ||||
| set -e | ||||
| 
 | ||||
| BLUE='\033[1;34m' | ||||
| NC='\033[0m' | ||||
| 
 | ||||
| script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}"  )" >/dev/null 2>&1 && pwd )" | ||||
| python_dir="$script_dir/occlum_instance/image/opt/python-occlum" | ||||
| 
 | ||||
| 
 | ||||
| function generate_ca_files() | ||||
| { | ||||
|     cn_name=${1:-"localhost"} | ||||
|     # Generate CA files | ||||
|     openssl req -x509 -nodes -days 1825 -newkey rsa:2048 -keyout myCA.key -out myCA.pem -subj "/CN=${cn_name}" | ||||
|     # Prepare test private key | ||||
|     openssl genrsa -out test.key 2048 | ||||
|     # Use private key to generate a Certificate Sign Request | ||||
|     openssl req -new -key test.key -out test.csr -subj "/C=CN/ST=Shanghai/L=Shanghai/O=Ant/CN=${cn_name}" | ||||
|     # Use CA private key and CA file to sign test CSR | ||||
|     openssl x509 -req -in test.csr -CA myCA.pem -CAkey myCA.key -CAcreateserial -out test.crt -days 825 -sha256 | ||||
| } | ||||
| 
 | ||||
| function build_instance() | ||||
| { | ||||
|     rm -rf occlum_instance* && occlum new occlum_instance | ||||
|     pushd occlum_instance | ||||
|     rm -rf image | ||||
|     copy_bom -f ../pytorch.yaml --root image --include-dir /opt/occlum/etc/template | ||||
| 
 | ||||
|     if [ ! -d $python_dir ];then | ||||
|         echo "Error: cannot stat '$python_dir' directory" | ||||
|         exit 1 | ||||
|     fi | ||||
| 
 | ||||
|     new_json="$(jq '.resource_limits.user_space_size = "4000MB" | | ||||
|                     .resource_limits.kernel_space_heap_size = "256MB" | | ||||
|                     .resource_limits.max_num_of_threads = 64 | | ||||
|                     .env.untrusted += [ "MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "RANK", "TORCH_CPP_LOG_LEVEL" ] | | ||||
|                     .env.default += ["GLOO_DEVICE_TRANSPORT=TCP_TLS"] | | ||||
|                     .env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=/ppml/certs/test.key"] | | ||||
|                     .env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=/ppml/certs/test.crt"] | | ||||
|                     .env.default += ["GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=/ppml/certs/myCA.pem"] | | ||||
|                     .env.default += ["PYTHONHOME=/opt/python-occlum"] | | ||||
|                     .env.default += [ "MASTER_ADDR=127.0.0.1", "MASTER_PORT=29500" ] ' Occlum.json)" && \ | ||||
|     echo "${new_json}" > Occlum.json | ||||
|     occlum build | ||||
|     popd | ||||
| } | ||||
| 
 | ||||
| generate_ca_files | ||||
| build_instance | ||||
| 
 | ||||
| # Test instance for 2 nodes distributed pytorch training | ||||
| cp -r occlum_instance occlum_instance_2 | ||||
							
								
								
									
										10
									
								
								demos/pytorch/distributed/install_python_with_conda.sh
									
									
									
									
									
										Executable file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										10
									
								
								demos/pytorch/distributed/install_python_with_conda.sh
									
									
									
									
									
										Executable file
									
								
							| @ -0,0 +1,10 @@ | ||||
| #!/bin/bash | ||||
| set -e | ||||
| script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}"  )" >/dev/null 2>&1 && pwd )" | ||||
| 
 | ||||
| # Install python and dependencies to specified position | ||||
| [ -f Miniconda3-latest-Linux-x86_64.sh ] || wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh | ||||
| [ -d miniconda ] || bash ./Miniconda3-latest-Linux-x86_64.sh -b -p $script_dir/miniconda | ||||
| $script_dir/miniconda/bin/conda create --prefix $script_dir/python-occlum -y \ | ||||
|     python=3.8.10 numpy=1.21.5 scipy=1.7.3 scikit-learn=1.0 pandas=1.3 \ | ||||
|     Cython pytorch torchvision -c pytorch | ||||
							
								
								
									
										210
									
								
								demos/pytorch/distributed/mnist.py
									
									
									
									
									
										Normal file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										210
									
								
								demos/pytorch/distributed/mnist.py
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,210 @@ | ||||
| from __future__ import print_function | ||||
| 
 | ||||
| import argparse | ||||
| import logging | ||||
| import os | ||||
| import time | ||||
| 
 | ||||
| from torchvision import datasets, transforms | ||||
| from torch.utils.data.distributed import DistributedSampler | ||||
| import torch | ||||
| import torch.distributed as dist | ||||
| import torch.nn as nn | ||||
| import torch.nn.functional as F | ||||
| import torch.optim as optim | ||||
| 
 | ||||
| WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1)) | ||||
| 
 | ||||
| RANK = int(os.environ.get("RANK", 0)) | ||||
| 
 | ||||
| class Net(nn.Module): | ||||
|     def __init__(self): | ||||
|         super(Net, self).__init__() | ||||
|         self.conv1 = nn.Conv2d(1, 20, 5, 1) | ||||
|         self.conv2 = nn.Conv2d(20, 50, 5, 1) | ||||
|         self.fc1 = nn.Linear(4*4*50, 500) | ||||
|         self.fc2 = nn.Linear(500, 10) | ||||
| 
 | ||||
|     def forward(self, x): | ||||
|         x = F.relu(self.conv1(x)) | ||||
|         x = F.max_pool2d(x, 2, 2) | ||||
|         x = F.relu(self.conv2(x)) | ||||
|         x = F.max_pool2d(x, 2, 2) | ||||
|         x = x.view(-1, 4*4*50) | ||||
|         x = F.relu(self.fc1(x)) | ||||
|         x = self.fc2(x) | ||||
|         return F.log_softmax(x, dim=1) | ||||
| 
 | ||||
| 
 | ||||
| def train(args, model, device, train_loader, optimizer, epoch): | ||||
|     model.train() | ||||
|     for batch_idx, (data, target) in enumerate(train_loader): | ||||
|         data, target = data.to(device), target.to(device) | ||||
|         optimizer.zero_grad() | ||||
|         output = model(data) | ||||
|         loss = F.nll_loss(output, target) | ||||
|         loss.backward() | ||||
|         optimizer.step() | ||||
|         if batch_idx % args.log_interval == 0: | ||||
|             msg = "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format( | ||||
|                 epoch, batch_idx, len(train_loader), | ||||
|                 100. * batch_idx / len(train_loader), loss.item()) | ||||
|             logging.info(msg) | ||||
|             niter = epoch * len(train_loader) + batch_idx | ||||
| 
 | ||||
| 
 | ||||
| def test(args, model, device, test_loader, epoch): | ||||
|     model.eval() | ||||
|     test_loss = 0 | ||||
|     correct = 0 | ||||
|     with torch.no_grad(): | ||||
|         for data, target in test_loader: | ||||
|             data, target = data.to(device), target.to(device) | ||||
|             output = model(data) | ||||
|             # sum up batch loss | ||||
|             test_loss += F.nll_loss(output, target, reduction="sum").item() | ||||
|             # get the index of the max log-probability | ||||
|             pred = output.max(1, keepdim=True)[1] | ||||
|             correct += pred.eq(target.view_as(pred)).sum().item() | ||||
| 
 | ||||
|     test_loss /= len(test_loader.dataset) | ||||
|     logging.info("{{metricName: accuracy, metricValue: {:.4f}}};{{metricName: loss, metricValue: {:.4f}}}\n".format( | ||||
|         float(correct) / (len(test_loader.dataset) / WORLD_SIZE), test_loss)) | ||||
| 
 | ||||
| 
 | ||||
| def should_distribute(): | ||||
|     return dist.is_available() and WORLD_SIZE > 1 | ||||
| 
 | ||||
| 
 | ||||
| def is_distributed(): | ||||
|     return dist.is_available() and dist.is_initialized() | ||||
| 
 | ||||
| 
 | ||||
| def main(): | ||||
|     # Training settings | ||||
|     parser = argparse.ArgumentParser(description="PyTorch MNIST Example") | ||||
|     parser.add_argument("--batch-size", type=int, default=64, metavar="N", | ||||
|                         help="input batch size for training (default: 64)") | ||||
|     parser.add_argument("--test-batch-size", type=int, default=1000, metavar="N", | ||||
|                         help="input batch size for testing (default: 1000)") | ||||
|     parser.add_argument("--epochs", type=int, default=10, metavar="N", | ||||
|                         help="number of epochs to train (default: 10)") | ||||
|     parser.add_argument("--lr", type=float, default=0.01, metavar="LR", | ||||
|                         help="learning rate (default: 0.01)") | ||||
|     parser.add_argument("--momentum", type=float, default=0.5, metavar="M", | ||||
|                         help="SGD momentum (default: 0.5)") | ||||
|     parser.add_argument("--no-cuda", action="store_true", default=False, | ||||
|                         help="disables CUDA training") | ||||
|     parser.add_argument("--seed", type=int, default=1, metavar="S", | ||||
|                         help="random seed (default: 1)") | ||||
|     parser.add_argument("--log-interval", type=int, default=10, metavar="N", | ||||
|                         help="how many batches to wait before logging training status") | ||||
|     parser.add_argument("--log-path", type=str, default="", | ||||
|                         help="Path to save logs. Print to StdOut if log-path is not set") | ||||
|     parser.add_argument("--save-model", action="store_true", default=False, | ||||
|                         help="For Saving the current Model") | ||||
| 
 | ||||
|     if dist.is_available(): | ||||
|         parser.add_argument("--backend", type=str, help="Distributed backend", | ||||
|                             choices=[dist.Backend.GLOO, | ||||
|                                      dist.Backend.NCCL, dist.Backend.MPI], | ||||
|                             default=dist.Backend.GLOO) | ||||
|     args = parser.parse_args() | ||||
| 
 | ||||
|     # Use this format (%Y-%m-%dT%H:%M:%SZ) to record timestamp of the metrics. | ||||
|     # If log_path is empty print log to StdOut, otherwise print log to the file. | ||||
|     if args.log_path == "": | ||||
|         logging.basicConfig( | ||||
|             format="%(asctime)s %(levelname)-8s %(message)s", | ||||
|             datefmt="%Y-%m-%dT%H:%M:%SZ", | ||||
|             level=logging.DEBUG) | ||||
|     else: | ||||
|         logging.basicConfig( | ||||
|             format="%(asctime)s %(levelname)-8s %(message)s", | ||||
|             datefmt="%Y-%m-%dT%H:%M:%SZ", | ||||
|             level=logging.DEBUG, | ||||
|             filename=args.log_path) | ||||
| 
 | ||||
|     use_cuda = not args.no_cuda and torch.cuda.is_available() | ||||
|     if use_cuda: | ||||
|         print("Using CUDA") | ||||
| 
 | ||||
|     torch.manual_seed(args.seed) | ||||
| 
 | ||||
|     device = torch.device("cuda" if use_cuda else "cpu") | ||||
| 
 | ||||
|     if should_distribute(): | ||||
|         print("Using distributed PyTorch with {} backend".format( | ||||
|             args.backend), flush=True) | ||||
|         dist.init_process_group(backend=args.backend) | ||||
| 
 | ||||
|     kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {} | ||||
| 
 | ||||
|     print("Before downloading data", flush=True) | ||||
|     train_data = datasets.FashionMNIST("./data", | ||||
|                             train=True, | ||||
|                             download=True, | ||||
|                             transform=transforms.Compose([ | ||||
|                             transforms.ToTensor() | ||||
|                             ])) | ||||
| 
 | ||||
| 
 | ||||
|     test_data = datasets.FashionMNIST("./data", | ||||
|                             train=True, | ||||
|                             download=True, | ||||
|                             transform=transforms.Compose([ | ||||
|                             transforms.ToTensor() | ||||
|                             ])) | ||||
|     if is_distributed(): | ||||
|         train_sampler = DistributedSampler(train_data, num_replicas=WORLD_SIZE, rank=RANK, shuffle=True, drop_last=False, seed=args.seed) | ||||
|         test_sampler = DistributedSampler(test_data, num_replicas=WORLD_SIZE, rank=RANK, shuffle=True, drop_last=False, seed=args.seed) | ||||
|         train_loader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size,sampler=train_sampler, **kwargs) | ||||
|         test_loader = torch.utils.data.DataLoader(test_data, batch_size=args.test_batch_size, shuffle=False, **kwargs) | ||||
|     else: | ||||
|         train_loader = torch.utils.data.DataLoader( | ||||
|             train_data, | ||||
|             batch_size=args.batch_size, shuffle=True, **kwargs) | ||||
|         test_loader = torch.utils.data.DataLoader(test_data, | ||||
|         batch_size=args.test_batch_size, shuffle=False, **kwargs) | ||||
| 
 | ||||
|     print("After downloading data", flush=True) | ||||
| 
 | ||||
|     test_loader = torch.utils.data.DataLoader( | ||||
|         datasets.FashionMNIST("./data", | ||||
|                               train=False, | ||||
|                               transform=transforms.Compose([ | ||||
|                                   transforms.ToTensor() | ||||
|                               ])), | ||||
|         batch_size=args.test_batch_size, shuffle=False, **kwargs) | ||||
| 
 | ||||
|     model = Net().to(device) | ||||
| 
 | ||||
|     if is_distributed(): | ||||
|         Distributor = nn.parallel.DistributedDataParallel | ||||
|         model = Distributor(model) | ||||
| 
 | ||||
|     optimizer = optim.SGD(model.parameters(), lr=args.lr, | ||||
|                           momentum=args.momentum) | ||||
| 
 | ||||
| 
 | ||||
|     start = time.perf_counter() | ||||
|     cpu_start = time.process_time() | ||||
| 
 | ||||
|     for epoch in range(1, args.epochs + 1): | ||||
|         train(args, model, device, train_loader, optimizer, epoch) | ||||
|         test(args, model, device, test_loader, epoch) | ||||
| 
 | ||||
|     cpu_end = time.process_time() | ||||
|     end = time.perf_counter() | ||||
|     print("CPU Elapsed time:", cpu_end - cpu_start) | ||||
|     print("Elapsed time:", end - start) | ||||
| 
 | ||||
|     if (args.save_model): | ||||
|         torch.save(model.state_dict(), "mnist_cnn.pt") | ||||
| 
 | ||||
|     if is_distributed(): | ||||
|         dist.destroy_process_group() | ||||
| 
 | ||||
| 
 | ||||
| if __name__ == "__main__": | ||||
|     main() | ||||
							
								
								
									
										39
									
								
								demos/pytorch/distributed/pytorch.yaml
									
									
									
									
									
										Normal file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										39
									
								
								demos/pytorch/distributed/pytorch.yaml
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,39 @@ | ||||
| includes: | ||||
|   - base.yaml | ||||
| targets: | ||||
|   - target: /bin | ||||
|     createlinks: | ||||
|       - src: /opt/python-occlum/bin/python3 | ||||
|         linkname: python3 | ||||
|     copy: | ||||
|       - files: | ||||
|           - /opt/occlum/toolchains/busybox/glibc/busybox | ||||
|   # python packages | ||||
|   - target: /opt | ||||
|     copy:  | ||||
|       - dirs: | ||||
|           - ../python-occlum | ||||
|   # python code | ||||
|   - target: / | ||||
|     copy: | ||||
|       - files:  | ||||
|           - ../mnist.py | ||||
|   - target: /opt/occlum/glibc/lib | ||||
|     copy: | ||||
|       - files: | ||||
|           - /lib/x86_64-linux-gnu/libnss_dns.so.2 | ||||
|           - /lib/x86_64-linux-gnu/libnss_files.so.2 | ||||
|   # etc files | ||||
|   - target: /etc | ||||
|     copy: | ||||
|       - dirs: | ||||
|           - /etc/ssl | ||||
|       - files: | ||||
|           - /etc/nsswitch.conf | ||||
|   # CA files | ||||
|   - target: /ppml/certs/ | ||||
|     copy: | ||||
|       - files: | ||||
|           - ../myCA.pem | ||||
|           - ../test.key | ||||
|           - ../test.crt | ||||
							
								
								
									
										3
									
								
								demos/pytorch/standalone/.gitignore
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
								
								
								
								
								
									
									
								
							
						
						
									
										3
									
								
								demos/pytorch/standalone/.gitignore
									
									
									
									
										vendored
									
									
										Normal file
									
								
							| @ -0,0 +1,3 @@ | ||||
| occlum_instance/ | ||||
| miniconda/ | ||||
| Miniconda3* | ||||
| @ -10,22 +10,22 @@ Use the nn package to define our model as a sequence of layers. nn.Sequential is | ||||
| 
 | ||||
| This tutorial is written under the assumption that you have Docker installed and use Occlum in a Docker container. | ||||
| 
 | ||||
| Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.23.0-ubuntu18.04 as example. | ||||
| Occlum is compatible with glibc-supported Python, we employ miniconda as python installation tool. You can import PyTorch packages using conda. Here, miniconda is automatically installed by install_python_with_conda.sh script, the required python and PyTorch packages for this project are also loaded by this script. Here, we take occlum/occlum:0.29.3-ubuntu20.04 as example. | ||||
| 
 | ||||
| Step 1 (on the host): Start an Occlum container | ||||
| ``` | ||||
| docker pull occlum/occlum:0.23.0-ubuntu18.04 | ||||
| docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.23.0-ubuntu18.04 bash | ||||
| docker pull occlum/occlum:0.29.3-ubuntu20.04 | ||||
| docker run -it --name=pythonDemo --device /dev/sgx/enclave occlum/occlum:0.29.3-ubuntu20.04 bash | ||||
| ``` | ||||
| 
 | ||||
| Step 2 (in the Occlum container): Download miniconda and install python to prefix position. | ||||
| ``` | ||||
| cd /root/demos/pytorch | ||||
| cd /root/demos/pytorch/standalone | ||||
| bash ./install_python_with_conda.sh | ||||
| ``` | ||||
| 
 | ||||
| Step 3 (in the Occlum container): Run the sample code on Occlum | ||||
| ``` | ||||
| cd /root/demos/pytorch | ||||
| cd /root/demos/standalone/pytorch | ||||
| bash ./run_pytorch_on_occlum.sh | ||||
| ``` | ||||
		Loading…
	
		Reference in New Issue
	
	Block a user