A relay notifier that observes the underlying endpoint is added as the
notifier for the socket. It broadcasts to its observers when either end
of the channel has IoEvents.
Read, write, connect and accept have both blocking and nonblocking mode.
It may block after the status lock is acquired resulting in potential
deadlock. This commit resolve the deadlock issue.
1. Add OCCLUM_COV to conditionally enable gcov profiling for libos Rust
code;
2. Add a makefile target to locally generate the coverage report in html
format.
1. Five new ioctl commands of /dev/sgx are added for occlum
applications to securely get and verify DCAP quote;
2. Not all the functions of the intel DCAP package are open to
developers to simplify the DCAP usage;
3. The test may only run on the platform with DCAP driver installed;
4. A macro OCCLUM_DISABLE_DCAP is used to separate the DCAP code from
the other code.
5. Skip DCAP test when DCAP driver is not detected or in simulation mode
1. Implement type-safe functions;
2. Improve the correctness of nearly all the functions;
3. Improve the readability by introducing Listener and Endpoint for StreamUnix;
4. Substitue RingBuf with Channel in Unix socket.
This bugfix ensures that when an object of Producer/Consumer for
channels is dropped, its shutdown method is called automatically. This ensures
that the peer of a Producer/Consumer gets notified and won't wait indefinitely.
The current Tcmalloc has memory leak issue. So change it as optional. By
default, dlmalloc is used. Enable tcmalloc with below command:
make TCMALLOC=Y
Usually, files are unregistered from an epoll file via the EPOLL_CTL_DEL command
explicitly. But for the sake of users' convenience, Linux supports
unregistering a file automatically from the epoll files that monitor the file
when the file is closed. This commit adds this capability.
When using the optimized string lib in Occlum, the memset function would
use xmm0 register, as the result, the FP area initialization code would
modify the FP area before saving it. So just ignor the FP area
initialization code.
1. >> has higher precedence than &. Use parentheses to conduct & first;
2. In the latest Intel software developer's manual, cpuid leaf 06H EDX
is related to the logical processor.
Before this commit, the epoll implementation works by simply delegating to the
host OS through OCall. One major problem with this implementation is
that it can only handle files that are backed by a file of the host OS
(e.g., sockets), but not those are are mainly implemented by the LibOS
(e.g., pipes). Therefore, a new epoll implementation that can handle all
kinds of files is needed.
This commit completely rewrites the epoll implementation by leveraging
the new event subsystem. Now the new epoll can handle all file types:
1. Host files, e.g., sockets, eventfd;
2. LibOS files, e.g., pipes;
3. Hybrid files, e.g., epoll files.
For a new file type to support epoll, it only neends to implement no
more than four methods of the File trait:
* poll (required for all file types);
* notifier (required for all file files);
* host_fd (only required for host files);
* recv_host_events (only required for host files).
1. Introduce channels, which provide an efficient means for IPC;
2. Leverage channels to rewrite pipe, improving the performance (3X),
robustness, and readability.
This pipe rewrite is not done: some more commits will be added to
implement poll and epoll for pipe.
An event can be anything ranging from the exit of a process (interesting
to `wait4`) to the arrival of a blocked signal (interesting to
`sigwaitinfo`), from the completion of a file operation (interesting to
`epoll`) to the change of a file status (interesting to `inotify`).
To meet the event-related demands from various subsystems, this event
subsystem is designed to provide a set of general-purpose primitives:
* `Waiter`, `Waker`, and `WaiterQueue` are primitives to put threads
to sleep and later wake them up.
* `Event`, `Observer`, and `Notifier` are primitives to handle and
broadcast events.
* `WaiterQueueObserver` implements the common pattern of waking up
threads once some interesting events happen.
Socket-related ocalls, e.g, sendto, sendmsg and write, may cause SIGPIPE
in host. Since the ocall is called by libos, this kind of signal should
be handled in libos. We ignore SIGPIPE in host and raise the same signal
in libos if the return value of the above ocalls is EPIPE. In this way
the signal is handled by libos.
This commit mainly accomplish two things:
1. Use makefile to manage dependencies for `occlum build`, which can save lots of time
2. Take dirs `build`, `run` outside from `.occlum`. Remove env var "OCCLUM_INSTANCE_DIR"
Rlimit are now on the same page of memory space limits defined in Occlum.json. Specific
memory size configuration can be set to child process with `prlimit` syscall or using `ulimit`
command in shell script.
Struct sigaction has a field named sa_mask, which specifies the blocked
signals while executing the signal handler. Previously, this field is not
supported. This commit adds this missing feature.
There are scenarios where the available CPUs are less than all the CPUs
on the machine. Therefore, sched_get/setaffinity should be allowed when
the input buffer size is no less than the available CPUs but less than
all the CPUs.
This reverts commit 1e456f025d6b4e34a726180e7a27a04424fe79d1.
This commit results in segmentation fault when the application munmaps
its own stack. Should be committed back after removing the dependency of
sysret on the user space stack.
The new interrupt subsystem breaks the simulation mode in two ways:
1. The signal 64 is not handled by Intel SGX SDK in simulation mode. A
handled real-time signal crashes the process.
2. The newly-enabled test case exit_group depends on interrupts. But
enclave interrupts, like enclave exceptions, are not supported in
simulation mode.
This commit ensures signal 64 is ignored by default and exit_group test
case is not enabled in simulation mode.
Before this commit, events like signals and exit_group are handled by
LibOS threads in a cooperative fashion: if the user code executed by a
LibOS thread does not invoke system calls (e.g., a busy loop), then the LibOS
won't have any opportunity to take control and handle events.
With the help from the POSIX signal-based interrupt mechanism of
Occlum's version of Intel SGX SDK, the LibOS can now interrupt the
execution of arbitrary user code in a LibOS thread by sending real-time
POSIX signals (the signal number is 64) to it. These signals are sent by
a helper thread spawn by Occlum PAL. The helper thread periodically
enters into the enclave to check if there are any LibOS threads with
pending events. If any, the helper thread broadcast POSIX signals to
them. When interrupted by a signal, the receiver LibOS thread may be in
one of the two previously problematic states in terms of event handling:
1. Executing non-cooperative user code (e.g., a busy loop). In this
case, the signal will trigger an interrupt handler inside the enclave,
which can then enter the LibOS kernel to deal with any pending events.
2. Executing an OCall that invokes blocking system calls (e.g., futex,
nanosleep, or blocking I/O). In this case, the signal will interrupt the
blocking system call so that the OCall can return back to the enclave.
Thanks to the new interrupt subsystem, some event-based system calls
are made robust. One such example is exit_group. We can now guarantee
that exit_group can force any thread in a process to exit.
The first bug is a race condition when acquiring the lock of a process's
parent. An example code with race condition looks like below:
```rust
let process : ProessRef = current!().process();
let parent : ProcessRef = process.parent();
let parent_guard : SgxMutexGuard<ProesssInner> = parent.inner();
// This assertion may fail because the process's parent may change to another
// process before the lock is acquired
assert!(parent.pid() == process.parent().pid());
```
The second bug is that when a process exits, its children processes are
not transfered to the idle process correctly.
1. Move the memory zeroization of mmap to munmap to increase mmap
performance
2. Do memory zeroizaiton during the drop of VMManager to guarentee all
allocated memory is zeroized before the next allocation
It turns out taking a lock in every system call is a significant
performance bottleneck. In light of this finding, we replace a mutex in
a critical path of system call with an atomic boolean.
This rewrite serves three purposes:
1. Fix some subtle bugs in the old implementation;
2. Implement mremap using mmap and munmap so that mremap can automatically
enjoy new features (e.g., mprotect and memory permissions) once mmap and
munmap support the feature.
3. Write down the invariants hold by VMManager explictly so that the correctness
of the new implementation can be reason more easily.
Update the occlum.json to align with the gen_enclave_conf design.
Below is the two updated structures:
"metadata": {
"product_id": 0,
"version_number": 0,
"debuggable": true
},
"resource_limits": {
"max_num_of_threads": 32,
"kernel_space_heap_size": "32MB",
"kernel_space_stack_size": "1MB",
"user_space_size": "256MB"
}
On lightweight Linux distribution, like alpine, getpwuid()
returns NULL, and errno is ENOENT, this patch fix crash
caused by this situation.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Add "untrusted" sections for environment variables defined in Occlum.json. Environment
variable defined in "default" will be shown in libos directly. Environment variable
defined in "untrusted" can be passed from occlum run or PAL layer and can override
the value in "default" and thus is considered "untrusted".
Fix std::alloc::Alloc not found
The lastest Rust changes the trait to std::alloc::AllocRef.
Update the docker files to support sgx 2.9.1
Remove the compilerRT dependency for rust sdk update
Before this commit, the three ECalls of the LibOS enclave do not give
the exact reason on error. In this commit, we modify the enclave entry code
to return the errno and list all possible values of errno in Enclave.edl.
In this commit, we add eight signal-related syscalls
* kill
* tkill
* tgkill
* rt_sigaction
* rt_sigreturn
* rt_sigprocmask
* rt_sigpending
* exit_group
We implement the following major features for signals:
* Generate, mask, and deliver signals
* Support user-defined signal handlers
* Support nested invocation of signal handlers
* Support passing arguments: signum, sigaction, and ucontext
* Support both process-directed and thread-directed signals
* Capture hardware exceptions and convert them to signals
* Deliver fatal signals (like SIGKILL) to kill processes gracefully
But we still have gaps, including but not limited to the points below:
* Convert #PF (page fault) and #GP (general protection) exceptions to signals
* Force delivery of signals via interrupt
* Support simulation mode
When a unix socket only calls function listen, its object is not created
but its status becomes listening. At this time closing the socket would
cause a panic before this commit.
The next generation of Intel CPUs does not support Intel MPX. Enabling MPX
by default crashes the LibOS on startup. So we disable MPX by default. The
long term plan is to turn on/off MPX via compiling options.
This commits improves both readability and correctness of the scheduling-related
system calls. In terms of readability, it extracts all scheduling-related code
ouf of the process/ directory and put it in a sched/ directory. In terms
of correctness, the new scheduling subsystem introduces CpuSet and SchedAgent
types to maintain and manipulate CPU scheduler settings in a secure and robust way.
As a major rewrite to the process/thread subsystem, this commits:
1. Implements threads as a first-class object, which represents a group of OS resources
and a thread of execution;
2. Implements processes as a first-class object that manages threads and maintains
the parent-child relationship between processes;
3. Refactors the code in process subsystem to follow the improved coding style and
conventions emerged in recent commits;
4. Refactors the code in other subsystems to use the new process/thread subsystem.
SEFS depends on version 0.9 of bitvec crate, which has been yanked on crates.io
by the crate author for some reasons. To fix this, we upgrade to the latest
version of bitvec crate.
This commit introduces a unified logging strategy, summarized as below:
1. Use `error!` to mark errors or unexpected conditions, e.g., a
`Result::Err` returned from a system call.
2. Use `warn!` to warn about potentially problematic issues, e.g.,
executing a workaround or fake implementation.
3. Use `info!` to show important events (from users' perspective) in
normal execution, e.g., creating/exiting a process/thread.
4. Use `debug!` to track major events in normal execution, e.g., the
high-level arguments of a system call.
5. Use `trace!` to record the most detailed info, e.g., when a system
call enters and exits the LibOS.
Now one can specify the log level of the LibOS by setting `OCCLUM_LOG_LEVEL`
environment variable. The possible values are "off", "error", "warn",
"info", and "trace".
However, for the sake of security, the log level of a release enclave
(DisableDebug = 1 in Enclave.xml) is always "off" (i.e., no log) regardless of
the log level specified by the untrusted environment.
This commit introduces a system call table, which brings several benefits:
1. The table is a centralized info hub that one can find an answer for every
question about system calls, e.g., what is the number and arguments of a
system call, is it implemented or supported, and if so, what is the
function that actual implements it.
2. System call-related code can be automatically derived from the system call
table through a clever use of macros. In this way, the code avoids repeating
itself.
Before this commit, there are two strange bugs:
1. No backtraces are displayed on panic by Rust; and,
2. Thread local storage in Rust sometimes causes panics.
It turns out that the the root cause of the two bugs are the same: Occlum's
patch to Intel SGX SDK that informs SDK about the stack range of the currnet
LibOS user-level thread. The problem about this patch is that it modifies some
fundamental data structures and Rust SGX SDK does not know the modification.
This causes Rust SGX SDK to panic in certain conditions.
To resolve the conflict for good, this commit gets rid of the patch to Intel
SGX SDK by updating SDK's stack ranges upon user/kernel switch.
1. Use arch_prctl to replace RDFSBASE/WRFSBASE
Ptrace can't get right value if WRFSBASE is called which
will make debugger fail in simulation mode. Use arch_prctl
to replace these instructions in simulation mode.
2. Disable the busy thread in exit_group test
exit_group doesn't have a real implementation yet but test
under SGX simulation mode give core dump for exit_group test.
Disable the busy loop thread and the core dump disappear.
3. Add SDK lib path to LD_LIBRARY_PATH
Linker sometims can't find urts_sim and uae_service_sim when
running. Explicitly add path to LD_LIBRARY_PATH when running
occlum command.
Signed-off-by: sanqian.hcy <sanqian.hcy@antfin.com>
This commits is a dummy implementation of file advisory locks.
Specifically, for regular files, fcntl `F_SETLK` (i.e., acquiring
or releasing locks) always succeeds and fcntl `F_GETLK` (i.e., testing locks)
always returns no locks.
1. Move the system call handling functions into the "syscalls.rs"
2. Split syscall memory safe implementations into small sub-modules
3. Move the unix_socket and io_multiplexing into "net"
4. Remove some unnecessary code
It is slow to allocate big buffers using SGX SDK's malloc. Even worse, it
consumes a large amount of precious trusted memory inside enclaves. This
commit avoids using trusted buffers and allocates untrusted buffers for
sendmsg/recvmsg directly via OCall, thus improving the performance of
sendmsg/recvmsg. Note that this optimization does not affect the security of
network data as it has to be sent/received via OCalls.
Before this commit, using custom C types in ECalls/OCalls defined in Occlum's
EDL is cumbersme. Now this issue is resolved by providing `occlum_edl_types.h`
header file. There are two versions of this file: one is under
`src/libos/include/edl/` for LibOS, the other is under
`src/pal/include/edl/` for PAL. So now to define a new custom C type, just
edit the two versions of `occlum_edl_types.h` to define the type.
SGX SDK's sgx_init_quote may return SGX_ERROR_BUSY, which is previously not
handled. The implementation of ioctl for /dev/sgx is now fixed to handle this
error.
By providing Occlum PAL as a shared library, it is now possible to embed and
use Occlum in an user-controled process (instead of an Occlum-controlled one).
The APIs of Occlum PAL can be found in `src/pal/include/occlum_pal_api.h`. The
Occlum PAL library, namely `libocclum-pal.so`, can be found in `.occlum/build/lib`.
To use the library, check out the source code of `occlum-run` (under
`src/run`), which can be seen as a sample code for using the Occlum PAL
library.
* Fix readlink from `/proc/self/exe` to get absolute path of the executable file
* Add readlink from`/proc/self/fd/<fd>` to get the file's real path
Note that for now we only support read links _statically_, meaning that even
if the file or any of its ancestors is moved after the file is opened, the
absolute paths obtained from the API does not change.
The output buffer given to getdents may not be large enough for the next directory
entry. If no directory entries has been loaded into the buffer, just return
EINVAL. Otherwise, return the total length of the directory entries already
loaded in the buffer
1. Add a separate net/ directory for the network subsystem;
2. Move some existing socket code to net/;
3. Implement sendmsg/recvmsg with OCalls;
4. Extend client/server test cases.
1. Introduce the new infrastructure for ioctl support
2. Refactor the old ioctls to use the new infrastructure
3. Implement builtin ioctls (e.g., TIOCGWINSZ and TIOCSWINSZ for stdout)
4. Implement non-builtin, driver-specific ioctls (e.g., ioctls for /dev/sgx)
1. Use epoll_wait to support epoll_pwait as there is no signal mechanism
2. The timeout is fixed to zero for not waiting for any signal to come
to speed up
3. Change the test case of server_epoll to use epoll_pwait
BACKGROUND
The exit_group syscall, which is implicitly called by libc after the main function
returns, kills all threads in a thread group, even if these threads are
running, sleeping, or waiting on a futex.
PROBLEM
In normal use cases, exit_group does nothing since a well-written program
should terminate all threads before the main function returns. But when this is
not the case, exit_group can clean up the mess.
Currently, Occlum does not implement exit_group. And the Occlum PAL process
waits for all tasks (i.e., SGX threads) to finish before exiting. So without
exit_group implemented, some tasks may be still running if after the main task
exits. And this causes the Occlum PAL process to wait---forever.
WORKAROUND
To implement a real exit_group, we need signals to kill threads. But we do not
have signals, yet. So we come up with a workaround: instead of waiting all
tasks to finish in PAL, we just wait for the main task. As soon as the main
task exits, the PAL process terminates, killing the remaining tasks.
The original implementation of program loader is written under the assumption
that there are only two loadable segments per ELF, one is code, and the other
is data. But this assumption is unnecessary and proves to be wrong for an ELF
on Alpine Linux, which has two extra read-only, loadable segments for security
hardening. This commit clears the obstacle towards running unmodified
executables from Alpine Linux.
In addition to getting rid of the false assumption of two fixed loadable segments,
this commit improves the quality of the code related to program loading and
process initialization.
* 'occlum init' does not copy signing key file any more.
* 'occlum build' supports to set signing key and signing tool in args.
* 'occlum run' supports to run enclave in sgx release mode.
1. Now we support set App's env in Occlum.json, for example:
"env": [
"OCCLUM=yes",
"TEST=true"
]
2. Rewrite env test cases
3. Update Dockerfile to install "jq" tool
1. All generated, build files are now in a separate build directory;
2. The CLI tool supports three sub-commands: init, build, and run;
3. Refactor tests to use the new tool.
In addition, to ensure that all future Rust code complies with
`cargo fmt`, we add a Git post-commit hook that generates warnings
if the commited code is not formated consistently.
* Add patch to Rust SGX SDK to enable integrity-only SgxFile
* Upgrade to the new SEFS extended with the integrity-only mode
* Use integrity-only SEFS for /bin and /lib in test
* Add the MAC of integrity-only SEFS to Occlum.json in test
* Mount multiple FS according to Occlum.json
* Check the MACs of integrity-only SEFS images
The old system call mechanism works by relocating the symbol __occlum_syscall
provided by libocclum_stub.so to the real entry point of the LibOS. This symbol
relocation is done by the program loader. Now, the new system call mechanism is
based on passing the entry point via the auxiliary vector. This new mechanism
is simpler and is more compatible with the upcoming support for ld.so.
Changes:
1. Fix a bug in serializing auxiliary vector in the stack of a user program;
2. Passing syscall entry via auxiliary vector;
3. Remove relocating for the __occlum_syscall symbol;
4. Remove the dependency on libocclum_stub.so in tests.