According to POSIX, waitid() should fill si_signo and si_pid members
with zeroes if there are no children that have already changed their
state by the time of the call. Let's just fill the whole structure
with zeroes to avoid leaking kernel memory.
sys$waitid() takes an explicit description of whether it's waiting for a single
process with the given PID, all of the children, a group, etc., and returns its
info as a siginfo_t.
It also doesn't automatically imply WEXITED, which clears up the confusion in
the kernel.
This patch introduces sys$perf_event() with two event types:
- PERF_EVENT_MALLOC
- PERF_EVENT_FREE
After the first call to sys$perf_event(), a process will begin keeping
these events in a buffer. When the process dies, that buffer will be
written out to "perfcore" in the current directory unless that filename
is already taken.
This is probably not the best way to do this, but it's a start and will
make it possible to start doing memory allocation profiling. :^)
Before putting itself back on the wait queue, the finalizer task will
now check if there's more work to do, and if so, do it first. :^)
This patch also puts a bunch of process/thread debug logging behind
PROCESS_DEBUG and THREAD_DEBUG since it was unbearable to debug this
stuff with all the spam.
If the waitee process is dead, we don't need to inspect the thread.
This fixes an issue with sys$waitpid() failing before reap() since
dead processes will have no remaining threads alive.
There was a race window in a bunch of syscalls between calling
Thread::from_tid() and checking if the found thread was in the same
process as the calling thread.
If the found thread object was destroyed at that point, there was a
use-after-free that could be exploited by filling the kernel heap with
something that looked like a thread object.
Memory validation is used to verify that user syscalls are allowed to
access a given memory range. Ring 0 threads never make syscalls, and
so will never end up in validation anyway.
The reason we were allowing kmalloc memory accesses is because kernel
thread stacks used to be allocated in kmalloc memory. Since that's no
longer the case, we can stop making exceptions for kmalloc in the
validation code.
Move timeout management to the ReadBlocker and WriteBlocker classes.
Also get rid of the specialized ReceiveBlocker since it no longer does
anything that ReadBlocker can't do.
Vector::ensure_capacity() makes sure the underlying vector buffer can
contain all the data, but it doesn't update the Vector::size().
As a result, writev() would simply collect all the buffers to write,
and then do nothing.
Move all the fork-specific inheritance logic to sys$fork(), and all the
stuff for setting up stdio for non-fork ring 3 processes moves to
Process::create_user_process().
Also: we were setting up the PGID, SID and umask twice. Also the code
for copying the open file descriptors was overly complicated. Now it's
just a simple Vector copy assignment. :^)
When using dbg() in the kernel, the output is automatically prefixed
with [Process(PID:TID)]. This makes it a lot easier to understand which
thread is generating the output.
This patch also cleans up some common logging messages and removes the
now-unnecessary "dbg() << *current << ..." pattern.
Sergey suggested that having a non-zero O_RDONLY would make some things
less confusing, and it seems like he's right about that.
We can now easily check read/write permissions separately instead of
dancing around with the bits.
This patch also fixes unveil() validation for O_RDWR which previously
forgot to check for "r" permission.
This syscall is a complement to pledge() and adds the same sort of
incremental relinquishing of capabilities for filesystem access.
The first call to unveil() will "drop a veil" on the process, and from
now on, only unveiled parts of the filesystem are visible to it.
Each call to unveil() specifies a path to either a directory or a file
along with permissions for that path. The permissions are a combination
of the following:
- r: Read access (like the "rpath" promise)
- w: Write access (like the "wpath" promise)
- x: Execute access
- c: Create/remove access (like the "cpath" promise)
Attempts to open a path that has not been unveiled with fail with
ENOENT. If the unveiled path lacks sufficient permissions, it will fail
with EACCES.
Like pledge(), subsequent calls to unveil() with the same path can only
remove permissions, not add them.
Once you call unveil(nullptr, nullptr), the veil is locked, and it's no
longer possible to unveil any more paths for the process, ever.
This concept comes from OpenBSD, and their implementation does various
things differently, I'm sure. This is just a first implementation for
SerenityOS, and we'll keep improving on it as we go. :^)
uintptr_t is 32-bit or 64-bit depending on the target platform.
This will help us write pointer size agnostic code so that when the day
comes that we want to do a 64-bit port, we'll be in better shape.
Instead of restoring CR3 to the current process's paging scope when a
ProcessPagingScope goes out of scope, we now restore exactly whatever
the CR3 value was when we created the ProcessPagingScope.
This fixes breakage in situations where a process ends up with nested
ProcessPagingScopes. This was making profiling very fragile, and with
this change it's now possible to profile g++! :^)
This will panic the kernel immediately if these functions are misused
so we can catch it and fix the misuse.
This patch fixes a couple of misuses:
- create_signal_trampolines() writes to a user-accessible page
above the 3GB address mark. We should really get rid of this
page but that's a whole other thing.
- CoW faults need to use copy_from_user rather than copy_to_user
since it's the *source* pointer that points to user memory.
- Inode faults need to use memcpy rather than copy_to_user since
we're copying a kernel stack buffer into a quickmapped page.
This should make the copy_to/from_user() functions slightly less useful
for exploitation. Before this, they were essentially just glorified
memcpy() with SMAP disabled. :^)
Previously, VFS::open() would only use the passed flags for permission checking
purposes, and Process::sys$open() would set them on the created FileDescription
explicitly. Now, they should be set by VFS::open() on any files being opened,
including files that the kernel opens internally.
This also lets us get rid of the explicit check for whether or not the returned
FileDescription was a preopen fd, and in fact, fixes a bug where a read-only
preopen fd without any other flags would be considered freshly opened (due to
O_RDONLY being indistinguishable from 0) and granted a new set of flags.
Kernel processes just do not need them.
This also avoids touching the file (sub)system early in the boot process when
initializing the colonel process.
Right now, permission flags passed to VFS::open() are effectively ignored, but
that is going to change.
* O_RDONLY is 0, but it's still nicer to pass it explicitly
* POSIX says that binding a Unix socket to a symlink shall fail with EADDRINUSE
It's now an error to sys$mmap() a file as writable if it's currently
mapped executable by anyone else.
It's also an error to sys$execve() a file that's currently mapped
writable by anyone else.
This fixes a race condition vulnerability where one program could make
modifications to an executable while another process was in the kernel,
in the middle of exec'ing the same executable.
Test: Kernel/elf-execve-mmap-race.cpp