This patch fixes some issues with the mmap() and mprotect() syscalls,
neither of whom were checking the permission bits of the underlying
files when mapping an inode MAP_SHARED.
This made it possible to subvert execution of any running program
by simply memory-mapping its executable and replacing some of the code.
Test: Kernel/mmap-write-into-running-programs-executable-file.cpp
This encourages callers to strongly reference file descriptions while
working with them.
This fixes a use-after-free issue where one thread would close() an
open fd while another thread was blocked on it becoming readable.
Test: Kernel/uaf-close-while-blocked-in-read.cpp
Before this, you could make the kernel copy memory from anywhere by
setting up an ELF executable with a program header specifying file
offsets outside the file.
Since ELFImage didn't even know how large it was, we had no clue that
we were copying things from outside the ELF.
Fix this by adding a size field to ELFImage and validating program
header ranges before memcpy()'ing to them.
The ELF code is definitely going to need more validation and checking.
This code had been misinterpreting the Multiboot ELF section headers
since the beginning. Furthermore QEMU wasn't even passing us any
headers at all, so this wasn't checking anything.
This patch introduces a helpful copy_string_from_user() function
that takes a bounded null-terminated string from userspace memory
and copies it into a String object.
Supervisor Mode Access Prevention (SMAP) is an x86 CPU feature that
prevents the kernel from accessing userspace memory. With SMAP enabled,
trying to read/write a userspace memory address while in the kernel
will now generate a page fault.
Since it's sometimes necessary to read/write userspace memory, there
are two new instructions that quickly switch the protection on/off:
STAC (disables protection) and CLAC (enables protection.)
These are exposed in kernel code via the stac() and clac() helpers.
There's also a SmapDisabler RAII object that can be used to ensure
that you don't forget to re-enable protection before returning to
userspace code.
THis patch also adds copy_to_user(), copy_from_user() and memset_user()
which are the "correct" way of doing things. These functions allow us
to briefly disable protection for a specific purpose, and then turn it
back on immediately after it's done. Going forward all kernel code
should be moved to using these and all uses of SmapDisabler are to be
considered FIXME's.
Note that we're not realizing the full potential of this feature since
I've used SmapDisabler quite liberally in this initial bring-up patch.
Our syscall calling convention only allows passing up to 3 arguments in
registers. For syscalls that take more arguments, we bake them into a
struct and pass a pointer to that struct instead.
When doing pointer validation, this is what we would do:
1) Validate the "params" struct
2) Validate "params->some_pointer"
3) ... other stuff ...
4) Use "params->some_pointer"
Since the parameter struct is stored in userspace, it can be modified
by userspace after validation has completed.
This was a recurring pattern in many syscalls that was further hidden
by me using structured binding declarations to give convenient local
names to things in the parameter struct:
auto& [some_pointer, ...] = *params;
memcpy(some_pointer, ...);
This devilishly makes "some_pointer" look like a local variable but
it's actually more like an alias for "params->some_pointer" and will
expand to a dereference when accessed!
This patch fixes the issues by explicitly copying out each member from
the parameter structs before validating them, and then never using
the "param" pointers beyond that.
Thanks to braindead for finding this bug! :^)
In order to ensure a specific owner and mode when the local socket
filesystem endpoint is instantiated, we need to be able to call
fchmod() and fchown() on a socket fd between socket() and bind().
This is because until we call bind(), there is no filesystem inode
for the socket yet.
We now have these API's in <Kernel/Random.h>:
- get_fast_random_bytes(u8* buffer, size_t buffer_size)
- get_good_random_bytes(u8* buffer, size_t buffer_size)
- get_fast_random<T>()
- get_good_random<T>()
Internally they both use x86 RDRAND if available, otherwise they fall
back to the same LCG we had in RandomDevice all along.
The main purpose of this patch is to give kernel code a way to better
express its needs for random data.
Randomness is something that will require a lot more work, but this is
hopefully a step in the right direction.
It was previously possible to write to read-only file descriptors,
and read from write-only file descriptors.
All FileDescription objects now start out non-readable + non-writable,
and whoever is creating them has to "manually" enable reading/writing
by calling set_readable() and/or set_writable() on them.
This code never worked, as was never used for anything. We can build
a much better SHM implementation on top of TmpFS or similar when we
get to the point when we need one.
Split a region into two/three if the desired mprotect range is a strict
subset of an existing region. We can then set the access bits on a new
region that is just our desired range and add both the new
desired subregion and the leftovers back to our page tables.