To make sure we don't lose changes, shared file mappings will now be
fully synced when they are unmapped, whether explicitly or implicitly
(by the program exiting/crashing/etc.)
This can incur a lot of work, since we don't keep track of dirty pages,
but that's something we can optimize down the road. :^)
This allows userspace to trigger a full (FIXME) flush of a shared file
mapping to disk. We iterate over all the mapped pages in the VMObject
and write them out to the underlying inode, one by one. This is rather
naive, and there's lots of room for improvement.
Note that shared file mappings are currently not possible since mmap()
returns ENOTSUP for PROT_WRITE+MAP_SHARED. That restriction will be
removed in a subsequent commit. :^)
This is a handy helper that copies out the full contents of a physical
page into a caller-provided buffer. It uses quickmapping internally
(and takes the MM lock for the duration.)
This isn't a complete conversion to ErrorOr<void>, but a good chunk.
The end goal here is to propagate buffer allocation failures to the
caller, and allow the use of TRY() with formatting functions.
Instead of signalling allocation failure with a bool return value
(false), we now use ErrorOr<void> and return ENOMEM as appropriate.
This allows us to use TRY() and MUST() with Vector. :^)
We now use AK::Error and AK::ErrorOr<T> in both kernel and userspace!
This was a slightly tedious refactoring that took a long time, so it's
not unlikely that some bugs crept in.
Nevertheless, it does pass basic functionality testing, and it's just
real nice to finally see the same pattern in all contexts. :^)
This small change simplifies the function a bit but also fixes a problem
with it.
Let's take an example to see this:
Let's say we have a reserved range between 0xe0000 to 0xfffff (EBDA),
then we want to map from the memory device (/dev/mem) the entire
EBDA to a program. If a program tries to map more than 131072 bytes,
the current logic will work - the start address is 0xe0000, and ofcourse
it's below the limit, hence it passes the first two restrictions.
Then, the third if statement will fail if we try to mmap more than
the said allowed bytes.
However, let's take another scenario, where we try to mmap from
0xf0000 - but we try to mmap less than 131072 - but more than 65536.
In such case, we again pass the first two if statements, but the third
one is passed two, because it doesn't take into account the offseted
address from the start of the reserved range (0xe0000). In such case,
a user can easily mmap 65535 bytes above 0x100000. This might
seem negligible. However, it's still a severe bug that can theoretically
be exploited into a info leak or tampering with important kernel
structures.
SonarCloud flagged this "Code Smell", where we are accessing these
static methods as if they are instance methods. While it is technically
possible, it is very confusing to read when you realize they are static
functions.
When testing the RTL8168 driver, it seems we can't allocate super pages
anymore. Either we expand the super pages range, or find a solution to
dynamically expand the range (or let drivers utilize other ranges).
This function was checking 1 byte after the provided range, which caused
it to reject valid userspace ranges that happened to end exactly at the
top of the user address space.
This fixes a long-standing issue with mysterious Optional errors in
Coredump::write_regions(). (It happened when trying to add a memory
region at the very top of the address space to a coredump.)
This expands the reach of error propagation greatly throughout the
kernel. Sadly, it also exposes the fact that we're allocating (and
doing other fallible things) in constructors all over the place.
This patch doesn't attempt to address that of course. That's work for
our future selves.
The default template argument is only used in one place, and it
looks like it was probably just an oversight. The rest of the Kernel
code all uses u8 as the type. So lets make that the default and remove
the unused template argument, as there doesn't seem to be a reason to
allow the size to be customizable.
This commit moves the KResult and KResultOr objects to Kernel/API to
signify that they may now be freely used by userspace code at points
where a syscall-related error result is to be expected. It also exposes
KResult and KResultOr to the global namespace to make it nicer to use
for userspace code.
And also try_create<T> => try_make_ref_counted<T>.
A global "create" was a bit much. The new name matches make<T> better,
which we've used for making single-owner objects since forever.
We have seen cases where the map fails, but we return the region
to the caller, causing them to page fault later on when they touch
the region.
The fix is to always observe the return code of map/remap.
The quickmap_page() and unquickmap_page() functions are used to map a
single physical page at a kernel virtual address for temporary access.
These use the per-CPU quickmap buffer in the page tables, and access to
this is guarded by the MM lock. To prevent bugs, quickmap_page() should
not *take* the MM lock, but rather verify that it is already held!
This exposed two situations where we were using quickmap without holding
the MM lock during page fault handling. This patch is forced to fix
these issues (which is great!) :^)
This has several benefits:
1) We no longer just blindly derefence a null pointer in various places
2) We will get nicer runtime error messages if the current process does
turn out to be null in the call location
3) GCC no longer complains about possible nullptr dereferences when
compiling without KUBSAN
The VMObject class now manages its own instance list (it was previously
a member of MemoryManager.) Removal from the list is done safely on the
last unref(), closing a race window in the previous implementation.
Note that VMObject::all_instances() now has its own lock instead of
using the global MM lock.
This makes for nicer handling of errors compared to checking whether a
RefPtr is null. Additionally, this will give way to return different
types of errors in the future.
First off: unregister the region from MemoryManager before unmapping it.
The order of operations here was a bit strange, presumably to avoid a
situation where a fault would happen while unmapping, and the fault
handler would find the MemoryManager region list in an invalid state.
Unregistering it before unmapping sidesteps the whole problem, and
allows us to easily fix another problem: a deadlock could occur due
to inconsistent acquisition order (PageDirectory must come before MM.)
We don't want to be holding the MM lock if it's a user region and we
have to consult the page directory, since that can lead to a deadlock
if we don't already have the page directory lock.
It may happen that CPU A manages to page in from the same inode
while we're just entering the same page fault handler on CPU B.
Handle it gracefully by checking if the data has already been paged in
(instead of VERIFY'ing that it hasn't) and then remap the page if that's
the case.
Taking a reference or a pointer to a value that's not aligned properly
is undefined behavior. While `[[gnu::packed]]` ensures that reads from
and writes to fields of packed structs is a safe operation, the
information about the reduced alignment is lost when creating pointers
to these values.
Weirdly enough, GCC's undefined behavior sanitizer doesn't flag these,
even though the doc of `-Waddress-of-packed-member` says that it usually
leads to UB. In contrast, x86_64 Clang does flag these, which renders
the 64-bit kernel unable to boot.
For now, the `address-of-packed-member` warning will only be enabled in
the kernel, as it is absolutely crucial there because of KUBSAN, but
might get excessively noisy for the userland in the future.
Also note that we can't append to `CMAKE_CXX_FLAGS` like we do for other
flags in the kernel, because flags added via `add_compile_options` come
after these, so the `-Wno-address-of-packed-member` in the root would
cancel it out.
When booting AP's, we identity map a region at 0x8000 while doing the
initial bringup sequence. This is the only thing in the kernel that
requires an identity mapping, yet we had a bunch of generic API's and a
dedicated VirtualRangeAllocator in every PageDirectory for this purpose.
This patch simplifies the situation by moving the identity mapping logic
to the AP boot code and removing the generic API's.
...and also RangeAllocator => VirtualRangeAllocator.
This clarifies that the ranges we're dealing with are *virtual* memory
ranges and not anything else.