When cloning an AnonymousVMObject, committed COW pages are shared
between the parent and child object. Whicever object COW's first will
take the shared committed page, and if the other object ends up doing
a COW as well, it will notice that the page is no longer shared by
two objects and simple remap it as read/write.
When a child is COW'ed again, while still having shared committed
pages with its own parent, the grandchild object will now join in the
sharing pool with its parent and grandparent. This means that the first
2 of 3 objects that COW will draw from the shared committed pages, and
3rd will remap read/write.
Previously, we would "fork" the shared committed pages when cloning,
which could lead to a situation where the grandparent held on to 1 of
the 3 needed shared committed pages. If both the child and grandchild
COW'ed, they wouldn't have enough pages, and since the grandparent
maintained an extra +1 ref count on the page, it wasn't possible to
to remap read/write.
We currently always crash if a user attempts to unmap a range that
does not intersect with an existing region, no matter the size. This
happens because we will never explicitly check to see if the search
for intersecting regions found anything, instead loop over the results,
which might be an empty vector. We then attempt to deallocate the
requested range from the `RangeAllocator` unconditionally, which will
be invalid if the specified range is not managed by the RangeAllocator.
We will assert validating m_total_range.contains(..) the range we are
requesting to deallocate.
This fix to this is straight forward, error out if we weren't able to
find any intersections.
You can get stress-ng to attempt this pattern with the following
arguments, which will attempt to unmap 0x0 through some large offset:
```
stress-ng --vm-segv 1
```
Fixes: #8483
Co-authored-by: Federico Guerinoni <guerinoni.federico@gmail.com>
To add a new per-CPU data structure, add an ID for it to the
ProcessorSpecificDataID enum.
Then call ProcessorSpecific<T>::initialize() when you are ready to
construct the per-CPU data structure on the current CPU. It can then
be accessed via ProcessorSpecific<T>::get().
This patch replaces the existing hard-coded mechanisms for Scheduler
and MemoryManager per-CPU data structure.
This enables further work on implementing KASLR by adding relocation
support to the pre-kernel and updating the kernel to be less dependent
on specific virtual memory layouts.
This allows us to specify virtual addresses for things the kernel should
access via virtual addresses later on. By doing this we can make the
kernel independent from specific physical addresses.
Previously the kernel relied on a fixed offset between virtual and
physical addresses based on the kernel's load address. This allows us
to specify an independent offset.
When we get a COW fault and discover that whoever we were COW'ing
together with has either COW'ed that page on their end (or they have
unmapped/exited) we simplify life for ourselves by clearing the COW
bit and keeping the page we already have. (No need to COW if the page
is not shared!)
The act of doing this does not return a committed page to the pool.
In fact, that committed page we had reserved for this purpose was used
up (allocated) by our COW buddy when they COW'ed the page.
This fixes a kernel panic when running TestLibCMkTemp. :^)
We don't need an entirely separate VMObject subclass to influence the
location of the physical pages.
Instead, we simply allocate enough physically contiguous memory first,
and then pass it to the AnonymousVMObject constructor that takes a span
of physical pages.
VMObject already has an IntrusiveList of all the Regions that map it.
We were keeping a counter in addition to this, and only using it in
a single place to avoid iterating over the list in case it only had
1 entry.
Simplify VMObject by removing this counter and always iterating the
list even if there's only 1 entry. :^)
This was previously used for a single debug logging statement during
memory purging. There are no remaining users of this weak pointer,
so let's get rid of it.
If a purgeable VM object is in the "volatile" state when we're asked
to make a COW clone of it, make life simpler by simply "purging"
the cloned object right away.
This effectively means that a fork()'ed child process will discover
its purgeable+volatile regions to be empty if/when it tries making
them non-volatile.
This patch changes the semantics of purgeable memory.
- AnonymousVMObject now has a "purgeable" flag. It can only be set when
constructing the object. (Previously, all anonymous memory was
effectively purgeable.)
- AnonymousVMObject now has a "volatile" flag. It covers the entire
range of physical pages. (Previously, we tracked ranges of volatile
pages, effectively making it a page-level concept.)
- Non-volatile objects maintain a physical page reservation via the
committed pages mechanism, to ensure full coverage for page faults.
- When an object is made volatile, it relinquishes any unused committed
pages immediately. If later made non-volatile again, we then attempt
to make a new committed pages reservation. If this fails, we return
ENOMEM to userspace.
mmap() now creates purgeable objects if passed the MAP_PURGEABLE option
together with MAP_ANONYMOUS. anon_create() memory is always purgeable.
This patch greatly simplifies VMObject locking by doing two things:
1. Giving VMObject an IntrusiveList of all its mapping Region objects.
2. Removing VMObject::m_paging_lock in favor of VMObject::m_lock
Before (1), VMObject::for_each_region() was forced to acquire the
global MM lock (since it worked by walking MemoryManager's list of
all regions and checking for regions that pointed to itself.)
With each VMObject having its own list of Regions, VMObject's own
m_lock is all we need.
Before (2), page fault handlers used a separate mutex for preventing
overlapping work. This design required multiple temporary unlocks
and was generally extremely hard to reason about.
Instead, page fault handlers now use VMObject's own m_lock as well.
Despite what the declaration would have us believe these are not "u8*".
If they were we wouldn't have to use the & operator to get the address
of them and then cast them to "u8*"/FlatPtr afterwards.
Since we're taking from the committed set of pages, there should never
be a reason for this call to fail.
Also add a Badge to disallow taking committed pages from anywhere but
the Region class.
We don't need to have a dedicated API for creating a VMObject with a
single page, the multi-page API option works in all cases.
Also make the API take a Span<NonnullRefPtr<PhysicalPage>> instead of
a NonnullRefPtrVector<PhysicalPage>.
Depending on the values it might be difficult to figure out whether a
value is decimal or hexadecimal. So let's make this more obvious. Also
this allows copying and pasting those numbers into GNOME calculator and
probably also other apps which auto-detect the base.
Move these to MM to simplify the flow of the syscall handler.
While here, also make sure we hold the process space lock for
the duration of the validation to avoid potential issues where
another thread attempts to modify the process space during the
validation. This will allow us to move the validation out of the
big process lock scope in a future change.
Additionally utilize the new no_lock variants of functions to avoid
unnecessary recursive process space spinlock acquisitions.
The entire process is not needed, just require the user to pass in the
Space. Also provide no_lock variant to use when you already have the
VM/Space lock acquired, to avoid unnecessary recursive spinlock
acquisitions.
This implements a simple bootloader that is capable of loading ELF64
kernel images. It does this by using QEMU/GRUB to load the kernel image
from disk and pass it to our bootloader as a Multiboot module.
The bootloader then parses the ELF image and sets it up appropriately.
The kernel's entry point is a C++ function with architecture-native
code.
Co-authored-by: Liav A <liavalb@gmail.com>
We should never request a regions removal that we don't currently
own. We currently assert this everywhere else by all callers.
Instead lets just push the assert down into the RedBlackTree removal
and assume that we will always successfully remove the region.