We now validate the full range of userspace memory passed into syscalls
instead of just checking that the first and last byte of the memory are
in process-owned regions.
This fixes an issue where it was possible to avoid rejection of invalid
addresses that sat between two valid ones, simply by passing a valid
address and a size large enough to put the end of the range at another
valid address.
I added a little test utility that tries to provoke EFAULT in various
ways to help verify this. I'm sure we can think of more ways to test
this but it's at least a start. :^)
Thanks to mozjag for pointing out that this code was still lacking!
Incidentally this also makes backtraces work again.
Fixes#989.
We now refuse to boot on machines that don't support PAE since all
of our paging code depends on it.
Also let's only enable SSE and PGE support if the CPU advertises it.
Previously we assumed all hosts would have support for IA32_EFER.NXE.
This is mostly true for newer hardware, but older hardware will crash
and burn if you try to use this feature.
Now we check for support via CPUID.80000001[20].
Introduce one more (CPU) indirection layer in the paging code: the page
directory pointer table (PDPT). Each PageDirectory now has 4 separate
PageDirectoryEntry arrays, governing 1 GB of VM each.
A really neat side-effect of this is that we can now share the physical
page containing the >=3GB kernel-only address space metadata between
all processes, instead of lazily cloning it on page faults.
This will give us access to the NX (No eXecute) bit, allowing us to
prevent execution of memory that's not supposed to be executed.
To enforce this, we create two separate mappings of the same underlying
physical page. A writable mapping for the kernel, and a read-only one
for userspace (the one returned by sys$get_kernel_info_page.)
The kernel is now no longer identity mapped to the bottom 8MiB of
memory, and is now mapped at the higher address of `0xc0000000`.
The lower ~1MiB of memory (from GRUB's mmap), however is still
identity mapped to provide an easy way for the kernel to get
physical pages for things such as DMA etc. These could later be
mapped to the higher address too, as I'm not too sure how to
go about doing this elegantly without a lot of address subtractions.
VM regions can now be marked as stack regions, which is then validated
on syscall, and on page fault.
If a thread is caught with its stack pointer pointing into anything
that's *not* a Region with its stack bit set, we'll crash the whole
process with SIGSTKFLT.
Userspace must now allocate custom stacks by using mmap() with the new
MAP_STACK flag. This mechanism was first introduced in OpenBSD, and now
we have it too, yay! :^)
Now the userspace page allocator will search through physical regions,
and stop the search as it finds an available page.
Also remove an "address of" sign since we don't need that when
counting size of physical regions
After the page fault handler has found the region in which the fault
occurred, do the rest of the work in the region itself.
This patch also makes all fault types consistently crash the process
if a new page is needed but we're all out of pages.
We were always returning the full VM range of the partially-unmapped
Region to the range allocator. This caused us to re-use those addresses
for subsequent VM allocations.
This patch also skips creating a new VMObject in partial munmap().
Instead we just make split regions that point into the same VMObject.
This fixes the mysterious GCC ICE on large C++ programs.
This simplifies the ownership model and makes Region easier to reason
about. Userspace Regions are now primarily kept by Process::m_regions.
Kernel Regions are kept in various OwnPtr<Regions>'s.
Regions now only ever get unmapped when they are destroyed.
This was a workaround to be able to build on case-insensitive file
systems where it might get confused about <string.h> vs <String.h>.
Let's just not support building that way, so String.h can have an
objectively nicer name. :^)
Remove the global hash tables and replace them with InlineLinkedLists.
This significantly reduces the kernel heap pressure from doing many
small mmap()'s.
Using a HashTable to track "all instances of Foo" is only useful if we
actually need to look up entries by some kind of index. And since they
are HashTable (not HashMap), the pointer *is* the index.
Since we have the pointer, we can just use it directly. Duh.
This increase sizeof(VMObject) by two pointers, but removes a global
table that had an entry for every VMObject, where the cost was higher.
It also avoids all the general hash tabling business when creating or
destroying VMObjects. Generally we should do more of this. :^)
KBuffers are now zero-filled on demand instead of up front. This means
that you can create a huge KBuffer and it will only take up VM, not
physical pages (until you access them.)
This allows the page fault code to find the owning PageDirectory and
corresponding process for faulting regions.
The mapping is implemented as a global hash map right now, which is
definitely not optimal. We can come up with something better when it
becomes necessary.
Sometimes you're only interested in either user OR kernel regions but
not both. Let's break this into two functions so the caller can choose
what he's interested in.
Instead of generating ByteBuffers and keeping those lying around, have
these filesystems generate KBuffers instead. These are way less spooky
to leave around for a while.
Since FileDescription will keep a generated file buffer around until
userspace has read the whole thing, this prevents trivially exhausting
the kmalloc heap by opening many files in /proc for example.
The code responsible for generating each /proc file is not perfectly
efficient and many of them still use ByteBuffers internally but they
at least go away when we return now. :^)
Generate a special page containing the "return from signal" trampoline code
on startup and then route signalled threads to it. This avoids a page
allocation in every process that ever receives a signal.
Region now has is_user_accessible(), which informs the memory manager how
to map these pages. Previously, we were just passing a "bool user_allowed"
to various functions and I'm not at all sure that any of that was correct.
All the Region constructors are now hidden, and you must go through one of
these helpers to construct a region:
- Region::create_user_accessible(...)
- Region::create_kernel_only(...)
That ensures that we don't accidentally create a Region without specifying
user accessibility. :^)
I was messing around with this to tell the compiler that these functions
always return the same value no matter how many times you call them.
It doesn't really seem to improve code generation and it looks weird so
let's just get rid of it.
Instead of PDE's and PTE's being weird wrappers around dword*, just have
MemoryManager::ensure_pte() return a PageDirectoryEntry&, which in turn has
a PageTableEntry* entries().
I've been trying to understand how things ended up this way, and I suspect
it was because I inadvertently invoked the PageDirectoryEntry copy ctor in
the original work on this, which must have made me very confused..
Anyways, now things are a bit saner and we can move forward towards a better
future, etc. :^)
This makes no functional difference, but it makes it clear that
MemoryManager and PhysicalRegion take over the actual physical
page represented by this PhysicalPage instance.
This significantly reduces the pressure on the kernel heap when
allocating a lot of pages.
Previously at about 250MB allocated, the free page list would outgrow
the kernel's heap. Given that there is no longer a page list, this does
not happen.
The next barrier will be the kernel memory used by the page records for
in-use memory. This kicks in at about 1GB.
This makes it possible to run Serenity with more than 64 MB of RAM.
Because each physical page is represented by a PhysicalPage object, and such
objects are allocated using kmalloc_eternal(), more RAM means more pressure
on kmalloc_eternal(), so we're gonna need a better strategy for this.
But for now, let's just celebrate that we can use the 128 MB of RAM we've
been telling QEMU to run with. :^)