We need to make sure other processors can grab the MM lock while we
wait, so release it when we might block. Reading the page from
disk may also block, so release it during that time as well.
Rather than walking all Thread instances and putting them into
a vector to be sorted by priority, queue them into priority sorted
linked lists as soon as they become ready to be executed.
Attempt to wake idle processors to get threads to be scheduled more quickly.
We don't want to wait until the next timer tick if we have processors that
aren't doing anything.
This eliminates the window between calling Processor::current and
the member function where a thread could be moved to another
processor. This is generally not as big of a concern as with
Processor::current_thread, but also slightly more light weight.
Change Thread::current to be a static function and read using the fs
register, which eliminates a window between Processor::current()
returning and calling a function on it, which can trigger preemption
and a move to a different processor, which then causes operating
on the wrong object.
We also need to store m_in_critical in the Thread upon switching,
and we need to restore it. This solves a problem where threads
moving between different processors could end up with an unexpected
value.
This allows us to determine what the previous mode (user or kernel)
was, e.g. in the timer interrupt. This is used e.g. to determine
whether a signal handler should be set up.
Fixes#5096
If we find ourselves with a user-accessible, non-shared Region backed by
a SharedInodeVMObject, that's pretty bad news, so let's just panic the
kernel instead of getting abused.
There might be a better place for this kind of check, so I've added a
FIXME about putting more thought into that.
This was exploitable since the shared flag determines whether inode
permission checks are applied in sys$mprotect().
The bug was pretty hard to spot due to default arguments being used
instead. This patch removes the default arguments to make explicit
at each call site what's being done.
When passing nullptr for either promises or execpromises to pledge(),
the expected behaviour is to not change their current value at all - we
were accidentally resetting them to 0, effectively dropping previously
pledge()'d promises.
We now move the execpromises state into the regular promises, and clear
the execpromises state.
Also make sure to duplicate the promise state on fork.
This fixes an issue where "su" would launch a shell which immediately
crashed due to not having pledged "stdio".
Let's force callers to provide a VM range when allocating a region.
This makes ENOMEM error handling more visible and removes implicit
VM allocation which felt a bit magical.
This tells the kernel that the process wants to use pledge, but without
pledging anything - effectively restricting it to syscalls that don't
require a certain promise. This is part of OpenBSD's pledge() as well,
which served as basis for Serenity's.
We were enabling interrupts too early, before the first context switch to
a thread was complete. This could then trigger another context switch
within the context switch, which lead to a crash.
There is a window between acquiring/releasing the lock with the atomic
variables and subsequently waiting or waking threads. With a single
processor this window was closed by using a critical section, but
this doesn't prevent other processors from running these code paths.
To solve this, set a flag in the WaitQueue while holding m_lock which
determines if threads should be blocked at all.
Instead of letting each File subclass do range allocation in their
mmap() override, do it up front in sys$mmap().
This makes us honor alignment requests for file-backed memory mappings
and simplifies the code somwhat.
This was done with the help of several scripts, I dump them here to
easily find them later:
awk '/#ifdef/ { print "#cmakedefine01 "$2 }' AK/Debug.h.in
for debug_macro in $(awk '/#ifdef/ { print $2 }' AK/Debug.h.in)
do
find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/#ifdef '$debug_macro'/#if '$debug_macro'/' {} \;
done
# Remember to remove WRAPPER_GERNERATOR_DEBUG from the list.
awk '/#cmake/ { print "set("$2" ON)" }' AK/Debug.h.in
Booting old computers without RDRAND/RDSEED and without a disk makes
the system severely starved for entropy. Uses interrupts as a source
to side-step that issue.
Also warn whenever the system is starved of entropy, because that's
a non-obvious failure mode.
For some reason we were keeping the bits 04777 in file modes. That
doesn't seem right and I can't think of a reason why the set-uid bit
should be allowed to slip through.
(mode & S_IFDIR) is not enough to check if "mode" is a directory,
we have to check all the bits in the S_IFMT mask.
Use the is_directory() helper to fix this bug.
Since devices are enumerable and can compute their own name inside the
/dev hierarchy, there is no need to try and parse "root=/dev/xxx" by
hand.
This also makes any block device a candidate for the boot device, which
now includes ramdisk devices, so SerenityOS can now boot diskless too.
The disk image generated for QEMU is suitable, as long as it fits in
memory with room to spare for the rest of the system.
Besides removing the monolithic DevFSDeviceInode::determine_name()
method, being able to determine a device's name inside the /dev
hierarchy outside of DevFS has its uses.
The kernel ignored the first 8 MiB of RAM while parsing the memory map
because the kmalloc heaps and the super physical pages lived here. Move
all that stuff inside the .bss segment so that those memory regions are
accounted for, otherwise we risk overwriting boot modules placed next
to the kernel.
This was just an alias for "unix" that I added early on back when there
was some belief that we might be compatible with OpenBSD. We're clearly
never going to be compatible with their pledges so just drop the alias.
It was possible to signal a process while it was paging in an inode
backed VM object. This would cause the inode read to EINTR, and the
page fault handler would assert.
Solve this by simply not unblocking threads due to signals if they are
currently busy handling a page fault. This is probably not the best way
to solve this issue, so I've added a FIXME to that effect.
..and allow implicit creation of KResult and KResultOr from ErrnoCode.
This means that kernel functions that return those types can finally
do "return EINVAL;" and it will just work.
There's a handful of functions that still deal with signed integers
that should be converted to return KResults.
This way, if something goes wrong, we get to keep the actual error.
Also, KResults are nodiscard, so we have to deal with that in Ext2FS
instead of just silently ignoring I/O errors(!)
Similar to LibC storing an assertion message before aborting, process
death by pledge violation now sets a "pledge_violation" key with the
respective pledge name as value in its coredump metadata, which the
CrashReporter will then show.
Path resolution will now refuse to follow symlinks in some cases where
you don't own the symlink, or when it's in a sticky world-writable
directory and the link has a different owner than the directory.
The point of all this is to prevent classic TOCTOU bugs in /tmp etc.
Fixes#4934
Both ESP and GDTR are left undefined by the Multiboot specification and
OS images must not rely on these values to be valid. Fix the undefined
behaviors so that booting with PXELINUX does not triple-fault the CPU.
This file was useful for debugging a long time ago, but has bitrotted
at this point. Instead of updating it, let's just remove it since
nothing is using it.
This adds support for FUTEX_WAKE_OP, FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET,
FUTEX_REQUEUE, and FUTEX_CMP_REQUEUE, as well well as global and private
futex and absolute/relative timeouts against the appropriate clock. This
also changes the implementation so that kernel resources are only used when
a thread is blocked on a futex.
Global futexes are implemented as offsets in VMObjects, so that different
processes can share a futex against the same VMObject despite potentially
being mapped at different virtual addresses.
This sort-of matches what some other systems do and seems like a
generally sane thing to do instead of allowing programs to spawn a
child with a nearly full stack.
When freeing an inode, we were checking if it's a directory *after*
wiping the inode metadata. This caused us to forget updating the block
group descriptor with the new directory count.
We forgot to remove the automatic SMAP disablers after fixing up all
this code to not access userspace memory directly. Let's lock things
down at last. :^)
All users of this mechanism have been switched to anonymous files and
passing file descriptors with sendfd()/recvfd().
Shbufs got us where we are today, but it's time we say good-bye to them
and welcome a much more idiomatic replacement. :^)
The priority boosting mechanism has been broken for a very long time.
Let's remove it from the codebase and we can bring it back the day
someone feels like implementing it in a working way. :^)
These changes are arbitrarily divided into multiple commits to make it
easier to find potentially introduced bugs with git bisect.
This commit touches some dbg() calls which are enclosed in macros. This
should be fine because with the new constexpr stuff, we ensure that the
stuff actually compiles.
Killing remaining threads already happens in Process::die(), but
coredumps are only written in Process::finalize(). We need to keep a
reference to each of those threads to prevent them from being destructed
between those two functions, otherwise coredumps will only ever contain
information about the last remaining thread.
Fixes the underlying problem of #4778, though the UI will need
refinements to not show every thread's backtrace mashed together.
This is in preparation of adding (much) more process information to
coredumps. As we can only have one null-terminated char[] of arbitrary
length in each struct it's now a single JSON blob, which is a great fit:
easily extensible in the future and allows for key/value pairs and even
nested objects, which will be used e.g. for the process environment, for
example.
This patch adds a new AnonymousFile class which is a File backed by
an AnonymousVMObject that can only be mmap'ed and nothing else, really.
I'm hoping that this can become a replacement for shbufs. :^)
Problem:
- Many constructors are defined as `{}` rather than using the ` =
default` compiler-provided constructor.
- Some types provide an implicit conversion operator from `nullptr_t`
instead of requiring the caller to default construct. This violates
the C++ Core Guidelines suggestion to declare single-argument
constructors explicit
(https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c46-by-default-declare-single-argument-constructors-explicit).
Solution:
- Change default constructors to use the compiler-provided default
constructor.
- Remove implicit conversion operators from `nullptr_t` and change
usage to enforce type consistency without conversion.
The vast majority of programs don't ever need to use sys$ptrace(),
and it seems like a high-value system call to prevent a compromised
process from using.
This patch moves sys$ptrace() from the "proc" promise to its own,
new "ptrace" promise and updates the affected apps.
Problem:
- The implementation of `find` is coupled to the implementation of
`SinglyLinkedList`.
Solution:
- Decouple the implementation of `find` from the class by using a
generic `find` algorithm.
Problem:
- The implementation of `find` is coupled to the implementation of
`DoublyLinkedList`.
- `append` and `prepend` are implemented multiple times so that
r-value references can be moved from into the new node. This is
probably not called very often because a pr-value or x-value needs
to be used here.
Solution:
- Decouple the implementation of `find` from the class by using a
generic `find` algorithm.
- Make `append` and `prepend` be function templates so that they can
have binding references which can be forwarded.
This patch merges the profiling functionality in the kernel with the
performance events mechanism. A profiler sample is now just another
perf event, rather than a dedicated thing.
Since perf events were already per-process, this now makes profiling
per-process as well.
Processes with perf events would already write out a perfcore.PID file
to the current directory on death, but since we may want to profile
a process and then let it continue running, recorded perf events can
now be accessed at any time via /proc/PID/perf_events.
This patch also adds information about process memory regions to the
perfcore JSON format. This removes the need to supply a core dump to
the Profiler app for symbolication, and so the "profiler coredump"
mechanism is removed entirely.
There's still a hard limit of 4MB worth of perf events per process,
so this is by no means a perfect final design, but it's a nice step
forward for both simplicity and stability.
Fixes#4848Fixes#4849
When loading non position-independent programs, we now take care not to
load the dynamic loader at an address that collides with the location
the main program wants to load at.
Fixes#4847.
This will enable us to take the desired load address of non-position
independent programs into account when randomizing the load address
of the dynamic loader.
Trying to pass these onto the Terminal while handling an IRQ is a recipe
for disaster. Use Processor::deferred_call_queue to create an ad-hoc
"second half" of the interrupt handler.
Fixes#4889
SystemServer now creates the /tmp/coredump and /tmp/profiler_coredumps
directories at startup, ensuring that they are owned by root, and with
basic 0755 permissions.
The kernel will also now refuse to put core dumps in a directory that
doesn't fulfill the following criteria:
- Owned by 0:0
- Directory with sticky bit not set
- 0755 permissions
Fixes#4435Fixes#4850
We were not handling sticky parents properly in sys$rmdir(). Child
directories of a sticky parent should not be rmdir'able by just anyone.
Only the owner and root.
Fixes#4875.
Before this change, truncating an Ext2FS inode to a larger size than it
was before would give you uninitialized on-disk data.
Fix this by zeroing out all the new space when doing an inode resize.
This is pretty naively implemented via Inode::write_bytes() and there's
lots of room for cleverness here in the future.
These changes are arbitrarily divided into multiple commits to make it
easier to find potentially introduced bugs with git bisect.Everything:
The modifications in this commit were automatically made using the
following command:
find . -name '*.cpp' -exec sed -i -E 's/dbg\(\) << ("[^"{]*");/dbgln\(\1\);/' {} \;
We can now test a _very_ basic transaction via `do_debug_transfer()`.
This function merely attaches some TDs to the LSCTRL queue head
and points some input and output buffers. We then sense an interrupt
with USBSTS value of 1, meaning Interrupt On Completion
(of the transaction). At this point, the input buffer is filled with
some data.
According the USB spec/UHCI datasheet (as well as the Linux and
BSD source code), if we receive an IRQ and USBSTS is 0, then
the IRQ does not belong to us and we should immediately jump
out of the handler.
We can now read/write to the two root ports exposed to the
UHCI controller, and detect when a device is plugged in or
out via a kernel process that constantly scans the port
for any changes. This is very basic, but is a bit of fun to see
the kernel detecting hardware on the fly :^)
Implemented both Queue Heads and Transfer Descriptors. These
are required to actually perform USB transactions. The UHCI
driver sets up a pool of these that can be allocated when we
need them. It seems some drivers have these statically
allocated, so it might be worth looking into that, but
for now, the simple way seems to be to allocate them on
the fly as we need them, and then release them.
It seems that not setting the framelist address register
was causing the entire system to lock up as it generated an insane
interrupt storm in the IRQ handler for the UHCI controller.
We now allocate a 4KiB aligned page via
`MemoryManager::allocate_supervisor_physical_page()` and set every
value to 1. In effect, this creates a framelist with each entry
being a "TERMINATE" entry in which the controller stalls until its'
1mS time slice is up.
Some more registers have also been set for consistency, though it
seems like this don't need to be set explicitly in software.
This patch adds sys$abort() which immediately crashes the process with
SIGABRT. This makes assertion backtraces a lot nicer by removing all
the gunk that otherwise happens between __assertion_failed() and
actually crashing from the SIGABRT.
When ProcFS could no longer allocate KBuffer objects to serve calls to
read, it would just return 0, indicating EOF. This then triggered
parsing errors because code assumed it read the file.
Because read isn't supposed to return ENOMEM, change ProcFS to populate
the file data upon file open or seek to the beginning. This also means
that calls to open can now return ENOMEM if needed. This allows the
caller to either be able to successfully open the file and read it, or
fail to open it in the first place.
Commit a3a9016701 removed the PT_INTERP header
from Loader.so which cleaned up some kernel code in execve. Unfortunately
it prevents Loader.so from being run as an executable
There is a window between dropping the last reference and removing
a ProcFSInode from the lookup map. So, when looking up we need to
check if that Inode is being destructed.
If a TLB flush request is broadcast to other processors and the addresses
to flush are user mode addresses, we can ignore such a request on the
target processor if the page directory currently in use doesn't match
the addresses to be flushed. We still need to broadcast to all processors
in that case because the other processors may switch to that same page
directory at any time.
If we remap pages (e.g. lazy allocation) inside a VMObject that is
shared among more than one region, broadcast it to any other region
that may be mapping the same page.
Lazily committed shared memory was not working in situations where one
process would write to the memory and another would only read from it.
Since the reading process would never cause a write fault in the shared
region, we'd never notice that the writing process had added real
physical pages to the VMObject. This happened because the lazily
committed pages were marked "present" in the page table.
This patch solves the issue by always allocating shared memory up front
and not trying to be clever about it.
Before this change, we would sometimes map a region into the address
space with !is_shared(), and then moments later call set_shared(true).
I found this very confusing while debugging, so this patch makes us pass
the initial shared flag to the Region constructor, ensuring that it's in
the correct state by the time we first map the region.
Insert stack canaries to find stack corruptions in the kernel.
It looks like this was enabled in the past (842716a) but appears to have been
lost during the CMake conversion.
The `-fstack-protector-strong` variant was chosen because it catches more issues
than `-fstack-protector`, but doesn't have substantial performance impact like
`-fstack-protector-all`.
This fixes a kernel crash that occured when calling ptrace with PT_PEEK
on non paged-in memory.
The crash occurred because we were holding the scheduler lock while
trying to read from the disk's block device, which we do not allow.
Fixes#4740
We need to allocate all pages for the profiler right away so that
we don't trigger page faults in the timer interrupt handler to
allocate them.
Fixes#4734
We need to free the regions before reverting the paging scope to the
original one when rolling back changes due to an error. This fixes
silent memory corruption.
By designating a committed page pool we can guarantee to have physical
pages available for lazy allocation in mappings. However, when forking
we will overcommit. The assumption is that worst-case it's better for
the fork to die due to insufficient physical memory on COW access than
the parent that created the region. If a fork wants to ensure that all
memory is available (trigger a commit) then it can use madvise.
This also means that fork now can gracefully fail if we don't have
enough physical pages available.
This brings mmap more in line with other operating systems. Prior to
this, it was impossible to request memory that was definitely committed,
instead MAP_PURGEABLE would provide a region that was not actually
purgeable, but also not fully committed, which meant that using such memory
still could cause crashes when the underlying pages could no longer be
allocated.
This fixes some random crashes in low-memory situations where non-volatile
memory is mapped (e.g. malloc, tls, Gfx::Bitmap, etc) but when a page in
these regions is first accessed, there is insufficient physical memory
available to commit a new page.
Rather than lazily committing regions by default, we now commit
the entire region unless MAP_NORESERVE is specified.
This solves random crashes in low-memory situations where e.g. the
malloc heap allocated memory, but using pages that haven't been
used before triggers a crash when no more physical memory is available.
Use this flag to create large regions without actually committing
the backing memory. madvise() can be used to commit arbitrary areas
of such regions after creating them.
This adds the ability for a Region to define volatile/nonvolatile
areas within mapped memory using madvise(). This also means that
memory purging takes into account all views of the PurgeableVMObject
and only purges memory that is not needed by all of them. When calling
madvise() to change an area to nonvolatile memory, return whether
memory from that area was purged. At that time also try to remap
all memory that is requested to be nonvolatile, and if insufficient
pages are available notify the caller of that fact.
Instead of specifying the boot argument to be root=/dev/hdXY, now
one can write root=PARTUUID= with the right UUID, and if the partition
is found, the kernel will boot from it.
This feature is mainly used with GUID partitions, and is considered to
be the most reliable way for the kernel to identify partitions.
RTTI is still disabled for the Kernel, and for the Dynamic Loader. This
allows for much less awkward navigation of class heirarchies in LibCore,
LibGUI, LibWeb, and LibJS (eventually). Measured RootFS size increase
was < 1%, and libgui.so binary size was ~3.3%. The small binary size
increase here seems worth it :^)
Use the GNU LD option --no-dynamic-linker. This allows uncommenting some
code in the Kernel that gets upset if your ELF interpreter has its own
interpreter.
Compared to version 10 this fixes a bunch of formatting issues, mostly
around structs/classes with attributes like [[gnu::packed]], and
incorrect insertion of spaces in parameter types ("T &"/"T &&").
I also removed a bunch of // clang-format off/on and FIXME comments that
are no longer relevant - on the other hand it tried to destroy a couple of
neatly formatted comments, so I had to add some as well.
BlockCondition::unblock should return true if it unblocked at
least one thread, not if iterating the blockers had been stopped.
This is a regression introduced by 49a76164c.
Fixes#4670
This assertion cannot be safely/reliably made in the
~SharedInodeVMObject destructor. The problem is that
Inode::is_shared_vmobject holds a weak reference to the instance
that is being destroyed (ref count 0). Checking the pointer using
WeakPtr::unsafe_ptr will produce nullptr depending on timing in
this case, and WeakPtr::safe_ref will reliably produce a nullptr
as soon as the reference count drops to 0. The only case where
this assertion could succeed is when WeakPtr::unsafe_ptr returned
the pointer because it won the race against revoking it. And
because WeakPtr::safe_ref will always return a nullptr, we cannot
reliably assert this from the ~SharedInodeVMObject destructor.
Fixes#4621
If a heap expansion is triggered by allocating from e.g. the
RangeAllocator, which may be holding a spin lock, we cannot
immediately allocate another block of backup memory, which could
require the same locks to be acquired. So, defer allocating the
backup memory
Fixes#4675
When doing the cast to u64 on the page directory physical address,
the sign bit was being extended. This only beomes an issue when
crossing the 2 GiB boundary. At >= 2 GiB, the physical address
has the sign bit set. For example, 0x80000000.
This set all the reserved bits in the PDPTE, causing a GPF
when loading the PDPT pointer into CR3. The reserved bits are
presumably there to stop you writing out a physical address that
the CPU physically cannot handle, as the size of the reserved bits
is determined by the physical address width of the CPU.
This fixes this by casting to FlatPtr instead. I believe the sign
extension only happens when casting to a bigger type. I'm also using
FlatPtr because it's a pointer we're writing into the PDPTE.
sizeof(FlatPtr) will always be the same size as sizeof(void*).
This also now asserts that the physical address in the PDPTE is
within the max physical address the CPU supports. This is better
than getting a GPF, because CPU::handle_crash tries to do the same
operation that caused the GPF in the first place. That would cause
an infinite loop of GPFs until the stack was exhausted, causing a
triple fault.
As far as I know and tested, I believe we can now use the full 32-bit
physical range without crashing.
Fixes#4584. See that issue for the full debugging story.
The unblock_all variant used to ASSERT if a blocker didn't unblock,
but it wasn't clear from the name that it would do that. Because
the BlockCondition already asserts that no blockers are left at
destruction time, it would still catch blockers that haven't been
unblocked for whatever reason.
Fixes#4496
* Add SERENITY_ARCH option to CMake for selecting the target toolchain
* Port all build scripts but continue to use i686
* Update GitHub Actions cache to include BuildIt.sh
Anything above or equal to the 2 GB mark has the left most bit set
(0x8000...), which was falsely interpreted as negative due to
local_offset being signed.
This makes it unsigned by using FlatPtr. To check for underflow as
was intended, lets use Checked instead.
Fixes#4585
The partitioning code was very outdated, and required a full refactor.
The new subsystem removes duplicated code and uses more AK containers.
The most important change is that all implementations of the
PartitionTable class conform to one interface, which made it possible
to remove unnecessary code in the EBRPartitionTable class.
Finding partitions is now done in the StorageManagement singleton,
instead of doing so in init.cpp.
Also, now we don't try to find partitions on demand - the kernel will
try to detect if a StorageDevice is partitioned, and if so, will check
what is the partition table, which could be MBR, GUID or EBR.
Then, it will create DiskPartitionMetadata object for each partition
that is available in the partition table. This object will be used
by the partition enumeration code to create a DiskPartition with the
correct minor number.
The DevFS along with DevPtsFS give a complete solution for populating
device nodes in /dev. The main purpose of DevFS is to eliminate the
need of device nodes generation when building the system.
Later on, DevFS will assist with exposing disk partition nodes.
BlockBasedFileSystem::read_block method should get a reference of
a UserOrKernelBuffer.
If we need to force caching a block, we will call other method to do so.
clang trunk with -std=c++20 doesn't seem to properly look for an
aggregate initializer here when the type being constructed is a simple
aggregate (e.g. `struct Thing { int a; int b; };`). This template fails
to compile in a usage added 12/16/2020 in `AK/Trie.h`.
Both forms of initialization are supposed to call the
aggregate-initializers but direct-list-initialization delegating to
aggregate initializers is a new addition in c++20 that might not be
implemented yet.
The PIT is now also running at a rate of ~250 ticks/second, so rather
than assuming there are 1000 ticks/second we need to query the timer
being used for the actual frequency.
Fixes#4508
This was a goofy kernel API where you could assign an icon_id (int) to
a process which referred to a global shbuf with a 16x16 icon bitmap
inside it.
Instead of this, programs that want to display a process icon now
retrieve it from the process executable instead.
Dumping core can happen at the end of a profiling run, and in that case
we have to protect the target process and take the lock while iterating
over its region map.
Fixes#4509.
When the ExpandableHeap calls the remove_memory function, the
subheap is assumed to be removed and freed entirely. remove_memory
may drop the underlying memory at any time, but it also may cause
further allocation requests. Not removing it from the list before
calling remove_memory could cause a memory allocation in that
subheap while remove_memory is executing. which then causes issues
once the underlying memory is actually freed.
This can happen when an unveil follows another with a path that is a
sub-path of the other one:
```c++
unveil("/home/anon/.config/whoa.ini", "rw");
unveil("/home/anon", "r"); // this would fail, as "/home/anon" inherits
// the permissions of "/", which is None.
```
Problem:
- C functions with no arguments require a single `void` in the argument list.
Solution:
- Put the `void` in the argument list of functions in C header files.
This new flag controls two things:
- Whether the kernel will generate core dumps for the process
- Whether the EUID:EGID should own the process's files in /proc
Processes are automatically made non-dumpable when their EUID or EGID is
changed, either via syscalls that specifically modify those ID's, or via
sys$execve(), when a set-uid or set-gid program is executed.
A process can change its own dumpable flag at any time by calling the
new sys$prctl(PR_SET_DUMPABLE) syscall.
Fixes#4504.
This is instead of the UID:GID, since that was allowing some very bad
information leaks like spawning "su" as an unprivileged user and having
full /proc access to it.
Work towards #4504.
If the allocation fails (e.g ENOMEM) we want to simply return an error
from sys$execve() and continue executing the current executable.
This patch also moves make_userspace_stack_for_main_thread() out of the
Thread class since it had nothing in particular to do with Thread.
Process had a couple of members whose only purpose was holding on to
some temporary data while building the auxiliary vector. Remove those
members and move the vector building to a free function in execve.cpp
Make it possible to bail out of ELF::Image::for_each_program_header()
and then do exactly that if something goes wrong during executable
loading in the kernel.
Also make the errors we return slightly more nuanced than just ENOEXEC.