The stack is misaligned at this point for some reason, this is a hack
that makes the resulting object "correctly" aligned, thus avoiding a
KUBSAN error.
Mere mortals like myself cannot understand more than two lines of
assembly without a million comments explaining what's happening, so do
that and make sure no one has to go on a wild stack state chase when
hacking on these.
POSIX requires that sigaction() and friends set a _process-wide_ signal
handler, so move signal handlers and flags inside Process.
This also fixes a "pid/tid confusion" FIXME, as we can now send the
signal to the process and let that decide which thread should get the
signal (which is the thread with tid==pid, but that's now the Process's
problem).
Note that each thread still retains its signal mask, as that is local to
each thread.
These are not technically required, since the Thread constructor
already sets these, but they are set on i686, so let's try and keep
consistent behaviour between the different archs.
While investigating why gdb is failing when it calls `PT_CONTINUE`
against Serenity I noticed that the names of the programs in the
System Monitor didn't make sense. They were seemingly stale.
After inspecting the kernel code, it became apparent that the sequence
occurs as follows:
1. Debugger calls `fork()`
2. The forked child calls `PT_TRACE_ME`
3. The `PT_TRACE_ME` instructs the forked process to block in the
kernel waiting for a signal from the tracer on the next call
to `execve(..)`.
4. Debugger waits for forked child to spawn and stop, and then it
calls `PT_ATTACH` followed by `PT_CONTINUE` on the child.
5. Currently the `PT_CONTINUE` fails because of some other yet to
be found bug.
6. The process name is set immediately AFTER we are woken up by
the `PT_CONTINUE` which never happens in the case I'm debugging.
This chain of events leaves the process suspended, with the name of
the original (forked) process instead of the name we inherit from
the `execve(..)` call.
To avoid such confusion in the future, we set the new name before we
block waiting for the tracer.
Arguments larger than 32bit need to be passed as a pointer on a 32bit
architectures. sys$profiling_enable has u64 event_mask argument,
which means that it needs to be passed as an pointer. Previously upper
32bits were filled by garbage.
This matches the likes of the adopt_{own, ref}_if_nonnull family and
also frees up the name to allow us to eventually add OOM-fallible
versions of these functions.
Move the definitions for maximum argument and environment size to
Process.h from execve.cpp. This allows sysconf(_SC_ARG_MAX) to return
the actual argument maximum of 128 KiB to userspace.
Function-local `static constexpr` variables can be `constexpr`. This
can reduce memory consumption, binary size, and offer additional
compiler optimizations.
These changes result in a stripped x86_64 kernel binary size reduction
of 592 bytes.
Rename the bound socket accessor from socket() to bound_socket().
Also return RefPtr<LocalSocket> instead of a raw pointer, to make it
harder for callers to mess up.
Previously we would return a bytes written value of 0 if the writing end
of the socket was full. Now we either exit with EAGAIN if the socket
description is non-blocking, or block until the description can be
written to.
This is mostly a copy of the conditions in sys$write but with the "total
nwritten" parts removed as sys$sendmsg does not have that.
This commit removes the usage of HashMap in Mutex, thereby making Mutex
be allocation-free.
In order to achieve this several simplifications were made to Mutex,
removing unused code-paths and extra VERIFYs:
* We no longer support 'upgrading' a shared lock holder to an
exclusive holder when it is the only shared holder and it did not
unlock the lock before relocking it as exclusive. NOTE: Unlike the
rest of these changes, this scenario is not VERIFY-able in an
allocation-free way, as a result the new LOCK_SHARED_UPGRADE_DEBUG
debug flag was added, this flag lets Mutex allocate in order to
detect such cases when debugging a deadlock.
* We no longer support checking if a Mutex is locked by the current
thread when the Mutex was not locked exclusively, the shared version
of this check was not used anywhere.
* We no longer support force unlocking/relocking a Mutex if the Mutex
was not locked exclusively, the shared version of these functions
was not used anywhere.
We were marking the execing thread as Runnable near the end of
Process::do_exec().
This was necessary for exec in processes that had never been scheduled
yet, which is a specific edge case that only applies to the very first
userspace process (normally SystemServer). At this point, such threads
are in the Invalid state.
In the common case (normal userspace-initiated exec), making the current
thread Runnable meant that we switched away from its current state:
Running. As the thread is indeed running, that's a bogus change!
This created a short time window in which the thread state was bogus,
and any attempt to block the thread would panic the kernel (due to a
bogus thread state in Thread::block() leading to VERIFY_NOT_REACHED().)
Fix this by not touching the thread state in Process::do_exec()
and instead make the first userspace thread Runnable directly after
calling Process::exec() on it in try_create_userspace_process().
It's unfortunate that exec() can be called both on the current thread,
and on a new thread that has never been scheduled. It would be good to
not have the latter edge case, but fixing that will require larger
architectural changes outside the scope of this fix.
There are many assumptions in the stack that argc is not zero, and
argv[0] points to a valid string. The recent pwnkit exploit on Linux
was able to exploit this assumption in the `pkexec` utility
(a SUID-root binary) to escalate from any user to root.
By convention `execve(..)` should always be called with at least one
valid argument, so lets enforce that semantic to harden the system
against vulnerabilities like pwnkit.
Reference: https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
These checks in `sys$execve` could trip up the system whenever you try
to execute an `.so` file. For example, double-clicking `libwasm.so` in
Terminal crashes the kernel.
This changes the program header alignment checks to reflect the same
checks in LibELF, and passes the requested alignment on to
`::try_allocate_range()`.
This was easily done, as the Kernel and Userland don't actually share
any of the APIs exposed by it, so instead the Kernel APIs were moved to
the Kernel, and the Userland APIs stayed in LibKeyboard.
This has multiple advantages:
* The non OOM-fallible String is not longer used for storing the
character map name in the Kernel
* The kernel no longer has to link to the userland LibKeyboard code
* A lot of #ifdef KERNEL cruft can be removed from LibKeyboard
Since we don't return normally from this function, let's make it a
little extra difficult to accidentally leak something by leaving it on
the stack in this function.
This ensures that everything allocated on the stack in Process::exec()
gets cleaned up. We had a few leaks related to the parsing of shebang
(#!) executables that get fixed by this.
Since we don't return from sys$execve() when it's successful, we have to
take special care to tear down anything we've allocated.
Turns out we were not doing this for the full executable path itself.
Since this was only out of bounds of the specific field, not of the
whole struct, and because setting the hostname requires root privileges
this was not actually a security vulnerability.
This function is an extended version of `chmod(2)` that lets one control
whether to dereference symlinks, and specify a file descriptor to a
directory that will be used as the base for relative paths.
This modifies sys$chown to allow specifying whether or not to follow
symlinks and in which directory.
This was then used to implement lchown and fchownat in LibC and LibCore.
Previously we would crash the process immediately when a promise
violation was found during a syscall. This is error prone, as we
don't unwind the stack. This means that in certain cases we can
leak resources, like an OwnPtr / RefPtr tracked on the stack. Or
even leak a lock acquired in a ScopeLockLocker.
To remedy this situation we move the promise violation handling to
the syscall handler, right before we return to user space. This
allows the code to follow the normal unwind path, and grantees
there is no longer any cleanup that needs to occur.
The Process::require_promise() and Process::require_no_promises()
functions were modified to return ErrorOr<void> so we enforce that
the errors are always propagated by the caller.
This change lays the foundation for making the require_promise return
an error hand handling the process abort outside of the syscall
implementations, to avoid cases where we would leak resources.
It also has the advantage that it makes removes a gs pointer read
to look up the current thread, then process for every syscall. We
can instead go through the Process this pointer in most cases.
This change lays the foundation for making the require_promise return
an error hand handling the process abort outside of the syscall
implementations, to avoid cases where we would leak resources.
It also has the advantage that it makes removes a gs pointer read
to look up the current thread, then process for every syscall. We
can instead go through the Process this pointer in most cases.
I fell into this trap and tried to switch the syscalls to pass by
the `off_t` by register. I think it makes sense to add a clarifying
comment for future readers of the code, so they don't fall into the
same trap. :^)
In `sys$accept4()` and `get_sock_or_peer_name()` we were not
initializing the padding of the `sockaddr_un` struct, leading to
an kernel information leak if the
caller looked back at it's contents.
Before Fix:
37.766 Clipboard(11:11): accept4 Bytes:
2f746d702f706f7274616c2f636c6970626f61726440eac130e7fbc1e8abbfc
19c10ffc18440eac15485bcc130e7fbc1549feaca6c9deaca549feaca1bb0bc
03efdf62c0e056eac1b402d7acd010ffc14602000001b0bc030100000050bf0
5c24602000001e7fbc1b402d7ac6bdc
After Fix:
0.603 Clipboard(11:11): accept4 Bytes:
2f746d702f706f7274616c2f636c6970626f617264000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000
... instead of returning the maximum number of Processor objects that we
can allocate.
Some ports (e.g. gdb) rely on this information to determine the number
of worker threads to spawn. When gdb spawned 64 threads, the kernel
could not cope with generating backtraces for it, which prevented us
from debugging it properly.
This commit also removes the confusingly named
`Processor::processor_count` function so that this mistake can't happen
again.
As a small cleanup, this also makes `page_round_up` verify its
precondition with `page_round_up_would_wrap` (which callers are expected
to call), rather than having its own logic.
Fixes#11297.
Most other syscalls pass address arguments as `void*` instead of
`uintptr_t`, so let's do that here too. Besides improving consistency,
this commit makes `strace` correctly pretty-print these arguments in
hex.
This feature was introduced in version 4.17 of the Linux kernel, and
while it's not specified by POSIX, I think it will be a nice addition to
our system.
MAP_FIXED_NOREPLACE provides a less error-prone alternative to
MAP_FIXED: while regular fixed mappings would cause any intersecting
ranges to be unmapped, MAP_FIXED_NOREPLACE returns EEXIST instead. This
ensures that we don't corrupt our process's address space if something
is already at the requested address.
Note that the more portable way to do this is to use regular
MAP_ANONYMOUS, and check afterwards whether the returned address matches
what we wanted. This, however, has a large performance impact on
programs like Wine which try to reserve large portions of the address
space at once, as the non-matching addresses have to be unmapped
separately.
Previously we were assigning to Process::m_space before actually
entering the new address space (assigning it to CR3.)
If a thread was preempted by the scheduler while destroying the old
address space, we'd then attempt to resume the thread with CR3 pointing
at a partially destroyed address space.
We could then crash immediately in write_cr3(), right after assigning
the new value to CR3. I am hopeful that this may have been the bug
haunting our CI for months. :^)
Not much to say here, this is an implementation of this call that
accesses the actual limit constant that's used by the VirtualFileSystem
class.
As a side note, this is required for my eventual Qt port.
For setreuid and setresuid syscalls, -1 means to set the current
uid/euid/gid/egid value, to be more convenient for programming.
However, for other syscalls where we pass only one argument, there's no
justification to specify -1.
This behavior is identical to how Linux handles the value -1, and is
influenced by the fact that the manual pages for the group of one
argument syscalls that handle ID operations is ambiguous about this
topic.
Now that the userland has a compatiblity wrapper for select(), the
kernel doesn't need to implement this syscall natively. The poll()
interface been around since 1987, any code still using select()
should be slapped silly.
Note: the SerenityOS source tree mostly uses select() and not poll()
despite SerenityOS having support for poll() since early 2019...
This includes a new Thread::Blocker called SignalBlocker which blocks
until a signal of a matching type is pending. The current Blocker
implementation in the Kernel is very complicated, but cleaning it up is
a different yak for a different day.
As required by posix. Also rename Thread::clear_signals to
Thread::reset_signals_for_exec since it doesn't actually clear any
pending signals, but rather does execve related signal book-keeping.
Returning 'result.error().code()' erroneously creates an
ErrorOr<FlatPtr> of the positive errno code, which breaks our
error-returning convention.
This seems to be due to a forgotten minus-sign during the refactoring in
9e51e295cf. This latent bug was never
discovered, because currently the error-handling paths are rarely
exercised.
Also, remove incomplete, superfluous check.
Incomplete, because only the byte at the provided address was checked;
this misses the last bytes of the "jerk page".
Superfluous, because it is already correctly checked by peek_user_data
(which calls copy_from_user).
The caller/tracer should not typically attempt to read non-userspace
addresses, we don't need to "hot-path" it either.
Since we're iterating over multiple regions that interesect with the
requested range, just one of them having the requested access flags
is not enough to finish the syscall early.
The advices are almost always exclusive of one another, and while POSIX
does not define madvise, most other unix-like and *BSD systems also only
accept a singular value per call.
We were returning `int`s from two functions that caused `ErrorOr` to
not recognize the error codes as a special case. For example,
`ETIMEDOUT` was returned as the positive number 66 resulting in all
kinds of defective behavior.
As a result, SDL2's timer subsystem was not working at all, since the
`SDL_MUTEX_TIMEDOUT` value was never returned.
This allows userspace to trigger a full (FIXME) flush of a shared file
mapping to disk. We iterate over all the mapped pages in the VMObject
and write them out to the underlying inode, one by one. This is rather
naive, and there's lots of room for improvement.
Note that shared file mappings are currently not possible since mmap()
returns ENOTSUP for PROT_WRITE+MAP_SHARED. That restriction will be
removed in a subsequent commit. :^)
This feels like it was a refactor transition kind of conversion. The
places that were relying on it can easily be changed to explicitly ask
for the ptr() or a new vaddr() method on Userspace<T*>.
FlatPtr can still implicitly convert to Userspace<T> because the
constructor is not explicit, but there's quite a few more places that
are relying on that conversion.
A series of refactors changed Threads to always have a name, and to
store their name as a KString. Before the refactors a StringBuilder was
used to format the default thread name for a non-main thread, but it is
since unused. Remove it and the AK/String related header includes from
the thread syscall implementation file.
`off_t` is a 64-bit signed integer, so passing it in a register on i686
is not the best idea.
This fix gets us one step closer to making the LLVM port work.
Instead of signalling allocation failure with a bool return value
(false), we now use ErrorOr<void> and return ENOMEM as appropriate.
This allows us to use TRY() and MUST() with Vector. :^)
In particular, fstatvfs used to assume that a file that was earlier
opened using some path will forever be at that path. This is wrong, and
in the meantime new mounts and new filesystems could take up the
filename or directories, leading to a completely inaccurate result.
This commit improves the situation:
- All filesystem information is now always accurate.
- The mount flags *might* be erroneously zero, if the custody for the
open file is not available. I don't know when that might happen, but
it is definitely not the typical case.
We now use AK::Error and AK::ErrorOr<T> in both kernel and userspace!
This was a slightly tedious refactoring that took a long time, so it's
not unlikely that some bugs crept in.
Nevertheless, it does pass basic functionality testing, and it's just
real nice to finally see the same pattern in all contexts. :^)
Found due to smelly code in InodeFile::absolute_path.
In particular, this replaces the following misleading methods:
File::absolute_path
This method *never* returns an actual path, and if called on an
InodeFile (which is impossible), it would VERIFY_NOT_REACHED().
OpenFileDescription::try_serialize_absolute_path
OpenFileDescription::absolute_path
These methods do not guarantee to return an actual path (just like the
other method), and just like Custody::absolute_path they do not
guarantee accuracy. In particular, just renaming the method made a
TOCTOU bug obvious.
The new method signatures use KResultOr, just like
try_serialize_absolute_path() already did.
The OpenFileDescription class already offers the necessary functionlity,
so implementing this was only a matter of following the structure for
`read` while handling the additional `offset` argument.
Having these bits of code factored out not only prevents duplication
now, but will also allow us to implement pread without repeating
ourselves (too much).
Values in `ioctl` are given through a pointer, but ioctl's FIONBIO
implementation was interpreting this pointer as an integer directly.
This meant that programs using `ioctl` to set a file descriptor in
blocking mode met with incorrect behavior: they passed a non-null
pointer pointing to a value of 0, but the kernel interpreted the pointer
as a non-zero integer, thus making the file non-blocking.
This commit fixes this behavior by reading the value from the userspace
pointer and using that to set the non-blocking flag on the file
descriptor.
This bug was found while trying to run the openssl tool on serenity,
which used `ioctl` to ensure newly-created sockets are in blocking mode.
SonarCloud flagged this "Code Smell", where we are accessing these
static methods as if they are instance methods. While it is technically
possible, it is very confusing to read when you realize they are static
functions.
Previously, attempting to call sys$waitid on non-child processes
returned ECHILD.
That prevented debugging non-child processes by attaching to them during
runtime (as opposed to forking and debugging the child, which is what
was previously supported).
We now allow calling sys$waitid on a any process that is being traced
by us, even if it's not our child.
This change removes the halt and reboot syscalls, and create a new
mechanism to change the power state of the machine.
Instead of how power state was changed until now, put a SysFS node as
writable only for the superuser, that with a defined value, can result
in either reboot or poweroff.
In the future, a power group can be assigned to this node (which will be
the GroupID responsible for power management).
This opens an opportunity to permit to shutdown/reboot without superuser
permissions, so in the future, a userspace daemon can take control of
this node to perform power management operations without superuser
permissions, if we enforce different UserID/GroupID on that node.
This will somwhat help unify them also under the same SysFS directory in
the commit.
Also, it feels much more like this change reflects the reality that both
ACPI and the BIOS are part of the firmware on x86 computers.
These interfaces are broken for about 9 months, maybe longer than that.
At this point, this is just a dead code nobody tests or tries to use, so
let's remove it instead of keeping a stale code just for the sake of
keeping it and hoping someone will fix it.
To better justify this, I read that OpenBSD removed loadable kernel
modules in 5.7 release (2014), mainly for the same reason we do -
nobody used it so they had no good reason to maintain it.
Still, OpenBSD had LKMs being effectively working, which is not the
current state in our project for a long time.
An arguably better approach to minimize the Kernel image size is to
allow dropping drivers and features while compiling a new image.
This patch converts all the usage of AK::String around sys$execve() to
using KString instead, allowing us to catch and propagate OOM errors.
It also required changing the kernel CommandLine helper class to return
a vector of KString for the userspace init program arguments.
The current implementation of DevFS resembles the linux devtmpfs, and
not the traditional DevFS, so let's rename it to better represent the
direction of the development in regard to this filesystem.
The abbreviation for DevTmpFS is still "dev", because it doesn't add
value as a commandline option to make it longer.
In quick summary - DevFS in unix OSes is simply a static filesystem, so
device nodes are generated and removed by the kernel code. DevTmpFS
is a "modern reinvention" of the DevFS, so it is much more like a TmpFS
in the sense that not only it's stored entirely in RAM, but the userland
is responsible to add and remove devices nodes as it sees fit, and no
kernel code is directly being involved to keep the filesystem in sync.
This was a weird KBuffer API that assumed failure was impossible.
This patch converts it to a modern KResultOr<NonnullOwnPtr<KBuffer>> API
and updates the two clients to the new style.
Make use of the new FileDescription::try_serialize_absolute_path() to
avoid String in favor of KString throughout much of sys$execve() and
its helpers.
I had to look at this for a moment before I realized that sys$execve()
and the spawning of /bin/SystemServer at boot are taking two different
paths out of exec().
Add a comment to help the next person looking at it. :^)
Before this patch, we were leaking a ref on the open file description
used for the interpreter (the dynamic loader) in sys$execve().
This surfaced when adapting the syscall to use TRY(), since we were now
correctly transferring ownership of the interpreter to Process::exec()
and no longer holding on to a local copy of it (in `elf_result`).
Fixing the leak uncovered another problem. The interpreter description
would now get destroyed when returning from do_exec(), which led to a
kernel panic when attempting to acquire a mutex.
This happens because we're in a particularly delicate state when
returning from do_exec(). Everything is primed for the upcoming context
switch into the new executable image, and trying to block the thread
at this point will panic the kernel.
We fix this by destroying the interpreter description earlier in
do_exec(), at the point where we no longer need it.
When executing a dynamically linked program, we need to pass the main
program executable via a file descriptor to the dynamic loader.
Before this patch, we were allocating an FD for this purpose long after
it was safe to do anything fallible. If we were unable to allocate an
FD we would simply panic the kernel(!)
We now hoist the allocation so it can fail before we've committed to
a new executable.
Due to the use of ELF::Image::for_each_program_header(), we were
previously unable to use TRY() in the ELF loading code (since the return
statement inside TRY() would only return from the iteration callback.)
Once we commit to a new executable image in sys$execve(), we can no
longer return with an error to whoever called us from userspace.
We must make sure to surface any potential errors before that point.
This patch moves signal trampoline allocation before the commit.
A number of other things remain to be moved.
We previously allowed Thread to exist in a state where its m_name was
null, and had to work around that in various places.
This patch removes that possibility and forces those who would create a
thread (or change the name of one) to provide a NonnullOwnPtr<KString>
with the name.
- Renamed try_create_absolute_path() => try_serialize_absolute_path()
- Use KResultOr and TRY() to propagate errors
- Don't call this when it's only for debug logging
- Use KResultOr and TRY() to propagate errors
- Check for OOM errors
- Move allocation out of constructors
There's still a lot more to do here, as SysFS is still quite brittle
in the face of memory pressure.
This expands the reach of error propagation greatly throughout the
kernel. Sadly, it also exposes the fact that we're allocating (and
doing other fallible things) in constructors all over the place.
This patch doesn't attempt to address that of course. That's work for
our future selves.