Commit graph

248 commits

Author SHA1 Message Date
Liav A
633006926f Kernel: Make the Jails' internal design a lot more sane
This is done with 2 major steps:
1. Remove JailManagement singleton and use a structure that resembles
    what we have with the Process object. This is required later for the
    second step in this commit, but on its own, is a major change that
    removes this clunky singleton that had no real usage by itself.
2. Use IntrusiveLists to keep references to Process objects in the same
    Jail so it will be much more straightforward to iterate on this kind
    of objects when needed. Previously we locked the entire Process list
    and we did a simple pointer comparison to check if the checked
    Process we iterate on is in the same Jail or not, which required
    taking multiple Spinlocks in a very clumsy and heavyweight way.
2023-03-12 10:21:59 -06:00
Andreas Kling
d1371d66f7 Kernel: Use non-locking {Nonnull,}RefPtr for OpenFileDescription
This patch switches away from {Nonnull,}LockRefPtr to the non-locking
smart pointers throughout the kernel.

I've looked at the handful of places where these were being persisted
and I don't see any race situations.

Note that the process file descriptor table (Process::m_fds) was already
guarded via MutexProtected.
2023-03-07 00:30:12 +01:00
Andreas Kling
359d6e7b0b Everywhere: Stop using NonnullOwnPtrVector
Same as NonnullRefPtrVector: weird semantics, questionable benefits.
2023-03-06 23:46:35 +01:00
Sam Atkins
fe7b08dad7 Kernel: Protect Process::m_name with a spinlock
This also lets us remove the `get_process_name` and `set_process_name`
syscalls from the big lock. :^)
2023-02-06 20:36:53 +01:00
Timon Kruiper
b941bd55d9 Kernel: Add Syscalls/execve.cpp to aarch64 build 2023-01-27 20:47:08 +00:00
Timon Kruiper
1fbf562e7e Kernel: Add ThreadRegisters::set_exec_state and use it in execve.cpp
Using this abstraction it is possible to compile this file for aarch64.
2023-01-27 20:47:08 +00:00
Timon Kruiper
12322670cb Kernel: Use InterruptsState abstraction in execve.cpp
This was using the x86_64 specific cpu_flags abstraction, which is not
compatible with aarch64.
2023-01-27 20:47:08 +00:00
Andrew Kaster
ddea37b521 Kernel+LibC: Move name length constants to Kernel/API from limits.h
Reduce inclusion of limits.h as much as possible at the same time.

This does mean that kmalloc.h is now including Kernel/API/POSIX/limits.h
instead of LibC/limits.h, but the scope could be limited a lot more.
Basically every file in the kernel includes kmalloc.h, and needs the
limits.h include for PAGE_SIZE.
2023-01-21 10:43:59 -07:00
Liav A
04221a7533 Kernel: Mark Process::jail() method as const
We really don't want callers of this function to accidentally change
the jail, or even worse - remove the Process from an attached jail.
To ensure this never happens, we can just declare this method as const
so nobody can mutate it this way.
2023-01-07 03:44:59 +03:30
yyny
9ca979846c Kernel: Add sid and pgid to Credentials
There are places in the kernel that would like to have access
to `pgid` credentials in certain circumstances.

I haven't found any use cases for `sid` yet, but `sid` and `pgid` are
both changed with `sys$setpgid`, so it seemed sensical to add it.

In Linux, `man 7 credentials` also mentions both the session id and
process group id, so this isn't unprecedented.
2023-01-03 18:13:11 +01:00
Liav A
e598f22768 Kernel: Disallow executing SUID binaries if process is jailed
Check if the process we are currently running is in a jail, and if that
is the case, fail early with the EPERM error code.

Also, as Brian noted, we should also disallow attaching to a jail in
case of already running within a setid executable, as this leaves the
user with false thinking of being secure (because you can't exec new
setid binaries), but the current program is still marked setid, which
means that at the very least we gained permissions while we didn't
expect it, so let's block it.
2022-12-30 15:49:37 -05:00
Liav A
5ff318cf3a Kernel: Remove i686 support 2022-12-28 11:53:41 +01:00
Agustin Gianni
ac40090583 Kernel: Add the auxiliary vector to the stack size validation
This patch validates that the size of the auxiliary vector does not
exceed `Process::max_auxiliary_size`. The auxiliary vector is a range
of memory in userspace stack where the kernel can pass information to
the process that will be created via `Process:do_exec`.

The reason the kernel needs to validate its size is that the about to
be created process needs to have remaining space on the stack.
Previously only `argv` and `envp` were taken into account for the
size validation, with this patch, the size of `auxv` is also
checked. All three elements contain values that a user (or an
attacker) can specify.

This patch adds the constant `Process::max_auxiliary_size` which is
defined to be one eight of the user-space stack size. This is the
approach taken by `Process:max_arguments_size` and
`Process::max_environment_size` which are used to check the sizes
of `argv` and `envp`.
2022-12-14 15:09:28 +00:00
sin-ack
ef6921d7c7 Kernel+LibC+LibELF: Set stack size based on PT_GNU_STACK during execve
Some programs explicitly ask for a different initial stack size than
what the OS provides. This is implemented in ELF by having a
PT_GNU_STACK header which has its p_memsz set to the amount that the
program requires. This commit implements this policy by reading the
p_memsz of the header and setting the main thread stack size to that.
ELF::Image::validate_program_headers ensures that the size attribute is
a reasonable value.
2022-12-11 19:55:37 -07:00
Liav A
718ae68621 Kernel+LibCore+LibC: Implement support for forcing unveil on exec
To accomplish this, we add another VeilState which is called
LockedInherited. The idea is to apply exec unveil data, similar to
execpromises of the pledge syscall, on the current exec'ed program
during the execve sequence. When applying the forced unveil data, the
veil state is set to be locked but the special state of LockedInherited
ensures that if the new program tries to unveil paths, the request will
silently be ignored, so the program will continue running without
receiving an error, but is still can only use the paths that were
unveiled before the exec syscall. This in turn, allows us to use the
unveil syscall with a special utility to sandbox other userland programs
in terms of what is visible to them on the filesystem, and is usable on
both programs that use or don't use the unveil syscall in their code.
2022-11-26 12:42:15 -07:00
Liav A
5e062414c1 Kernel: Add support for jails
Our implementation for Jails resembles much of how FreeBSD jails are
working - it's essentially only a matter of using a RefPtr in the
Process class to a Jail object. Then, when we iterate over all processes
in various cases, we could ensure if either the current process is in
jail and therefore should be restricted what is visible in terms of
PID isolation, and also to be able to expose metadata about Jails in
/sys/kernel/jails node (which does not reveal anything to a process
which is in jail).

A lifetime model for the Jail object is currently plain simple - there's
simpy no way to manually delete a Jail object once it was created. Such
feature should be carefully designed to allow safe destruction of a Jail
without the possibility of releasing a process which is in Jail from the
actual jail. Each process which is attached into a Jail cannot leave it
until the end of a Process (i.e. when finalizing a Process). All jails
are kept being referenced in the JailManagement. When a last attached
process is finalized, the Jail is automatically destroyed.
2022-11-05 18:00:58 -06:00
Liav A
965afba320 Kernel/FileSystem: Add a few missing includes
In preparation to future commits, we need to ensure that
OpenFileDescription.h doesn't include the VirtualFileSystem.h file to
avoid include loops.
2022-10-22 16:57:52 -04:00
Andreas Kling
cf16b2c8e6 Kernel: Wrap process address spaces in SpinlockProtected
This forces anyone who wants to look into and/or manipulate an address
space to lock it. And this replaces the previous, more flimsy, manual
spinlock use.

Note that pointers *into* the address space are not safe to use after
you unlock the space. We've got many issues like this, and we'll have
to track those down as wlel.
2022-08-24 14:57:51 +02:00
Anthony Iacono
f86b671de2 Kernel: Use Process::credentials() and remove user ID/group ID helpers
Move away from using the group ID/user ID helpers in the process to
allow for us to take advantage of the immutable credentials instead.
2022-08-22 12:46:32 +02:00
Andreas Kling
c3351d4b9f Kernel: Make VirtualFileSystem functions take credentials as input
Instead of getting credentials from Process::current(), we now require
that they be provided as input to the various VFS functions.

This ensures that an atomic set of credentials is used throughout an
entire VFS operation.
2022-08-21 16:02:24 +02:00
Andreas Kling
8ed06ad814 Kernel: Guard Process "protected data" with a spinlock
This ensures that both mutable and immutable access to the protected
data of a process is serialized.

Note that there may still be multiple TOCTOU issues around this, as we
have a bunch of convenience accessors that make it easy to introduce
them. We'll need to audit those as well.
2022-08-21 12:25:14 +02:00
Andreas Kling
728c3fbd14 Kernel: Use RefPtr instead of LockRefPtr for Custody
By protecting all the RefPtr<Custody> objects that may be accessed from
multiple threads at the same time (with spinlocks), we remove the need
for using LockRefPtr<Custody> (which is basically a RefPtr with a
built-in spinlock.)
2022-08-21 12:25:14 +02:00
Andreas Kling
122d7d9533 Kernel: Add Credentials to hold a set of user and group IDs
This patch adds a new object to hold a Process's user credentials:

- UID, EUID, SUID
- GID, EGID, SGID, extra GIDs

Credentials are immutable and child processes initially inherit the
Credentials object from their parent.

Whenever a process changes one or more of its user/group IDs, a new
Credentials object is constructed.

Any code that wants to inspect and act on a set of credentials can now
do so without worrying about data races.
2022-08-20 18:32:50 +02:00
Andreas Kling
11eee67b85 Kernel: Make self-contained locking smart pointers their own classes
Until now, our kernel has reimplemented a number of AK classes to
provide automatic internal locking:

- RefPtr
- NonnullRefPtr
- WeakPtr
- Weakable

This patch renames the Kernel classes so that they can coexist with
the original AK classes:

- RefPtr => LockRefPtr
- NonnullRefPtr => NonnullLockRefPtr
- WeakPtr => LockWeakPtr
- Weakable => LockWeakable

The goal here is to eventually get rid of the Lock* classes in favor of
using external locking.
2022-08-20 17:20:43 +02:00
Undefine
97cc33ca47 Everywhere: Make the codebase more architecture aware 2022-07-27 21:46:42 +00:00
Hendiadyoin1
ad904cdcab Kernel: Use find_last_split_view to get the executable name in do_exec 2022-07-15 12:42:43 +02:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Tim Schumacher
add4dd3589 Kernel: Do a POSIX-correct signal handler reset on exec 2022-07-05 20:58:38 +03:00
Luke Wilde
1682b0b6d8 Kernel: Remove big lock from sys$set_coredump_metadata
The only requirement for this syscall is to make
Process::m_coredump_properties SpinlockProtected.
2022-04-09 21:51:16 +02:00
Andreas Kling
9250ac0c24 Kernel: Randomize non-specific VM allocations done by sys$execve()
Stuff like TLS regions, main thread stacks, etc. All deserve to be
randomized unless the ELF requires specific placement. :^)
2022-04-04 00:42:18 +02:00
Andreas Kling
858b196c59 Kernel: Unbreak ASLR in the new RegionTree world
Functions that allocate and/or place a Region now take a parameter
that tells it whether to randomize unspecified addresses.
2022-04-03 21:51:58 +02:00
Andreas Kling
07f3d09c55 Kernel: Make VM allocation atomic for userspace regions
This patch move AddressSpace (the per-process memory manager) to using
the new atomic "place" APIs in RegionTree as well, just like we did for
MemoryManager in the previous commit.

This required updating quite a few places where VM allocation and
actually committing a Region object to the AddressSpace were separated
by other code.

All you have to do now is call into AddressSpace once and it'll take
care of everything for you.
2022-04-03 21:51:58 +02:00
Idan Horowitz
086969277e Everywhere: Run clang-format 2022-04-01 21:24:45 +01:00
Andreas Kling
580d89f093 Kernel: Put Process unveil state in a SpinlockProtected container
This makes path resolution safe to perform without holding the big lock.
2022-03-08 00:19:49 +01:00
Idan Horowitz
011bd06053 Kernel: Set CS selector when initializing thread context on x86_64
These are not technically required, since the Thread constructor
already sets these, but they are set on i686, so let's try and keep
consistent behaviour between the different archs.
2022-02-27 00:38:00 +02:00
Brian Gianforcaro
70f3fa2dd2 Kernel: Set new process name in do_exec before waiting for the tracer
While investigating why gdb is failing when it calls `PT_CONTINUE`
against Serenity I noticed that the names of the programs in the
System Monitor didn't make sense. They were seemingly stale.

After inspecting the kernel code, it became apparent that the sequence
occurs as follows:

    1. Debugger calls `fork()`
    2. The forked child calls `PT_TRACE_ME`
    3. The `PT_TRACE_ME` instructs the forked process to block in the
       kernel waiting for a signal from the tracer on the next call
       to `execve(..)`.
    4. Debugger waits for forked child to spawn and stop, and then it
       calls `PT_ATTACH` followed by `PT_CONTINUE` on the child.
    5. Currently the `PT_CONTINUE` fails because of some other yet to
       be found bug.
    6. The process name is set immediately AFTER we are woken up by
       the `PT_CONTINUE` which never happens in the case I'm debugging.

This chain of events leaves the process suspended, with the name  of
the original (forked) process instead of the name we inherit from
the `execve(..)` call.

To avoid such confusion in the future, we set the new name before we
block waiting for the tracer.
2022-02-19 18:04:32 -08:00
Ali Mohammad Pur
a1cb2c371a AK+Kernel: OOM-harden most parts of Trie
The only part of Unveil that can't handle OOM gracefully is the
String::formatted() use in the node metadata.
2022-02-15 18:03:02 +02:00
Idan Horowitz
c8ab7bde3b Kernel: Use try_make_weak_ptr() instead of make_weak_ptr() 2022-02-13 23:02:57 +01:00
Idan Horowitz
d6ea6c39a7 AK+Kernel: Rename try_make_weak_ptr to make_weak_ptr_if_nonnull
This matches the likes of the adopt_{own, ref}_if_nonnull family and
also frees up the name to allow us to eventually add OOM-fallible
versions of these functions.
2022-02-13 23:02:57 +01:00
Andrew Kaster
b4a7d148b1 Kernel: Expose maximum argument limit in sysconf
Move the definitions for maximum argument and environment size to
Process.h from execve.cpp. This allows sysconf(_SC_ARG_MAX) to return
the actual argument maximum of 128 KiB to userspace.
2022-02-13 22:06:54 +02:00
Lenny Maiorani
c6acf64558 Kernel: Change static constexpr variables to constexpr where possible
Function-local `static constexpr` variables can be `constexpr`. This
can reduce memory consumption, binary size, and offer additional
compiler optimizations.

These changes result in a stripped x86_64 kernel binary size reduction
of 592 bytes.
2022-02-09 21:04:51 +00:00
Andreas Kling
3845c90e08 Kernel: Remove unnecessary includes from Thread.h
...and deal with the fallout by adding missing includes everywhere.
2022-01-30 16:21:59 +01:00
Idan Horowitz
e28af4a2fc Kernel: Stop using HashMap in Mutex
This commit removes the usage of HashMap in Mutex, thereby making Mutex
be allocation-free.

In order to achieve this several simplifications were made to Mutex,
removing unused code-paths and extra VERIFYs:
 * We no longer support 'upgrading' a shared lock holder to an
   exclusive holder when it is the only shared holder and it did not
   unlock the lock before relocking it as exclusive. NOTE: Unlike the
   rest of these changes, this scenario is not VERIFY-able in an
   allocation-free way, as a result the new LOCK_SHARED_UPGRADE_DEBUG
   debug flag was added, this flag lets Mutex allocate in order to
   detect such cases when debugging a deadlock.
 * We no longer support checking if a Mutex is locked by the current
   thread when the Mutex was not locked exclusively, the shared version
   of this check was not used anywhere.
 * We no longer support force unlocking/relocking a Mutex if the Mutex
   was not locked exclusively, the shared version of these functions
   was not used anywhere.
2022-01-29 16:45:39 +01:00
Andreas Kling
b56646e293 Kernel: Switch process file descriptor table from spinlock to mutex
There's no reason for this to use a spinlock. Instead, let's allow
threads to block if someone else is using the descriptor table.
2022-01-29 02:17:09 +01:00
Andreas Kling
8ebec2938c Kernel: Convert process file descriptor table to a SpinlockProtected
Instead of manually locking in the various member functions of
Process::OpenFileDescriptions, simply wrap it in a SpinlockProtected.
2022-01-29 02:17:06 +01:00
Andreas Kling
31c1094577 Kernel: Don't mess with thread state in Process::do_exec()
We were marking the execing thread as Runnable near the end of
Process::do_exec().

This was necessary for exec in processes that had never been scheduled
yet, which is a specific edge case that only applies to the very first
userspace process (normally SystemServer). At this point, such threads
are in the Invalid state.

In the common case (normal userspace-initiated exec), making the current
thread Runnable meant that we switched away from its current state:
Running. As the thread is indeed running, that's a bogus change!
This created a short time window in which the thread state was bogus,
and any attempt to block the thread would panic the kernel (due to a
bogus thread state in Thread::block() leading to VERIFY_NOT_REACHED().)

Fix this by not touching the thread state in Process::do_exec()
and instead make the first userspace thread Runnable directly after
calling Process::exec() on it in try_create_userspace_process().

It's unfortunate that exec() can be called both on the current thread,
and on a new thread that has never been scheduled. It would be good to
not have the latter edge case, but fixing that will require larger
architectural changes outside the scope of this fix.
2022-01-27 11:18:25 +01:00
Brian Gianforcaro
e954b4bdd4 Kernel: Return error from sys$execve() when called with zero arguments
There are many assumptions in the stack that argc is not zero, and
argv[0] points to a valid string. The recent pwnkit exploit on Linux
was able to exploit this assumption in the `pkexec` utility
(a SUID-root binary) to escalate from any user to root.

By convention `execve(..)` should always be called with at least one
valid argument, so lets enforce that semantic to harden the system
against vulnerabilities like pwnkit.

Reference: https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
2022-01-26 13:05:59 +01:00
Idan Horowitz
d1433c35b0 Kernel: Handle OOM failures in find_shebang_interpreter_for_executable 2022-01-26 02:37:03 +02:00
Idan Horowitz
8cf0e4a5e4 Kernel: Eliminate allocations from generate_auxiliary_vector 2022-01-26 02:37:03 +02:00
Jelle Raaijmakers
df73e8b46b Kernel: Allow program headers to align on multiples of PAGE_SIZE
These checks in `sys$execve` could trip up the system whenever you try
to execute an `.so` file. For example, double-clicking `libwasm.so` in
Terminal crashes the kernel.

This changes the program header alignment checks to reflect the same
checks in LibELF, and passes the requested alignment on to
`::try_allocate_range()`.
2022-01-23 00:11:56 +02:00