Instead of getting credentials from Process::current(), we now require
that they be provided as input to the various VFS functions.
This ensures that an atomic set of credentials is used throughout an
entire VFS operation.
This ensures that both mutable and immutable access to the protected
data of a process is serialized.
Note that there may still be multiple TOCTOU issues around this, as we
have a bunch of convenience accessors that make it easy to introduce
them. We'll need to audit those as well.
By protecting all the RefPtr<Custody> objects that may be accessed from
multiple threads at the same time (with spinlocks), we remove the need
for using LockRefPtr<Custody> (which is basically a RefPtr with a
built-in spinlock.)
Instead, allocate when acquiring the lock on m_fds struct, which is
safer to do in terms of safely mutating the m_fds struct, because we
don't use the big process lock in this syscall.
This patch adds the NGROUPS_MAX constant and enforces it in
sys$setgroups() to ensure that no process has more than 32 supplementary
group IDs.
The number doesn't mean anything in particular, just had to pick a
number. Perhaps one day we'll have a reason to change it.
Now that these operate on the neatly atomic and immutable Credentials
object, they should no longer require the process big lock for
synchronization. :^)
This patch adds a new object to hold a Process's user credentials:
- UID, EUID, SUID
- GID, EGID, SGID, extra GIDs
Credentials are immutable and child processes initially inherit the
Credentials object from their parent.
Whenever a process changes one or more of its user/group IDs, a new
Credentials object is constructed.
Any code that wants to inspect and act on a set of credentials can now
do so without worrying about data races.
Until now, our kernel has reimplemented a number of AK classes to
provide automatic internal locking:
- RefPtr
- NonnullRefPtr
- WeakPtr
- Weakable
This patch renames the Kernel classes so that they can coexist with
the original AK classes:
- RefPtr => LockRefPtr
- NonnullRefPtr => NonnullLockRefPtr
- WeakPtr => LockWeakPtr
- Weakable => LockWeakable
The goal here is to eventually get rid of the Lock* classes in favor of
using external locking.
This fixes an issue where a sharing process would map the "lazy
committed page" early and then get stuck with that page even after
it had been replaced in the VMObject by a page fault.
Regressed in 27c1135d30, which made it
happen every time with the backing bitmaps used for WebContent.
Make sure we reject the unveil attempt with EPERM if the veil was locked
by another thread while we were parsing argument (and not holding the
veil state spinlock.)
Thanks Brian for spotting this! :^)
Amendment to #14907.
Path resolution may do blocking I/O so we must not do it while holding
a spinlock. There are tons of problems like this throughout the kernel
and we need to find and fix all of them.
This matches out general macro use, and specifically other verification
macros like VERIFY(), VERIFY_NOT_REACHED(), VERIFY_INTERRUPTS_ENABLED(),
and VERIFY_INTERRUPTS_DISABLED().
Instead of locking it twice, we now frontload all the work that doesn't
touch the fd table, and then only lock it towards the end of the
syscall.
The benefit here is simplicity. The downside is that we do a bit of
unnecessary work in the EMFILE error case, but we don't need to optimize
that case anyway.
If the final copy_to_user() call fails when writing the file descriptors
to the output array, we have to make sure the file descriptors don't
remain in the process file descriptor table. Otherwise they are
basically leaked, as userspace is not aware of them.
This matches the behavior of our sys$socketpair() implementation.
We don't need to explicitly check for EMFILE conditions before doing
anything in sys$pipe(). The fd allocation code will take care of it
for us anyway.
This fixes an issue where failing the fork due to OOM or other error,
we'd end up destroying the Process too early. By the time we got to
WaitBlockerSet::finalize(), it was long gone.
This argument is always set to description.is_blocking(), but
description is also given as a separate argument, so there's no point
to piping it through separately.
While null StringViews are just as bad, these prevent the removal of
StringView(char const*) as that constructor accepts a nullptr.
No functional changes.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
Until the thread is first set as Runnable at the end of sys$fork, its
state is Invalid, and as a result, the Finalizer which is searching for
Dying threads will never find it if the syscall short-circuits due to
an error condition like OOM. This also meant the parent Process of the
thread would be leaked as well.
The extra argument to fcntl is a pointer in the case of F_GETLK/F_SETLK
and we were pulling out a u32, leading to pointer truncation on x86_64.
Among other things, this fixes Assistant on x86_64 :^)
The previous check for valid how values assumed this field was a bitmap
and that SHUT_RDWR was simply a bitwise or of SHUT_RD and SHUT_WR,
which is not the case.
`sigsuspend` was previously implemented using a poll on an empty set of
file descriptors. However, this broke quite a few assumptions in
`SelectBlocker`, as it verifies at least one file descriptor to be
ready after waking up and as it relies on being notified by the file
descriptor.
A bare-bones `sigsuspend` may also be implemented by relying on any of
the `sigwait` functions, but as `sigsuspend` features several (currently
unimplemented) restrictions on how returns work, it is a syscall on its
own.
When updating the signal mask, there is a small frame where we might set
up the receiving process for handing the signal and therefore remove
that signal from the list of pending signals before SignalBlocker has a
chance to block. In turn, this might cause SignalBlocker to never notice
that the signal arrives and it will never unblock once blocked.
Track the currently handled signal separately and include it when
determining if SignalBlocker should be unblocking.
As with the previous commit, we put a distinction between filesystems
that require a file description and those which don't, but now in a much
more readable mechanism - all initialization properties as well as the
create static method are grouped to create the FileSystemInitializer
structure. Then when we need to initialize an instance, we iterate over
a table of these structures, checking for matching structure and then
validating the given arguments from userspace against the requirements
to ensure we can create a valid instance of the requested filesystem.
We do this by putting a distinction between two types of filesystems -
the first type is backed in RAM, and includes TmpFS, ProcFS, SysFS,
DevPtsFS and DevTmpFS. Because these filesystems are backed in RAM,
trying to mount them doesn't require source open file description.
The second type is filesystems that are backed by a file, therefore the
userspace program has to open them (hence it has a open file description
on them) and provide the appropriate source open file description.
By putting this distinction, we can early check if the user tried to
mount the second type of filesystems without a valid file description,
and fail with EBADF then.
Otherwise, we can proceed to either mount either type of filesystem,
provided that the fs_type is valid.
Implement futimes() in terms of utimensat(). Now, utimensat() strays
from POSIX compliance because it also accepts a combination of a file
descriptor of a regular file and an empty path. utimensat() then uses
this file descriptor instead of the path to update the last access
and/or modification time of a file. That being said, its prior behavior
remains intact.
With the new behavior of utimensat(), `path` must point to a valid
string; given a null pointer instead of an empty string, utimensat()
sets `errno` to `EFAULT` and returns a failure.
Coverage tools like LLVM's source-based coverage or GNU's --coverage
need to be able to write out coverage files from any binary, regardless
of its security posture. Not ignoring these pledges and veils means we
can't get our coverage data out without playing some serious tricks.
However this is pretty terrible for normal exeuction, so only skip these
checks when we explicitly configured userspace for coverage.
This keeps us from accidentally overwriting an already set region name,
for example when we are mapping a file (as, in this case, the file name
is already stored in the region).
This syscall ends up disabling interrupts while changing the time,
and the clock is a global resource anyway, so preventing threads in the
same process from running wouldn't solve anything.
This patch move AddressSpace (the per-process memory manager) to using
the new atomic "place" APIs in RegionTree as well, just like we did for
MemoryManager in the previous commit.
This required updating quite a few places where VM allocation and
actually committing a Region object to the AddressSpace were separated
by other code.
All you have to do now is call into AddressSpace once and it'll take
care of everything for you.
RegionTree holds an IntrusiveRedBlackTree of Region objects and vends a
set of APIs for allocating memory ranges.
It's used by AddressSpace at the moment, and will be used by MM soon.
This patch stops using VirtualRangeAllocator in AddressSpace and instead
looks for holes in the region tree when allocating VM space.
There are many benefits:
- VirtualRangeAllocator is non-intrusive and would call kmalloc/kfree
when used. This new solution is allocation-free. This was a source
of unpleasant MM/kmalloc deadlocks.
- We consolidate authority on what the address space looks like in a
single place. Previously, we had both the range allocator *and* the
region tree both being used to determine if an address was valid.
Now there is only the region tree.
- Deallocation of VM when splitting regions is no longer complicated,
as we don't need to keep two separate trees in sync.
8233da3398 introduced a not-so-subtle bug
where an application with an existing pledge set containing `no_error`
could elevate its pledge set by pledging _anything_, this commit makes
sure that no new promise is accepted.
This makes pledge() ignore promises that would otherwise cause it to
fail with EPERM, which is very useful for allowing programs to run under
a "jail" so to speak, without having them termiate early due to a
failing pledge() call.
The obsolete ttyname and ptsname syscalls are removed.
LibC doesn't rely on these anymore, and it helps simplifying the Kernel
in many places, so it's an overall an improvement.
In addition to that, /proc/PID/tty node is removed too as it is not
needed anymore by userspace to get the attached TTY of a process, as
/dev/tty (which is already a character device) represents that as well.