Commit graph

7390 commits

Author SHA1 Message Date
Andreas Kling
a3b2b20782 Kernel: Remove global MM lock in favor of SpinlockProtected
Globally shared MemoryManager state is now kept in a GlobalData struct
and wrapped in SpinlockProtected.

A small set of members are left outside the GlobalData struct as they
are only set during boot initialization, and then remain constant.
This allows us to access those members without taking any locks.
2022-08-26 01:04:51 +02:00
Andreas Kling
2c72d495a3 Kernel: Use RefPtr instead of LockRefPtr for PhysicalPage
I believe this to be safe, as the main thing that LockRefPtr provides
over RefPtr is safe copying from a shared LockRefPtr instance. I've
inspected the uses of RefPtr<PhysicalPage> and it seems they're all
guarded by external locking. Some of it is less obvious, but this is
an area where we're making continuous headway.
2022-08-24 18:35:41 +02:00
Andreas Kling
5a804b9a1d Kernel: Make PhysicalPage::ref() use relaxed memory order
When incrementing a reference count, it should be sufficient to use
relaxed ordering. Note that unref() still uses acquire-release.
2022-08-24 18:35:41 +02:00
Andreas Kling
0dd88fd836 Kernel: Remove unnecessary forward declaration of s_mm_lock 2022-08-24 14:57:51 +02:00
Andreas Kling
ac3ea277aa Kernel: Don't take MM lock in ~PageDirectory()
We don't need the MM lock to unregister a PageDirectory from the CR3
map. This is already protected by the CR3 map's own lock.
2022-08-24 14:57:51 +02:00
Andreas Kling
5beed613ca Kernel: Don't take MM lock in MemoryManager::dump_kernel_regions()
We have to hold the region tree lock while dumping its regions anyway,
and taking the MM lock here was unnecessary.
2022-08-24 14:57:51 +02:00
Andreas Kling
05156cac94 Kernel: Don't take MM lock in MemoryManager::enter_address_space()
We're not accessing any of the MM members here. Also remove some
redundant code to update CR3, since it calls activate_page_directory()
which does exactly the same thing.
2022-08-24 14:57:51 +02:00
Andreas Kling
2607a6a4bd Kernel: Update comment about what the MM lock protects 2022-08-24 14:57:51 +02:00
Andreas Kling
da24a937f5 Kernel: Don't wrap AddressSpace's RegionTree in SpinlockProtected
Now that AddressSpace itself is always SpinlockProtected, we don't
need to also wrap the RegionTree. Whoever has the AddressSpace locked
is free to poke around its tree.
2022-08-24 14:57:51 +02:00
Andreas Kling
d3e8eb5918 Kernel: Make file-backed memory regions remember description permissions
This allows sys$mprotect() to honor the original readable & writable
flags of the open file description as they were at the point we did the
original sys$mmap().

IIUC, this is what Dr. POSIX wants us to do:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/mprotect.html

Also, remove the bogus and racy "W^X" checking we did against mappings
based on their current inode metadata. If we want to do this, we can do
it properly. For now, it was not only racy, but also did blocking I/O
while holding a spinlock.
2022-08-24 14:57:51 +02:00
Andreas Kling
30861daa93 Kernel: Simplify the File memory-mapping API
Before this change, we had File::mmap() which did all the work of
setting up a VMObject, and then creating a Region in the current
process's address space.

This patch simplifies the interface by removing the region part.
Files now only have to return a suitable VMObject from
vmobject_for_mmap(), and then sys$mmap() itself will take care of
actually mapping it into the address space.

This fixes an issue where we'd try to block on I/O (for inode metadata
lookup) while holding the address space spinlock. It also reduces time
spent holding the address space lock.
2022-08-24 14:57:51 +02:00
Andreas Kling
cf16b2c8e6 Kernel: Wrap process address spaces in SpinlockProtected
This forces anyone who wants to look into and/or manipulate an address
space to lock it. And this replaces the previous, more flimsy, manual
spinlock use.

Note that pointers *into* the address space are not safe to use after
you unlock the space. We've got many issues like this, and we'll have
to track those down as wlel.
2022-08-24 14:57:51 +02:00
Andreas Kling
d6ef18f587 Kernel: Don't hog the MM lock while unmapping regions
We were holding the MM lock across all of the region unmapping code.
This was previously necessary since the quickmaps used during unmapping
required holding the MM lock.

Now that it's no longer necessary, we can leave the MM lock alone here.
2022-08-24 14:57:51 +02:00
Andreas Kling
dc9d2c1b10 Kernel: Wrap RegionTree objects in SpinlockProtected
This makes locking them much more straightforward, and we can remove
a bunch of confusing use of AddressSpace::m_lock. That lock will also
be converted to use of SpinlockProtected in a subsequent patch.
2022-08-24 14:57:51 +02:00
James Bellamy
9c1ee8cbd1 Kernel: Remove big lock from sys$socket
With the implementation of the credentials object the socket syscall no
longer needs the big lock.
2022-08-23 20:29:50 +02:00
Timon Kruiper
d62bd3c635 Kernel/aarch64: Properly initialize T0SZ and T1SZ fields in TCR_EL1
By default these 2 fields were zero, which made it rely on
implementation defined behavior whether these fields internally would be
set to the correct value. The ARM processor in the Raspberry PI (and
QEMU 6.x) would actually fixup these values, whereas QEMU 7.x now does
not do that anymore, and a translation fault would be generated instead.

For more context see the relevant QEMU issue:
 - https://gitlab.com/qemu-project/qemu/-/issues/1157

Fixes #14856
2022-08-23 09:23:27 -04:00
Samuel Bowman
91574ed677 Kernel: Fix boot profiling
Boot profiling was previously broken due to init_stage2() passing the
event mask to sys$profiling_enable() via kernel pointer, but a user
pointer is expected.

To fix this, I added Process::profiling_enable() as an alternative to
Process::sys$profiling_enable which takes a u64 rather than a
Userspace<u64 const*>. It's a bit of a hack, but it works.
2022-08-23 11:48:50 +02:00
Anthony Iacono
ec3d8a7a18 Kernel: Remove unused Process::in_group() 2022-08-23 01:01:48 +02:00
Andreas Kling
434d77cd43 Kernel/ProcFS: Silently ignore attempts to update ProcFS timestamps
We have to override Inode::update_timestamps() for ProcFS inodes,
otherwise we'll get the default behavior of erroring with ENOTIMPL.
2022-08-23 01:00:40 +02:00
Andreas Kling
5307e1bf01 Kernel/SysFS: Silently ignore attempts to update SysFS timestamps
We have to override Inode::update_timestamps() for SysFS inodes,
otherwise we'll get the default behavior of erroring with ENOTIMPL.
2022-08-23 00:55:41 +02:00
Andreas Kling
4c081e0479 Kernel/x86: Protect the CR3->PD map with a spinlock
This can be accessed from multiple CPUs at the same time, so relying on
the interrupt flag is clearly insufficient.
2022-08-22 17:56:03 +02:00
Andreas Kling
6cd3695761 Kernel: Stop taking MM lock while using regular quickmaps
You're still required to disable interrupts though, as the mappings are
per-CPU. This exposed the fact that our CR3 lookup map is insufficiently
protected (but we'll address that in a separate commit.)
2022-08-22 17:56:03 +02:00
Andreas Kling
c8375c51ff Kernel: Stop taking MM lock while using PD/PT quickmaps
This is no longer required as these quickmaps are now per-CPU. :^)
2022-08-22 17:56:03 +02:00
Andreas Kling
a838fdfd88 Kernel: Make the page table quickmaps per-CPU
While the "regular" quickmap (used to temporarily map a physical page
at a known address for quick access) has been per-CPU for a while,
we also have the PD (page directory) and PT (page table) quickmaps
used by the memory management code to edit page tables. These have been
global, which meant that SMP systems had to keep fighting over them.

This patch makes *all* quickmaps per-CPU. We reserve virtual addresses
for up to 64 CPUs worth of quickmaps for now.

Note that all quickmaps are still protected by the MM lock, and we'll
have to fix that too, before seeing any real throughput improvements.
2022-08-22 17:56:03 +02:00
Andreas Kling
930dedfbd8 Kernel: Make sys$utime() and sys$utimensat() not take the big lock 2022-08-22 17:56:03 +02:00
Andreas Kling
280694bb46 Kernel: Update atime/ctime/mtime timestamps atomically
Instead of having three separate APIs (one for each timestamp),
there's now only Inode::update_timestamps() and it takes 3x optional
timestamps. The non-empty timestamps are updated while holding the inode
mutex, and the outside world no longer has to look at intermediate
timestamp states.
2022-08-22 17:56:03 +02:00
Andreas Kling
35b2e9c663 Kernel: Make sys$mknod() not take the big lock 2022-08-22 17:56:03 +02:00
Anthony Iacono
f86b671de2 Kernel: Use Process::credentials() and remove user ID/group ID helpers
Move away from using the group ID/user ID helpers in the process to
allow for us to take advantage of the immutable credentials instead.
2022-08-22 12:46:32 +02:00
Andreas Kling
42435ce5e4 Kernel: Make sys$recvfrom() with MSG_DONTWAIT not so racy
Instead of temporary changing the open file description's "blocking"
flag while doing a non-waiting recvfrom, we instead plumb the currently
wanted blocking behavior all the way through to the underlying socket.
2022-08-21 16:45:42 +02:00
Andreas Kling
8997c6a4d1 Kernel: Make Socket::connect() take credentials as input 2022-08-21 16:35:03 +02:00
Andreas Kling
51318d51a4 Kernel: Make Socket::bind() take credentials as input 2022-08-21 16:33:09 +02:00
Andreas Kling
8d0bd3f225 Kernel: Make LocalSocket do chown/chmod through VFS
This ensures that all the permissions checks are made against the
provided credentials. Previously we were just calling through directly
to the inode setters, which did no security checks!
2022-08-21 16:22:34 +02:00
Andreas Kling
dbe182f1c6 Kernel: Make Inode::resolve_as_link() take credentials as input 2022-08-21 16:17:13 +02:00
Andreas Kling
006f753647 Kernel: Make File::{chown,chmod} take credentials as input
...instead of getting them from Process::current(). :^)
2022-08-21 16:15:29 +02:00
Andreas Kling
c3351d4b9f Kernel: Make VirtualFileSystem functions take credentials as input
Instead of getting credentials from Process::current(), we now require
that they be provided as input to the various VFS functions.

This ensures that an atomic set of credentials is used throughout an
entire VFS operation.
2022-08-21 16:02:24 +02:00
James Bellamy
9744dedb50 Kernel: Use credentials object in Socket set_origin/acceptor 2022-08-21 14:55:01 +02:00
James Bellamy
2686640baf Kernel: Use credentials object in LocalSocket constructor 2022-08-21 14:55:01 +02:00
James Bellamy
386642ffcf Kernel: Use credentials object in VirtualFileSystem
Use credentials object in mknod, create, mkdir, and symlink
2022-08-21 14:55:01 +02:00
James Bellamy
8ef5dbed21 Kernel: Use credentials object in Coredump:try_create_target_file 2022-08-21 14:55:01 +02:00
Andreas Kling
18abba2c4d Kernel: Make sys$getppid() not take the big lock
This only needs to access the process PPID, which is protected by the
"protected data" lock.
2022-08-21 13:29:36 +02:00
Andreas Kling
8ed06ad814 Kernel: Guard Process "protected data" with a spinlock
This ensures that both mutable and immutable access to the protected
data of a process is serialized.

Note that there may still be multiple TOCTOU issues around this, as we
have a bunch of convenience accessors that make it easy to introduce
them. We'll need to audit those as well.
2022-08-21 12:25:14 +02:00
Andreas Kling
728c3fbd14 Kernel: Use RefPtr instead of LockRefPtr for Custody
By protecting all the RefPtr<Custody> objects that may be accessed from
multiple threads at the same time (with spinlocks), we remove the need
for using LockRefPtr<Custody> (which is basically a RefPtr with a
built-in spinlock.)
2022-08-21 12:25:14 +02:00
Liav A
5331d243c6 Kernel/Syscall: Make anon_create to not use Process::allocate_fd method
Instead, allocate when acquiring the lock on m_fds struct, which is
safer to do in terms of safely mutating the m_fds struct, because we
don't use the big process lock in this syscall.
2022-08-21 10:56:48 +01:00
Andreas Kling
619ac65302 Kernel: Get GID from credentials object in sys$setgroups()
I missed one instance of these. Thanks Anthony Iacono for spotting it!
2022-08-20 22:41:49 +02:00
Andreas Kling
9eeee24a39 Kernel+LibC: Enforce a limit on the number of supplementary group IDs
This patch adds the NGROUPS_MAX constant and enforces it in
sys$setgroups() to ensure that no process has more than 32 supplementary
group IDs.

The number doesn't mean anything in particular, just had to pick a
number. Perhaps one day we'll have a reason to change it.
2022-08-20 22:39:56 +02:00
Andreas Kling
998c1152ef Kernel: Mark syscalls that get/set user/group ID as not needing big lock
Now that these operate on the neatly atomic and immutable Credentials
object, they should no longer require the process big lock for
synchronization. :^)
2022-08-20 18:36:47 +02:00
Andreas Kling
122d7d9533 Kernel: Add Credentials to hold a set of user and group IDs
This patch adds a new object to hold a Process's user credentials:

- UID, EUID, SUID
- GID, EGID, SGID, extra GIDs

Credentials are immutable and child processes initially inherit the
Credentials object from their parent.

Whenever a process changes one or more of its user/group IDs, a new
Credentials object is constructed.

Any code that wants to inspect and act on a set of credentials can now
do so without worrying about data races.
2022-08-20 18:32:50 +02:00
Andreas Kling
bec314611d Kernel: Move InodeMetadata methods out of line 2022-08-20 17:20:44 +02:00
Andreas Kling
11eee67b85 Kernel: Make self-contained locking smart pointers their own classes
Until now, our kernel has reimplemented a number of AK classes to
provide automatic internal locking:

- RefPtr
- NonnullRefPtr
- WeakPtr
- Weakable

This patch renames the Kernel classes so that they can coexist with
the original AK classes:

- RefPtr => LockRefPtr
- NonnullRefPtr => NonnullLockRefPtr
- WeakPtr => LockWeakPtr
- Weakable => LockWeakable

The goal here is to eventually get rid of the Lock* classes in favor of
using external locking.
2022-08-20 17:20:43 +02:00
Andreas Kling
e475263113 AK+Kernel: Add AK::AtomicRefCounted and use everywhere in the kernel
Instead of having two separate implementations of AK::RefCounted, one
for userspace and one for kernelspace, there is now RefCounted and
AtomicRefCounted.
2022-08-20 17:15:52 +02:00