Joining dead threads is allowed for two main reasons:
- Thread join behavior should not be racy when a thread is joined and
exiting at roughly the same time. This is common behavior when threads
are given a signal to end (meaning they are going to exit ASAP) and
then joined.
- POSIX requires that exited threads are joinable (at least, there is no
language in the specification forbidding it).
The behavior is still well-defined; e.g. it doesn't allow a dead
detached thread to be joined or a thread to be joined more than once.
The fact that we used a Vector meant that even if creating a Mount
object succeeded, we were still at a risk that appending to the actual
mounts Vector could fail due to OOM condition. To guard against this,
the mount table is now an IntrusiveList, which always means that when
allocation of a Mount object succeeded, then inserting that object to
the list will succeed, which allows us to fail early in case of OOM
condition.
From now on, we don't allow jailed processes to open all device nodes in
/dev, but only allow jailed processes to open /dev/full, /dev/zero,
/dev/null, and various TTY and PTY devices (and not including virtual
consoles) so we basically restrict applications to what they can do when
they are in jail.
The motivation for this type of restriction is to ensure that even if a
remote code execution occurred, the damage that can be done is very
small.
We also don't restrict reading and writing on device nodes that were
already opened, because that limit seems not useful, especially in the
case where we do want to provide an OpenFileDescription to such device
but nothing further than that.
`SysFSComponentRegistry`, `ProcFSComponentRegistry` and
`attach_null_device` "just work" already; let's include them to match
x86_64 as closely as possible.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
This patch adds a way to ask the allocator to skip its internal
scrubbing memset operation. Before this change, kcalloc() would scrub
twice: once internally in kmalloc() and then again in kcalloc().
The same mechanism already existed in LibC malloc, and this patch
brings it over to the kernel heap allocator as well.
This solves one FIXME in kcalloc(). :^)
This solves one of the security issues being mentioned in issue #15996.
We simply don't allow creating hardlinks on paths that were not unveiled
as writable to prevent possible bypass on a certain path that was
unveiled as non-writable.
Instead, allow userspace to decide on the coredump directory path. By
default, SystemServer sets it to the /tmp/coredump directory, but users
can now change this by writing a new path to the sysfs node at
/sys/kernel/variables/coredump_directory, and also to read this node to
check where coredumps are currently generated at.
By default, disallow reading of values in that directory. Later on, we
will enable sparingly read access to specific files.
The idea that led to this mechanism was suggested by Jean-Baptiste
Boric (also known as boricj in GitHub), to prevent access to sensitive
information in the SysFS if someone adds a new file in the /sys/kernel
directory.
There's simply no benefit in allowing sandboxed programs to change the
power state of the machine, so disallow writes to the mentioned node to
prevent malicious programs to request that.
Previously we tried to determine if `fd` refers to a non-regular file by
doing a stat() operation on the file.
This didn't work out very well since many File subclasses don't
actually implement stat() but instead fall back to failing with EBADF.
This patch fixes the issue by checking for regular files with
File::is_regular_file() instead.
We used size_t, which is a type that is guarenteed to be large
enough to hold an array index, but uintptr_t is designed to be used
to hold pointer values, which is the case of stack guards.
To accomplish this, we add another VeilState which is called
LockedInherited. The idea is to apply exec unveil data, similar to
execpromises of the pledge syscall, on the current exec'ed program
during the execve sequence. When applying the forced unveil data, the
veil state is set to be locked but the special state of LockedInherited
ensures that if the new program tries to unveil paths, the request will
silently be ignored, so the program will continue running without
receiving an error, but is still can only use the paths that were
unveiled before the exec syscall. This in turn, allows us to use the
unveil syscall with a special utility to sandbox other userland programs
in terms of what is visible to them on the filesystem, and is usable on
both programs that use or don't use the unveil syscall in their code.
Because the ".." entry in a directory is a separate inode, if a
directory is renamed to a new location, then we should update this entry
the point to the new parent directory as well.
Co-authored-by: Liav A <liavalb@gmail.com>
Each GenericInterruptHandler now tracks the number of calls that each
CPU has serviced.
This takes care of a FIXME in the /sys/kernel/interrupts generator.
Also, the lsirq command line tool now displays per-CPU call counts.
This patch fixes some include problems on aarch64. aarch64 is still
currently broken but this will get us back to the underlying problem
of FloatExtractor.
We now disallow jail creation from a process within a jail because there
is simply no valid use case to allow it, and we will probably not enable
this behavior (which is considered a bug) again.
Although there was no "real" security issue with this bug, as a process
would still be denied to join that jail, there's an information reveal
about the amount of jails that are or were present in the system.
Add support for async transfers by using a separate kernel task to poll
a list of active async transfers on a set time interval, and invoke
their user-provided callback function when they are complete. Also add
support for the interrupt class of transfers, building off of this async
functionality.
Our implementation for Jails resembles much of how FreeBSD jails are
working - it's essentially only a matter of using a RefPtr in the
Process class to a Jail object. Then, when we iterate over all processes
in various cases, we could ensure if either the current process is in
jail and therefore should be restricted what is visible in terms of
PID isolation, and also to be able to expose metadata about Jails in
/sys/kernel/jails node (which does not reveal anything to a process
which is in jail).
A lifetime model for the Jail object is currently plain simple - there's
simpy no way to manually delete a Jail object once it was created. Such
feature should be carefully designed to allow safe destruction of a Jail
without the possibility of releasing a process which is in Jail from the
actual jail. Each process which is attached into a Jail cannot leave it
until the end of a Process (i.e. when finalizing a Process). All jails
are kept being referenced in the JailManagement. When a last attached
process is finalized, the Jail is automatically destroyed.
This adds try_* methods to AK::SinglyLinkedList and
AK::SinglyLinkedListWithCount and updates the network stack to use
those to gracefully handle allocation failures.
Refs #6369.
This decreases the number of bytes necessary to capture the variables
for this lambda. The next step will be to remove dynamic allocations
from AK::Function which depends on this change to keep the size of
AK::Function objects reasonable.
This is intended to reflect the POSIX sched_setparam API, which has some
cryptic language
(https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_08_04_01
) that as far as I can tell implies we should prioritize process
scheduling policies over thread scheduling policies. Technically this
means that a process must have its own sets of policies that are
considered first by the scheduler, but it seems unlikely anyone relies
on this behavior in practice. So we just override all thread's policies,
making them (at least before calls to pthread_setschedparam) behave
exactly like specified on the surface.
The priority range was changed several years ago, but the
userland-reported limits were just forgotten :skeleyak:. Move the thread
priority constants into an API header so that userland can use it
properly.
The syscalls are renamed as they no longer reflect the exact POSIX
functionality. They can now handle setting/getting scheduler parameters
for both threads and processes.
The kernel image grew so much that it wasn't possible to jump to the C++
symbol anymore, since this generated a 'relocation truncated' error when
linking.
Let's put the power_state global node into the /sys/kernel directory,
because that directory represents all global nodes and variables being
related to the Kernel. It's also a mutable node, that is more acceptable
being in the mentioned directory due to the fact that all other files in
the /sys/firmware directory are just firmware blobs and are not mutable
at all.
Now that all global nodes are located in the /sys/kernel directory, we
can safely drop the global nodes in /proc, which includes both /proc/net
and /proc/sys directories as well.
This in fact leaves the ProcFS to only have subdirectories for processes
and the "self" symbolic link to reflect the current process being run.
The ProcFS is an utter mess currently, so let's start move things that
are not related to processes-info. To ensure it's done in a sane manner,
we start by duplicating all /proc/ global nodes to the /sys/kernel/
directory, then we will move Userland to use the new directory so the
old directory nodes can be removed from the /proc directory.
If a program needs to execute a dynamic executable program, then it
should unveil /usr/lib/Loader.so by itself and not rely on the Kernel to
allow using this binary without any sense of respect to unveil promises
being made by the running parent program.
Previously we didn't send the SIGPIPE signal to processes when
sendto()/sendmsg()/etc. returned EPIPE. And now we do.
This also adds support for MSG_NOSIGNAL to suppress the signal.
This commit reached that goal of "safely discarding" a filesystem by
doing the following:
1. Stop using the s_file_system_map HashMap as it was an unsafe measure
to access pointers of FileSystems. Instead, make sure to register all
FileSystems at the VFS layer, with an IntrusiveList, to avoid problems
related to OOM conditions.
2. Make sure to cleanly remove the DiskCache object from a BlockBased
filesystem, so the destructor of such object will not need to do that in
the destruction point.
3. For ext2 filesystems, don't cache the root inode at m_inode_cache
HashMap. The reason for this is that when unmounting an ext2 filesystem,
we lookup at the cache to see if there's a reference to a cached inode
and if that's the case, we fail with EBUSY. If we keep the m_root_inode
also being referenced at the m_inode_cache map, we have 2 references to
that object, which will lead to fail with EBUSY. Also, it's much simpler
to always ask for a root inode and get it immediately from m_root_inode,
instead of looking up the cache for that inode.
The idea is to enable mounting FileSystem objects across multiple mounts
in contrast to what happened until now - each mount has its own unique
FileSystem object being attached to it.
Considering a situation of mounting a block device at 2 different mount
points at in system, there were a couple of critical flaws due to how
the previous "design" worked:
1. BlockBasedFileSystem(s) that pointed to the same actual device had a
separate DiskCache object being attached to them. Because both instances
were not synchronized by any means, corruption of the filesystem is most
likely achieveable by a simple cache flush of either of the instances.
2. For superblock-oriented filesystems (such as the ext2 filesystem),
lack of synchronization between both instances can lead to severe
corruption in the superblock, which could render the entire filesystem
unusable.
3. Flags of a specific filesystem implementation (for example, with xfs
on Linux, one can instruct to mount it with the discard option) must be
honored across multiple mounts, to ensure expected behavior against a
particular filesystem.
This patch put the foundations to start fix the issues mentioned above.
However, there are still major issues to solve, so this is only a start.
We now have a seperately allocated structure for the bookkeeping
information in the QueueHead and TransferDescriptor UHCI strucutres.
This way, we can support 64-bit pointers in UHCI, fixing a problem where
32-bit pointers would truncate the upper 32-bits of the (virtual)
address of the descriptor, causing a crash.
Co-authored-by: b14ckcat <b14ckcat@protonmail.com>
This flag doesn't conform to any POSIX standard nor is found in any OS
out there. The idea behind this mount flag is to ensure that only
non-regular files will be placed in a filesystem, which includes device
nodes, symbolic links, directories, FIFOs and sockets. Currently, the
only valid case for using this mount flag is for TmpFS instances, where
we want to mount a TmpFS but disallow any kind of regular file and only
allow other types of files on the filesystem.
Although this code worked quite well, it is considered to be a code
duplication with the TmpFS code which is more tested and works quite
well for a variety of cases. The only valid reason to keep this
filesystem was that it enforces that no regular files will be created at
all in the filesystem. Later on, we will re-introduce this feature in a
sane manner. Therefore, this can be safely removed after SystemServer no
longer uses this filesystem type anymore.
Instead of using absolute paths which is considered an abstraction layer
violation between the kernel and userspace, let's not hardcode the path
to children PID directories but instead we can use relative path links
to them.
Having this function return `nullptr` explicitly triggers the compiler's
inbuilt checker, as it knows the destination is null. Having this as a
static (scoped) variable for now circumvents this problem.
Decompose the current monolithic USBD Pipe interface into several
subclasses, one for each pair of endpoint type & direction. This is to
make it more clear what data and functionality belongs to which Pipe
type, and prevent nonsensical things like trying to execute a control
transfer on a non-control pipe. This is important, because the Pipe
class is the interface by which USB device drivers will interact with
the HCD, so the clearer and more explicit this interface is the better.
Allocate DMA buffer pages for use within the USBD Pipe class, and allow
for the user to specify the size of this buffer, rounding up to the
next page boundary.
This sets up the RPi::Timer to trigger an interurpt every 4ms using one
of the comparators. The actual time is calculated by looking at the main
counter of the RPi::Timer using the Timer::update_time function.
A stub for Scheduler::timer_tick is also added, since the TimeManagement
code now calls the function.
Instead of just having a giant KBuffer that is not resizeable easily, we
use multiple AnonymousVMObjects in one Vector to store them.
The idea is to not have to do giant memcpy or memset each time we need
to allocate or de-allocate memory for TmpFS inodes, but instead, we can
allocate only the desired block range when trying to write to it.
Therefore, it is also possible to have data holes in the inode content
in case of skipping an entire set of one data block or more when writing
to the inode content, thus, making memory usage much more efficient.
To ensure we don't run out of virtual memory range, don't allocate a
Region in advance to each TmpFSInode, but instead try to allocate a
Region on IO operation, and then use that Region to map the VMObjects
in IO loop.
This makes it easier to differentiate between cases where certain
functionality is not implemented vs. cases where a code location
should really be unreachable.
It costs us nothing, and some utilities (such as the known file utility)
rely on the exposed file size (after doing lstat on it), to show
anything useful besides saying the file is "empty".
In a few places we check `!Processor::in_critical()` to validate
that the current processor doesn't hold any kernel spinlocks.
Instead lets provide it a first class name for readability.
I'll also be adding more of these, so I would rather add more
usages of a nice API instead of this implicit/assumed logic.
This change ensures that the scheduler doesn't depend on a platform
specific or arch-specific code when it initializes itself, but rather we
ensure that in compile-time we will generate the appropriate code to
find the correct arch-specific current time methods.
For some odd reason we used to return PhysicalPtr for a page_table_base
result, but when setting it we accepted only a 32 bit value, so we
truncated valid 64 bit addresses into 32 bit addresses by doing that.
With this commit being applied, now PageDirectories can be located
beyond the 4 GiB barrier.
This was found by sin-ack, therefore he should be credited with this fix
appropriately with Co-authored-by sign.
Co-authored-by: sin-ack <sin-ack@users.noreply.github.com>
There is no particular reason why this section should be marked as
`NOBITS` (as it might very well include initialized values), and it
resolves 90% of the mismatches between the input and output sections,
which LLD now warns about when linking.
Doesn't use them in libc headers so that those don't have to pull in
AK/Platform.h.
AK_COMPILER_GCC is set _only_ for gcc, not for clang too. (__GNUC__ is
defined in clang builds as well.) Using AK_COMPILER_GCC simplifies
things some.
AK_COMPILER_CLANG isn't as much of a win, other than that it's
consistent with AK_COMPILER_GCC.
Nobody uses this because the x86 prekernel environment is corrupting the
ramdisk image prior to running the actual kernel. In the future we can
ensure that the prekernel doesn't corrupt the ramdisk if we want to
bring support back. In addition to that, we could just use a RAM based
filesystem to load whatever is needed like in Linux, without the need of
additional filesystem driver.
For the mentioned corruption problem, look at issue #9893.
The BootFramebufferConsole class maps the framebuffer using the
MemoryManager, so to be able to draw the logo, we need to get this
mapped framebuffer. This commit adds a unsafe API for that.
The MemoryManager now works, so we can use the same code as on x86 to
map the framebuffer. Since it uses the MemoryManager, the initialization
of the BootFramebufferConsole has to happen after the MemoryManager is
working.
For the initial page tables we only need to identity map the kernel
image, the rest of the memory will be managed by the MemoryManager. The
linker script is updated to get the kernel image start and end
addresses.
The page table and page directory formats are architecture specific, so
move the headers into the Arch directory. Also move the aarch64 page
table constants from aarch64/MMU.cpp to aarch64/PageDirectory.h.
When an exception happens it is sometimes hard to figure out where
exactly the exception happened, so use the frame pointer of the trap
frame to print a backtrace.
We no longer require to lock the m_inode_lock in the SharedInodeVMObject
code as the methods write_bytes and read_bytes of the Inode class do
this for us now.
According to Dr. POSIX, we should allow to call mmap on inodes even on
ranges that currently don't map to any actual data. Trying to read or
write to those ranges should result in SIGBUS being sent to the thread
that did violating memory access.
To implement this restriction, we simply check if the result of
read_bytes on an Inode returns 0, which means we have nothing valid to
map to the program, hence it should receive a SIGBUS in that case.
This value will be used later on by WindowServer to reject resolutions
that will request a mapping that will overflow the hardware framebuffer
max length.
This class is intended to replace all IOAddress usages in the Kernel
codebase altogether. The idea is to ensure IO can be done in
arch-specific manner that is determined mostly in compile-time, but to
still be able to use most of the Kernel code in non-x86 builds. Specific
devices that rely on x86-specific IO instructions are already placed in
the Arch/x86 directory and are omitted for non-x86 builds.
The reason this works so well is the fact that x86 IO space acts in a
similar fashion to the traditional memory space being available in most
CPU architectures - the x86 IO space is essentially just an array of
bytes like the physical memory address space, but requires x86 IO
instructions to load and store data. Therefore, many devices allow host
software to interact with the hardware registers in both ways, with a
noticeable trend even in the modern x86 hardware to move away from the
old x86 IO space to exclusively using memory-mapped IO.
Therefore, the IOWindow class encapsulates both methods for x86 builds.
The idea is to allow PCI devices to be used in either way in x86 builds,
so when trying to map an IOWindow on a PCI BAR, the Kernel will try to
find the proper method being declared with the PCI BAR flags.
For old PCI hardware on non-x86 builds this might turn into a problem as
we can't use port mapped IO, so the Kernel will gracefully fail with
ENOTSUP error code if that's the case, as there's really nothing we can
do within such case.
For general IO, the read{8,16,32} and write{8,16,32} methods are
available as a convenient API for other places in the Kernel. There are
simply no direct 64-bit IO API methods yet, as it's not needed right now
and is not considered to be Arch-agnostic too - the x86 IO space doesn't
support generating 64 bit cycle on IO bus and instead requires two 2
32-bit accesses. If for whatever reason it appears to be necessary to do
IO in such manner, it could probably be added with some neat tricks to
do so. It is recommended to use Memory::TypedMapping struct if direct 64
bit IO is actually needed.
The APICTimer, HPET and RTC (the RTC timer is in the context of the PC
RTC here) are timers that exist only in x86 platforms, therefore, we
move the handling code and the initialization code to the Arch/x86/Time
directory. Other related code patterns in the TimeManagement singleton
and in the Random.cpp file are guarded with #ifdef to ensure they are
only compiled for x86 builds.
The new VGAIOArbiter class is now responsible to conduct x86-specific
instructions to control VGA hardware from the old ISA ports. This allows
us to ensure the GraphicsManagement code doesn't use x86-specific code,
thus allowing it to be compiled within non-x86 kernel builds.
The BootFramebufferConsole highly depends on using the m_lock spinlock,
therefore setting and changing the cursor state should be done under
that spinlock too to avoid crashing.
Only the Console code in the Graphics directory should be able to call
on these methods. The set_cursor method stays public as VirtualConsole
uses that method to change the cursor position.
This device is supposed to be used in microvm and ISA-PC machine types,
and we assume that if we are able to probe for the QEMU BGA version of
0xB0C5, then we have an existing ISA Bochs VGA adapter to utilize.
To ensure we don't instantiate the driver for non isa-vga devices, we
try to ensure that PCI is disabled because hardware IO test probe failed
so we can be sure that we use this special handling code only in the
QEMU microvm and ISA-PC machine types. Unfortunately, this means that if
for some reason the isa-vga device is attached for the i440FX or Q35
machine types, we simply are not able to drive the device in such setups
at all.
To determine the amount of VRAM being available, we read VBE register at
offset 0xA. That register holds the amount of VRAM divided by 64K, so we
need to multiply the value in our code to use the actual VRAM size value
again.
The isa-vga device requires us to hardcode the framebuffer physical
address to 0xE0000000, and that address is not expected to change in the
future as many other projects rely on the isa-vga framebuffer to be
present at that physical memory address.
We should aim to reliably determine if PCI hardware exists or not, and
we should consider the ACPI MCFG table in that test. Although it is
unusual to see an hardware setup where the PCI host bridge does not
respond to x86 IO instructions, it is expected to happen at least on the
QEMU microvm machine type as the host bridge only responds to memory
mapped IO requests. Therefore, we first test if ACPI is enabled, and we
try to use it to fetch the MCFG table. Later on we could also add FDT
parsing as part of the PCI IO test which would be useful for the QEMU
microvm machine type.
We use a ScopeGuard to ensure we always set a console of some sort if we
exit early from the initialization sequence in the GraphicsManagement
code. We do so to ensure we can boot into text mode console in an ISA-PC
machine type, because earlier we failed with an assertion due to not
setting any console for VirtualConsole to use.
ISA IDE controllers don't support Bus-master DMA as this feature is only
available for PCI IDE controllers. Therefore, don't try to use DMA mode
for such hardware.
The code in init.cpp is specific to the x86 initialization sequence, so
move it to the Arch/x86 directory in the same fashion like the aarch64
pattern.
The PIC and APIC code are specific to x86 platforms, so move them out of
the general Interrupts directory to Arch/x86/common/Interrupts directory
instead.
That code heavily relies on x86-specific instructions, and while other
CPU architectures and platforms can have PCI IDE controllers, currently
we don't support those, so this code is a special case which needs to be
in the Arch/x86 directory.
In the future it could be put back to the original place when we make it
more generic and suitable for other platforms.
The i8042 controller with its attached devices, the PS2 keyboard and
mouse, rely on x86-specific IO instructions to work. Therefore, move
them to the Arch/x86 directory to make it easier to omit the handling
code of these devices.
The ISA IDE controller code makes sense to be compiled in a x86 build as
it relies on access to the x86 IO space. For other architectures, we can
just omit the code as there's no way we can use that code again.
To ensure we can omit the code easily, we move it to the Arch/x86
directory.
The AHCI code doesn't rely on x86 IO at all as it only uses memory
mapped IO so we can simply remove the header.
We also simply don't use x86 IO in the Intel graphics driver, so we can
simply remove the include of the x86 IO header there too.
Everything else was a bunch of stale includes to the x86 IO header and
are actually not necessary, so let's remove them to make it easier to
compile non-x86 Kernel builds.
The VMWare backdoor handling code involves many x86-specific
instructions and therefore should be in the Arch/x86 directory. This
ensures we can easily omit the code in compile-time for non-x86 builds.
It seems more correct to let each platform to define its own sequence of
initialization of the PCI bus, so let's remove the #if flags and just
put the entire Initializer.cpp file in the appropriate code directory.
Only use the Bochs debug output if we compile a x86 build since bochs
debug output relies on x86 specific instructions.
We also remove the CONSOLE_OUT_TO_BOCHS_DEBUG_PORT flag as we always
compile bochs debug output for x86 builds and we always want to include
the bochs debug output capability as it is very handy and doesn't hurt
bare metal hardware or do any other problem besides taking a small
amount of CPU cycles.
The simple PCI::HostBridge class implements access to the PCI
configuration space by using x86 IO instructions. Therefore, it should
be put in the Arch/x86/PCI directory so it can be easily omitted for
non-x86 builds.
kprintf should not really care about the hardware-specific details of
each UART or serial port out there, so instead of using x86 specific
instructions, let's ensure that we will compile only the relevant code
for debug output for a targeted-specific platform.
The RTC and CMOS are currently only supported for x86 platforms and use
specific x86 instructions to produce only certain x86 plaform operations
and results, therefore, we move them to the Arch/x86 specific directory.
Many code patterns and hardware procedures rely on reliable delay in the
microseconds granularity, and since they are using such delays which are
valid cases, but should not rely on x86 specific code, we allow to
determine in compile time the proper platform-specific code to use to
invoke such delays.
We only use the RTC code in the Kernel, so it doesn't make sense to make
the RTC namespace outside of it. In addition to that, we will need later
on to use the RTC in an x86 specific manner and this will help us to use
this code in such fashion.
We move QEMU and VirtualBox shutdown sequences to a separate file, as
well as moving the i8042 reboot code sequence too to another file.
This allows us to abstract specific methods from the power state node
code of the SysFS filesystem, to allow other architectures to put their
methods there too in the future.
Using the IO address space is only relevant for x86 machines, so let's
not compile instructions to access the PCI configuration space when we
don't target x86 platforms.
This remained undetected for a long time as HeaderCheck is disabled by
default. This commit makes the following file compile again:
// file: compile_me.cpp
#include <Kernel/API/POSIX/ucontext.h>
// That's it, this was enough to cause a compilation error.
According to Dr. POSIX, we should allow to call mmap on inodes even on
ranges that currently don't map to any actual data. Trying to read or
write to those ranges should result in SIGBUS being sent to the thread
that did violating memory access.
We make these methods non-virtual because we want to ensure we properly
enforce locking of the m_inode_lock mutex. Also, for write operations,
we want to call prepare_to_write_data before the actual write. The
previous design required us to ensure the callers do that at various
places which lead to hard-to-find bugs. By moving everything to a place
where we call prepare_to_write_data only once, we eliminate a possibilty
of forgeting to call it on some code path in the kernel.
The block list required a bit of work, and now the only method being
declared const to bypass its const-iness is the read_bytes method that
calls a new method called compute_block_list_with_exclusive_locking that
takes care of proper locking before trying to update the block list data
of the ext2 inode.
Before of this patch, we supported two methods to address a boot device:
1. Specifying root=/dev/hdXY, where X is a-z letter which corresponds to
a boot device, and Y as number from 1 to 16, to indicate the partition
number, which can be omitted to instruct the kernel to use a raw device
rather than a partition on a raw device.
2. Specifying root=PARTUUID: with a GUID string of a GUID partition. In
case of existing storage device with GPT partitions, this is most likely
the safest option to ensure booting from persistent storage.
While option 2 is more advanced and reliable, the first option has 2
caveats:
1. The string prefix "/dev/hd" doesn't mean anything beside a convention
on Linux installations, that was taken into use in Serenity. In Serenity
we don't mount DevTmpFS before we mount the boot device on /, so the
kernel doesn't really access /dev anyway, so this convention is only a
big misleading relic that can easily make the user to assume we access
/dev early on boot.
2. This convention although resemble the simple linux convention, is
quite limited in specifying a correct boot device across hardware setup
changes, so option 2 was recommended to ensure the system is always
bootable.
With these caveats in mind, this commit tries to fix the problem with
adding more addressing options as well as to remove the first option
being mentioned above of addressing.
To sum it up, there are 4 addressing options:
1. Hardware relative address - Each instance of StorageController is
assigned with a index number relative to the type of hardware it handles
which makes it possible to address storage devices with a prefix of the
commandset ("ata" for ATA, "nvme" for NVMe, "ramdisk" for Plain memory),
and then the number for the parent controller relative hardware index,
another number LUN target_id, and a third number for LUN disk_id.
2. LUN address - Similar to the previous option, but instead we rely on
the parent controller absolute index for the first number.
3. Block device major and minor numbers - by specifying the major and
minor numbers, the kernel can simply try to get the corresponding block
device and use it as the boot device.
4. GUID string, in the same fashion like before, so the user use the
"PARTUUID:" string prefix and add the GUID of the GPT partition.
For the new address modes 1 and 2, the user can choose to also specify a
partition out of the selected boot device. To do that, the user needs to
append the semicolon character and then add the string "partX" where X
is to be changed for the partition number. We start counting from 0, and
therefore the first partition number is 0 and not 1 in the kernel boot
argument.
This reworks the way the UHCI schedule is set up to handle interrupt
transfers, creating 11 queue heads each assigned a different
period/latency, so that interrupt transfers can be linked into the
schedule with their specified period more easily.
Modifies the way the UHCI schedule is set up & modified to allow for
multiple transfers of the same type, from one or more devices, to be
queued up and handled simultaneously.
Otherwise we would be holding the MM global data lock and the Process
address space locks in reversed order to the rest of the system, which
can lead to deadlocks.
This is a left-over from back when we didn't have any locking on the
global Process list, nor did we have SMP support, so this acted as some
kind of locking mechanism. We now have proper locks around the Process
list, so this is no longer relevant.
Change the name of set_serial_debug(bool on_or_off) to
set_serial_debug_enabled(bool desired_state). This is to make the names
more expressive and less unclear as to what the function does, as it
only sets the enabled state.
Likewise, change the name of get_serial_debug() to
is_serial_debug_enabled() in order to make clear from the name that
this is simply the state of s_serial_debug_enabled.
Change the name of serial_debug to s_serial_debug_enabled since this is
a static bool describing this state.
Finally, change the signature of set_serial_debug_enabled to return a
bool, as this is more logical and understandable.
Now that the Spinlock code is not dependent on architectural specific
code anymore, we can move it back to the Locking folder. This also means
that the Spinlock implemenation is now used for the aarch64 kernel.
This commit updates the lock function from Spinlock and
RecursiveSpinlock to return the InterruptsState of the processor,
instead of the processor flags. The unlock functions would only look at
the interrupt flag of the processor flags, so we now use the
InterruptsState enum to clarify the intent, and such that we can use the
same Spinlock code for the aarch64 build.
To not break the build, all the call sites are updated aswell.
This commit adds the concept of an InterruptsState to the kernel. This
will be used to make the Spinlock code architecture independent. A new
Processor.cpp file is added such that we don't have to duplicate the
code.
Globally shared MemoryManager state is now kept in a GlobalData struct
and wrapped in SpinlockProtected.
A small set of members are left outside the GlobalData struct as they
are only set during boot initialization, and then remain constant.
This allows us to access those members without taking any locks.
I believe this to be safe, as the main thing that LockRefPtr provides
over RefPtr is safe copying from a shared LockRefPtr instance. I've
inspected the uses of RefPtr<PhysicalPage> and it seems they're all
guarded by external locking. Some of it is less obvious, but this is
an area where we're making continuous headway.
We're not accessing any of the MM members here. Also remove some
redundant code to update CR3, since it calls activate_page_directory()
which does exactly the same thing.
Now that AddressSpace itself is always SpinlockProtected, we don't
need to also wrap the RegionTree. Whoever has the AddressSpace locked
is free to poke around its tree.
This allows sys$mprotect() to honor the original readable & writable
flags of the open file description as they were at the point we did the
original sys$mmap().
IIUC, this is what Dr. POSIX wants us to do:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/mprotect.html
Also, remove the bogus and racy "W^X" checking we did against mappings
based on their current inode metadata. If we want to do this, we can do
it properly. For now, it was not only racy, but also did blocking I/O
while holding a spinlock.
Before this change, we had File::mmap() which did all the work of
setting up a VMObject, and then creating a Region in the current
process's address space.
This patch simplifies the interface by removing the region part.
Files now only have to return a suitable VMObject from
vmobject_for_mmap(), and then sys$mmap() itself will take care of
actually mapping it into the address space.
This fixes an issue where we'd try to block on I/O (for inode metadata
lookup) while holding the address space spinlock. It also reduces time
spent holding the address space lock.
This forces anyone who wants to look into and/or manipulate an address
space to lock it. And this replaces the previous, more flimsy, manual
spinlock use.
Note that pointers *into* the address space are not safe to use after
you unlock the space. We've got many issues like this, and we'll have
to track those down as wlel.
We were holding the MM lock across all of the region unmapping code.
This was previously necessary since the quickmaps used during unmapping
required holding the MM lock.
Now that it's no longer necessary, we can leave the MM lock alone here.
This makes locking them much more straightforward, and we can remove
a bunch of confusing use of AddressSpace::m_lock. That lock will also
be converted to use of SpinlockProtected in a subsequent patch.
By default these 2 fields were zero, which made it rely on
implementation defined behavior whether these fields internally would be
set to the correct value. The ARM processor in the Raspberry PI (and
QEMU 6.x) would actually fixup these values, whereas QEMU 7.x now does
not do that anymore, and a translation fault would be generated instead.
For more context see the relevant QEMU issue:
- https://gitlab.com/qemu-project/qemu/-/issues/1157Fixes#14856
Boot profiling was previously broken due to init_stage2() passing the
event mask to sys$profiling_enable() via kernel pointer, but a user
pointer is expected.
To fix this, I added Process::profiling_enable() as an alternative to
Process::sys$profiling_enable which takes a u64 rather than a
Userspace<u64 const*>. It's a bit of a hack, but it works.