The definitions were being defined already by `BootInfo.h` and that was
being included here via transitive includes. The extern definitions of
the variables do not have the `READONLY_AFTER_INIT` attribute in
`BootInfo.h`. This causes conflicting definitions of the same variable.
The `READONLY_AFTER_INIT` specifier is not needed for extern variables
as it only effects their linkage, not their actual use, so just use the
versions in `BootInfo.h` instead of re-declaring.
These were easy to pick-up as these pointers are assigned during the
construction point and are never changed afterwards.
This small change to these pointers will ensure that our code will not
accidentally assign these pointers with a new object which is always a
kind of bug we will want to prevent.
When switching to the new address space, we also have to switch the
Process::m_master_tls_* variables as they may refer to a region in
the old address space.
This was causing `su` to not run correctly.
Regression from 65641187ff.
Specifically this commit implements two setters set_userspace_sp and
set_ip in RegisterState.h, and also adds a stack pointer getter (sp) in
ThreadRegisters.h. Contributed by konrad, thanks for that.
Setting the page table base register (ttbr0_el1) is not enough, and will
not flush the TLB caches, in contrary with x86_64 where setting the CR3
register will actually flush the caches. This commit adds the necessary
code to properly flush the TLB caches when context switching. This
commit also changes Processor::flush_tlb_local to use the vmalle1
variant, as previously we would be flushing the tlb's of all the cores
in the inner-shareable domain.
Previously we had a race condition in the page fault handling: We were
relying on the affected Region staying alive while handling the page
fault, but this was not actually guaranteed, as an munmap from another
thread could result in the region being removed concurrently.
This commit closes that hole by extending the lifetime of the region
affected by the page fault until the handling of the page fault is
complete. This is achieved by maintaing a psuedo-reference count on the
region which counts the number of in-progress page faults being handled
on this region, and extending the lifetime of the region while this
counter is non zero.
Since both the increment of the counter by the page fault handler and
the spin loop waiting for it to reach 0 during Region destruction are
serialized using the appropriate AddressSpace spinlock, eventual
progress is guaranteed: As soon as the region is removed from the tree
no more page faults on the region can start.
And similarly correctness is ensured: The counter is incremented under
the same lock, so any page faults that are being handled will have
already incremented the counter before the region is deallocated.
This replaces the previous owning address space pointer. This commit
should not change any of the existing functionality, but it lays down
the groundwork needed to let us properly access the region table under
the address space spinlock during page fault handling.
Instead of setting up the new address space on it's own, and only swap
to the new address space at the end, we now immediately swap to the new
address space (while still keeping the old one alive) and only revert
back to the old one if we fail at any point.
This is done to ensure that the process' active address space (aka the
contents of m_space) always matches actual address space in use by it.
That should allow us to eventually make the page fault handler process-
aware, which will let us properly lock the process address space lock.
The current way we handle sync commands is very ugly and depends on lot
of preconditions. Now that we have an end_io handler for a request, we
can use WaitQueue to do sync commands more elegantly.
This does depend on block layer sending one request at a time but this
change is a step forward towards better IO handling.
There was a private variable named m_current_request which was used to
track a single request at a time. This guarantee is given by the block
layer where we wait on each IO. This design will break down in the
driver once the block layer removes that constraint.
Redesign the IO handling in a completely asynchronous way by maintaining
requests up to queue depth. NVMeIO struct is introduced to track an IO
submitted along with other information such whether the IO is still
being processed and an endio callback which will be called during the
end of a request.
A hashmap private variable is created which will key based on the
command id of a request with a value of NVMeIO. endio handler will come
in handy if we are doing a sync request and we want to wake up the wait
queue during the end.
This change also simplified the code by removing some special condition
in submit_sqe function, etc that were marked as FIXME for a long time.
Using sq_tail as cid makes an inherent assumption that we send only
one IO at a time. Use an atomic variable instead for command id of a
submission queue entry.
As sq_tail is not used as cid anymore, remove m_prev_sq_tail which used
to hold the last used sq_tail value.
The SID was duplicated between the process credentials and protected
data. And to make matters worse, the credentials SID was not updated in
sys$setsid.
This patch fixes this by removing the SID from protected data and
updating the credentials SID everywhere.
This closes two race windows:
- ProcessGroup removed itself from the "all process groups" list in its
destructor. It was possible to walk the list between the last unref()
and the destructor invocation, and grab a pointer to a ProcessGroup
that was about to get deleted.
- sys$setsid() could end up creating a process group that already
existed, as there was a race window between checking if the PGID
is used, and actually creating a ProcessGroup with that PGID.
Now that it's no longer using LockRefPtr, we can actually move it into
protected data. (LockRefPtr couldn't be stored there because protected
data is immutable at times, and LockRefPtr uses some of its own bits
for locking.)
This syscall sends a signal to other threads or itself. This mechanism
is already guarded by locking mechanisms, and widely used within the
kernel without help from the big lock.
...and also make the Process tick counters clock_t instead of u32.
It seems harmless to get interrupted in the middle of reading these
counters and reporting slightly fewer ticks in some category.
Expand the following types from 32-bit to 64-bit:
- blkcnt_t
- blksize_t
- dev_t
- nlink_t
- suseconds_t
- clock_t
This matches their size on other 64-bit systems.
This syscall is only concerned with the current thread (except in the
case of a pledge violation, when it will add some details about that
to the process coredump metadata. That stuff is already serialized.)
This syscall operates on the file descriptor table, and on individual
open file descriptions. Both of those are already protected by scoped
locking mechanisms.
This syscall had a TOCTOU where it checked the peer's PPID before
locking the protected data (where the PPID is stored).
After closing the race window, we can mark the syscall as not needing
the big lock.
There was a whole bunch of ref counting churn coming from Mutex, which
had a RefPtr<Thread> m_holder to (mostly) point at the thread holding
the mutex.
Since we never actually dereference the m_holder value, but only use it
for identity checks against thread pointers, we can store it as an
uintptr_t and skip the ref counting entirely.
Threads can't die while holding a mutex anyway, so there's no risk of
them going missing on us.
These were stored in a bunch of places. The main one that's a bit iffy
is the Mutex::m_holder one, which I'm going to simplify in a subsequent
commit.
In Plan9FS and WorkQueue, we can't make the NNRPs const due to
initialization order problems. That's probably doable with further
cleanup, but left as an exercise for our future selves.
Before starting this, I expected the thread blockers to be a problem,
but as it turns out they were super straightforward (for once!) as they
don't mutate the thread after initiating a block, so they can just use
simple const-ified NNRPs.
- Instead of taking the first new thread as an out-parameter, we now
bundle the process and its first thread in a struct and use that
as the return value.
- Make all Process factory functions return ErrorOr. Use this to convert
some places to more TRY().
- Drop the "try_" prefix on Process factory functions.
The only persistent one of these was Thread::m_process and that never
changes after initialization. Make it const to enforce this and switch
everything over to RefPtr & NonnullRefPtr.
- The host custody never changes after initialization, so there's no
need to protect it with a spinlock.
- To enforce the fact that some members don't change after
initialization, make them const.
There was only one permanent storage location for these: as a member
in the Mount class.
That member is never modified after Mount initialization, so we don't
need to worry about races there.
This matches x86_64's behaviour in common_trap_exit. (called from
thread_context_first_enter)
Currently thread_context_first_enter is only called when creating new
processes from scratch, in which case this doesn't change the actual
behaviour. But once thread_context_first_enter is called as part of
execve support, this will ensure the Thread's m_current_trap is set
correctly to the new trap frame.
The details of the specific interrupt bits that must be turned on are
irrelevant to the sys$execve implementation. Abstract it away to the
Processor implementations using the InterruptsState enum.
Forked processes already have an existing value for the link register,
which we can't overwrite. But since they're forked the original link
register value that points to exit_kernel_thread was already saved
somewhere on the stack, so it's ok not to set it.
This commit fixes a kernel panic that happened when unmounting
a disk due to an invalid memory access.
This was because `DiskCache` initializes two linked lists that use
an argument `KBuffer` as the storage for their elements.
Since the member `KBuffer` was declared after the two lists,
when `DiskCache`'s destructor was called, then `KBuffer`'s destructor
was called before the ones of the two lists, causing a page fault in
the kernel.
Instead, only update it when the Caps Lock key event is generated and
remapping to the Ctrl key is enabled.
This fixes a bug that when enabling remapping Caps Lock key to the Ctrl
key, the original Ctrl key is no longer usable.
This is an implementation that tries to follow the spec as closely as
possible, and works with Qemu's Intel HDA and some bare metal HDA
controllers out there. Compiling with `INTEL_HDA_DEBUG=on` will provide
a lot of detailed information that could help us getting this to work
on more bare metal controllers as well :^)
Output format is limited to `i16` samples for now.
This reverts commit 187723776a.
This was reverted because it was needed until the aarch64 port
got an SD card driver
Co-authored-by: Ollrogge <nils-ollrogge@outlook.de>
It appeared that we sometimes failed to invoke synchronous commands on
the GPU. To temporarily fix this, wait 10 milliseconds for commands to
complete before failing.
According to the specification, modesetting can be invoked with no need
for flushing the framebuffer nor with DMA to transfer the framebuffer
rendering.
This configuration exposes a suboptimal mechanism to access other
VirtIO device configurations. It is also the only configuration to use a
zero length for a configuration structure, and specify a valid BAR which
triggered a kernel panic when attaching a virtio-gpu-pci device before
95b15e4901 was applied.
The real solution for that problem is to ignore this configuration type
because we never actually use it. It means that we can VERIFY that all
other configuration types have a valid length, as being expected.
Before, the mapping of our HBA region would be done in the constructor.
Since this can fail, I moved it into initialize().
Additionally, we now use the TypedMapping helper for mapping the HBA
instead of doing it manually. This actually uncovered a bug where we
would ignore any possible offset into the page we were mapping, which
caused us to miss the mapped registers entirely.
It makes much more sense to have these actions being performed via the
prctl syscall, as they both require 2 plain arguments to be passed to
the syscall layer, and in contrast to most syscalls, we don't get in
these removed syscalls an automatic representation of Userspace<T>, but
two FlatPtr(s) to perform casting on them in the prctl syscall which is
suited to what has been done in the removed syscalls.
Also, it makes sense to have these actions in the prctl syscall, because
they are strongly related to the process control concept of the prctl
syscall.
Use the same pattern for Ramdisk similar to other storage devices during
device initialization. This will propagate errors if the Ramdisk fails
to initialize.
Storage controllers are initialized during init and are never modified.
NonnullRefPtr can be safely used instead of the NonnullLockRefPtr. This
also fixes one of the UB issue that was there when using an NVMe device
because of NonnullLockRefPtr.
We can add proper locking when we need to modify the storage controllers
after init.
Any userspace cpp file that included <syscall.h> would end up with
a large glob of Kernel headers included, all the way down to
Kernel/Arch/x86_64/CPU.h and friends.
Only the kernel needs RegisterState, so hide it from userspace.
This is done with 2 major steps:
1. Remove JailManagement singleton and use a structure that resembles
what we have with the Process object. This is required later for the
second step in this commit, but on its own, is a major change that
removes this clunky singleton that had no real usage by itself.
2. Use IntrusiveLists to keep references to Process objects in the same
Jail so it will be much more straightforward to iterate on this kind
of objects when needed. Previously we locked the entire Process list
and we did a simple pointer comparison to check if the checked
Process we iterate on is in the same Jail or not, which required
taking multiple Spinlocks in a very clumsy and heavyweight way.
The LUN.target_id parameter points to a NVMe Namespace which starts from
1 and not 0. Fix the document to reflect the same while addressing a
nvme device in the boot parameters
This was mostly straightforward, as all the storage locations are
guarded by some related mutex.
The use of old-school associated mutexes instead of MutexProtected
is unfortunate, but the process to modernize such code is ongoing.
There's no plan to support ATAPI in the foreseeable future. ATAPI is
considered mostly as an extension to pass SCSI commands over ATA-link
compatible channel (which could be a physical SATA or PATA).
ATAPI is mostly used for controlling optical drives which are considered
obsolete in 2023, and require an entire SCSI abstraction layer we don't
exhibit with bypassing ioctls for sending specific SCSI commands in many
control-flow sequences for actions being taken for such hardware.
Therefore, let's make it clear we don't support ATAPI (SCSI over ATA)
unless someone picks it up and proves otherwise that this can be done
cleanly and also in a relevant way to our project.
These configurations are simply invalid. Ignoring those allow us to boot
with the virtio-gpu-pci device (in addition to the already supported
virtio-vga PCI device).
This patch switches away from {Nonnull,}LockRefPtr to the non-locking
smart pointers throughout the kernel.
I've looked at the handful of places where these were being persisted
and I don't see any race situations.
Note that the process file descriptor table (Process::m_fds) was already
guarded via MutexProtected.
This class had slightly confusing semantics and the added weirdness
doesn't seem worth it just so we can say "." instead of "->" when
iterating over a vector of NNRPs.
This patch replaces NonnullRefPtrVector<T> with Vector<NNRP<T>>.
Before of this patch, we looked at the unveil data of the FinalizerTask,
which naturally doesn't have any unveil restrictions, therefore allowing
an unveil bypass for a process that enabled performance coredumps.
To ensure we always check the dumped process unveil data, an option to
pass a Process& has been added to a couple of methods in the class of
VirtualFileSystem.
This allows us to get rid of an include to LibC/sys/ttydefaults.h in the
Kernel TTY implementation.
Also, move ttydefchars static const struct to another file called
Kernel/API/ttydefaultschars.h, so it could be used too in the Kernel TTY
implementation without the need to include anything from LibC.
We don't actually need the va_list and other stdarg definitions in the
kernel, because we actually don't use the "pure" printf interface in any
kernel code at all, but we retain the snprintf declaration because the
libstdc++ library still need it to be declared and extern'ed.
Instead of using a special case of the annotate_mapping syscall, let's
introduce a new prctl option to disallow further annotations of Regions
as new syscall Region(s).
Since the ProcFS doesn't hold many global objects within it, the need
for a fully-structured design of backing components and a registry like
with the SysFS is no longer true.
To acommodate this, let's remove all backing store and components of the
ProcFS, so now it resembles what we had in the early days of ProcFS in
the project - a mostly-static filesystem, with very small amount of
kmalloc allocations needed.
We still use the inode index mechanism to understand the role of each
inode, but this is done in a much "static"ier way than before.
Before doing a check if offset_in_region + num_bytes of the transfer
descriptor are together more than NUM_TRANSFER_REGION_PAGES * PAGE_SIZE,
check that addition of both of these parameters will not simply overflow
which could lead to out-of-bounds read/write.
Fixes#17518.
Support all the available clocks in clock_getres(). Also, fix this
function to use the actual ticks per second value, not the constant
`_SC_CLK_TCK` (which is always equal to 8) and move the resolution
computation logic to TimeManagement.
Dealing with the specific details of how to program a PLL should be done
in a separate file to ensure we can easily expand it to support future
generations of the Intel graphics device.
ENODEV better represents the fact that there might be no display device
(e.g. a monitor) connected to the connector, therefore we should return
this error.
Another reason to not use ENOTIMPL is that it's a requirement for all
DisplayConnectors to put a valid EDID in place even for a hardware we
don't currently support mode-setting in runtime.
Instead of doing that on the IntelDisplayPlane class, let's have this in
derived classes so these classes can decide how to use the settings that
were provided before calling the enable method.
It became apparent to me that future generations of the Intel graphics
chipset utilize the same register set as part of the Transcoder register
set. Therefore, it should be included now in the Transcoder class.
In the real world, graphics hardware tend to have multiple display
connectors. However, usually the connectors share one register space but
still keeping different PLL timings and display lanes.
This new class should represent a group of multiple display connectors
working together in the same Intel graphics adapter. This opens an
opportunity to abstract the interface so we could support future Intel
iGPU generations.
This is also a preparation before the driver can support newer devices
and utilize their capabilities.
The mentioned preparation is applied in a these aspects:
1. The code is splitted into more classes to adjust to future expansion.
2 classes are introduced: IntelDisplayPlane and IntelDisplayTranscoder,
so the IntelDisplayPlane controls the plane registers and second class
controls the pipeline (transcoder, encoder) registers. On gen4 it's not
really useful because there are probably one plane and one encoder to
care about, but in future generations, there are likely to be multiple
transcoders and planes to accommodate multi head support.
2. The set_edid_bytes method in the DisplayConnector class can now be
told to not assume the provided EDID bytes are always invalid. Therefore
it can refrain from printing error messages if this flag parameter is
true. This is useful for supporting real hardware situation when on boot
not all ports are connected to a monitor, which can result in floating
bus condition (essentially all the bytes we read are 0xFF).
3. An IntelNativeDisplayConnector could now be set to flag other types
of connections such as eDP (embedded DisplayPort), Analog output, etc.
This is important because on the Intel gen4 graphics we could assume to
have one analog output connector, but on future generations this is very
likely to not be the case, as there might be no VGA outputs, but rather
only an eDP connector which is converted to VGA by a design choice of
the motherboard manufacturer.
4. Add ConnectorIndex to IntelNativeDisplayConnector class - Currently
this is used to verify we always handle the correct connector when doing
modesetting.
Later, it will be used to locate special settings needed when handling
connector requests.
5. Prepare to support more types of display planes. For example, the
Intel Skylake register set for display planes is a bit different, so
let's ensure we can properly support it in the near future.
This subdirectory is meant to hold all constant data related to the
kernel. This means that this data is never meant to updated and is
relevant from system boot to system shutdown.
Move the inodes of "load_base", "cmdline" and "system_mode" to that
directory. All nodes under this new subdirectory are generated during
boot, and therefore don't require calling kmalloc each time we need to
read them. Locking is also not necessary, because these nodes and their
data are completely static once being generated.
This is considered somewhat an abstraction layer violation, because we
should always let userspace to decide on the root filesystem mount flags
because it allows the user to configure the mount table to preferences
that they desire.
Now that SystemServer is modified to re-mount the root mount with the
desired flags, we can just mount the root filesystem without assuming
special flags.
The check of ensuring we are not trying to read beyond the end of the
inode data buffer is already there, it's just that we need to disallow
further reading if the read offset equals to the inode data size.
Apparently we lacked this important check from the beginning of this
piece of code. This check is crucial to ensure we only give back data
being related to the FATInode data buffer and nothing beyond it.
This is necessary to support the wayland protocol.
I also moved the CMSG_* macros to the kernel API since they are used in
both kernel and userspace.
this does not break ntpquery/SCM_TIMESTAMP.
There was a bug in which bound Inodes would lose all their references
(because localsocket does not reference them), and they would be
deallocated, and clients would get ECONNREFUSED as a result. now
LocalSocket has a strong reference to inode so that the inode will live
as long as the socket, and Inode has a weak reference to the socket,
because if the socket stops being referenced anywhere it should not be
bound.
This still prevents the reference loop that
220b7dd779 was trying to fix.
Even though we currently build all of Userland and the Kernel with the
-mstrict-align flag, the compiler will still emit unaligned memory
accesses. To work around this, we disable the check for now. See
https://github.com/SerenityOS/serenity/issues/17516 for the relevant
issue.
This commit adds Processor::set_thread_specific_data, and this function
is used to factor out architecture specific implementation of setting
the thread specific data. This function is implemented for
aarch64 and x86_64, and the callsites are changed to use this function
instead.
This new method is meant to be used in both userspace and kernel code.
The idea is to allow printing of a verbose message and then returning an
errno code which is the proper mechanism for kernel code because we
should almost always assume that such error will be propagated back to
userspace in some way, so the userspace code could reasonably decode it.
For userspace code however, this new method is meant to be a simple
wrapper for Error::from_string_view, because for most invocations, it's
much more useful to have a verbose & literal error than a errno code, so
we simply ignore that errno code completely in such context.
Returning literal strings is not the proper action here, because we
should always assume that error could be propagated back to userland, so
we need to keep a valid errno when returning an Error.
Returning literal strings is not the proper action here, because we
should always assume that error could be propagated back to userland, so
we need to keep a valid errno when returning an Error.
For example, consider cases where we want to propagate errors only in
specific instances:
auto result = read_data(); // something like ErrorOr<ByteBuffer>
if (result.is_error() && result.error().code() != EINTR)
continue;
auto bytes = TRY(result);
The TRY invocation will currently copy the byte buffer when the
expression (in this case, just a local variable) is stored into
_temporary_result.
This patch binds the expression to a reference to prevent such copies.
In less trival invocations (such as TRY(some_function()), this will
incur only temporary lifetime extensions, i.e. no functional change.
This is needed so we can retrieve the registers of a traced
thread that was attached to while it was running.
Attaching with ptrace to a running thread sends SIGSTOP to it.
This adds the necessary code to init.cpp to be able to execute the first
userspace process. To do this, first the filesystem code is initialized,
which will use the ramdisk embedded into the kernel image. Then the
first userspace process, /bin/SystemServer is executed. :^)
The ramdisk code is used as it is useful for the bring-up of the aarch64
port, however once the kernel has support for better ram-based
filesystems, the ramdisk code will be removed again.