The first byte of the EBDA structure contains the size of the EBDA
in 1 KiB units. We were incorrectly using the word at offset 0x413
of the BDA which specifies the number of KiB before the EBDA structure.
Contradictory to the comment above it, this while loop was actually
clearing the selectors above or equal to the edited one (instead of
the selectors that were skipped when the gdt was extended), this wasn't
really an issue so far, as all calls to this function did extend the
GDT, which meant this condition was always false, but future calls to
this function that will try to edit an existing entry would fail.
Previously, one could put '\b' in a keymap, but in non-Terminal
applications, it would just insert a literal '\b' character instead of
behaving like backspace. This patch modifes
`visible_code_point_to_key_code` to include backspace, as well as
renaming it to `code_point_to_key_code` since '\b' is not a visible
character. Additionally, `KeyboardDevice::key_state_changed` has been
rearranged to apply the user's keymap before checking for things like
caps lock.
The function `KString::must_create()` can only be enforced
during early boot (that is, when `g_in_early_boot` is true), hence
the use of this function during runtime causes a `VERIFY` to assert,
leading to a Kernel Panic.
We should instead use `TRY()` along with `try_create()` to prevent
this from crashing whenever a USB device is inserted into the system,
and we don't have enough memory to allocate the device's KString.
Since NVME devices end with a digit that indicates the node index we
cannot simply append a partition index. Instead, there will be a "p"
character as separator, e.g. /dev/nvme0n1p3 for the 3rd partition.
So, if the early device name ends in a digit we need to add this
separater before matching for the partition index.
If the partition index is omitted (as is the default) the root file
system is on a disk without any partition table (e.g. using QEMU).
This enables booting from the correct partition on an NVMe drive by
setting the command line variable root to e.g. root=/dev/nvme0n1p1
We need to use the volatile keyword when mapping the device registers,
or the compiler may optimize access, which lead to this QEMU error:
pci_nvme_ub_mmiord_toosmall in nvme_mmio_read: MMIO read smaller than
32-bits, offset=0x0
This modifies sys$chown to allow specifying whether or not to follow
symlinks and in which directory.
This was then used to implement lchown and fchownat in LibC and LibCore.
Add a basic NVMe driver support to serenity
based on NVMe spec 1.4.
The driver can support multiple NVMe drives (subsystems).
But in a NVMe drive, the driver can support one controller
with multiple namespaces.
Each core will get a separate NVMe Queue.
As the system lacks MSI support, PIN based interrupts are
used for IO.
Tested the NVMe support by replacing IDE driver
with the NVMe driver :^)
Calls to link_up() in the E1000 driver would read the link state
directly from the hardware on every call. This had negative
performance impact in high throughput situations since link_up()
is called every time an IP packet's route is resolved.
This patch takes inspiration from the RTL8139 network adapter where
the link state is stored in a bool and only updated when the hardware
generates an interrupt related to link state change.
After this change I measured a ~9% increase in TCP Tx throughput
using:
cat /dev/zero | nc <host_IP> <host_port> from the Serenity VM to my
host machine
By using the binary from our build of binutils, we can be sure that `nm`
supports demangling symbols, so we can avoid spawning a separate
`c++filt` process.
This change adds a thread member variable to track if we have a pending
promise violation on a kernel thread. This ensures that all code
properly propagates promise violations up to the syscall handler.
Suggested-by: Andreas Kling <kling@serenityos.org>
Previously we would crash the process immediately when a promise
violation was found during a syscall. This is error prone, as we
don't unwind the stack. This means that in certain cases we can
leak resources, like an OwnPtr / RefPtr tracked on the stack. Or
even leak a lock acquired in a ScopeLockLocker.
To remedy this situation we move the promise violation handling to
the syscall handler, right before we return to user space. This
allows the code to follow the normal unwind path, and grantees
there is no longer any cleanup that needs to occur.
The Process::require_promise() and Process::require_no_promises()
functions were modified to return ErrorOr<void> so we enforce that
the errors are always propagated by the caller.
This change lays the foundation for making the require_promise return
an error hand handling the process abort outside of the syscall
implementations, to avoid cases where we would leak resources.
It also has the advantage that it makes removes a gs pointer read
to look up the current thread, then process for every syscall. We
can instead go through the Process this pointer in most cases.
This change lays the foundation for making the require_promise return
an error hand handling the process abort outside of the syscall
implementations, to avoid cases where we would leak resources.
It also has the advantage that it makes removes a gs pointer read
to look up the current thread, then process for every syscall. We
can instead go through the Process this pointer in most cases.
I fell into this trap and tried to switch the syscalls to pass by
the `off_t` by register. I think it makes sense to add a clarifying
comment for future readers of the code, so they don't fall into the
same trap. :^)
This is required for SlavePTY's custom unref handler to function
correctly, as otherwise a SlavePTY held in a File RefPtr would call
the base's (RefCounted<>) unref method instead of SlavePTY's version.
It looks like type types are small enough that there is no padding.
So there didn't happen to be an info leak here, but lets zero initialize
just to be on the safe side, and make auditing easier.
In `sys$accept4()` and `get_sock_or_peer_name()` we were not
initializing the padding of the `sockaddr_un` struct, leading to
an kernel information leak if the
caller looked back at it's contents.
Before Fix:
37.766 Clipboard(11:11): accept4 Bytes:
2f746d702f706f7274616c2f636c6970626f61726440eac130e7fbc1e8abbfc
19c10ffc18440eac15485bcc130e7fbc1549feaca6c9deaca549feaca1bb0bc
03efdf62c0e056eac1b402d7acd010ffc14602000001b0bc030100000050bf0
5c24602000001e7fbc1b402d7ac6bdc
After Fix:
0.603 Clipboard(11:11): accept4 Bytes:
2f746d702f706f7274616c2f636c6970626f617264000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000
In FB_IOCTL_GET_PROPERTIES we were not initializing the padding of the
struct, leading to the potential of an kernel information leak if the
caller looked back at it's contents.
Lets just be extra paranoid and zero initialize all these structs
in we store on the stack while handling ioctls(..).
... instead of returning the maximum number of Processor objects that we
can allocate.
Some ports (e.g. gdb) rely on this information to determine the number
of worker threads to spawn. When gdb spawned 64 threads, the kernel
could not cope with generating backtraces for it, which prevented us
from debugging it properly.
This commit also removes the confusingly named
`Processor::processor_count` function so that this mistake can't happen
again.
Since RefCounted automatically calls a method named `will_be_destoyed`
on classes that have one, so there's no need to have a custom
implementation of unref in File.
This will allow File and it's descendants to use RefCounted instead of
having a custom implementation of unref. (Since RefCounted calls
will_be_destroyed automatically)
This commit also removes an erroneous call to `before_removing` in
AHCIPort, this is a duplicate call, as the only reference to the device
is immediately dropped following the call, which in turns calls
`before_removing` via File::unref.
Custody's unref is one of many implementions of ListedRefCounted's
behaviour in the Kernel, which results in avoidable bugs caused by
the fragmentation of the implementations. This commit starts the work
of replacing all custom implementations with ListedRefCounted by
porting Custody to it.
Add a kernel data segment and make the user code segment come after
the data segment. We need the GDT to be in a certain order to support
the syscall and sysret instruction pair.
We've finally gotten kmalloc to a point where it feels decent enough
to drop this comment.
There's still a lot of room for improvement, and we'll continue working
on it.
This was a premature optimization from the early days of SerenityOS.
The eternal heap was a simple bump pointer allocator over a static
byte array. My original idea was to avoid heap fragmentation and improve
data locality, but both ideas were rooted in cargo culting, not data.
We would reserve 4 MiB at boot and only ended up using ~256 KiB, wasting
the rest.
This patch replaces all kmalloc_eternal() usage by regular kmalloc().
Since a socket can be accessed by multiple threads concurrently, we need
to protect shared data behind the socket mutex.
There's very likely more places where we need to fix this, the purpose
of this patch is to fix a VERIFY() failure in getsockopt() seen on CI.
We were doing this dance in notify_watchers():
set_metadata_dirty(true);
set_metadata_dirty(false);
This was done in order to force out inode watcher events immediately.
Unfortunately, this was racy, as if SyncTask got scheduled at the wrong
moment, it would try to flush metadata for a clean inode. This then got
trapped by the VERIFY() statement in Inode::sync_all():
VERIFY(inode.is_metadata_dirty());
This patch fixes the issue by replacing notify_watchers() with lazy
metadata notifications like all other filesystems.
Much like the existing in6addr_any global and the IN6ADDR_ANY_INIT
macro, our LibC is also expected to export the in6addr_loopback global
and the IN6ADDR_LOOPBACK_INIT constant.
These were found by the stress-ng port.
This patch adds generic slab allocators to kmalloc. In this initial
version, the slab sizes are 16, 32, 64, 128, 256 and 512 bytes.
Slabheaps are backed by 64 KiB block-aligned blocks with freelists,
similar to what we do in LibC malloc and LibJS Heap.
There are no more users of the C-style kfree() API in the kernel,
so let's get rid of it and enjoy the new world where we always know
how much memory we are freeing. :^)
This patch does two things:
- Combines kmalloc_aligned() and kmalloc_aligned_cxx(). Templatizing
the alignment parameter doesn't seem like a valuable enough
optimization to justify having two almost-identical implementations.
- Stores the real allocation size of an aligned allocation along with
the other alignment metadata, and uses it to call kfree_sized()
instead of kfree().
This class was misusing the outdate Lockable template and didn't take
advantage of the lock/resource separation mechanism fully anyway.
Since the underlying PRNG has its own SpinLock, and we already use that
for synchronization everywhere anyway, we can simply remove the Lockable
inheritance from this class.
Currently the APIC class is constructed irrespective of whether it
is used or not.
So, move APIC initialization from init to the InterruptManagement
class and construct the APIC class only when it is needed.
Since we allocate the subheap in the first page of the given storage
let's assert that the subheap can actually fit in a single page, to
prevent the possible future headache of trying to debug the cause of
random kernel memory corruption :^)
This avoids getting caught with our pants down when heap expansion fails
due to missing page tables. It also avoids a circular dependency on
kmalloc() by way of HashMap::set() in MemoryManager::ensure_pte().
If the data passed to sys$write() is backed by a not-yet-paged-in inode
mapping, we could end up in a situation where we get a page fault when
trying to copy data from userspace.
If that page fault handler tried reading from an inode that someone else
had locked while waiting for the disk cache lock, we'd deadlock.
This patch fixes the issue by copying the userspace data into a local
buffer before acquiring the disk cache lock. This is not ideal since it
incurs an extra copy, and I'm sure we can think of a better solution
eventually.
This was a frequent cause of startup deadlocks on x86_64 for me. :^)
Previously, the heap expansion logic could end up calling kmalloc
recursively, which was quite messy and hard to reason about.
This patch redesigns heap expansion so that it's kmalloc-free:
- We make a single large virtual range allocation at startup
- When expanding, we bump allocate VM from that region
- When expanding, we populate page tables directly ourselves,
instead of going via MemoryManager.
This makes heap expansion a great deal simpler. However, do note that it
introduces two new flaws that we'll need to deal with eventually:
- The single virtual range allocation is limited to 64 MiB and once
exhausted, kmalloc() will fail. (Actually, it will PANIC for now..)
- The kmalloc heap can no longer shrink once expanded. Subheaps stay
in place once constructed.
This was used to return a pre-locked UDPSocket in one place, but there
was really no need for that mechanism in the first place since the
caller ends up locking the socket anyway.
The function to protect ksyms after initialization, is only used during
boot of the system, so it can be UNMAP_AFTER_INIT as well.
This requires we switch the order of the init sequence, so we now call
`MM.protect_ksyms_after_init()` before `MM.unmap_text_after_init()`.
As a small cleanup, this also makes `page_round_up` verify its
precondition with `page_round_up_would_wrap` (which callers are expected
to call), rather than having its own logic.
Fixes#11297.
Most other syscalls pass address arguments as `void*` instead of
`uintptr_t`, so let's do that here too. Besides improving consistency,
this commit makes `strace` correctly pretty-print these arguments in
hex.
This feature was introduced in version 4.17 of the Linux kernel, and
while it's not specified by POSIX, I think it will be a nice addition to
our system.
MAP_FIXED_NOREPLACE provides a less error-prone alternative to
MAP_FIXED: while regular fixed mappings would cause any intersecting
ranges to be unmapped, MAP_FIXED_NOREPLACE returns EEXIST instead. This
ensures that we don't corrupt our process's address space if something
is already at the requested address.
Note that the more portable way to do this is to use regular
MAP_ANONYMOUS, and check afterwards whether the returned address matches
what we wanted. This, however, has a large performance impact on
programs like Wine which try to reserve large portions of the address
space at once, as the non-matching addresses have to be unmapped
separately.
This error only ever gets propagated to the userspace if
MAP_FIXED_NOREPLACE is requested, as MAP_FIXED unmaps intersecting
ranges beforehand, and non-fixed mmap() calls will just fall back to
allocating anywhere.
Linux specifies MAP_FIXED_NOREPLACE to return EEXIST when it can't
allocate, we now match that behavior.
Previously we were assigning to Process::m_space before actually
entering the new address space (assigning it to CR3.)
If a thread was preempted by the scheduler while destroying the old
address space, we'd then attempt to resume the thread with CR3 pointing
at a partially destroyed address space.
We could then crash immediately in write_cr3(), right after assigning
the new value to CR3. I am hopeful that this may have been the bug
haunting our CI for months. :^)
Instead, wait until we transition back to userspace. This stops
userspace from being able to suspend a thread indefinitely while it's
running in kernelspace (potentially holding some blocking mutex.)
Add the `posix_madvise(..)` LibC implementation that just forwards
to the normal `madvise(..)` implementation.
Also define a few POSIX_MADV_DONTNEED and POSIX_MADV_NORMAL as they
are part of the POSIX API for `posix_madvise(..)`.
This is needed by the `fio` port.
These 2 members are required by POSIX and are also used by some ports.
Zero is a valid value for both of these, so no further work to support
them is required.
The Prekernel's memory is only accessed until MemoryManager has been
initialized. Keeping them around afterwards is both unnecessary and bad,
as it prevents the userland from using the 0x100000-0x155000 virtual
address range.
Co-authored-by: Idan Horowitz <idan.horowitz@gmail.com>
As PROT_NONE regions can't be accessed by processes, and their only real
use is for reserving ranges of virtual memory, there's no point in
including them in coredumps.
Since this range is mapped in already in the kernel page directory, we
can initialize it before jumping into the first kernel process which
lets us avoid mapping in the range into init_stage2's address space.
This brings us half-way to removing the shared bottom 2 MiB mapping in
every process, leaving only the Prekernel.
The sa_family field in SIOCGIFHWADDR specifies the underlying network
interface's device type, this is hardcoded to generic "Ethernet" right
now, as we don't have a nice way to query it.
In order to reduce our reliance on __builtin_{ffs, clz, ctz, popcount},
this commit removes all calls to these functions and replaces them with
the equivalent functions in AK/BuiltinWrappers.h.
Not much to say here, this is an implementation of this call that
accesses the actual limit constant that's used by the VirtualFileSystem
class.
As a side note, this is required for my eventual Qt port.
As pointed out by BertalanD on Discord, POSIX specifies that
_SC_SYMLOOP_MAX (implemented in the following commit) always needs to be
equal or more than _POSIX_SYMLOOP_MAX (8, defined in
LibC/bits/posix1_lim.h), hence I've increased it to that value to
comply with the standard.
The move to header is required for the following commit - to make this
constant accessible outside of the VFS class, namely in sysconf.
For setreuid and setresuid syscalls, -1 means to set the current
uid/euid/gid/egid value, to be more convenient for programming.
However, for other syscalls where we pass only one argument, there's no
justification to specify -1.
This behavior is identical to how Linux handles the value -1, and is
influenced by the fact that the manual pages for the group of one
argument syscalls that handle ID operations is ambiguous about this
topic.
Unsurprisingly, the /proc/PID/stacks/TID stack walk had the same
arbitrary memory read problem as the perf event stack walk.
It would be nice if the kernel had a single stack walk implementation,
but that's outside the scope of this commit.
Since perfcore files can be generated during process finalization,
we can't just allow them to contain sensitive kernel information
if they're gonna be owned by the process's own UID+GID.
So instead, perfcores are now owned by 0:0. This is not the most
ergonomic solution, but I'm not sure what we could do to make it nicer.
We'll have to think more about that. In the meantime, this patches up
a kernel info leak. :^)
When walking the stack to generate a perf_event sample, we now check
if a userspace stack frame points back into kernel memory.
It was possible to use this as an arbitrary kernel memory read. :^)
We can leave the .ksyms section mapped-but-read-only and then have the
symbols index simply point into it.
Note that we manually insert null-terminators into the symbols section
while parsing it.
This gets rid of ~950 KiB of kmalloc_eternal() at startup. :^)
We used to build with -Os in order to fit within a certain size, but
there isn't really a good reason for that kind of restriction.
Switching to -O2 yields a significant improvement in throughput,
for example `test-js` is roughly 20% faster on my machine. :^)
This fixes at least half of our LibC includes in the kernel. The source
of truth for errno codes and their description strings now lives in
Kernel/API/POSIX/errno.h as an enumeration, which LibC includes.
Before this commit, we only checked the receive buffer on the socket,
which is unused on datagram streams. Now we return the actual size of
the datagram without the protocol headers, which required the protocol
to tell us what the size of the payload is.
This small change allows to use the IOAPIC by default without to enable
SMP mode, which emulates Uni-Processor setup with IOAPIC instead of
using the PIC.
This opens the opportunity to utilize other types of interrupts like MSI
and MSI-X interrupts.
The MADT data could be on unaligned boundary - for example, a GSI number
(u32) on unaligned address which leads to a KUBSAN error and halting the
system.
Using the phrase "create" doesn't give information on whether the object
must be allocated or a failure to do so can be handled gracefully.
Therefore, we must use better phrase for such purpose, so "must_create"
for the allocate-and-construct static methods is definitely good choice.
Instead, allocate before constructing the object and pass NonnullOwnPtr
of KString to the object if needed. Some classes can determine their
names as they have a known attribute to look for or have a static name.
- dump_backtrace was using ebp instead of rbp on x86_64, only using the
lower 32-bits of rbp.
- The symbol loader was only fetching half of the pointer from the
symbol table. (8 chars instead of 16 chars)
Since it's possible to determine where the small zones will start to
occur for each PhysicalRegion, we can use arithmetic so that the call
time for both large and small zones is identical.
This file refers to the controlling terminal associated with the current
process. It's specified by POSIX, and is used by ports like openssh to
interface with the terminal even if the standard input/output is
redirected to somewhere else.
Our implementation leverages ProcFS's existing facilities to create
process-specific symbolic links. In our setup, `/dev/tty` is a symbolic
link to `/proc/self/tty`, which itself is a symlink to the appropriate
`/dev/pts` entry. If no TTY is attached, `/dev/tty` is left dangling.
Now that the userland has a compatiblity wrapper for select(), the
kernel doesn't need to implement this syscall natively. The poll()
interface been around since 1987, any code still using select()
should be slapped silly.
Note: the SerenityOS source tree mostly uses select() and not poll()
despite SerenityOS having support for poll() since early 2019...
This includes a new Thread::Blocker called SignalBlocker which blocks
until a signal of a matching type is pending. The current Blocker
implementation in the Kernel is very complicated, but cleaning it up is
a different yak for a different day.
As required by posix. Also rename Thread::clear_signals to
Thread::reset_signals_for_exec since it doesn't actually clear any
pending signals, but rather does execve related signal book-keeping.
This option is already enabled when building Lagom, so let's enable it
for the main build too. We will no longer be surprised by Lagom Clang
CI builds failing while everything compiles locally.
Furthermore, the stronger `-Wsuggest-override` warning is enabled in
this commit, which enforces the use of the `override` keyword in all
classes, not just those which already have some methods marked as
`override`. This works with both GCC and Clang.
It's not enough to just find the largest-address-not-above the argument,
we must also check that the found region actually contains the argument.
Regressed in a23edd42b8, thanks to Idan
for pointing this out.
Most of the time, we will be freeing physical pages within the
full-sized zones. We can do some simple math to find the right zone
immediately instead of looping through the zones, checking each one.
We still do loop through the slack/remainder zones at the end.
There's probably an even nicer way to solve this, but this is already a
nice improvement. :^)
We were already doing this for userspace memory regions (in the
Memory::AddressSpace class), so let's do it for kernel regions as well.
This gives a nice speed-up on test-js and probably basically everything
else as well. :^)
There is no use to create a temporary String of a char const* to just
cast it to a StringView on SysFSComponent construction again.
Also this could have lead to a UAF bug.
If we're sending an urgent signal (i.e. due to unexpected conditions)
and the Process did not setup any signal handler, we should immediately
terminate the Thread, to ensure the current trap frame is preserved for
the impending core dump.
Returning 'result.error().code()' erroneously creates an
ErrorOr<FlatPtr> of the positive errno code, which breaks our
error-returning convention.
This seems to be due to a forgotten minus-sign during the refactoring in
9e51e295cf. This latent bug was never
discovered, because currently the error-handling paths are rarely
exercised.
Also, remove incomplete, superfluous check.
Incomplete, because only the byte at the provided address was checked;
this misses the last bytes of the "jerk page".
Superfluous, because it is already correctly checked by peek_user_data
(which calls copy_from_user).
The caller/tracer should not typically attempt to read non-userspace
addresses, we don't need to "hot-path" it either.