If there's no fds to copy in a message with proper space for an
SCM_RIGHTS set of cmsg headers, then don't try to copy them.
This avoids a Kernel panic when recvmsg-ing, as copy_to_user(p, 0, 0)
hits a VERIFY.
These changes are compatible with clang-format 16 and will be mandatory
when we eventually bump clang-format version. So, since there are no
real downsides, let's commit them now.
Small position independent code model (which we end up using after this
change) is suitable for us since the kernel is not expected to grow more
than 2Gb in size. This might be a bit risky since this model is not
mentioned anywhere except for System V ABI document but experiments show
that the kernel compiled with this change works just fine.
This otherwise caused a race condition between the signal dispatcher
(which sets sepc to the signal trampoline) and sepc being updated in the
trap handler.
We obviously have to keep the sepc set by the signal dispatcher and not
increment it afterwards.
While this clutters Process.cpp a tiny bit, I feel that it's worth it:
- 2x speed on the kcov_loop benchmark. Likely more during fuzzing.
- Overall code complexity is going down with this change.
- By reducing the code reachable from __sanitizer_cov_trace_pc code,
we can now instrument more code.
Sticking this to the function source has multiple benefits:
- We instrument more code, by not excluding entire files.
- NO_SANITIZE_COVERAGE can be used in Header files.
- Keeping the info with the source code, means if a function or
file is moved around, the NO_SANITIZE_COVERAGE moves with it.
This reverts commit 9dbec601b0.
For KCOV to be performant (or at least not even slower) we need to
mmap the PC buffer from both user and kernel space at the same time.
You can't mmap a character device, so this change didn't make sense.
Plus even if we did invent a new method to exfiltrate the coverage
information out of the kernel, it would be incompatible with existing
kernel fuzzers. That would be kind of annoying. 🙃
This simple delay loop uses the time CSR to wait for the given amount
of time. The tick frequency of the CSR is read from the
/cpus/timebase-frequency devicetree property.
Instead, rewrite the region page fault handling code to not use
PageFault::type() on RISC-V.
I split Region::handle_fault into having a RISC-V-specific
implementation, as I am not sure if I cover all page fault handling edge
cases by solely relying on MM's own region metadata.
We should probably also take the processor-provided page fault reason
into account, if we decide to merge these two implementations in the
future.
This commit also removes the unnecessary user_sp RegisterState member.
We never use the kernel stack pointer on entry, so we can simply always
store the stack pointer of the previous privilege mode in sp.
Also remove the sp member from mcontext, as RISC-V doesn't have a
dedicated stack pointer register.
sp is defined to be x2 (x[1] in our case) by the ABI.
I probably accidentally included sp while copying the struct from
aarch64.
sepc has to be incremented before the call to syscall_handler,
as we otherwise would return to the ecall instruction, resulting in an
infinite trap loop.
We can't increment it after syscall_handler, as sepc might get changed
while handling the syscall.
The value of this field is incremented by one, as a value of 0 for this
field means 1 entry supported.
A value of 0xffff for CAP.MQES would incorrectly by truncated to 0x0000,
if we don't increase the bit width of the return type.
RFC9293 states that from the TimeWait state the TCPSocket
should wait the MSL (2mins) for delayed segments to expire
so that their sequence numbers do not clash with a new
connection's sequence numbers using the same ip address
and port number. The wait also ensures the remote TCP peer
has received the ACK to their FIN segment.
Please be aware that I only have NIC with chip version 6 so
this is the only one that I have tested. Rest was implemented
via looking at Linux rtl8169 driver. Also thanks to IdanHo
for some initial work.
Either we mount from a loop device or other source, the user might want
to obfuscate the given source for security reasons, so this option will
ensure this will happen.
If passed during a mount, the source will be hidden when reading from
the /sys/kernel/df node.
Similarly to DevPtsFS, this filesystem is about exposing loop device
nodes easily in /dev/loop, so userspace doesn't need to do anything in
order to use new devices immediately.
Instead of returning a raw pointer, which could be technically invalid
when using it in the caller function, we return a valid RefPtr of such
device.
This ensures that the code in DevPtsFS is now safe from a rare race
condition in which the SlavePTY device is gone but we still have a
pointer to it.
This device is a block device that allows a user to effectively treat an
Inode as a block device.
The static construction method is given an OpenFileDescription reference
but validates that:
- The description has a valid custody (so it's not some arbitrary file).
Failing this requirement will yield EINVAL.
- The description custody points to an Inode which is a regular file, as
we only support (seekable) regular files. Failing this requirement
will yield ENOTSUP.
LoopDevice can be used to mount a regular file on the filesystem like
other supported types of (physical) block devices.
Instead of using C-arrays, and manually counting their lengths, use
AK::Array. And pass these arrays around as spans, instead of as pointer-
and-length pairs.
Such operation is almost equivalent to writing on an Inode, so lock the
Inode m_inode_lock exclusively.
All FileSystem Inode implementations then override a new method called
truncate_locked which should implement the actual truncating.
thread_context_first_enter reuses the context restoring code in the
trap handler, just like other arches already do.
The `ld x2, 1*8(sp)` is unnecessary in the trap handler, as the stack
pointer should be equal to the stack pointer slot in the RegisterState
if the trap is from supervisor mode (and we currently don't support
user traps).
This load will however make us unable to reuse that code for
thread_context_first_enter.
This commit adds two functions which save/restore the entire FPU state.
On RISC-V, you only need to save the floating pointer registers
themselves and the fcsr CSR, which contains the entire state of the F/D
extensions.
Some real hardware apparently uses smaller BAR sizes than sizeof(HBA)
with a completely filled port_regs member.
Change the port_regs array to a flexible array member, so we don't panic
while verifying that the BAR size is large enough to map this struct.
Accesses to this array are already bounds checked against
AHCI::Limits::MaxPorts.
Allowing creation of StorageDevicePartition objects for any arbitrary
BlockDevice objects means that we could technically create a
StorageDevicePartition for another StorageDevicePartition which is
obviously not the intention for this code. Instead, require to pass a
StorageDevice reference to ensure this cannot happen.
It is expected that these class members will be set when the object is
created (so they're set in the class constructor method) and never
change again, as its the driver responsibility to find these values
before creating a StorageDevice object.
This makes it easier to rely on these values later on as we don't expect
them to ever change for a StorageDevice object during its lifetime.
It calculated the disk size with the zero-based max addressable block
value.
For example, for a disk device with a block size of 512 bytes that has 2
LBAs so it can address LBA 0 and LBA 1 (so m_max_addressable_block is 1)
the calculated disk size will be 512 instead of 1024 bytes.
We remove can_read() and can_write(), as both of these methods should be
implemented for proper blocking support.
For our case, the previous code will simply block the user if they tries
to read beyond the max addressable offset, which is not a correct
behavior.
Instead, just do proper EOF guarding when calling read() and write() on
such objects.
Add a method for matehmatical operations when verifying IO operation
boundaries.
Also, make max_addressable_block method non-virtual, since no other
derived class actually has ever overrided this method.
RFC9293 states that a closed socket should reply to all non-RST
packets with an RST. This change implements this behaviour as
specified in section 3.5.2 in bullet point 1.
The former automatically adapts the prefix to binary and octal
output, and is what we already use in the majority of cases.
Patch generated by:
rg -l '0x\{' | xargs sed -i '' -e 's/0x{:/{:#/'
I ran it 4 times (until it stopped changing things) since each
invocation only converted one instance per line.
No behavior change.
According to the Intel Software Developer's Manual Volume 3A section
6.12.1.3, the interrupt gate type means the IF flag is cleared to
prevent nested interruption. The trap gate type does not modify the
IF flag. Thus the gate type argument is important when constructing
an IDTEntry.
On x86_64 and x86, storage_segment (bit 12 counting from 0)
is always 0 according to the Intel Software Developer's Manual,
volume 3A, section 6.11 and section 6.14.1. It has therefore
been removed as a parameter from IDTEntry's constructor and
hardwired to 0.
In a TTY's non-canonical mode, data availability can be configured by
setting VMIN and VTIME to determine the minimum amount of bytes to read
and the timeout between bytes, respectively. Some ports (such as SRB2)
set VMIN to 0 which effectively makes reading a TTY such as stdin a
non-blocking read. We didn't support this, causing ports to hang as soon
as they try to read stdin without any data available.
Add a very duct-tapey implementation for the case where VMIN == 0 by
overwriting the TTY's description's blocking status; 3 FIXMEs are
included to make sure we clean this up some day.
This makes it possible to use MakeIndexSequqnce in functions like:
template<typename T, size_t N>
constexpr auto foo(T (&a)[N])
This means AK/StdLibExtraDetails.h must now include AK/Types.h
for size_t, which means AK/Types.h can no longer include
AK/StdLibExtras.h (which arguably it shouldn't do anyways),
which requires rejiggering some things.
(IMHO Types.h shouldn't use AK::Details metaprogramming at all.
FlatPtr doesn't necessarily have to use Conditional<> and ssize_t could
maybe be in its own header or something. But since it's tangential to
this PR, going with the tried and true "lift things that cause the
cycle up to the top" approach.)
This easily led to kernel deadlocks if the stopped thread held an
important global mutex (like the disk cache lock) while blocking.
Resolve this by ensuring stopped threads have a chance to return to the
userland boundary before actually stopping.
Locking a mutex while holding a spinlock is always wrong, but in the
case of the scheduler lock, it also causes an assertion failure. (Which
would be triggered by 2 separate threads trying to ptrace at the same
time).
This helps ensure no one accidentally accesses m_requests without first
locking it's spinlock. In fact this change fixed such a case, since
process_cq() implicitly assumed the caller locked the lock, which was
not the case for NVMePollQueue::submit_sqe().
Due to an incorrect lambda scope capture declaration, we would copy the
result status at the start of the function, before it actually got
updated with the final status. Capture it by reference instead to
ensure we report the updated result.
Instead of assuming data races won't occur and trying to somehow verify
it with manual un-atomic tracking, we can just use a recursive spinlock
instead of a normal one, to resolve the original deadlock.
Most of the actual logic is identical, with the only real difference
being that one wraps it with an async work item.
Merge the implementations to reduce duplications (which will also
require the fixes in the next commits to only be done once).
Multiple fields in sstatus are defined as WPRI "Reserved Writes Preserve
Values, Reads Ignore Values", which means we have to preserve their
values when writing to other fields in the same CSR.
We don't really need to touch any fields except SIE.
Interrupts are probably already disabled, but just to be safe,
disable them explicitly.
We need to handle the character map to set the code point before we can
reassign the correct key to the queued_event.key. This fixes keyboard
shortcuts using the incorrect keys based on the keyboard layout.
Automarks are similar to bookmarks placed by the terminal, allowing the
user to selectively remove a single command and its output from the
terminal scrollback.
This commit implements a single way to add marks: automatically placing
them when the shell becomes interactive.
To make sure the shell behaves correctly after its expected prompt
position changes, the terminal layer forces a resize event to be passed
to the shell on such (possibly) partial clears; this also has the nice
side effect of fixing the disappearing prompt on the preexisting "clear
including history" action: Fixes#4192.