If the final copy_to_user() call fails when writing the file descriptors
to the output array, we have to make sure the file descriptors don't
remain in the process file descriptor table. Otherwise they are
basically leaked, as userspace is not aware of them.
This matches the behavior of our sys$socketpair() implementation.
We don't need to explicitly check for EMFILE conditions before doing
anything in sys$pipe(). The fd allocation code will take care of it
for us anyway.
We ensure that when we call SharedInodeVMObject::sync we lock the inode
lock before calling Inode virtual write_bytes method directly to avoid
assertion on the unlocked inode lock, as it was regressed recently. This
is not a complete fix as the need to lock from each path before calling
the write_bytes method should be avoided because it can lead to
hard-to-find bugs, and this commit only fixes the problem temporarily.
Previously, when starved for pages, *all* clean file-backed memory
would be released, which is quite excessive.
This patch instead releases just 1 page, since only 1 page is needed
to satisfy the request to `allocate_physical_page()`
Previously, we could only release *all* clean pages.
This patch makes it possible to release a specific amount of clean
pages. If the attempted number of pages to release is more than the
amount of clean pages, all clean pages will be released.
At the point at which we try to map the Region it was already added to
the Process region tree, so we have to make sure to remove it before
freeing it in the mapping failure path, otherwise the tree will contain
a dangling pointer to the free'd instance.
This fixes an issue where failing the fork due to OOM or other error,
we'd end up destroying the Process too early. By the time we got to
WaitBlockerSet::finalize(), it was long gone.
Until now, our only backup plan when running out of physical pages
was to try and purge volatile memory. If that didn't work out, we just
hung userspace out to dry with an ENOMEM.
This patch improves the situation by also considering clean, file-backed
pages (that we could page back in from disk).
This could be better in many ways, but it already allows us to boot to
WindowServer with 256 MiB of RAM. :^)
This enum was created to help put distinction between the commandset and
the interface type, as ATAPI devices are simply ATA devices utilizing
the SCSI commandset. Because we don't support ATAPI, putting such type
of distinction is pointless, so let's remove this for now.
We don't really support ATAPI (SCSI packets over ATA channels) and it's
uncertain if we ever will support such type of media. For this reason,
there's basically no reason to keep this code.
If we ever introduce ATAPI support into the Kernel, we can simply put
this back into the codebase.
In the near future, we will be able to figure out connections between
storage devices and their partitions, so there's no need to hardcode 16
partitions per storage device - each storage device should be able to
have "infinite" count of partitions in it, and we should be able to use
and figure out about them.
This does not need to be a critical dmesg, as the system stays up
it makes more sense for it to be a normal dmesg message.
Luke mentioned this on discord, they really deserve the credit :^)
Reported-by: Luke Wilde <lukew@serenityos.org>
This deadlock was introduced with the creation of this API. The lock
order is such that we always need to take the page directory lock
before we ever take the MM lock.
This function violated that, as both Region creation and region
destruction require the pd and mm locks, but with the mm lock
already acquired we deadlocked with SMP mode enabled while other
threads were allocating regions.
With this change SMP boots to the desktop successfully for me,
(and then subsequently has other issues). :^)
This makes sure that the debug message are properly aligned when running
the kernel bare-metal on a Raspberry Pi. While we are here, also move
the function out of line.
We were previously rejecting `SO_REUSEADDR` with an `ENOPROTOOPT`, but
that made QEMU unhappy. Instead, just silently discard it and print a
FIXME message in case anybody wonders what went wrong if the system
won't reuse an address.
Instead of requiring each FileSystem implementation to call this method
when trying to write data, do the calls at 2 points to avoid further
calls (or lack of them due to not remembering to use it) at other files
and locations in the codebase.
Currently the SysFS node for USB devices is only initialized for USB
hubs, which means it will cause a kernel crash upon being dereferenced
in a non-hub device. This fixes the problem by making initialization
happen for all USB devices.
We should actually start counting from the parent directory and not from
the symbolic link as it will represent a wrong count of hops from the
actual mountpoint.
The symlinks in /sys/dev/block and /sys/dev/char worked only by luck,
because I have set it to the wrong parent directory which is the
/sys/dev directory, so with the symlink it was 3 hops to /sys, together
with the root directory, therefore, everything seemed to work.
Now that the device symlinks in /sys/dev/block and /sys/dev/char are set
to the right parent directory and we start measure hops from root
directory with the parent directory of a symlink, everything seem to
work correctly now.
Now that the infrastructure of the Graphics subsystem is quite stable,
it is time to try to fix a long-standing problem, which is the lack of
locking on display connector devices. Reading and writing from multiple
processes to a framebuffer controlled by the display connector is not a
huge problem - it could be solved with POSIX locking.
The real problem is some program that will try to do ioctl operations on
a display connector without the WindowServer being aware of that which
can lead to very bad situations, for example - assuming a framebuffer is
encoded at a known resolution and certain display timings, but another
process changed the ModeSetting of the display connector, leading to
inconsistency on the properties of the current ModeSetting.
To solve this, there's a new "master" ioctl to take "ownership" and
another one to release that ownership of a display connector device. To
ensure we will not hold a Process object forever just because it has an
ownership over a display connector, we hold it with a weak reference,
and if the process is gone, someone else can take an ownership.
This header file represents the entire interface between the kernel and
userland, and as such, no longer should be called FB.h but something
that represents the whole graphics subsystem.
Everything in Kernel/Storage/Partition but DiskPartition has been moved
into LibPartiton. This makes the Partition directory unnecessary so
DiskPartition is moved up into Kernel/Storage.
This commit creates a new library LibPartition which will contain
partition related code sharable between Kernel and Userland and
includes DiskPartitionMetadata as the first shared class.
This argument is always set to description.is_blocking(), but
description is also given as a separate argument, so there's no point
to piping it through separately.
The interrupts enabled check in the Kernel mutex is there so that we
don't lock mutexes within a spinlock, because mutexes reenable
interrupts and that will mess up the spinlock in more ways than one if
the thread moves processors. This check is guarded behind a debug flag
because it's too hard to fix all the problems at once, but we regressed
and weren't even getting to init stage 2 with it enabled. With this
commit, we get to stage 2 again. In early boot, there are no interrupts
enabled and spinlocks used, so we can sort of kind of safely ignore the
interrupt state. There might be a better solution with another boot
state flag that checks whether APs are up (because they have interrupts
enabled from the start) but that seems overkill.
Right now the TD and QH descriptor pools look to be susceptible
to a race condition in the event they are accessed simultaneously
by separate threads making USB transfers. This fix does not seem to
add any noticeable overhead.
IDEChannel which is an ATAPort derived class holded a NonnullRefPtr to a
parent IDEController, although we can easily defer the usage of it to
not be in the IDEChannel code at all, so it allows to keep NonnullRefPtr
to the parent ATAController in the ATAPort base class and only there.
This abstraction layer is mainly for ATA ports (AHCI ports, IDE ports).
The goal is to create a convenient and flexible framework so it's
possible to expand to support other types of controller (e.g. Intel PIIX
and ICH IDE controllers) and to abstract operations that are possible on
each component.
Currently only the ATA IDE code is affected by this, making it much
cleaner and readable - the ATA bus mastering code is moved to the
ATAPort code so more implementations in the near future can take
advantage of such functionality easily.
In addition to that, the hierarchy of the ATA IDE code resembles more of
the SATA AHCI code now, which means the IDEChannel class is solely
responsible for getting interrupts, passing them for further processing
in the ATAPort code to take care of the rest of the handling logic.
We do that to increase clarity of the major and secondary components in
the subsystem. To ensure it's even more understandable, we rename the
files to better represent the class within them and to remove redundancy
in the name.
Also, some includes are removed from the general components of the ATA
components' classes.
We are able to read the EDID from SysFS, therefore there's no need to
provide this ioctl on a DisplayConnector anymore.
Also, now we can simply require the video pledge to be set before doing
any ioctl on a DisplayConnector.
It is starting to get a little messy with how each device can try to add
or remove itself to either /sys/dev/block or /sys/dev/char directories.
To better do this, we introduce 4 virtual methods to take care of that,
so until we ensure all nodes in /sys/dev/block and /sys/dev/char are
actual symlinks, we allow the Device base class to call virtual methods
upon insertion or before being destroying, so it add itself elegantly to
either of these directories or remove itself when needed.
For special cases where we need to create symlinks, we have two virtual
methods to be called otherwise to do almost the same thing mentioned
before, but to use symlinks instead.
Under normal conditions (when mounting SysFS in /sys), there will be a
new directory in the /sys/devices directory called "graphics".
For now, under that directory there will be only a sub-directory called
"connectors" which will contain all DisplayConnectors' details, each in
its own sub-directory too, distinguished in naming with its minor
number.
Therefore, /sys/devices/graphics/connectors/MINOR_NUMBER/ will contain:
- General device attributes such as mutable_mode_setting_capable,
double_buffering_capable, flush_support, partial_flush_support and
refresh_rate_support. These values are exposed in the ioctl interface
of the DisplayConnector class too, but these can be useful later on
for command line utilities that want/need to expose these basic
settings.
- The EDID blob, simply named "edid". This will help userspace to fetch
the edid without the need of using the ioctl interface later on.
This change in fact does the following:
1. Use support for symlinks between /sys/dev/block/ storage device
identifier nodes and devices in /sys/devices/storage/{LUN}.
2. Add basic nodes in a /sys/devices/storage/{LUN} directory, to let
userspace to know about the device and its details.
These methods are essentially splitted from the after_inserting method
and the will_be_destroyed method so later on we can allow Storage
devices to override the after_inserting method and the will_be_destroyed
method while still being able to use shared functionality as before,
such as adding the device to and removing it from the device list.
This enforces us to remove duplicated code across the SysFS code. This
results in great simplification of how the SysFS works now, because we
enforce one way to treat SysFSDirectory objects.
This will be used later on to help connecting a node at /sys/dev/block/
that represents a Storage device to a directory in /sys/devices/storage/
with details on that device in that directory.
These methods will be used later on to introduce symbolic links support
in the SysFS, so the kernel will be able to resolve relative paths of
components in filesystem based on using the m_parent_directory pointer
in each SysFSComponent object.
LUN address is essentially how people used to address SCSI devices back
in the day we had these devices more in use. However, SCSI was taken as
an abstraction layer for many Unix and Unix-like systems, so it still
common to see LUN addresses in use. In Serenity, we don't really provide
such abstraction layer, and therefore until now, we didn't use LUNs too.
However (again), this changes, as we want to let users to address their
devices under SysFS easily. LUNs make sense in that regard, because they
can be easily adapted to different interfaces besides SCSI.
For example, for legacy ATA hard drive being connected to the first IDE
controller which was enumerated on the PCI bus, and then to the primary
channel as slave device, the LUN address would be 0:0:1.
To make this happen, we add unique ID number to each StorageController,
which increments by 1 for each new instance of StorageController. Then,
we adapt the ATA and NVMe devices to use these numbers and generate LUN
in the construction time.
This folder in the SysFS code represents everything related to /sys/dev,
which is a directory meant to be a convenient interface to track all IDs
of all block and character devices (ID = major:minor numbers).
This bug was probably around for a very long time, but it is noticeable
only under VirtualBox as it generated an non fatal error which caused a
kernel panic because we VERIFYed the wrong lock to be locked.
There's no point in keeping this method as we don't really care if a
graphics adapter is VGA compatible or not because we don't use this
method anymore.
There's no real value in separating physical pages to supervisor and
user types, so let's remove the concept and just let everyone to use
"user" physical pages which can be allocated from any PhysicalRegion
we want to use. Later on, we will remove the "user" prefix as this
prefix is not needed anymore.
We are limited on the amount of supervisor pages we can allocate, so
don't allocate from that pool. Supervisor pages are always below 16 MiB
barrier so using those was crucial when we used devices like the ISA
SoundBlaster 16 card, because that device required very low physical
addresses to be used.
This used to be needed to protect accesses to Process::all_instances.
That list now has a more granular lock, so we don't need to take the
scheduler lock.
This fixes a crash when we try to access a locked Thread::m_fds in the
loop, which calls Thread::block, which then asserts that the scheduler
lock must not be locked by the current process.
Fixes#13617
We should not allocate a kernel region inside the constructor of the
VGATextModeConsole class. We do use MUST() because allocation cannot
fail at this point, but that happens in the static factory method
instead.
The original intention was to support other types of consoles based on
standard VGA modes, but it never came to an implementation, nor we need
such feature at all.
Therefore, this class is not needed and can be removed.
While null StringViews are just as bad, these prevent the removal of
StringView(char const*) as that constructor accepts a nullptr.
No functional changes.
This prevents us from needing a sv suffix, and potentially reduces the
need to run generic code for a single character (as contains,
starts_with, ends_with etc. for a char will be just a length and
equality check).
No functional changes.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
This commit moves the length calculations out to be directly on the
StringView users. This is an important step towards the goal of removing
StringView(char const*), as it moves the responsibility of calculating
the size of the string to the user of the StringView (which will prevent
naive uses causing OOB access).
We never supported VGA framebuffers and that folder was a big misleading
part of the graphics subsystem.
We do support bare-bones VGA text console (80x25), but that only happens
to be supported because we can't be 100% sure we can always initialize
framebuffer so in the worst scenario we default to plain old VGA console
so the user can still use its own machine.
Therefore, the only remaining parts of VGA is in the GraphicsManagement
code to help driving the VGA text console if needed.
Uncommitted pages (shared zero pages) can not contain any existing data
and can not be modified, so there's no point to committing a bunch of
extra pages to cover for them in the forked child.
Since both the parent process and child process hold a reference to the
COW committed set, once the child process exits, the committed COW
pages are effectively leaked, only being slowly re-claimed each time
the parent process writes to one of them, realizing it's no longer
shared, and uncommitting it.
In order to mitigate this we now hold a weak reference the parent
VMObject from which the pages are cloned, and we use it on destruction
when available to drop the reference to the committed set from it as
well.
Until the thread is first set as Runnable at the end of sys$fork, its
state is Invalid, and as a result, the Finalizer which is searching for
Dying threads will never find it if the syscall short-circuits due to
an error condition like OOM. This also meant the parent Process of the
thread would be leaked as well.
The extra argument to fcntl is a pointer in the case of F_GETLK/F_SETLK
and we were pulling out a u32, leading to pointer truncation on x86_64.
Among other things, this fixes Assistant on x86_64 :^)
This is not explicitly specified by POSIX, but is supported by other
*nixes, already supported by our sys$bind, and expected by various
programs. While were here, also clean up the user memory copies a bit.
We were previously assuming that the how value was a bitfield, but that
is not the case, so we must explicitly check for SHUT_RDWR when
deciding on the read and write shutdowns.
The previous check for valid how values assumed this field was a bitmap
and that SHUT_RDWR was simply a bitwise or of SHUT_RD and SHUT_WR,
which is not the case.
`sigsuspend` was previously implemented using a poll on an empty set of
file descriptors. However, this broke quite a few assumptions in
`SelectBlocker`, as it verifies at least one file descriptor to be
ready after waking up and as it relies on being notified by the file
descriptor.
A bare-bones `sigsuspend` may also be implemented by relying on any of
the `sigwait` functions, but as `sigsuspend` features several (currently
unimplemented) restrictions on how returns work, it is a syscall on its
own.
When updating the signal mask, there is a small frame where we might set
up the receiving process for handing the signal and therefore remove
that signal from the list of pending signals before SignalBlocker has a
chance to block. In turn, this might cause SignalBlocker to never notice
that the signal arrives and it will never unblock once blocked.
Track the currently handled signal separately and include it when
determining if SignalBlocker should be unblocking.
I haven't found any POSIX specification on this, but the Linux kernel
appears to handle it like that.
This is required by QEMU, as it just bulk-unlocks all its file locking
bytes without checking first if they are held.
Access to RDTSC is occasionally restricted to give malware one less
option to accurately time attacks (side-channels, etc.).
However, QEMU requires access to the timestamp counter for the exact
same reason (which is accurately timing its CPU ticks), so lets just
enable it for now.
The initialize_hba method now calls the reset method to reset the HBA
and initialize each AHCIPort. Also, after full HBA reset we need to turn
on the AHCI functionality of the HBA and global interrupts since they
are cleared to 0 according to the specification in the GHC register.
Instead of doing this in a parent class like the AHCIController, let's
do that directly in the AHCIPort class as that class is the only user of
these sort of physical pages. While it seems like we waste an entire 4KB
of physical RAM for each allocation, this could serve us later on if we
want to fetch other types of logs from the ATA device.
The way AHCIPortHandler held AHCIPorts and even provided them with
physical pages for the ATA identify buffer just felt wrong.
To fix this, AHCIPortHandler is not a ref-counted object anymore. This
solves the big part of the problem, because AHCIPorts can't hold a
reference to this object anymore, only the AHCIController can do that.
Then, most of the responsibilities are shifted to the AHCIController,
making the AHCIPortHandler a handler of port interrupts only.
The AHCI code is not very good at OOM conditions, so this is a first
step towards OOM correctness. We should not allocate things inside C++
constructors because we can't catch OOM failures, so most allocation
code inside constructors is exported to a different function.
Also, don't use a HashMap for holding RefPtr of AHCIPort objects in
AHCIPortHandler because this structure is not very OOM-friendly. Instead
use a fixed Array of 32 RefPtrs, as at most we can have 32 AHCI ports
per AHCI controller.
To prevent a race condition in case we received the ARP response in the
window between creating and initializing the Thread Blocker and the
actual blocking, we were checking if the IP address was updated in the
ARP table just before starting to block.
Unfortunately, the condition was partially flipped, which meant that if
the table was updated with the IP address we would still end up
blocking, at which point we would never end unblocking again, which
would result in LookupServer locking up as well.
Currently when allocating buffers for USB transfers, it is done
once for every transfer rather than once upon creation of the
USB device. This commit changes that by moving allocation of buffers
to the USB Pipe class where they can be reused.
In the same fashion like in the Linux kernel, we support pre-initialized
framebuffers that were set up by either the BIOS or the bootloader.
These framebuffers can be backed by any kind of video hardware, and are
not tied to VGA hardware at all. Therefore, this code should be in a
separate sub-folder in the Graphics subsystem to indicate this.
The flag will automatically initialize all variables to a pattern based
on it's type. The goal being here is to eradicate an entire bug class
of issues that can originate from uninitialized stack memory.
Some examples include:
- Kernel information disclosure, where uninitialized struct members
or struct padding is copied back to usermode, leaking kernel
information such as stack or heap addresses, or secret data like
stack cookies.
- Control flow based on uninitialized memory can cause a variety of
issues at runtime, including stack corruptions like buffer
overflows, heap corruptions due to deleting stray pointers.
Even basic logic bugs can result from control flow operating on
uninitialized data.
As of GCC 12 this flag is now supported.
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=a25e0b5e6ac8a77a71c229e0a7b744603365b0e9
Clang has already supported it for a few releases.
https://reviews.llvm.org/D54604
When the size of the audio data was not a multiple of a page size,
subtracting the page size from this unsigned variable would underflow it
close to 2^32 and be clamped to the page size again. This would lead to
writes into garbage addresses because of an incorrect write size,
interestingly only causing the write() call to error out.
Using saturating math neatly fixes this problem and allows buffer
lengths that are not a multiple of a page size.
Currently CursorStyle enum handles both the styles and the steadiness or
blinking of the terminal caret, which doubles the amount of its entries.
This commit changes CursorStyle to CursorShape and moves the blinking
option to a seperate boolean value.
The RDGSBASE userspace instruction allows programs to read the contents
of the gs segment register which contains a kernel pointer to the base
of the current Processor struct.
Since we don't use this instruction in Serenity at the moment, we can
simply disable it for now to ensure we don't break KASLR. Support can
later be restored once proper swapping of the contents of gs is done on
userspace/kernel boundaries.
This is basically unchanged since the beginning of 2020, which is a year
before we had proper ASLR.
Now that we have a proper ASLR implementation, we can turn this down a
bit, as it is no longer our only protection against predictable dynamic
loader addresses, and it actually obstructs the default loading address
of x86_64 quite frequently.
There's nothing stopping a userspace program from keeping a bunch of
threads around with a custom signal stack in a suspended state with
their normal thread stack mprotected to PROT_NONE.
OpenJDK seems to do this, for example.