This patch replaces the remaining users of this API with the new
try_copy_kstring_from_user() instead. Note that we still convert to a
String for continued processing, and I've added FIXME about continuing
work on using KString all the way.
This makes the following scenario impossible with an SMP setup:
1) CPU A enters unref() and decrements the link count to 0.
2) CPU B sees the process in the process list and ref()s it.
3) CPU A removes the process from the list and continues destructing.
4) CPU B is now holding a destructed Process object.
By holding the process list lock before doing anything with it, we
ensure that other CPUs can't find this process in the middle of it being
destructed.
This allows us to 1) let go of the Process when an inode is ref'ing for
ProcFSExposedComponent related reasons, and 2) change our ref/unref
implementation.
A hub can technically have up to 255 ports, given that bNbrPorts is a
u8 and the DeviceRemovable field is a VLA to support up to 255 ports.
Source: USB 2.0 Specification Section 11.23.2.1
That means this enum is not going to scale well in terms of size.
Replacing it with a raw u8 allows me to remove the two port assumption
and a cast.
Previously it would create a contiguous AVMO manually and pass it to
MM. This uses supervisor pages that quickly run out as they never get
returned and crash the system.
Instead, use allocate_kernel_region as we're only allocating a page so
it will be contiguous and will be returned when destroyed.
A potentially better solution would be to use a pool of transfers to
avoid all the allocations. This just prevents the system from crashing
within ~5 seconds from the continuous hub polling.
This patch begins the work of sharing types and macros between Kernel
and LibC instead of duplicating them via the kludge in UnixTypes.h.
The basic idea is that the Kernel vends various POSIX headers via
Kernel/API/POSIX/ and LibC simply #include's them to get the macros.
This is a bug that went unnoticed for a long time, so the exposed values
in SysFS PCI device directories were incorrect because the assigned PCI
address was simply the host bridge always.
Also, the bus typing should really be two hexadecimal digits and not 4
digits.
This patch removes KResult::operator int() and deals with the fallout.
This forces a lot of code to be more explicit in its handling of errors,
greatly improving readability.
The C++ standard specifies that `free` and `operator delete` should
be callable with nullptr. The non-aligned `kfree` already handles this,
but because of the pointer arithmetic to obtain the allocation start
pointer, the aligned version would produce undefined behavior.
In e7fb70b05, regular kmalloc was changed to return nullptr on
allocation failure instead of crashing. The `kmalloc_aligned_cxx`
wrapper used by the aligned operator new should do the same.
Enable the LOCK_DEBUG functionality for these new APIs, as it looks
like we want to move the whole system to use this in the not so distant
future. :^)
The LOCK_DEBUG conditional code is pretty ugly for a feature that we
only use rarely. We can remove a significant amount of this code by
utilizing a zero sized fake type when not building in LOCK_DEBUG mode.
This lets us keep the same API, but just let the compiler optimize it
away when don't actually care about the location the caller came from.
Calling error() on KResult is a mistake I made in 7ba991dc37, so
instead of doing that, which triggers an assertion if an error occured,
in Inode::read_entire method with VERIFY(nread <= sizeof(buffer)), we
really should just return the KResult and not to call error() on it.
When TCP sockets successfully establish a connection, any SO_ERROR
should be cleared back to success. For example, SO_ERROR gets set to
EINPROGRESS on asynchronous connect calls and should be cleared when
the socket moves to the Established state.
This allows for commands like netstat to reference /proc/net and
identify a connection's owning process. Process information is limited
to superusers and user owned processes.
Now that we have a significant amount of code paths handling OOM, lets
enable kmalloc and friends to actually return nullptr. This way we can
start stressing these paths and validating all of they work as expected.
The implementation uses try_copy_kstring_from_user to allocate a kernel
string using, but does not use the length of the resulting string.
The size parameter to the syscall is untrusted, as try copy kstring will
attempt to perform a `safe_strlen(..)` on the user mode string and use
that value for the allocated length of the KString instead. The bug is
that we are printing the kstring, but with the usermode size argument.
During fuzzing this resulted in us walking off the end of the allocated
KString buffer printing garbage (or any kernel data!), until we stumbled
in to the KSym region and hit a fatal page fault.
This is technically a kernel information disclosure, but (un)fortunately
the disclosure only happens to the Bochs debug port, and or the serial
port if serial debugging is enabled. As far as I can tell it's not
actually possible for an untrusted attacker to use this to do something
nefarious, as they would need access to the host. If they have host
access then they can already do much worse things :^).
The only two paths for copying strings in the kernel should be going
through the existing Userspace<char const*>, or StringArgument methods.
Lets enforce this by removing the option for using the raw cstring APIs
that were previously available.
Since the InodeIndex encapsulates a 64 bit value, it is correct to
ensure that the Kernel is exposing the entire value and the LibC is
aware of it.
This commit requires an entire re-compile because it's essentially a
change in the Kernel ABI, together with a corresponding change in LibC.
The compiler can re-order the structure (class) members if that's
necessary, so if we make Process to inherit from ProcFSExposedComponent,
even if the declaration is to inherit first from ProcessBase, then from
ProcFSExposedComponent and last from Weakable<Process>, the members of
class ProcFSExposedComponent (including the Ref-counted parts) are the
first members of the Process class.
This problem made it impossible to safely use the current toggling
method with the write-protection bit on the ProcessBase members, so
instead of inheriting from it, we make its members the last ones in the
Process class so we can safely locate and modify the corresponding page
write protection bit of these values.
We make sure that the Process class doesn't expand beyond 8192 bytes and
the protected values are always aligned on a page boundary.
If you want to record perf events, just enable profiling. This allows us
to add random perf events to programs without littering the file system
with perfcore files.
Making userspace provide a global string ID was silly, and made the API
extremely difficult to use correctly in a global profiling context.
Instead, simply make the kernel do the string ID allocation for us.
This also allows us to convert the string storage to a Vector in the
kernel (and an array in the JSON profile data.)
This syscall allows userspace to register a keyed string that appears in
a new "strings" JSON object in profile output.
This will be used to add custom strings to profile signposts. :^)
We have to disable interrupts before capturing the current Processor*,
or we risk storing the wrong one if we get preempted and resume on a
different CPU.
Caught by the VERIFY in RecursiveSpinLock::unlock()
Currently, to append a UTF-16 view to a StringBuilder, callers must
first convert the view to UTF-8 and then append the copy. Add a UTF-16
overload so callers do not need to hold an entire copy in memory.
This allows tracing the syscalls made by a thread through the kernel's
performance event framework, which is similar in principle to strace.
Currently, this merely logs a stack backtrace to the current thread's
performance event buffer whenever a syscall is made, if profiling is
enabled. Future improvements could include tracing the arguments and
the return value, for example.
Previously, non-blocking read operations on pipes returned EOF instead
of EAGAIN and non-blocking write operations blocked. This fixes
pkgsrc's bmake "eof on job pipe!" error message when running in
non-compatibility mode.
As pointed out by 8infy, this mechanism is racy:
WRITER:
1. ++update1;
2. write_data();
3. ++update2;
READER:
1. do { auto saved = update1;
2. read_data();
3. } while (saved != update2);
The following sequence can lead to a bogus/partial read:
R1 R2 R3
W1 W2 W3
We close this race by incrementing the second update counter first:
WRITER:
1. ++update2;
2. write_data();
3. ++update1;
This fixes the placeholder stub for the SO_ERROR via getsockopt. It
leverages the m_so_error value that each socket maintains. The SO_ERROR
option obtains and then clears this field, which is useful when checking
for errors that occur between socket calls. This uses an integer value
to return the SO_ERROR status.
Resolves#146
This patch adds a vDSO-like mechanism for exposing the current time as
an array of per-clock-source timestamps.
LibC's clock_gettime() calls sys$map_time_page() to map the kernel's
"time page" into the process address space (at a random address, ofc.)
This is only done on first call, and from then on the timestamps are
fetched from the time page.
This first patch only adds support for CLOCK_REALTIME, but eventually
we should be able to support all clock sources this way and get rid of
sys$clock_gettime() in the kernel entirely. :^)
Accesses are synchronized using two atomic integers that are incremented
at the start and finish of the kernel's time page update cycle.
Leave interrupts enabled so that we can still process IRQs. Critical
sections should only prevent preemption by another thread.
Co-authored-by: Tom <tomut@yahoo.com>
By making these functions static we close a window where we could get
preempted after calling Processor::current() and move to another
processor.
Co-authored-by: Tom <tomut@yahoo.com>
This removes Pipes dependency on the UHCIController by introducing a
controller base class. This will be used to implement other controllers
such as OHCI.
Additionally, there can be multiple instances of a UHCI controller.
For example, multiple UHCI instances can be required for systems with
EHCI controllers. EHCI relies on using multiple of either UHCI or OHCI
controllers to drive USB 1.x devices.
This means UHCIController can no longer be a singleton. Multiple
instances of it can now be created and passed to the device and then to
the pipe.
To handle finding and creating these instances, USBManagement has been
introduced. It has the same pattern as the other management classes
such as NetworkManagement.
Port2 logic was errantly using `portsc1` registers, meaning that
the port wouldn't be reset properly. In effect, this puts devices
connected to Port2 in an undefined state.
Processing SMP messages outside of non-SMP mode is a waste of time,
and now that we don't rely on the side effects of calling the message
processing function, let's stop calling it entirely. :^)
We were previously relying on a side effect of the critical section in
smp_process_pending_messages(): when exiting that section, it would
process any pending deferred calls.
Instead of relying on that, make the deferred invocations explicit by
calling deferred_call_execute_pending() in exit_trap().
This ensures that deferred calls get processed before entering the
scheduler at the end of exit_trap(). Since thread unblocking happens
via deferred calls, the threads don't have to wait until the next
scheduling opportunity when they could be ready *now*. :^)
This was the main reason Tom's SMP branch ran slowly in non-SMP mode.
Enter a critical section in Processor::exit_trap so that processing
SMP messages doesn't enable interrupts upon leaving. We need to delay
this until the end where we call into the Scheduler if exiting the
trap results in being outside of a critical section and irq handler.
Co-authored-by: Tom <tomut@yahoo.com>
First off: unregister the region from MemoryManager before unmapping it.
The order of operations here was a bit strange, presumably to avoid a
situation where a fault would happen while unmapping, and the fault
handler would find the MemoryManager region list in an invalid state.
Unregistering it before unmapping sidesteps the whole problem, and
allows us to easily fix another problem: a deadlock could occur due
to inconsistent acquisition order (PageDirectory must come before MM.)
We don't want to be holding the MM lock if it's a user region and we
have to consult the page directory, since that can lead to a deadlock
if we don't already have the page directory lock.
- Use the receiver's per-CPU entry in the message, instead of the
sender's. (Using the sender's entry wasn't safe for broadcast
messages since the same entry ended up on multiple message queues.)
- Retry the CAS until it *succeeds* instead of *fails*. This closes a
race window, and also ensures a correct return value. The return value
is used by the caller to decide whether to broadcast an IPI.
This was the main reason smp=on was so slow. We had CPUs busy-waiting
until someone else triggered an IPI and moved things along.
- Add a CPU pause hint to the spin loop. :^)
It may happen that CPU A manages to page in from the same inode
while we're just entering the same page fault handler on CPU B.
Handle it gracefully by checking if the data has already been paged in
(instead of VERIFY'ing that it hasn't) and then remap the page if that's
the case.
Due to a boolean mistake in smp_return_to_pool(), we didn't retry
pushing the message onto the freelist after a failed attempt.
This caused the message pool to eventually become completely empty
after enough contentious access attempts.
This patch also adds a pause hint to the CPU in the failed attempt
code path.
The kernel panic handler now parses the kernels boot_mode to decide how
to handle the panic. So the previous logic could end up in an panic loop
until we blew out the kernel stack.
Instead only validate the kernel's boot mode once per boot, after
initializing the kernel command line.
Taking a reference or a pointer to a value that's not aligned properly
is undefined behavior. While `[[gnu::packed]]` ensures that reads from
and writes to fields of packed structs is a safe operation, the
information about the reduced alignment is lost when creating pointers
to these values.
Weirdly enough, GCC's undefined behavior sanitizer doesn't flag these,
even though the doc of `-Waddress-of-packed-member` says that it usually
leads to UB. In contrast, x86_64 Clang does flag these, which renders
the 64-bit kernel unable to boot.
For now, the `address-of-packed-member` warning will only be enabled in
the kernel, as it is absolutely crucial there because of KUBSAN, but
might get excessively noisy for the userland in the future.
Also note that we can't append to `CMAKE_CXX_FLAGS` like we do for other
flags in the kernel, because flags added via `add_compile_options` come
after these, so the `-Wno-address-of-packed-member` in the root would
cancel it out.
Kernels built with Clang seem to be quite allocation-heavy compared to
their GCC counterparts. We would sometimes end up crashing during boot
because the eternal ranges had no free capacity.
The Clang error message reads like this (`-Wdeprecated-array-compare`):
> error: comparison between two arrays is deprecated; to compare
> array addresses, use unary '+' to decay operands to pointers.
The maximum number of virtual consoles is determined at compile time,
so we can pre-allocate that many slots, dodging some heap allocations.
Furthermore, virtual consoles are never destroyed, so it's fine to
simply store a raw pointer to the currently active one.
This commit implements the ISO 9660 filesystem as specified in ECMA 119.
Currently, it only supports the base specification and Joliet or Rock
Ridge support is not present. The filesystem will normalize all
filenames to be lowercase (same as Linux).
The filesystem can be mounted directly from a file. Loop devices are
currently not supported by SerenityOS.
Special thanks to Lubrsi for testing on real hardware and providing
profiling help.
Co-Authored-By: Luke <luke.wilde@live.co.uk>
I had to move the thread list out of the protected base area of Process
so that it could live with its lock (which needs to be mutable).
Ideally it would live in the protected area, so maybe we can figure out
a way to do that later.
When I laid down the foundation for the start of the big process lock
separation, I added asserts to all system call implementations to
validate we hold the big process lock in the locations we think we
should be.
Adding that assert to sys$profiling_enable broke boot time profiling as
we were never holding the lock on boot. Even though it's not technically
required, lets make sure to hold the lock while enabling to appease the
assert.
VirtualConsole::m_lock was only used in a single place: on_tty_write()
That function is already protected by a global lock, so this second
lock served no purpose whatsoever.